Diving Into the Deluge of Data :: Lab 7 :: Twitter Word Clouds

Lab 7: Twitter Word Clouds

This lab uses the Twitter API as a source for sampling tweets real-time as well as grabbing tweets from a particular user. The focus is on constructing working software from existing libraries and doctumentation while also gaining exposure to JSON and text processing.

Here is a word cloud created from the text of the last 100 tweets by Williams College.

Step 0: Lab Preparation

  • You should begin by reading a blog post about the twitter streaming API.
  • To use the Twitter API, you must have a twitter account and be credentialed. The credentials take the form of a Consumer (API) key, a Consumer (API) secret, an Access Token, and an Access Token Secret. To gain credentials, perform the following steps:
    • Create a twitter account if you do not alrady have one.
    • Head to https://apps.twitter.com/ and log in.
    • Click Create New App.
    • Fill out the Application Details form with reasonable data (you can use http://www.cs.williams.edu as your website), agree to the Developer Agreement terms, and click Create your Twitter application.
    • Click on the Keys and Access Tokens tab.
    • Click on Create my access token in the Token Actions area.
    • In the lab you will need the consumer key, the consumer secret, the access token, and the access token secret.
  • Also read the Getting Started guide on the Tweepy API.

Step 1: Source Code

Step 2: Twitter Data

Make sure that you've reviewed this blog post on the twitter API. This tutorial shows how to use Tweepy to sample data from Twitter's public stream and filter it based on keywords. Just read it---you'll do something similiar in this lab.

The heart of the code lies in the StdOutListener class, which subclasses StreamListener. The streaming portion of Tweepy is event-based: when new data is available, it calls the on_data method with the new data. This data is always valid JSON and it always corresponds to some sort of Twitter event, like a status updated (i.e. a tweet), blocking users, following users, deleting accounts, etc. If the method returns False, then the listener stops. If it returns True or None then it continues listening.

  class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

The file twitter.py contains a sample program that you can run on the command line, which samples twitter events from users who speak english:

  $ python twitter.py
  

Notice that JSON objects are streaming by the screen. You can exit this program by using CTRL-C.

Step 2a: Dumping Tweets Realtime

Alter the StdOutListener class so that it listens for and stores N tweets. After it's reached its limit, it dumps those tweets to stdout as a single valid JSON list. Here are some implementation notes:

Test your program from the command line:

  $ python twitter.py 5 > tweets.json
  

Examining the file should reveal a JSON list with a dictionary inside...

  [{"source": ...
  

Step 2b: Dumping Historical Tweets by User

Instead of sampling from the public twitter feed realtime, we can grab historical tweets from arbitrary twitter users. Here we will define a function timeline that grabs the last N tweets from a twitter user and dumps them to standard out.

Here are some implementation details:

Step 3: Word Counts

The file wordcounts.py contains the class WordCounts, which represents a dictionary of words and their associated counts. This class also features filtering — you can pass in a list of functions that operate on strings. If a function returns True then the string should not be included in the count list. There is an example below.

  class WordCounts:

      """A class representing word counts.  Words are only included if they
      are not filtered."""

      def __init__(self, filters=[]):
          """filters is a list of functions, each of which operates on a single
          string and returns True if and only if the word should be
          filtered (i.e., not stored)"""

      def addwords(self, words):
          """Update the word count for each word in words.  Only include the
          word if it passes all the filters (i.e., all the filters return False)"""

      def cloudstr(self):
          """return a single string where each word in the dictionary
             is repeated N times (with spaces) where N is its associted count.
             In other words if {'Brent':2,'Courtney':3} then the string would be
             'Brent Brent Courtney Courtney Courtney' """

      def __str__(self):
          return ''

Here is an example that uses some filtering functions.

  >>> from wordcounts import WordCounts
  >>> def filter_dog(word): return word == "dog"
  ...
  >>> def filter_starts_with_hi(word): return word.startswith("hi")
  ...
  >>> counts = WordCounts(filters=[filter_dog, filter_starts_with_hi])
  >>> counts.addwords("Brent Brent dog hiBrent hiCourtney Courtney Courtney Courtney".split())
  >>> counts.addwords(["dog", "hidog", "Brent"])
  >>> print(counts)
  {'Brent': 3, 'Courtney': 3}
  >>> counts.cloudstr()
  'Brent Brent Brent Courtney Courtney Courtney'
  

Here are a few implementation notes:

Step 4: Word Clouds

In this section you will write code to create word clouds from JSON data representing status updates. Given a filename containing JSON data and a filename for an output image, your program should

The command line usage should be:

  $ python cloud.py tweets.json tweets.png
  

However, there is a problem with matplotlib and virtual environments that prevents python from running correctly. We still need to use and activate virtualenv, but instead of using the python version we installed in our virtualenv, we want to use the python3 version installed on the system.

On OIT machines, use:

  $ PYTHONHOME=$VIRTUAL_ENV /opt/local/bin/python3 cloud.py tweets.json tweets.png
  
On TCL 216/217 machines, use:
  $ PYTHONHOME=$VIRTUAL_ENV /usr/local/bin/python3 cloud.py tweets.json tweets.png
  

The PYTHONHOME environment variable tells Python where to look for modules and libraries. More information on matplotlib and virtalenv.

Step 4a: Word Clouds

Here are some details about creating word clouds:

Step 4b: Filters

You should define at least 4 different kinds of filters:

These filters will be passed to your WordCounts instance.

Step 5: Make some Clouds

You should create word clouds for some Twitter username and for a sample of 1000 tweets from the real time twitter feed. Add and commit these PNG files to your repo.

Step E: Extra Credit

The WordCloud module lets us customize our word clouds in many ways. You can find documentation and examples in the WordCloud github repository. Experiment with these options to make word clouds that are unique! You may find inspiration in the Hillary and Donald wordclouds below, but be creative. Some ideas include: adding an image mask; changing fonts or colors; transforming text (e.g., pig Latin or "real" language translation); non-twitter data sources (remember, we can dump dictionaries to json files); or new/better filters.

Please indicate, using comments in your code, what enhancements you have made and how they work so that we can properly credit your efforts. Your final submission should produce word clouds as normal with the default program arguments, but easy-to-follow documentation explaining how to run your enhancements should be provided in the README.md that comes with your repository.

Step 6: Submission