Computer Science 135 :: Lab 8 :: Twitter Word Clouds

Lab 8: Twitter Word Clouds

This lab uses the Twitter API as a source for sampling tweets real-time as well as grabbing tweets from a particular user. The focus is on constructing working software from existing libraries and doctumentation while also gaining exposure to JSON and text processing.

Here is a word cloud created from the text of the last 100 tweets by Williams College.

Step 0: Lab Preparation

You should begin by reading a blog post about the twitter streaming API.
To use the Twitter API, you must have a twitter account and be credentialed. The credentials take the form of a Consumer (API) key, a Consumer (API) secret, an Access Token, and an Access Token Secret. To gain credentials, perform the following steps:
- Create a twitter account if you do not alrady have one.
- Head to https://apps.twitter.com/ and log in.
- Click Create New App.
- Fill out the Application Details form with reasonable data (you can use http://www.cs.williams.edu as your website), agree to the Developer Agreement terms, and click Create your Twitter application.
- Click on the Keys and Access Tokens tab.
- Click on Create my access token in the Token Actions area.
- In the lab you will need the consumer key, the consumer secret, the access token, and the access token secret.
Also read the Getting Started guide on the Tweepy API.

Step 1: Source Code

Clone your private repo to an appropriate directory in your home folder (~/labs is a good choice):
```
$ git clone git@github.com:williams-cs/<git-username>-cs135-lab8.git
```
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github.
Once inside your <git-username>-cs135-lab8 directory, create a virtual environment using
```
$ virtualenv --system-site-packages -p python3 venv
```
The --system-site-packages will let us use the matplotlib package.
Activate your environment by typing:
```
$ . venv/bin/activate
```
Use pip to install the pillows imaging library:
```
$ pip install pillow
```
Use pip to install the Cython, which allows C-extensions to Python:
```
$ pip install cython
```
Use pip to install tweepy, which is a python wrapper around the Twitter API:
```
$ pip install tweepy
```

Use pip to install wordcloud from GitHub:

$ pip install git+git://github.com/amueller/word_cloud.git

Remember that you must always activate your virtual environment when opening a new terminal
Type
```
$ git branch
```
and notice that you are currently editing the master branch.
Create a new branch with
```
$ git branch twitter
```
Checkout this branch by typing
```
$ git checkout twitter
```
Any changes you make to the repository are now isolated on this branch.

Step 2: Twitter Data

Make sure that you've reviewed this blog post on the twitter API. This tutorial shows how to use Tweepy to sample data from Twitter's public stream and filter it based on keywords.

The heart of the code lies in the StdOutListener class, which subclasses StreamListener. The streaming portion of Tweepy is event-based: when new data is available, it calls the on_data method with the new data. This data is always valid JSON and it always corresponds to some sort of Twitter event, like a status updated (i.e. a tweet), blocking users, following users, deleting accounts, etc. If the method returns False, then the listener stops. If it returns True or None then it continues listening.

  class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

The file twitter.py contains a sample program that you can run on the command line, which samples twitter events from users who speak english:

  $ python3 twitter.py

Notice that JSON objects are streaming by the screen. You can exit this program by using CTRL-C.

Step 2a: Dumping Tweets Realtime

Alter the StdOutListener class so that it listens for and stores N tweets. After it's reached its limit, it dumps those tweets to stdout as a single valid JSON list. Here are some implementation notes:

The data argument to on_data is a valid JSON string that you can parse with json.loads.
If the data contains the key 'in_reply_to_status_id' then it is a status update. You are only interested in collecting the JSON associated with status updates. But you want to store all of the assocated JSON, not just the text of the update.
Add an __init__ method to StdOutListener that provides enough state to count and store JSON objects until you have N objects in total. Remember that on_data will be called automatically by Tweepy every time there is a new Twitter event.
Once you've collected N JSON objects in a list L, you can dump them to standard out using json.dump(L, sys.stdout, ensure_ascii=False) where the final keyword parameter just ensures that the UTF-8 encoding is maintained. Remember to return False at this point too.
Add an argument num_tweets to the sample function so that you can pass it along to your StdOutListener constructor. Edit your program so that this number comes from the command line.

Test your program from the command line:

  $ python twitter.py 5 > tweets.json

Examining the file should reveal a JSON list with a dictionary inside...

  [{"source": ...

Step 2b: Dumping Historical Tweets by User

Instead of sampling from the public twitter feed realtime, we can grab historical tweets from arbitrary twitter users. Here we will define a function timeline that grabs the last N tweets from a twitter user and dumps them to standard out.

Here are some implementation details:

Begin by reviewing the Getting Started Guide on the Twippy API. We will be using the method user_timeline instead of home_timeline.
Write a function called timeline that has three arguments: oauth, screenname, and num_tweets. This function should create a new Twitter API object using oauth. It then should call the user_timeline method making use of the keyword arguments screen_name and count.
The user_timeline method returns a list of Status objects. The Status object provides a friendly wrapper around the JSON. But we only want the JSON. Luckly, you can get direct access to it through _json attribute. In other words, if S is a Status object, then s._json will give you its underlyling JSON.
You should dump a list of JSON dictionaries, each representing a status update, to standard out
Update your if __name__ == '__main__' code so that it respects the following command line sytnax:
```
    $ python twitter.py sample 100
    $ python twitter.py timeline iamdiddy 50
    
```

Step 3: Word Counts

The file wordcounts.py contains the class WordCounts, which represents a dictionary of words and their associated counts. This class also features filtering — you can pass in a list of functions that operate on strings. If a function returns True then the string should not be included in the count list.

  class WordCounts:

      """A class representing word counts.  Words are only included if they
         are not filtered."""

      def __init__(self, filters=[]):
          """filters is a list of functions, each of which operates on a single
          string and returns True if and only if the word should be
          filtered (i.e., not stored)"""

      def addwords(self, words):
          """Update the word count for each word in words.  Only include the
          word if it passes all the filters (i.e., all the filters return False)"""

      def cloudstr(self):
          """return a single string where each word in the dictionary
             is repeated N times (with spaces) where N is its associted count.
             In other words if {"Brent":2,"Courtney":3} then the string would be
             "Brent Brent Courtney Courtney Courtney"""

      def __str__(self):
          return ""

  >>> from wordcounts import WordCounts
  >>> def filter_dog(word): return word == "dog"
  ...
  >>> def filter_starts_with_hi(word): return word.startswith("hi")
  ...
  >>> counts = WordCounts(filters=[filter_dog, filter_starts_with_hi])
  >>> counts.addwords("Brent Brent dog hiBrent hiCourtney Courtney Courtney Courtney".split())
  >>> counts.addwords(["dog", "hidog", "Brent"])
  >>> print(counts)
  {'Brent': 3, 'Courtney': 3}
  >>> counts.cloudstr()
  'Brent Brent Brent Courtney Courtney Courtney'

Here are a few implementation notes:

Remember that functions are valid python objects, so if you have a list of function objects funcs then
```
[f("brent") for f in funcs]
```
hands back a list of the result of each function applied to the string "Brent"
Try to implement cloudstr in a single line. This string will be useful when making word clouds.

Step 4: Word Clouds

In this section you will write code to create word clouds from JSON data representing status updates. Given a filename containing JSON data and a filename for an output image, your program should

create an instance of WordCounts using the filters described below;
load the JSON from disk;
extract the tweets from each JSON dictionary (use the key "text") to grab this;
add words from the text of each tweet to your WordCounts instance; and
create a word cloud using the string returned by cloudstr; and
save the word cloud to the appropriate file.

The command line usage should be

  $ python cloud.py tweets.json tweets.png

Step 4a: Word Clouds

Here are some details about creating word clouds:

You can import the WordCloud class as well as the STOPWORDS list, which is useful for telling the cloud to ignore words like the and a.

Use the following syntax to create an instance of WordCloud:

  cloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                    stopwords=STOPWORDS,
                    background_color='white',
                    width=1800,
                    height=1400)

You can change fonts by providing a different path to a font file.

Use wordcloud.generate(s) to internally generate a wordcloud from a string of whitespace delimited words. You'll want to make use of the cloudstr method of your WordCounts class.
Use cloud.to_file(filename) to save the word cloud to disk

Step 4b: Filters

You should define at least 4 different kinds of filters:

a filter for URLS (i.e., anything with "http://" in it);
a filter for the word "RT";
a filter for anything starting with "@"
a filter for the encoded ampersand character "&".

These filters will be passed to your WordCounts instance.

Step 5: Make some Clouds

You should create word clouds for some Twitter username and for a sample of 1000 tweets from the real time twitter feed. Add and commit these PNG files to your repo.

Step 6: Submission

Now commit those additions to the repository:
```
$ git commit -a -m "some log message"
```
Push your changes back to github repo:
```
$ git push
```
You will probably be asked to type $ git push --set-upstream origin twitter which you should do. This pushes your iterator branch back up to the GitHub Repo.
Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR