Lab 8: Twitter Word Clouds
This lab uses the Twitter API as a source for sampling tweets real-time as well as grabbing tweets from a particular user. The focus is on constructing working software from existing libraries and doctumentation while also gaining exposure to JSON and text processing.
Here is a word cloud created from the text of the last 100 tweets by Williams College.

Step 0: Lab Preparation
- You should begin by reading a blog post about the twitter streaming API.
- To use the Twitter API, you must have a twitter account and be credentialed. The credentials take the form of a Consumer (API) key, a Consumer (API) secret, an Access Token, and an Access Token Secret. To gain credentials, perform the following steps:
- Create a twitter account if you do not alrady have one.
- Head to https://apps.twitter.com/ and log in.
- Click Create New App.
- Fill out the Application Details form with reasonable data (you can use http://www.cs.williams.edu as your website), agree to the Developer Agreement terms, and click Create your Twitter application.
- Click on the Keys and Access Tokens tab.
- Click on Create my access token in the Token Actions area.
- In the lab you will need the consumer key, the consumer secret, the access token, and the access token secret.
- Also read the Getting Started guide on the Tweepy API.
Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
(
~/labs
is a good choice):$ git clone git@github.com:williams-cs/<git-username>-cs135-lab8.git
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github. - Once inside your <git-username>-cs135-lab8 directory, create a virtual environment using
$ virtualenv --system-site-packages -p python3 venv
The --system-site-packages will let us use the matplotlib package. - Activate your environment by typing:
$ . venv/bin/activate
- Use pip to install the pillows imaging library:
$ pip install pillow
- Use pip to install the Cython, which allows C-extensions to Python:
$ pip install cython
- Use pip to install tweepy, which is a python wrapper around the Twitter API:
$ pip install tweepy
- Use pip to install wordcloud from GitHub:
$ pip install git+git://github.com/amueller/word_cloud.git
- Remember that you must always activate your virtual environment when opening a new terminal
- Type
$ git branch
and notice that you are currently editing the master branch. - Create a new branch with
$ git branch twitter
- Checkout this branch by typing
$ git checkout twitter
- Any changes you make to the repository are now isolated on this branch.
Step 2: Twitter Data
Make sure that you've reviewed this blog post on the twitter API. This tutorial shows how to use Tweepy to sample data from Twitter's public stream and filter it based on keywords.
The heart of the code lies in the StdOutListener
class, which subclasses StreamListener
. The streaming portion of Tweepy is event-based: when new data is available, it calls the on_data
method with the new data. This data is always valid JSON and it always corresponds to some sort of Twitter event, like a status updated (i.e. a tweet), blocking users, following users, deleting accounts, etc. If the method returns False, then the listener stops. If it returns True or None then it continues listening.
class StdOutListener(StreamListener): def on_data(self, data): print data return True
The file twitter.py
contains a sample program that you can run on the command line, which samples twitter events from users who speak english:
$ python3 twitter.py
Notice that JSON objects are streaming by the screen. You can exit this program by using CTRL-C.
Step 2a: Dumping Tweets Realtime
Alter the StdOutListener
class so that it listens for and stores
- The
data
argument toon_data
is a valid JSON string that you can parse withjson.loads
. - If the data contains the key
'in_reply_to_status_id'
then it is a status update. You are only interested in collecting the JSON associated with status updates. But you want to store all of the assocated JSON, not just the text of the update. - Add an
__init__
method toStdOutListener
that provides enough state to count and store JSON objects until you haveN objects in total. Remember thaton_data
will be called automatically by Tweepy every time there is a new Twitter event. - Once you've collected
N JSON objects in a listL
, you can dump them to standard out usingjson.dump(L, sys.stdout, ensure_ascii=False)
where the final keyword parameter just ensures that the UTF-8 encoding is maintained. Remember to return False at this point too. - Add an argument
num_tweets
to thesample
function so that you can pass it along to yourStdOutListener
constructor. Edit your program so that this number comes from the command line.
Test your program from the command line:
$ python twitter.py 5 > tweets.json
Examining the file should reveal a JSON list with a dictionary inside...
[{"source": ...
Step 2b: Dumping Historical Tweets by User
Instead of sampling from the public twitter feed realtime, we can grab historical tweets from arbitrary twitter users. Here we will define a function timeline
that grabs the last
Here are some implementation details:
- Begin by reviewing the Getting Started Guide on the Twippy API. We will be using the method
user_timeline
instead ofhome_timeline
. -
Write a function called
timeline
that has three arguments:oauth
,screenname
, andnum_tweets
. This function should create a new Twitter API object usingoauth
. It then should call theuser_timeline
method making use of the keyword argumentsscreen_name
andcount
. - The
user_timeline
method returns a list ofStatus
objects. TheStatus
object provides a friendly wrapper around the JSON. But we only want the JSON. Luckly, you can get direct access to it through_json
attribute. In other words, ifS
is aStatus
object, thens._json
will give you its underlyling JSON. - You should dump a list of JSON dictionaries, each representing a status update, to standard out
- Update your
if __name__ == '__main__'
code so that it respects the following command line sytnax:$ python twitter.py sample 100 $ python twitter.py timeline iamdiddy 50
Step 3: Word Counts
The file wordcounts.py
contains the class WordCounts
, which represents a dictionary of words and their associated counts. This class also features filtering — you can pass in a list of functions that operate on strings. If a function returns True then the string should not be included in the count list.
class WordCounts: """A class representing word counts. Words are only included if they are not filtered.""" def __init__(self, filters=[]): """filters is a list of functions, each of which operates on a single string and returns True if and only if the word should be filtered (i.e., not stored)""" def addwords(self, words): """Update the word count for each word in words. Only include the word if it passes all the filters (i.e., all the filters return False)""" def cloudstr(self): """return a single string where each word in the dictionary is repeated N times (with spaces) where N is its associted count. In other words if {"Brent":2,"Courtney":3} then the string would be "Brent Brent Courtney Courtney Courtney""" def __str__(self): return ""
>>> from wordcounts import WordCounts >>> def filter_dog(word): return word == "dog" ... >>> def filter_starts_with_hi(word): return word.startswith("hi") ... >>> counts = WordCounts(filters=[filter_dog, filter_starts_with_hi]) >>> counts.addwords("Brent Brent dog hiBrent hiCourtney Courtney Courtney Courtney".split()) >>> counts.addwords(["dog", "hidog", "Brent"]) >>> print(counts) {'Brent': 3, 'Courtney': 3} >>> counts.cloudstr() 'Brent Brent Brent Courtney Courtney Courtney'
Here are a few implementation notes:
- Remember that functions are valid python objects, so if you have a list of function objects
funcs
then[f("brent") for f in funcs]
hands back a list of the result of each function applied to the string "Brent" - Try to implement
cloudstr
in a single line. This string will be useful when making word clouds.
Step 4: Word Clouds
In this section you will write code to create word clouds from JSON data representing status updates. Given a filename containing JSON data and a filename for an output image, your program should
- create an instance of
WordCounts
using the filters described below; - load the JSON from disk;
- extract the tweets from each JSON dictionary (use the key "text") to grab this;
- add words from the text of each tweet to your
WordCounts
instance; and - create a word cloud using the string returned by
cloudstr
; and - save the word cloud to the appropriate file.
The command line usage should be
$ python cloud.py tweets.json tweets.png
Step 4a: Word Clouds
Here are some details about creating word clouds:
- You can import the
WordCloud
class as well as theSTOPWORDS
list, which is useful for telling the cloud to ignore words like the and a. -
Use the following syntax to create an instance of
WordCloud
:cloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf', stopwords=STOPWORDS, background_color='white', width=1800, height=1400)
You can change fonts by providing a different path to a font file. - Use
wordcloud.generate(s)
to internally generate a wordcloud from a string of whitespace delimited words. You'll want to make use of thecloudstr
method of yourWordCounts
class. - Use
cloud.to_file(filename)
to save the word cloud to disk
Step 4b: Filters
You should define at least 4 different kinds of filters:
- a filter for URLS (i.e., anything with "http://" in it);
- a filter for the word "RT";
- a filter for anything starting with "@"
- a filter for the encoded ampersand character "&".
These filters will be passed to your WordCounts
instance.
Step 5: Make some Clouds
You should create word clouds for some Twitter username and for a sample of 1000 tweets from the real time twitter feed. Add and commit these PNG files to your repo.Step 6: Submission
- Now commit those additions to the repository:
$ git commit -a -m "some log message"
- Push your changes back to github repo:
$ git push
You will probably be asked to type$ git push --set-upstream origin twitter
which you should do. This pushes your iterator branch back up to the GitHub Repo. - Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR