Lab 7: Twitter Word Clouds
This lab uses the Twitter API as a source for sampling tweets real-time as well as grabbing tweets from a particular user. The focus is on constructing working software from existing libraries and doctumentation while also gaining exposure to JSON and text processing.
Here is a word cloud created from the text of the last 100 tweets by Williams College.

Step 0: Lab Preparation
- You should begin by reading a blog post about the twitter streaming API.
- To use the Twitter API, you must have a twitter account and
be credentialed. The credentials take the form of a Consumer
(API) key, a Consumer (API) secret, an Access Token, and an
Access Token Secret. To gain credentials, perform the following
steps:
- Create a twitter account if you do not alrady have one.
- Head to https://apps.twitter.com/ and log in.
- Click Create New App.
- Fill out the Application Details form with reasonable data (you can use http://www.cs.williams.edu as your website), agree to the Developer Agreement terms, and click Create your Twitter application.
- Click on the Keys and Access Tokens tab.
- Click on Create my access token in the Token Actions area.
- In the lab you will need the consumer key, the consumer secret, the access token, and the access token secret.
- Also read the Getting Started guide on the Tweepy API.
Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
(
~/labs
is a good choice):$ git clone https://github.com:williams-cs/<git-username>-lab7.git
Remember, you can always get the repo address by using the https copy-to-clipboard link on github. - Once inside your <git-username>-lab7 directory, create a virtual environment using
$ virtualenv -p python3 venv
- Activate your environment by typing:
$ . venv/bin/activate
- Use pip to install the pillows imaging library:
$ pip install pillow
- Use pip to install the Cython, which allows C-extensions to Python:
$ pip install cython
- Use pip to install tweepy, which is a python wrapper around the Twitter API:
$ pip install tweepy
- Use pip to install wordcloud from GitHub:
$ pip install git+git://github.com/amueller/word_cloud.git
- Remember that you must always activate your virtual environment when opening a new terminal
- Type
$ git branch
and notice that you are currently editing the master branch. - Create a new branch with
$ git branch twitter
- Checkout this branch by typing
$ git checkout twitter
- Any changes you make to the repository are now isolated on this branch.
Step 2: Twitter Data
Make sure that you've reviewed this blog post on the twitter API. This tutorial shows how to use Tweepy to sample data from Twitter's public stream and filter it based on keywords. Just read it---you'll do something similiar in this lab.
The heart of the code lies in the StdOutListener
class,
which subclasses StreamListener
. The streaming portion
of Tweepy is event-based: when new data is available, it calls
the on_data
method with the new data. This data is
always valid JSON and it always corresponds to some sort of Twitter
event, like a status updated (i.e. a tweet), blocking users,
following users, deleting accounts, etc. If the method returns
False, then the listener stops. If it returns True or None then it
continues listening.
class StdOutListener(StreamListener): def on_data(self, data): print data return True
The file twitter.py
contains a sample program that you
can run on the command line, which samples twitter events from users
who speak english:
$ python twitter.py
Notice that JSON objects are streaming by the screen. You can exit this program by using CTRL-C.
Step 2a: Dumping Tweets Realtime
Alter the StdOutListener
class so that it listens for
and stores
- The
data
argument toon_data
is a valid JSON string that you can parse withjson.loads
. - If the data contains the
key
'in_reply_to_status_id'
then it is a status update. You are only interested in collecting the JSON associated with status updates. But you want to store all of the assocated JSON, not just the text of the update. - Add an
__init__
method toStdOutListener
that provides enough state to count and store JSON objects until you haveN objects in total. Remember thaton_data
will be called automatically by Tweepy every time there is a new Twitter event. - Once you've collected
N JSON objects in a listL
, you can dump them to standard out usingjson.dump(L, sys.stdout, ensure_ascii=False)
where the final keyword parameter just ensures that the UTF-8 encoding is maintained. Remember to return False at this point too. - Add an argument
num_tweets
to thesample
function so that you can pass it along to yourStdOutListener
constructor. Edit your program so that this number comes from the command line.
Test your program from the command line:
$ python twitter.py 5 > tweets.json
Examining the file should reveal a JSON list with a dictionary inside...
[{"source": ...
Step 2b: Dumping Historical Tweets by User
Instead of sampling from the public twitter feed realtime, we can
grab historical tweets from arbitrary twitter users. Here we will
define a function timeline
that grabs the
last
Here are some implementation details:
- Begin by reviewing the Getting
Started Guide on the Twippy API. We will be using the
method
user_timeline
instead ofhome_timeline
. -
Write a function called
timeline
that has three arguments:oauth
,screenname
, andnum_tweets
. This function should create a new Twitter API object usingoauth
. It then should call theuser_timeline
method making use of the keyword argumentsscreen_name
andcount
. - The
user_timeline
method returns a list ofStatus
objects. TheStatus
object provides a friendly wrapper around the JSON. But we only want the JSON. Luckly, you can get direct access to it through_json
attribute. In other words, ifs
is aStatus
object, thens._json
will give you its underlyling JSON. - You should dump a list of JSON dictionaries, each representing a status update, to standard out
- Update your
if __name__ == '__main__'
code so that it respects the following command line sytnax:$ python twitter.py sample 100 $ python twitter.py timeline iamdiddy 50
Step 3: Word Counts
The file wordcounts.py
contains the
class WordCounts
, which represents a dictionary of
words and their associated counts. This class also features
filtering — you can pass in a list of functions that operate
on strings. If a function returns True then the string should not
be included in the count list. There is an example below.
class WordCounts: """A class representing word counts. Words are only included if they are not filtered.""" def __init__(self, filters=[]): """filters is a list of functions, each of which operates on a single string and returns True if and only if the word should be filtered (i.e., not stored)""" def addwords(self, words): """Update the word count for each word in words. Only include the word if it passes all the filters (i.e., all the filters return False)""" def cloudstr(self): """return a single string where each word in the dictionary is repeated N times (with spaces) where N is its associted count. In other words if {'Brent':2,'Courtney':3} then the string would be 'Brent Brent Courtney Courtney Courtney' """ def __str__(self): return ''
Here is an example that uses some filtering functions.
>>> from wordcounts import WordCounts >>> def filter_dog(word): return word == "dog" ... >>> def filter_starts_with_hi(word): return word.startswith("hi") ... >>> counts = WordCounts(filters=[filter_dog, filter_starts_with_hi]) >>> counts.addwords("Brent Brent dog hiBrent hiCourtney Courtney Courtney Courtney".split()) >>> counts.addwords(["dog", "hidog", "Brent"]) >>> print(counts) {'Brent': 3, 'Courtney': 3} >>> counts.cloudstr() 'Brent Brent Brent Courtney Courtney Courtney'
Here are a few implementation notes:
- Remember that functions are valid python objects, so if you
have a list of function objects
funcs
then[f("brent") for f in funcs]
hands back a list of the result of each function applied to the string "Brent" - Try to implement
cloudstr
in a single line. This string will be useful when making word clouds.
Step 4: Word Clouds
In this section you will write code to create word clouds from JSON data representing status updates. Given a filename containing JSON data and a filename for an output image, your program should
- create an instance of
WordCounts
using the filters described below; - load the JSON from disk;
- extract the tweets from each JSON dictionary (use the key "text") to grab this;
- add words from the text of each tweet to your
WordCounts
instance; and - create a word cloud using the string returned by
cloudstr
; and - save the word cloud to the appropriate file.
The command line usage should be:
$ python cloud.py tweets.json tweets.png
However, there is a problem with matplotlib and virtual environments that prevents python from running correctly. We still need to use and activate virtualenv, but instead of using the python version we installed in our virtualenv, we want to use the python3 version installed on the system.
On OIT machines, use:
$ PYTHONHOME=$VIRTUAL_ENV /opt/local/bin/python3 cloud.py tweets.json tweets.pngOn TCL 216/217 machines, use:
$ PYTHONHOME=$VIRTUAL_ENV /usr/local/bin/python3 cloud.py tweets.json tweets.png
The PYTHONHOME environment variable tells Python where to look for modules and libraries. More information on matplotlib and virtalenv.
Step 4a: Word Clouds
Here are some details about creating word clouds:
- You can import the
WordCloud
class as well as theSTOPWORDS
list, which is useful for telling the cloud to ignore words like the and a. -
Use the following syntax to create an instance of
WordCloud
:cloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf', collocations=False, stopwords=STOPWORDS, background_color='white', width=1800, height=1400)
You can change fonts by providing a different path to a font file. - Use
wordcloud.generate(s)
to internally generate a wordcloud from a string of whitespace delimited words. You'll want to make use of thecloudstr
method of yourWordCounts
class. - Use
cloud.to_file(filename)
to save the word cloud to disk
Step 4b: Filters
You should define at least 4 different kinds of filters:
- a filter for URLS (i.e., anything with "http://" or "https://" in it);
- a filter for the word "RT";
- a filter for anything starting with "@"
- a filter for the encoded ampersand character "&".
These filters will be passed to your WordCounts
instance.
Step 5: Make some Clouds
You should create word clouds for some Twitter username and for a sample of 1000 tweets from the real time twitter feed. Add and commit these PNG files to your repo.
Step E: Extra Credit
The WordCloud module lets us customize our word clouds in many ways. You can find documentation and examples in the WordCloud github repository. Experiment with these options to make word clouds that are unique! You may find inspiration in the Hillary and Donald wordclouds below, but be creative. Some ideas include: adding an image mask; changing fonts or colors; transforming text (e.g., pig Latin or "real" language translation); non-twitter data sources (remember, we can dump dictionaries to json files); or new/better filters.


Please indicate, using comments in your code, what enhancements you have made and how they work so that we can properly credit your efforts. Your final submission should produce word clouds as normal with the default program arguments, but easy-to-follow documentation explaining how to run your enhancements should be provided in the README.md that comes with your repository.
Step 6: Submission
- Now commit those additions to the repository:
$ git commit -a -m "some log message"
- Push your changes back to github repo:
$ git push
You will probably be asked to type$ git push --set-upstream origin twitter
which you should do. This pushes your iterator branch back up to the GitHub Repo. - Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR