Lab 7: Twitter Word Clouds
This lab uses the Twitter API as a source for sampling tweets real-time as well as grabbing tweets from a particular user. The focus is on constructing working software from existing libraries and doctumentation while also gaining exposure to JSON and text processing.
Here is a word cloud created from the text of the last 100 tweets by Williams College.
 
  Step 0: Lab Preparation
- You should begin by reading a blog post about the twitter streaming API.
- To use the Twitter API, you must have a twitter account and
      be credentialed.  The credentials take the form of a Consumer
      (API) key, a Consumer (API) secret, an Access Token, and an
      Access Token Secret.  To gain credentials, perform the following
      steps:
      - Create a twitter account if you do not alrady have one.
- Head to https://apps.twitter.com/ and log in.
- Click Create New App.
- Fill out the Application Details form with reasonable data (you can use http://www.cs.williams.edu as your website), agree to the Developer Agreement terms, and click Create your Twitter application.
- Click on the Keys and Access Tokens tab.
- Click on Create my access token in the Token Actions area.
- In the lab you will need the consumer key, the consumer secret, the access token, and the access token secret.
 
- Also read the Getting Started guide on the Tweepy API.
Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
         (~/labsis a good choice):$ git clone https://github.com:williams-cs/<git-username>-lab7.git Remember, you can always get the repo address by using the https copy-to-clipboard link on github.
- Once inside your <git-username>-lab7 directory, create a virtual environment using $ virtualenv -p python3 venv 
- Activate your environment by typing: $ . venv/bin/activate 
- Use pip to install the pillows imaging library:
      $ pip install pillow 
- Use pip to install the Cython, which allows C-extensions to Python:
      $ pip install cython 
- Use pip to install tweepy, which is a python wrapper around the Twitter API:
      $ pip install tweepy 
- Use pip to install wordcloud from GitHub:
      $ pip install git+git://github.com/amueller/word_cloud.git 
- Remember that you must always activate your virtual environment when opening a new terminal
- Type
      $ git branch and notice that you are currently editing the master branch.
- Create a new branch with
      $ git branch twitter 
- Checkout this branch by typing
      $ git checkout twitter 
- Any changes you make to the repository are now isolated on this branch.
Step 2: Twitter Data
Make sure that you've reviewed this blog post on the twitter API. This tutorial shows how to use Tweepy to sample data from Twitter's public stream and filter it based on keywords. Just read it---you'll do something similiar in this lab.
  The heart of the code lies in the StdOutListener class,
  which subclasses StreamListener.  The streaming portion
  of Tweepy is event-based: when new data is available, it calls
  the on_data method with the new data.  This data is
  always valid JSON and it always corresponds to some sort of Twitter
  event, like a status updated (i.e. a tweet), blocking users,
  following users, deleting accounts, etc.  If the method returns
  False, then the listener stops.  If it returns True or None then it
  continues listening.
  
  class StdOutListener(StreamListener):
    def on_data(self, data):
        print data
        return True
  
  The file twitter.py contains a sample program that you
  can run on the command line, which samples twitter events from users
  who speak english:
  
$ python twitter.py
Notice that JSON objects are streaming by the screen. You can exit this program by using CTRL-C.
Step 2a: Dumping Tweets Realtime
  Alter the StdOutListener class so that it listens for
  and stores 
- The dataargument toon_datais a valid JSON string that you can parse withjson.loads.
- If the data contains the
    key 'in_reply_to_status_id'then it is a status update. You are only interested in collecting the JSON associated with status updates. But you want to store all of the assocated JSON, not just the text of the update.
- Add an __init__method toStdOutListenerthat provides enough state to count and store JSON objects until you haveN objects in total. Remember thaton_datawill be called automatically by Tweepy every time there is a new Twitter event.
- Once you've collected N JSON objects in a listL, you can dump them to standard out usingjson.dump(L, sys.stdout, ensure_ascii=False)where the final keyword parameter just ensures that the UTF-8 encoding is maintained. Remember to return False at this point too.
- Add an argument num_tweetsto thesamplefunction so that you can pass it along to yourStdOutListenerconstructor. Edit your program so that this number comes from the command line.
Test your program from the command line:
$ python twitter.py 5 > tweets.json
Examining the file should reveal a JSON list with a dictionary inside...
  [{"source": ...
  
  
  Step 2b: Dumping Historical Tweets by User
  Instead of sampling from the public twitter feed realtime, we can
  grab historical tweets from arbitrary twitter users.  Here we will
  define a function timeline that grabs the
  last 
Here are some implementation details:
- Begin by reviewing the Getting
    Started Guide on the Twippy API.  We will be using the
    method user_timelineinstead ofhome_timeline.
- 
    Write a function called timelinethat has three arguments:oauth,screenname, andnum_tweets. This function should create a new Twitter API object usingoauth. It then should call theuser_timelinemethod making use of the keyword argumentsscreen_nameandcount.
- The user_timelinemethod returns a list ofStatusobjects. TheStatusobject provides a friendly wrapper around the JSON. But we only want the JSON. Luckly, you can get direct access to it through_jsonattribute. In other words, ifsis aStatusobject, thens._jsonwill give you its underlyling JSON.
- You should dump a list of JSON dictionaries, each representing a status update, to standard out
- Update your if __name__ == '__main__'code so that it respects the following command line sytnax:$ python twitter.py sample 100 $ python twitter.py timeline iamdiddy 50
Step 3: Word Counts
  The file wordcounts.py contains the
  class WordCounts, which represents a dictionary of
  words and their associated counts.  This class also features
  filtering — you can pass in a list of functions that operate
  on strings. If a function returns True then the string should not
  be included in the count list.  There is an example below.
  
  class WordCounts:
      """A class representing word counts.  Words are only included if they
      are not filtered."""
      def __init__(self, filters=[]):
          """filters is a list of functions, each of which operates on a single
          string and returns True if and only if the word should be
          filtered (i.e., not stored)"""
      def addwords(self, words):
          """Update the word count for each word in words.  Only include the
          word if it passes all the filters (i.e., all the filters return False)"""
      def cloudstr(self):
          """return a single string where each word in the dictionary
             is repeated N times (with spaces) where N is its associted count.
             In other words if {'Brent':2,'Courtney':3} then the string would be
             'Brent Brent Courtney Courtney Courtney' """
      def __str__(self):
          return ''
 
  Here is an example that uses some filtering functions.
  >>> from wordcounts import WordCounts
  >>> def filter_dog(word): return word == "dog"
  ...
  >>> def filter_starts_with_hi(word): return word.startswith("hi")
  ...
  >>> counts = WordCounts(filters=[filter_dog, filter_starts_with_hi])
  >>> counts.addwords("Brent Brent dog hiBrent hiCourtney Courtney Courtney Courtney".split())
  >>> counts.addwords(["dog", "hidog", "Brent"])
  >>> print(counts)
  {'Brent': 3, 'Courtney': 3}
  >>> counts.cloudstr()
  'Brent Brent Brent Courtney Courtney Courtney'
  
  
  Here are a few implementation notes:
- Remember that functions are valid python objects, so if you
    have a list of function objects funcsthen[f("brent") for f in funcs]hands back a list of the result of each function applied to the string "Brent"
- Try to implement cloudstrin a single line. This string will be useful when making word clouds.
Step 4: Word Clouds
In this section you will write code to create word clouds from JSON data representing status updates. Given a filename containing JSON data and a filename for an output image, your program should
- create an instance of WordCountsusing the filters described below;
- load the JSON from disk;
- extract the tweets from each JSON dictionary (use the key "text") to grab this;
- add words from the text of each tweet to your WordCountsinstance; and
- create a word cloud using the string returned by cloudstr; and
- save the word cloud to the appropriate file.
The command line usage should be:
$ python cloud.py tweets.json tweets.png
However, there is a problem with matplotlib and virtual environments that prevents python from running correctly. We still need to use and activate virtualenv, but instead of using the python version we installed in our virtualenv, we want to use the python3 version installed on the system.
On OIT machines, use:
$ PYTHONHOME=$VIRTUAL_ENV /opt/local/bin/python3 cloud.py tweets.json tweets.pngOn TCL 216/217 machines, use:
$ PYTHONHOME=$VIRTUAL_ENV /usr/local/bin/python3 cloud.py tweets.json tweets.png
The PYTHONHOME environment variable tells Python where to look for modules and libraries. More information on matplotlib and virtalenv.
Step 4a: Word Clouds
Here are some details about creating word clouds:
- You can import the WordCloudclass as well as theSTOPWORDSlist, which is useful for telling the cloud to ignore words like the and a.
- 
  Use the following syntax to create an instance of WordCloud:cloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf', collocations=False, stopwords=STOPWORDS, background_color='white', width=1800, height=1400)You can change fonts by providing a different path to a font file.
- Use wordcloud.generate(s)to internally generate a wordcloud from a string of whitespace delimited words. You'll want to make use of thecloudstrmethod of yourWordCountsclass.
- Use cloud.to_file(filename)to save the word cloud to disk
Step 4b: Filters
You should define at least 4 different kinds of filters:
- a filter for URLS (i.e., anything with "http://" or "https://" in it);
- a filter for the word "RT";
- a filter for anything starting with "@"
- a filter for the encoded ampersand character "&".
  These filters will be passed to your WordCounts instance.
  
Step 5: Make some Clouds
You should create word clouds for some Twitter username and for a sample of 1000 tweets from the real time twitter feed. Add and commit these PNG files to your repo.
Step E: Extra Credit
The WordCloud module lets us customize our word clouds in many ways. You can find documentation and examples in the WordCloud github repository. Experiment with these options to make word clouds that are unique! You may find inspiration in the Hillary and Donald wordclouds below, but be creative. Some ideas include: adding an image mask; changing fonts or colors; transforming text (e.g., pig Latin or "real" language translation); non-twitter data sources (remember, we can dump dictionaries to json files); or new/better filters.
 
	 
  Please indicate, using comments in your code, what enhancements you have made and how they work so that we can properly credit your efforts. Your final submission should produce word clouds as normal with the default program arguments, but easy-to-follow documentation explaining how to run your enhancements should be provided in the README.md that comes with your repository.
Step 6: Submission
- Now commit those additions to the repository:
  $ git commit -a -m "some log message" 
- Push your changes back to github repo:
  $ git push You will probably be asked to type$ git push --set-upstream origin twitterwhich you should do. This pushes your iterator branch back up to the GitHub Repo.
- Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR