Diving Into the Deluge of Data :: Lab 8 :: Dictionaries, Baby Names and Ngrams

Lab 8: Dictionaries, Baby Names and Ngrams

This labs focuses on application of dictionaries to two types of data: counts of baby names in the United States by year and counts of baby names referenced in literature by year. Our goal will be to produce a visualization that fuses these two data sources together.

Step 1: Source Code

Data

This lab features two data sources.

Representing Baby Names

Develop a class called BabyNames in baby.py that encapsulates the counts of baby names for a particular range of years. Besides __init__, your class should support three other methods:

  class BabyNames:

      def __init__(self):

      def add(self, name, year, count):
          """
          Add 'count' to 'name' in 'year' (or make name/ year have count if
          name / year does not yet exist)
          """

      def count(self, name, year):
          """Return count associated with name / year"""

      def counts(self, name, years):
          """Return a list of counts associated with 'name' for 'years'"""

Here are some implementation notes:

Step 4: babynames_from_files

In baby.py, write a function called babynames_from_files that creates an instance of BabyNames populated with data from basedir for the given list of years.

  def babynames_from_files(basedir, prefix, years):
      """Return a BabyNames object populated from data in 'dir'
         for the given years"""

Some implementation notes:

Test your code
  >>> import baby
  >>> bn = baby.babynames_from_files("./names", "yob", list(range(1900,2001)))
  >>> bn.counts("Brent", list(range(1970,1981)))
  [4304, 4074, 3556, 3306, 3441, 3387, 3426, 3448, 3202, 3479, 3566]
  

Step 5: Google Ngrams

Slide on over to the Google Ngram Viewer and try a few searches. You'll notice that the URL is composed of three parts:

The only key/value pairs that we care about are the following:

The requests library that we used in Lab 4 has excellent support for making URL requests with key/value parameters.

    >>> import requests
    >>> params = {"names" : "Brent,Courtney,Oscar,George", "year" : 2012}
    >>> r = requests.get("http://www.somewhere.com/foobar", params=params)
    >>> print(r.url)
    https://www.somewhere.com/foobar?names=Brent%2CCourtney%2COscar%2CGeorge&year=2012
  

In the file ngrams.py write a function called google_ngram_request that takes a list of strings (the tokens), a start year and a finish year, and returns the underlying response content.

  def google_ngram_request(tokens, start_year, end_year)
    """
    Return the text of the google ngram results for a list of 'tokens'
    starting with 'start_year' and ending with 'end_year'
  """

The function google_ngram_request can be used in conjection with the supplied parse function to extract the given data into a dictionary. Test your code:

  >>> ngrams.parse(ngrams.google_ngram_request(['Brent', 'Courtney'], 1970, 1972))
{'Brent': [1.0628228134616318e-06, 1.0628228134616318e-06, 1.0628228134616318e-06],
 'Courtney': [1.1416135142402102e-06, 1.1416135142402102e-06, 1.1416135142402102e-06]}
  

Step 6: Plotting

Define a function in baby.py called plot that accepts five arguments:

This function should construct a plot similar to that shown above and save it in filename. Similar means it should contain all the major characteristics: filled plots, a 2:1 plotting ratio between the top figure and the bottom figure, a legend, proper labels, etc. The color scheme can be slightly different as can some of the font choices and sizes. Here are some implementation notes:

You can run your code from the command line using

  $ python3 baby.py names.png ./names 1900 2000 name1 name2 name3 ...
  
where ./names is the directory where the baby name data is kept.

Step 7: Submission