Lab 7: Dictionaries, Baby Names and Ngrams
This labs focuses on application of dictionaries to two types of data: counts of baby names in the United States by year and counts of baby names referenced in literature by year. Our goal will be to produce a visualization that fuses these two data sources together.

Step 0: Lab Preparation
Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
(
~/labs
is a good choice):$ git clone git@github.com:williams-cs/<git-username>-cs135-lab7.git
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github. - Once inside your <git-username>-cs135-lab7 directory, create a virtual environment using
$ virtualenv --system-site-packages -p python3 venv
The --system-site-packages will let us use the matplotlib package. - Activate your environment by typing:
$ . venv/bin/activate
- Use pip to install the pillows imaging library:
$ pip install pillow
- Use pip to install the requests library:
$ pip install requests
- Remember that you must always activate your virtual environment when opening a new terminal
- Type
$ git branch
and notice that you are currently editing the master branch. - Create a new branch with
$ git branch plot
- Checkout this branch by typing
$ git checkout plot
- Any changes you make to the repository are now isolated on this branch.
Data
This lab features two data sources.- Baby names from social security card applications by year from 1880-2014. This data has already been downloaded and is available to you in
/usr/mac-cs-local/share/cs135/names/
. Each year YYYY has a file with the name yobYYYY.txt. Each file is in CSV format. Each row has the format:NAME,SEX,COUNT
Feel free to copy this data into your lab7 directory usingcp -r /usr/mac-cs-local/share/cs135/names ./names/
- Ngram data from the Google Ngram Viewer. Here we will actually make an HTTP request and scrape the data from the result. This data shows what percentage of books in a given year feature the given Ngrams.
Representing Baby Names
Develop a class called BabyNames
in baby.py
that encapsulates the counts of baby names for a particular range of years. Besides __init__
, your class should support three other methods:
add
: update the count associated withname
/year
bycount
or initialize tocount
if it doesn't already exist.count
: return the count associated withname
foryear
.counts
: return the counts associated withname
for the give list ofyears
class BabyNames: def __init__(self): def add(self, name, year, count): """ Add 'count' to 'name' in 'year' (or make name/ year have count if name / year does not yet exist) """ def count(self, name, year): """Return count associated with name / year""" def counts(self, name, years): """Return counts associated with 'name' for 'years'"""
Here are some implementation notes:
- You probably want to store a dictionary keyed by names whose values are hashtables keyed by year.
count
should return 0 if the name / year pair does not exist
Step 4: babynames_from_files
In baby.py
, write a function called babynames_from_files
that creates an instance of BabyNames
populated with data from dir
for the given list of years
.
def babynames_from_files(dir, prefix, years): """Return a BabyNames object populated from data in 'dir' for the given years"""
Some implementation notes:
- The
dir
parameter is a directory where files of the formprefixYYYY.txt
live. - You may find the function
os.path.join
helpful.
>>> import baby >>> bn = baby.babynames_from_files("./names", "yob", list(range(1900,2001))) >>> bn.counts("Brent", list(range(1970,1981))) [4303, 4074, 3558, 3306, 3440, 3382, 3425, 3449, 3202, 3479, 3564]
Step 5: Google Ngrams
Slide on over to the Google Ngram Viewer and try a few searches. You'll notice that the URL is composed of three parts:
https://books.google.com/ngrams/graph
gives the base address;?
signals that we are about to pass some parameters; andkey1=value1&key2=value2&...&keyN=valueN
where the key and value strings are encoded properly.
The only key/value pairs that we care about are the following:
- content=Albert+Einstein%2CFrankenstein
- year_start=1800
- year_end=2000
- corpus=17
- smoothing=3
The requests library that we used in Lab 4 has excellent support for making URL requests with key/value parameters.
>>> import requests >>> params = {"names" : "Brent,Courtney,Oscar,George", "year" : 2012} >>> r = requests.get("http://www.somewhere.com/foobar", params=params) >>> print(r.url) https://www.somewhere.com/foobar?names=Brent%2CCourtney%2COscar%2CGeorge&year=2012
In the file ngrams.py
write a function called google_ngram_request
that takes a list of strings (the tokens), a start year and a finish year, and returns the underlying response content.
def google_ngram_request(tokens, start_year, end_year)""" Return the text of the google ngram results for a list of 'tokens' starting with 'start_year' and ending with 'end_year' """
The function google_ngram_request
can be used in conjection with the supplied parse
function to extract the given data into a dictionary. Test your code:
>>> ngrams.parse(ngrams.google_ngram_request(['Brent', 'Courtney'], 1970, 1972)) {'Brent': [1.0628228134616318e-06, 1.0628228134616318e-06, 1.0628228134616318e-06], 'Courtney': [1.1416135142402102e-06, 1.1416135142402102e-06, 1.1416135142402102e-06]}
Step 6: Plotting
Define a function in baby.py
called plot
that accepts five arguments:
filename
the filename in which to output the plot;bn
aBabyNames
instance storing name / year / count data;ngrams
a dictionary mapping names to percentages over a range of years;names
a list of names; andyears
a list of years.
This function should construct a plot similar to that shown above and save it in filename
. Similar means it should contain all the major characteristics: filled plots, a 2:1 plotting ratio between the top figure and the bottom figure, a legend, proper labels, etc. The color scheme can be slightly different as can some of the font choices and sizes.
Here are some implementation notes:
- Use subplot2grid to create 3 rows and 1 column. Your top plot should occupy 2 rows. Your bottom plot should occupy 1 row. Use the rowspan keyword argument in one of your subplot2grid call to make this possible. Here is the subplot2grid docs.
- Consider making the tick labels on your bottom plot invisible
You can run your code from the command line using
$ python3 baby.py names.png ./names 1900 2000 name1 name2 name3 ...where
./names
is the directory where the bay name data is kept.
Step 7: Submission
- Now commit those additions to the repository:
$ git commit -a -m "some log message"
- Push your changes back to github repo:
$ git push
You will probably be asked to type$ git push --set-upstream origin plot
which you should do. This pushes your iterator branch back up to the GitHub Repo. - Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR