Lab 9: Regular Expressions and Recursion
This lab uses regular expressions to grab semi-structured from a website and save it as CSV. The data is historical population data from US counties. We will use this data to create choropleth maps using Vincent and Vega. We will create an animated GIF from the pictures so that one can see trends through time. Finally, we will also use recursion to create a Sierpinksi Triangle.
Here is a picture of populatoin densities over time from 1790 to 2010 by decade:

Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
(
~/labs
is a good choice):$ git clone git@github.com:williams-cs/<git-username>-cs135-lab9.git
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github. - Once inside your <git-username>-cs135-lab9 directory, create a virtual environment using
$ virtualenv --system-site-packages -p python3 venv
The --system-site-packages will let us use the numpy package. - Activate your environment by typing:
$ . venv/bin/activate
- Use pip to install the python imaging library:
$ pip install pillow
- Use pip to install the requests library:
$ pip install requests
- Use pip to install pandas, which is a statistical package for Python, similar to R:
$ pip install pandas
- Use pip to install vincent, which is a python framework for generating visualizations using the Vega grammar:
$ pip install vincent
- Clone the VEGA git repository using:
$ git clone https://github.com/trifacta/vega.git
- Make the VEGA command-line tools using:
cd vega && make install && cd ..
This just changes the directory to vega, makes the tools, and backs up one directory - Remember that you must always activate your virtual environment when opening a new terminal
- Type
$ git branch
and notice that you are currently editing the master branch. - Create a new branch with
$ git branch regex
- Checkout this branch by typing
$ git checkout regex
- Any changes you make to the repository are now isolated on this branch.
Step 2: Population Data
The Census Beureau has a pretty visualization of coastline and interior population densities from 1790 to 2010. We will make a simliar visualization using an animated gif created from a sequence of choropleth images. Below the Census Beureau visualization is a link to the data table. Your primary exercise this week will be extract this data from the web page source using regular expressions.
- Begin by looking at the source of the data table. You can use COMMAND-SHIFT-U to view the source. You should scroll down far enough to see the actual data embedded within an HTML table:
<div id='tcontainer'><table><tr><td class='c1'>State, County FIPS</td>...
The - Write a function called
html_source(URL)
that givenURL
downloads the page source using the requests library and returns it as a string. Test your code by inspecting the string and seeing if it contains all the data. - Begin by writing a function called
make_regex()
that takes no arguments and returns a regular expression string that matches an entire row. Test your code in the REPL:>>> URL = 'http://www.census.gov/dataviz/visualizations/039/508.php' >>> re.findall(make_regex(),html_source(URL))[0] "<tr><td class='c1'>State, County FIPS</td><td>STATE</td><td>County name</td>
<td>Coastline status</td><td>Estimated population 1790</td>
<td>Estimated population 1800</td><td>Estimated population 1810</td>
<td>Estimated population 1820</td><td>Estimated population 1830</td>
<td>Estimated population 1840</td><td>Estimated population 1850</td>
<td>Estimated population 1860</td><td>Estimated population 1870</td>
<td>Estimated population 1880</td><td>Estimated population 1890</td>
<td>Estimated population 1900</td><td>Estimated population 1910</td>
<td>Estimated population 1920</td><td>Estimated population 1930</td>
<td>Estimated population 1940</td><td>Estimated population 1950</td>
<td>Estimated population 1960</td><td>Estimated population 1970</td>
<td>Estimated population 1980</td><td>Estimated population 1990</td>
<td>Estimated population 2000</td><td>Estimated population 2010</td></tr>" >>> re.findall(make_regex(),html)[1] "<tr><td class='c1'>01001</td><td>Alabama</td><td>Autauga County</td><td>non-coastline</td>
<td>0</td><td>0</td><td>0</td><td>2,000</td><td>6,000</td><td>8,000</td>
<td>8,000</td><td>9,000</td><td>12,000</td><td>13,000</td><td>13,000</td>
<td>18,000</td><td>20,000</td><td>19,000</td><td>20,000</td><td>21,000</td>
<td>18,000</td><td>19,000</td><td>24,000</td><td>32,000</td><td>34,000</td>
<td>44,000</td><td>55,000</td></tr>"
- There are 27 columns of data. The first one is the FIPS column, which is keyed by
<td class='c1'>
. The remaining 26 columns are keyed by<td>...</td>
. You should alter your regular expression code to caputure the data in these 27 columns in groups so that it will be easy to extract. You should generate the regular expression programmatically (i.e., create the string representing the regular expression using some Python code). - Write a function called
grabdata(URL)
that usesmake_regex
andhtml_source
to create a list of rows where each row is represented as a list of 27 values. You should return this list. Make use ofre.finditer(...)
, which returns a match object. The header row, which is the first row, should all be strings, but the remaining rows contain integer values that should be converted. Here are some details:- Use a boolean variable to help tell you when you are or are not parsing the header row
- Consider writing an inner function called
row_from_match
that takes a match object and returns a properly formatted row. Remember thatgroup(0)
refers to the entire match, whilegroup(1)
throughgroup(27)
refer to the individual data items. - The first four data items (i.e. groups 1-4) should be strings. The remaining 23 items (i.e., groups 5-27) should be converted to integers.
Notice, however, that the integers contain commas. To parse these integers correctly, consider the following code:
>>> import locale >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.atoi("10,000,000") 10000000
- Test your
grabdata
function.>>> rows = grabdata(URL) >>> rows[0] ['State, County FIPS', 'STATE', 'County name', 'Coastline status', 'Estimated population 1790', 'Estimated population 1800', 'Estimated population 1810', 'Estimated population 1820', 'Estimated population 1830', 'Estimated population 1840', 'Estimated population 1850', 'Estimated population 1860', 'Estimated population 1870', 'Estimated population 1880', 'Estimated population 1890', 'Estimated population 1900', 'Estimated population 1910', 'Estimated population 1920', 'Estimated population 1930', 'Estimated population 1940', 'Estimated population 1950', 'Estimated population 1960', 'Estimated population 1970', 'Estimated population 1980', 'Estimated population 1990', 'Estimated population 2000', 'Estimated population 2010'] >>> rows[1] ['01001', 'Alabama', 'Autauga County', 'non-coastline', 0, 0, 0, 2000, 6000, 8000, 8000, 9000, 12000, 13000, 13000, 18000, 20000, 19000, 20000, 21000, 18000, 19000, 24000, 32000, 34000, 44000, 55000] >>> rows[2] ['01003', 'Alabama', 'Baldwin County', 'coastline', 0, 0, 0, 1000, 2000, 2000, 3000, 6000, 6000, 9000, 9000, 13000, 18000, 21000, 28000, 32000, 41000, 49000, 59000, 79000, 98000, 140000, 182000]
- Write a function called
writedata(rows, output)
that writesrows
to a CSV file calledoutput
. - Test out your map.py function on the command line.
$ python3 data.py data.csv
<tr>
tag means table row and the <td>
tag means table data. Each row always starts with <tr><td class='c1'>
and ends with </tr>
.
Step 3: Creating Maps
The file map.py
is provided for you. Take a look at the code. We use the Vincent library along with a counties datafile us_counties.topo.json
to bind county population data to map counties based on their FIPS code. Vincent outputs a Vega graph visualization in JSON format. The syntax is as follows:
$ python3 map.py data.csv us_counties.topo.json pop 1790 2010
where 1790 is the start deacde and 2010 is the finish decade. This will output 23 maps named popYYYY.json
respectively. You can use the Vega command-line tools to convert these JSON files into PNG files. Here is the syntax.
$ vega/bin/vg2png pop.2000.json pop.2000.png
To convert all the files using BASH's builtin for loop use:
$ for file in `ls pop*.json`; do vega/bin/vg2png $file "${file%.*}".png; done
To create an animated gif called pop.gif
from these PNG files, use the convert
command.
$ convert -delay 20 -loop 0 pop*.png pop.gif
You can view your animated gif in a web browser using
$ open -a safari pop.gif
Step 4: Recursion
Below are five Sierpinksi Triangles. You should write a program called triangle.py
that, when called from the command line with argument
python triangle.py 8 tri8.png 1000

Here are some implementation details:
- Think about writing a
triangle(n)
generator that yields all the traingles up to depthn for the unit square. A value returned by this generator would be a list of three points. - Your triangle generator might make use of an internal helper function
helper(n, x0, y0, x1, y1, x2, y2)
that you call recursively. The 6 coordinates provided to the function define the triangle within which you want to yield the embedded triangle. The embedded triangle will help you make 3 recursive calls to generate other triangles. - Remember that when you make recursive call to generator, the return value is a generator, so you must use the syntax
for x in helper(...): yield x
- Use your iterator, along with the
polygon
function of PIL to create your properly interpolated image.
Step 5: Submission
- Now commit those additions to the repository:
$ git commit -a -m "some log message"
- Push your changes back to github repo:
$ git push
You will probably be asked to type$ git push --set-upstream origin regex
which you should do. This pushes your iterator branch back up to the GitHub Repo. - Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR