Computer Science 135 :: Lab 9 :: Regular Expressions and Recursion

Lab 9: Regular Expressions and Recursion

This lab uses regular expressions to grab semi-structured from a website and save it as CSV. The data is historical population data from US counties. We will use this data to create choropleth maps using Vincent and Vega. We will create an animated GIF from the pictures so that one can see trends through time. Finally, we will also use recursion to create a Sierpinksi Triangle.

Here is a picture of populatoin densities over time from 1790 to 2010 by decade:

Step 0: Lab Preparation

Review Lecture 27 and Lecture 28.

Step 1: Source Code

Clone your private repo to an appropriate directory in your home folder (~/labs is a good choice):
```
$ git clone git@github.com:williams-cs/<git-username>-cs135-lab9.git
```
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github.
Once inside your <git-username>-cs135-lab9 directory, create a virtual environment using
```
$ virtualenv --system-site-packages -p python3 venv
```
The --system-site-packages will let us use the numpy package.
Activate your environment by typing:
```
$ . venv/bin/activate
```
Use pip to install the python imaging library:
```
$ pip install pillow
```
Use pip to install the requests library:
```
$ pip install requests
```
Use pip to install pandas, which is a statistical package for Python, similar to R:
```
$ pip install pandas
```
Use pip to install vincent, which is a python framework for generating visualizations using the Vega grammar:
```
$ pip install vincent
```

Clone the VEGA git repository using:

$ git clone https://github.com/trifacta/vega.git

Make the VEGA command-line tools using:
```
cd vega && make install && cd ..
```
This just changes the directory to vega, makes the tools, and backs up one directory
Remember that you must always activate your virtual environment when opening a new terminal
Type
```
$ git branch
```
and notice that you are currently editing the master branch.
Create a new branch with
```
$ git branch regex
```
Checkout this branch by typing
```
$ git checkout regex
```
Any changes you make to the repository are now isolated on this branch.

Step 2: Population Data

The Census Beureau has a pretty visualization of coastline and interior population densities from 1790 to 2010. We will make a simliar visualization using an animated gif created from a sequence of choropleth images. Below the Census Beureau visualization is a link to the data table. Your primary exercise this week will be extract this data from the web page source using regular expressions.

Begin by looking at the source of the data table. You can use COMMAND-SHIFT-U to view the source. You should scroll down far enough to see the actual data embedded within an HTML table:
```
      <div id='tcontainer'><table><tr><td class='c1'>State, County FIPS</td>...
      
```

<tr>

table row

<td>

table data

<tr><td class='c1'>

</tr>

Write a function called html_source(URL) that given URL downloads the page source using the requests library and returns it as a string. Test your code by inspecting the string and seeing if it contains all the data.

Begin by writing a function called make_regex() that takes no arguments and returns a regular expression string that matches an entire row. Test your code in the REPL:

      >>> URL = 'http://www.census.gov/dataviz/visualizations/039/508.php'

      >>> re.findall(make_regex(),html_source(URL))[0]

"<tr><td class='c1'>State, County FIPS</td><td>STATE</td><td>County name</td>
 <td>Coastline status</td><td>Estimated population 1790</td>
 <td>Estimated population 1800</td><td>Estimated population 1810</td>
 <td>Estimated population 1820</td><td>Estimated population 1830</td>
 <td>Estimated population 1840</td><td>Estimated population 1850</td>
 <td>Estimated population 1860</td><td>Estimated population 1870</td>
 <td>Estimated population 1880</td><td>Estimated population 1890</td>
 <td>Estimated population 1900</td><td>Estimated population 1910</td>
 <td>Estimated population 1920</td><td>Estimated population 1930</td>
 <td>Estimated population 1940</td><td>Estimated population 1950</td>
 <td>Estimated population 1960</td><td>Estimated population 1970</td>
 <td>Estimated population 1980</td><td>Estimated population 1990</td>
 <td>Estimated population 2000</td><td>Estimated population 2010</td></tr>"

      >>> re.findall(make_regex(),html)[1]

"<tr><td class='c1'>01001</td><td>Alabama</td><td>Autauga County</td><td>non-coastline</td>
<td>0</td><td>0</td><td>0</td><td>2,000</td><td>6,000</td><td>8,000</td>
<td>8,000</td><td>9,000</td><td>12,000</td><td>13,000</td><td>13,000</td>
<td>18,000</td><td>20,000</td><td>19,000</td><td>20,000</td><td>21,000</td>
<td>18,000</td><td>19,000</td><td>24,000</td><td>32,000</td><td>34,000</td>
<td>44,000</td><td>55,000</td></tr>"

There are 27 columns of data. The first one is the FIPS column, which is keyed by <td class='c1'>. The remaining 26 columns are keyed by <td>...</td>. You should alter your regular expression code to caputure the data in these 27 columns in groups so that it will be easy to extract. You should generate the regular expression programmatically (i.e., create the string representing the regular expression using some Python code).
Write a function called grabdata(URL) that uses make_regex and html_source to create a list of rows where each row is represented as a list of 27 values. You should return this list. Make use of re.finditer(...), which returns a match object. The header row, which is the first row, should all be strings, but the remaining rows contain integer values that should be converted. Here are some details:
- Use a boolean variable to help tell you when you are or are not parsing the header row
- Consider writing an inner function called row_from_match that takes a match object and returns a properly formatted row. Remember that group(0) refers to the entire match, while group(1) through group(27) refer to the individual data items.
- The first four data items (i.e. groups 1-4) should be strings. The remaining 23 items (i.e., groups 5-27) should be converted to integers. Notice, however, that the integers contain commas. To parse these integers correctly, consider the following code:
```
          >>> import locale
          >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
          'en_US.UTF-8'
          >>> locale.atoi("10,000,000")
          10000000
        
```

Test your grabdata function.

        >>> rows = grabdata(URL)
        >>> rows[0]
        ['State, County FIPS', 'STATE', 'County name', 'Coastline status',
         'Estimated population 1790', 'Estimated population 1800', 'Estimated population 1810',
         'Estimated population 1820', 'Estimated population 1830', 'Estimated population 1840',
         'Estimated population 1850', 'Estimated population 1860', 'Estimated population 1870',
         'Estimated population 1880', 'Estimated population 1890', 'Estimated population 1900',
         'Estimated population 1910', 'Estimated population 1920', 'Estimated population 1930',
         'Estimated population 1940', 'Estimated population 1950', 'Estimated population 1960',
         'Estimated population 1970', 'Estimated population 1980', 'Estimated population 1990',
         'Estimated population 2000', 'Estimated population 2010']
        >>> rows[1]
        ['01001', 'Alabama', 'Autauga County', 'non-coastline', 0, 0, 0, 2000, 6000,
          8000, 8000, 9000, 12000, 13000, 13000, 18000, 20000, 19000, 20000, 21000,
          18000, 19000, 24000, 32000, 34000, 44000, 55000]
        >>> rows[2]
        ['01003', 'Alabama', 'Baldwin County', 'coastline', 0, 0, 0, 1000, 2000, 2000,
          3000, 6000, 6000, 9000, 9000, 13000, 18000, 21000, 28000, 32000, 41000, 49000,
          59000, 79000, 98000, 140000, 182000]

Write a function called writedata(rows, output) that writes rows to a CSV file called output.
Test out your map.py function on the command line.
```
      $ python3 data.py data.csv
      
```

Step 3: Creating Maps

The file map.py is provided for you. Take a look at the code. We use the Vincent library along with a counties datafile us_counties.topo.json to bind county population data to map counties based on their FIPS code. Vincent outputs a Vega graph visualization in JSON format. The syntax is as follows:

  $ python3 map.py data.csv us_counties.topo.json pop 1790 2010

where 1790 is the start deacde and 2010 is the finish decade. This will output 23 maps named popYYYY.json respectively. You can use the Vega command-line tools to convert these JSON files into PNG files. Here is the syntax.

  $ vega/bin/vg2png pop.2000.json pop.2000.png

To convert all the files using BASH's builtin for loop use:

  $ for file in `ls pop*.json`; do vega/bin/vg2png $file "${file%.*}".png; done

To create an animated gif called pop.gif from these PNG files, use the convert command.

  $ convert -delay 20 -loop 0 pop*.png pop.gif

You can view your animated gif in a web browser using

  $ open -a safari pop.gif

Step 4: Recursion

Below are five Sierpinksi Triangles. You should write a program called triangle.py that, when called from the command line with argument N, produces a single Sierpinksi Triangle of depth N. You should also pass it an output file name and an image dimension. Here is some example syntax for creating a triangle of depth 8.

  python triangle.py 8 tri8.png 1000

Here are some implementation details:

Think about writing a triangle(n) generator that yields all the traingles up to depth n for the unit square. A value returned by this generator would be a list of three points.
Your triangle generator might make use of an internal helper function helper(n, x0, y0, x1, y1, x2, y2) that you call recursively. The 6 coordinates provided to the function define the triangle within which you want to yield the embedded triangle. The embedded triangle will help you make 3 recursive calls to generate other triangles.
Remember that when you make recursive call to generator, the return value is a generator, so you must use the syntax
```
    for x in helper(...):
        yield x
```
Use your iterator, along with the polygon function of PIL to create your properly interpolated image.

Step 5: Submission

Now commit those additions to the repository:
```
$ git commit -a -m "some log message"
```
Push your changes back to github repo:
```
$ git push
```
You will probably be asked to type $ git push --set-upstream origin regex which you should do. This pushes your iterator branch back up to the GitHub Repo.
Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR