Files and Comprehensions

So far in the course, we have learnt how we can read from a text file and turn it into a Python data structure (such as a list of words). Today we will look at how to read from a CSV (comma separated file), process the entries and write/append to a different text file.

In the process, we will look at some code patterns involving lists, strings and counters that are useful when analyzing data.

Acknowlegement. This notebook has been adapted from the Wellesley CS111 Spring 2019 course materials (http://cs111.wellesley.edu/spring19).

Reading in a CSV File

CSV Format. A CSV (Comma Separated Values) file is a type of plain text file that stores tabula data. Each row of a table is a line in the text file, with each column on the row separated by commas. This format is the most common import and export format for spreadsheets and databases.

For example a simple table such as the following with columns names and ages would be represented in a CSV as:

Table:

Name

Age

Harry

14

Hermoine

14

Dumbledor

60

CSV:

Name,Age
Harry,14
Hermoine,14
Dumbledor,60

Python’s csv module provides an easy way to read and iterate over a CSV file.

import csv # the module must be explicitly imported
with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    print(csvf)
# implicitly closes file
# csvf is a  file object that can be iterated over
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/2948643024.py in <module>
----> 1 with open('roster.csv') as myFile:
      2     csvf = csv.reader(myFile)
      3     print(csvf)
      4 # implicitly closes file
      5 # csvf is a  file object that can be iterated over

FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'

Iterating over a CSV object

When we iterate over a regular text file, the loop variable is a string and takes the role of each line in the file one by one in order. When we iterate over a CSV object, the loop variable is a list and takes the value of each row one by one in order.

with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    for row in csvf:
        print(row)
        
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4222509918.py in <module>
----> 1 with open('roster.csv') as myFile:
      2     csvf = csv.reader(myFile)
      3     for row in csvf:
      4         print(row)
      5 

FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'

Accumulating the rows of the CSV as a Nested List

We can iterate over a CSV file and accumulate all rows (each of which is a list) into a mega list.

rosterList = []
with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    for row in csvf:
        rosterList.append(row)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3119812453.py in <module>
      1 rosterList = []
----> 2 with open('roster.csv') as myFile:
      3     csvf = csv.reader(myFile)
      4     for row in csvf:
      5         rosterList.append(row)

FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'
rosterList # lets see what is in the rosterList
[]

List of lists format. Notice that each item in the list is a row in the original file (in order) and the overall list is a list of rowLists. How can we access the information of a particular student from this nested list?

len(rosterList)  # number of students in class
0

Generating random indices. Remember Homework 1 where you were asked to design an algorithm for generating random numbers? Let’s play a game where we generated random numbers between 0 and 31 and index our list with that number to see whose name comes up.

import random # import module to help generate random numbers
randomIndex = random.randint(0, 31)  
# generates a random integer between 0 and 31
rosterList[randomIndex]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/215022150.py in <module>
----> 1 rosterList[randomIndex]

IndexError: list index out of range
randomIndex = random.randint(0, 31)
rosterList[randomIndex]  # great way of cold calling in lectures !
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/1950524331.py in <module>
----> 1 rosterList[randomIndex]  # great way of cold calling in lectures !

IndexError: list index out of range
rosterList[random.randint(0,31)][0]   
# Accessing just the name
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/168478761.py in <module>
----> 1 rosterList[random.randint(0,31)][0]
      2 # Accessing just the name

IndexError: list index out of range

Reorganizing Data

Sometimes your CSV may have unnecessary data that you want to discard (such as the last column in our class roster). Additionally your rows might have integer values stored as a string (such as class year) that you may want to convert to an integer. Let us write some helper functions that take as input a list (which is a row of the CSV file) and output a cleaned row as a tuple. The returned tuple must have three items:

  • First item of the returned tuple must be the student first name as a string

  • Second index of the returned tuple must be the student last name as a string

  • Third index of the returned tuple must represent the graduation year (23, 22, 21, 20) as an int

def reorgData(rowList):
    """Takes a row of a CSV (as a list) and returns
    a tuple of student information"""
    # tuple assignment, splitting last name
    # and first(with middle) name
    lName, fmName = rowList[0].split(',')  
    fName = fmName.split()[0]
    year = rowList[1]  # takes the form '23AAA'
    yy = int(year[:2])
    return fName, lName, yy

Let us test our reorgData function on a particular random rowList from the rosterList.

randomIndex = random.randint(0, 31)
reorgData(rosterList[randomIndex])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/877059263.py in <module>
----> 1 reorgData(rosterList[randomIndex])

IndexError: list index out of range

Accumulation with Lists

In previous lectures we have seen that it is common to use loops in conjunction with accumulation variables that collect results from processing elements within the loop. Let us write some funtions that exercise commonly seen accumulation patterns using lists.

Exercise: Number of Students by Year

Let’s get to know our class better! We will write a function yearList which takes in two arguments rosterList (list of lists) and year (int) and returns the list of students in the class with that graduating year.

def yearList(classList, year):
    result = []
    for sList in rosterList:
        # tuple assignment:
        fName, lName, yy = reorgData(sList) 
        if yy == year:
            result.append(fName + ' ' +lName)
    return result
len(yearList(rosterList, 23)) # how many first years in class?
0
yearList(rosterList, 23)  # Names of first years
[]
len(yearList(rosterList, 22)) # how many second sophmores
0
yearList(rosterList, 22)  # Names of sophmores 
[]
len(yearList(rosterList, 21))  # how many juniors?
0
yearList(rosterList, 21) # names of juniors
[]
len(yearList(rosterList, 20))  # how many seniors
0
yearList(rosterList, 20)  # name of seniors
[]

Exercise: Use our sequenceTools

We built an assortment functions last week as part of our sequences toolkit. Lets use some of those functions now to find out fun facts about the class. Function names in the __all__ variable of our toolkit:

  • isVowel

  • countAllVowels

  • countChar

  • wordStartEndCount

  • wordStartEndList

  • isPalindrome

We can import these functions from our module into our current interactive python session, using the import command.

from sequenceTools import *
help(countAllVowels)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3373308659.py in <module>
----> 1 help(countAllVowels)

NameError: name 'countAllVowels' is not defined
countAllVowels('onomatopoeia')  # test if import work
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4001814340.py in <module>
----> 1 countAllVowels('onomatopoeia')  # test if import work

NameError: name 'countAllVowels' is not defined

Another helper function. As we will be analyzing student names, lets create helper functions which extract names out of the CVS rows (lists).

def getName(sInfo):
    """Takes in a tuple consisting of first name, last name, year 
    and returns the string first name concatenated with last name"""
    fName, lName, yy = reorgData(sInfo)
    return fName + ' ' + lName
getName(rosterList[random.randint(0, 31)])  # test on a random student!
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3160768721.py in <module>
----> 1 getName(rosterList[random.randint(0, 31)])  # test on a random student!

IndexError: list index out of range

Fun Facts. Who has the most number of vowels in their name?

def mostVowelName(classList):
    currentMax = 0 # initialize max value
    persons = []  # initialize list for names
    for sInfo in classList:
        name = getName(sInfo)
        numVowels = countAllVowels(name)
        if numVowels > currentMax:
            # found someone whose name as more vowels
            # than current max update person, currentMax
            currentMax = numVowels 
            persons = [name] # reupdate
        elif numVowels == currentMax:
            # is someone's name as long as currentMax?
            persons.append(name)
    return persons, currentMax
mostVowelName(rosterList)  # which student has most vowels in their name?
([], 0)

Fun Facts. How about the least number of vowels? Since we will need to extract student names again, lets just write a little helper function to do it for us.

def leastVowelName(classList):
    currentMin = 20 # initialize min value
    persons = []  # initialize placeholder for name
    for sInfo in classList:
        name = getName(sInfo)
        numVowels = countAllVowels(name)
        if numVowels < currentMin:
            currentMin = numVowels # update state of current max
            person = [name]
        elif numVowels == currentMin:
            persons.append(name)
    return person, currentMin
leastVowelName(rosterList)  # which student has most vowels in their name?
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/1780779790.py in <module>
----> 1 leastVowelName(rosterList)  # which student has most vowels in their name?

/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4118649431.py in leastVowelName(classList)
     10         elif numVowels == currentMin:
     11             persons.append(name)
---> 12     return person, currentMin

UnboundLocalError: local variable 'person' referenced before assignment

Writing to Files

We can write all the results that we are computing into a file (a persitent structure). To open a file for writing, we use open with the mode ‘w’.

The following code will create a new file named studentFacts.txt in the current working directory and write in it results of our function calls.

with open('studentFacts.txt', 'w') as sFile:
    sFile.write('Fun facts about CS134 students.\n')# need newlines
    sFile.write('No. of first years in CS134: {}\n'.format(len(yearList(rosterList, 23)))) 
    sFile.write('No. of sophmores in CS134: {}\n'.format(len(yearList(rosterList, 22))))
    sFile.write('No. of juniors in CS134: {}\n'.format(len(yearList(rosterList, 21))))
    sFile.write('No. of seniors in CS134: {}\n'.format(len(yearList(rosterList, 20))))

We can use ls -l to see that a new file studentFacts.txt has been created:

ls # new file information
__pycache__/                    lec_listPatterns_solns.ipynb
csv/                            sequenceTools.py
lec_listPatterns-jeannie.ipynb  studentFacts.txt
lec_listPatterns.ipynb          textfiles/

Use the OS command more to view the contents of the file:

more studentFacts.txt

Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.

Appending to files

How do we add lines to the end of an existing file? We can’t open the file in write mode (with a ‘w’), because that erases all previous contents and starts with an empty file.

Instead, we open the file in append mode (with an ‘a’). Any subsequent writes are made after the existing contents.

with open('studentFacts.txt', 'a') as sFile:
    sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
    sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3234214236.py in <module>
      1 with open('studentFacts.txt', 'a') as sFile:
      2     sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
----> 3     sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))

/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4118649431.py in leastVowelName(classList)
     10         elif numVowels == currentMin:
     11             persons.append(name)
---> 12     return person, currentMin

UnboundLocalError: local variable 'person' referenced before assignment

Open the file studentFacts.txt again to view it, or using the OS command more:

more studentFacts.txt

List Accumulation Patterns

When iterative over lists there are several accumulation patterns which come up a lot. In the following questions, the premise is that we have a list we are iterating over and we are returning a new list. There are two common category of tasks:

  • Mapping patters: when you want to perform the same action to every item in the list

  • Filter patterns: when you want to retain only some items of the list

We can simplify the mapping/filtering patterns with a syntactic device called list comprehension. Lets take an exampe of each.

Mapping Patteer via List Comprehension

We can generate a new list by performing an operation on every element in a given list. This is called mapping.

def mapDouble(nums):
    """Given a list of numbers, returns a *new* list,
    in which each element is twice the corresponding
    element in the input list.
    """
    result = []
    for n in nums:
        result.append(2*n)
    return result
mapDouble([2, 3, 4, 5])
[4, 6, 8, 10]

Succint form using list comprehension.

def mapDoubleShort(nums):
    return [2*n for n in nums]
mapDoubleShort([6, 7, 8])
[12, 14, 16]

List of Names. Suppose we want to iterate over our nested list rosterList, and collect all the student names in a list, we can do that with a simple mapping list comprehension!

nameList = [getName(sInfo) for sInfo in rosterList]
nameList
[]

Another example. Suppose we want to iterate over a list of names and return a list of first names in lower case.

def firstNames(nameList):
    """Given a list of names as firstName lastname, returns a list of firstNames.
    """
    return [name.split()[0].lower() for name in nameList]  
firstNames(['Shikha Singh', 'Iris Howley', 'Lida Doret'])
['shikha', 'iris', 'lida']

Filtering Pattern via List Comprehensions

Another common way to produce a new list is to filter an existing list, keeping only those elements that satisfy a certain predicate.

def filterNames(nameList):
    """Given a list of names as first name, returns a *new* list of all
    names in the input list that have length >= 6.
    """
    result = []
    for name in nameList:
        if len(name) >= 9:
            result.append(name)
    return result
filterNames(firstNames(nameList))
[]

We can also do this filtering pattern very succinctly using list comprehensions!

def filterNamesShort(nameList):
    return [name for name in nameList if len(name) >= 9]
filterNamesShort(firstNames(nameList))
[]

List Comprehensions Exercises

# Given a list of numbers numList
# Create a list of all numbers that are even
nums = [1, 2, 3, 4, 5, 6, 7]
result = [n for n in nums if n%2 == 0]
print(result)
[2, 4, 6]
# add the ending 'th' to all words in a phrase
phrase = "mine dog ate your shoe"
# expected phrase: ["mineth", "dogth", "ateth", "yourth", "shoeth"]
newPhrase = [word + 'th' for word in phrase.split()]
newPhrase
['mineth', 'dogth', 'ateth', 'yourth', 'shoeth']

List Comprehensions with Mapping and Filtering

It is possible to do both mapping and filtering in a single list comprehension. Examine the example below which filters a list by even numbers and creates a new list of their squares.

[(x**2) for x in range(10) if x % 2 == 0]
[0, 4, 16, 36, 64]

Note that our expression for mapping still comes before the “for” and our filtering with “if” still comes after our sequence. Below is the equivalent code without list comprehensions.

newList = []
for x in range(10):
    if x % 2 == 0:
        newList.append(x**2)
newList
[0, 4, 16, 36, 64]

YOUR TURN: Try to write the following list comprehension examples:

# Example 1: Write a list comprehension that filters the vowels from a word 
# such as beauteous and returns a list of its capitalized vowels.
word = "beauteous"
newList = [char.upper() for char in word if isVowel(char)]
newList
['E', 'A', 'U', 'E', 'O', 'U']
# Example 2: Write a list comprehension that filters a list of proper nouns by length.
# It should extract nouns of length greater than 4 but less than 8 and return a list
# where the first letter is properly capitalized
# This is a challenge!
properNouns = ["cher", "bjork", "sting", "beethoven", "prince", "madonna"]
newList = [word[0].upper() + word[1:] for word in properNouns if len(word)>4 and len(word)<=8]
newList
['Bjork', 'Sting', 'Prince', 'Madonna']