Files and Comprehensions¶
So far in the course, we have learnt how we can read from a text file and turn it into a Python data structure (such as a list of words). Today we will look at how to read from a CSV (comma separated file), process the entries and write/append to a different text file.
In the process, we will look at some code patterns involving lists, strings and counters that are useful when analyzing data.
Acknowlegement. This notebook has been adapted from the Wellesley CS111 Spring 2019 course materials (http://cs111.wellesley.edu/spring19).
Reading in a CSV File¶
CSV Format. A CSV (Comma Separated Values) file is a type of plain text file that stores tabula
data. Each row of a table is a line in the text file, with each column on the row separated by commas. This format is the most common import and export format for spreadsheets and databases.
For example a simple table such as the following with columns names and ages would be represented in a CSV as:
Table:
Name |
Age |
---|---|
Harry |
14 |
Hermoine |
14 |
Dumbledor |
60 |
CSV:
Name,Age
Harry,14
Hermoine,14
Dumbledor,60
Python’s csv
module provides an easy way to read and iterate over a CSV file.
import csv # the module must be explicitly imported
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
print(csvf)
# implicitly closes file
# csvf is a file object that can be iterated over
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/2948643024.py in <module>
----> 1 with open('roster.csv') as myFile:
2 csvf = csv.reader(myFile)
3 print(csvf)
4 # implicitly closes file
5 # csvf is a file object that can be iterated over
FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'
Iterating over a CSV object¶
When we iterate over a regular text file, the loop variable is a string and takes the role of each line in the file one by one in order. When we iterate over a CSV object, the loop variable is a list and takes the value of each row one by one in order.
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
for row in csvf:
print(row)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4222509918.py in <module>
----> 1 with open('roster.csv') as myFile:
2 csvf = csv.reader(myFile)
3 for row in csvf:
4 print(row)
5
FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'
Accumulating the rows of the CSV as a Nested List¶
We can iterate over a CSV file and accumulate all rows (each of which is a list) into a mega list.
rosterList = []
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
for row in csvf:
rosterList.append(row)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3119812453.py in <module>
1 rosterList = []
----> 2 with open('roster.csv') as myFile:
3 csvf = csv.reader(myFile)
4 for row in csvf:
5 rosterList.append(row)
FileNotFoundError: [Errno 2] No such file or directory: 'roster.csv'
rosterList # lets see what is in the rosterList
[]
List of lists format. Notice that each item in the list is a row in the original file (in order) and the overall list is a list of rowLists. How can we access the information of a particular student from this nested list?
len(rosterList) # number of students in class
0
Generating random indices. Remember Homework 1 where you were asked to design an algorithm for generating random numbers? Let’s play a game where we generated random numbers between 0 and 31 and index our list with that number to see whose name comes up.
import random # import module to help generate random numbers
randomIndex = random.randint(0, 31)
# generates a random integer between 0 and 31
rosterList[randomIndex]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/215022150.py in <module>
----> 1 rosterList[randomIndex]
IndexError: list index out of range
randomIndex = random.randint(0, 31)
rosterList[randomIndex] # great way of cold calling in lectures !
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/1950524331.py in <module>
----> 1 rosterList[randomIndex] # great way of cold calling in lectures !
IndexError: list index out of range
rosterList[random.randint(0,31)][0]
# Accessing just the name
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/168478761.py in <module>
----> 1 rosterList[random.randint(0,31)][0]
2 # Accessing just the name
IndexError: list index out of range
Reorganizing Data¶
Sometimes your CSV may have unnecessary data that you want to discard (such as the last column in our class roster). Additionally your rows might have integer values stored as a string (such as class year) that you may want to convert to an integer. Let us write some helper functions that take as input a list (which is a row of the CSV file) and output a cleaned row as a tuple. The returned tuple must have three items:
First item of the returned tuple must be the student first name as a string
Second index of the returned tuple must be the student last name as a string
Third index of the returned tuple must represent the graduation year (23, 22, 21, 20) as an int
def reorgData(rowList):
"""Takes a row of a CSV (as a list) and returns
a tuple of student information"""
# tuple assignment, splitting last name
# and first(with middle) name
lName, fmName = rowList[0].split(',')
fName = fmName.split()[0]
year = rowList[1] # takes the form '23AAA'
yy = int(year[:2])
return fName, lName, yy
Let us test our reorgData
function on a particular random rowList
from the rosterList
.
randomIndex = random.randint(0, 31)
reorgData(rosterList[randomIndex])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/877059263.py in <module>
----> 1 reorgData(rosterList[randomIndex])
IndexError: list index out of range
Accumulation with Lists¶
In previous lectures we have seen that it is common to use loops in conjunction with accumulation variables that collect results from processing elements within the loop. Let us write some funtions that exercise commonly seen accumulation patterns using lists.
Exercise: Number of Students by Year¶
Let’s get to know our class better! We will write a function yearList
which takes in two arguments rosterList
(list of lists) and year
(int) and returns the list of students in the class with that graduating year.
def yearList(classList, year):
result = []
for sList in rosterList:
# tuple assignment:
fName, lName, yy = reorgData(sList)
if yy == year:
result.append(fName + ' ' +lName)
return result
len(yearList(rosterList, 23)) # how many first years in class?
0
yearList(rosterList, 23) # Names of first years
[]
len(yearList(rosterList, 22)) # how many second sophmores
0
yearList(rosterList, 22) # Names of sophmores
[]
len(yearList(rosterList, 21)) # how many juniors?
0
yearList(rosterList, 21) # names of juniors
[]
len(yearList(rosterList, 20)) # how many seniors
0
yearList(rosterList, 20) # name of seniors
[]
Exercise: Use our sequenceTools¶
We built an assortment functions last week as part of our sequences toolkit. Lets use some of those functions now to find out fun facts about the class. Function names in the __all__
variable of our toolkit:
isVowel
countAllVowels
countChar
wordStartEndCount
wordStartEndList
isPalindrome
We can import these functions from our module into our current interactive python session, using the import command.
from sequenceTools import *
help(countAllVowels)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3373308659.py in <module>
----> 1 help(countAllVowels)
NameError: name 'countAllVowels' is not defined
countAllVowels('onomatopoeia') # test if import work
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4001814340.py in <module>
----> 1 countAllVowels('onomatopoeia') # test if import work
NameError: name 'countAllVowels' is not defined
Another helper function. As we will be analyzing student names, lets create helper functions which extract names out of the CVS rows (lists).
def getName(sInfo):
"""Takes in a tuple consisting of first name, last name, year
and returns the string first name concatenated with last name"""
fName, lName, yy = reorgData(sInfo)
return fName + ' ' + lName
getName(rosterList[random.randint(0, 31)]) # test on a random student!
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3160768721.py in <module>
----> 1 getName(rosterList[random.randint(0, 31)]) # test on a random student!
IndexError: list index out of range
Fun Facts. Who has the most number of vowels in their name?
def mostVowelName(classList):
currentMax = 0 # initialize max value
persons = [] # initialize list for names
for sInfo in classList:
name = getName(sInfo)
numVowels = countAllVowels(name)
if numVowels > currentMax:
# found someone whose name as more vowels
# than current max update person, currentMax
currentMax = numVowels
persons = [name] # reupdate
elif numVowels == currentMax:
# is someone's name as long as currentMax?
persons.append(name)
return persons, currentMax
mostVowelName(rosterList) # which student has most vowels in their name?
([], 0)
Fun Facts. How about the least number of vowels? Since we will need to extract student names again, lets just write a little helper function to do it for us.
def leastVowelName(classList):
currentMin = 20 # initialize min value
persons = [] # initialize placeholder for name
for sInfo in classList:
name = getName(sInfo)
numVowels = countAllVowels(name)
if numVowels < currentMin:
currentMin = numVowels # update state of current max
person = [name]
elif numVowels == currentMin:
persons.append(name)
return person, currentMin
leastVowelName(rosterList) # which student has most vowels in their name?
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/1780779790.py in <module>
----> 1 leastVowelName(rosterList) # which student has most vowels in their name?
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4118649431.py in leastVowelName(classList)
10 elif numVowels == currentMin:
11 persons.append(name)
---> 12 return person, currentMin
UnboundLocalError: local variable 'person' referenced before assignment
Writing to Files¶
We can write all the results that we are computing into a file (a persitent structure). To open a file for writing, we use open
with the mode ‘w’.
The following code will create a new file named studentFacts.txt
in the current working directory and write in it results of our function calls.
with open('studentFacts.txt', 'w') as sFile:
sFile.write('Fun facts about CS134 students.\n')# need newlines
sFile.write('No. of first years in CS134: {}\n'.format(len(yearList(rosterList, 23))))
sFile.write('No. of sophmores in CS134: {}\n'.format(len(yearList(rosterList, 22))))
sFile.write('No. of juniors in CS134: {}\n'.format(len(yearList(rosterList, 21))))
sFile.write('No. of seniors in CS134: {}\n'.format(len(yearList(rosterList, 20))))
We can use ls -l
to see that a new file studentFacts.txt
has been created:
ls # new file information
__pycache__/ lec_listPatterns_solns.ipynb
csv/ sequenceTools.py
lec_listPatterns-jeannie.ipynb studentFacts.txt
lec_listPatterns.ipynb textfiles/
Use the OS command more
to view the contents of the file:
more studentFacts.txt
Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.
Appending to files¶
How do we add lines to the end of an existing file? We can’t open the file in write mode (with a ‘w’), because that erases all previous contents and starts with an empty file.
Instead, we open the file in append mode (with an ‘a’). Any subsequent writes are made after the existing contents.
with open('studentFacts.txt', 'a') as sFile:
sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/3234214236.py in <module>
1 with open('studentFacts.txt', 'a') as sFile:
2 sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
----> 3 sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))
/var/folders/md/kwd9nc_d2ns0hw9wsvdrnt2c0000gn/T/ipykernel_17263/4118649431.py in leastVowelName(classList)
10 elif numVowels == currentMin:
11 persons.append(name)
---> 12 return person, currentMin
UnboundLocalError: local variable 'person' referenced before assignment
Open the file studentFacts.txt
again to view it, or using the OS command more:
more studentFacts.txt
List Accumulation Patterns¶
When iterative over lists there are several accumulation patterns which come up a lot. In the following questions, the premise is that we have a list we are iterating over and we are returning a new list. There are two common category of tasks:
Mapping patters: when you want to perform the same action to every item in the list
Filter patterns: when you want to retain only some items of the list
We can simplify the mapping/filtering patterns with a syntactic device called list comprehension. Lets take an exampe of each.
Mapping Patteer via List Comprehension¶
We can generate a new list by performing an operation on every element in a given list. This is called mapping.
def mapDouble(nums):
"""Given a list of numbers, returns a *new* list,
in which each element is twice the corresponding
element in the input list.
"""
result = []
for n in nums:
result.append(2*n)
return result
mapDouble([2, 3, 4, 5])
[4, 6, 8, 10]
Succint form using list comprehension.
def mapDoubleShort(nums):
return [2*n for n in nums]
mapDoubleShort([6, 7, 8])
[12, 14, 16]
List of Names. Suppose we want to iterate over our nested list rosterList
, and collect all the student names in a list, we can do that with a simple mapping list comprehension!
nameList = [getName(sInfo) for sInfo in rosterList]
nameList
[]
Another example. Suppose we want to iterate over a list of names and return a list of first names in lower case.
def firstNames(nameList):
"""Given a list of names as firstName lastname, returns a list of firstNames.
"""
return [name.split()[0].lower() for name in nameList]
firstNames(['Shikha Singh', 'Iris Howley', 'Lida Doret'])
['shikha', 'iris', 'lida']
Filtering Pattern via List Comprehensions¶
Another common way to produce a new list is to filter an existing list, keeping only those elements that satisfy a certain predicate.
def filterNames(nameList):
"""Given a list of names as first name, returns a *new* list of all
names in the input list that have length >= 6.
"""
result = []
for name in nameList:
if len(name) >= 9:
result.append(name)
return result
filterNames(firstNames(nameList))
[]
We can also do this filtering pattern very succinctly using list comprehensions!
def filterNamesShort(nameList):
return [name for name in nameList if len(name) >= 9]
filterNamesShort(firstNames(nameList))
[]
List Comprehensions Exercises¶
# Given a list of numbers numList
# Create a list of all numbers that are even
nums = [1, 2, 3, 4, 5, 6, 7]
result = [n for n in nums if n%2 == 0]
print(result)
[2, 4, 6]
# add the ending 'th' to all words in a phrase
phrase = "mine dog ate your shoe"
# expected phrase: ["mineth", "dogth", "ateth", "yourth", "shoeth"]
newPhrase = [word + 'th' for word in phrase.split()]
newPhrase
['mineth', 'dogth', 'ateth', 'yourth', 'shoeth']
List Comprehensions with Mapping and Filtering¶
It is possible to do both mapping and filtering in a single list comprehension. Examine the example below which filters a list by even numbers and creates a new list of their squares.
[(x**2) for x in range(10) if x % 2 == 0]
[0, 4, 16, 36, 64]
Note that our expression for mapping still comes before the “for” and our filtering with “if” still comes after our sequence. Below is the equivalent code without list comprehensions.
newList = []
for x in range(10):
if x % 2 == 0:
newList.append(x**2)
newList
[0, 4, 16, 36, 64]
YOUR TURN: Try to write the following list comprehension examples:
# Example 1: Write a list comprehension that filters the vowels from a word
# such as beauteous and returns a list of its capitalized vowels.
word = "beauteous"
newList = [char.upper() for char in word if isVowel(char)]
newList
['E', 'A', 'U', 'E', 'O', 'U']
# Example 2: Write a list comprehension that filters a list of proper nouns by length.
# It should extract nouns of length greater than 4 but less than 8 and return a list
# where the first letter is properly capitalized
# This is a challenge!
properNouns = ["cher", "bjork", "sting", "beethoven", "prince", "madonna"]
newList = [word[0].upper() + word[1:] for word in properNouns if len(word)>4 and len(word)<=8]
newList
['Bjork', 'Sting', 'Prince', 'Madonna']