Files¶
Today, we will discuss the following:
- Representing data using the file abstraction
- Opening and reading data from files
- String manipulation to "clean" and "format" file data
- Storing file data inside data structures we've learned about to perform useful tasks
Files¶
Files are persistent data, usable between sessions and applications!
File Extensions¶
Every file has a name, and most files have an "extension" that signals the way that the file is formatted
- Python files typically end in
.py
- PDF documents typically end in
.pdf
The extension doesn't restrict or change the file's contents, but it signals how we should interpret the data stored inside
- Try the following at home:
- Take your last lab and copy one of the files so that its extension is no longer
.py
:cp runtests.py runtests.copy
- Open that file inside visual studio code. Does it look different?
- Yes! There is no longer syntax highlighting for python keywords
- Run that file as a script using the python3 program:
python3 runtests.copy
- It behaves exactly the same!
- Take your last lab and copy one of the files so that its extension is no longer
Text Files¶
We will primarily work with "text files" when reading data into our Python programs.
- Text files contain words, numbers, and punctuation characters that are "human readable"
- Text files can be opened and read/written in a text editor like Visual Studio Code
Example text files:
- Python program code (typpically end in
.py
) - Comma-separated data sets (typically end in
.csv
)
Example non-text files:
- A PDF document
- PDFs contain metadata that describes formatting and other information
- Image files
- Microsoft Word Document
- Like a PDF, Word documents contain information used by the Microsoft Word program to format the document's data
Line-based Files¶
We often think of text files as having a "line-based" organization, where every line is a separate unit of text
- The end of a line in the text file is determined by the special newline character
'\n'
- Inside our text editor, hitting the "enter" key inserts a
'\n'
in the file's data - Our text editor interprets this character as a "newline" and starts the next character on the next line
Reading from a text file¶
Within a with open(filename) as input_file:
block, we can iterate over the lines in the file just as we would iterate over any sequence such as lists, strings, or ranges.
Example: We have a text file mountains.txt
within a directory data. We can iterate and print each line as follows:
with open("data/mountains.txt") as lyrics_file:
for line in lyrics_file:
print(line)
Because the end of the line in a file is a newline character '\n'
and when we call print(some_string)
, a newline character is added to the end...we end up with an empty newline between each printed line!
with open("data/mountains.txt") as lyrics_file:
result = []
for line in lyrics_file:
result += [line]
print(result)
So, we want to remove that newline at the end of each line, if it exists.
Note that the \n
character is only one character, not two!
len('\n')
To follow Python's naming conventions, we'll call the function that removes all spaces from the "end" of a string rstrip
(right strip):
def rstrip(line):
''' Removes all whitespace RIGHT of str line'''
if not line: # handle empty line
return line
SPACE_CHARS = ['\n', '\t', ' ']
# find the last not-space character
end_index = len(line)-1
while end_index>0 and line[end_index] in SPACE_CHARS:
end_index -= 1
return line[:end_index+1]
While we're at it, we can also remove all space characters at the "start" of a line in lstrip
:
def lstrip(line):
''' Removes all whitespace LEFT of str line'''
if not line: # handle empty line
return line
SPACE_CHARS = ['\n', '\t', ' ']
# find the first non-space character
start_index = 0
while start_index<len(line) and line[start_index] in SPACE_CHARS:
start_index += 1
return line[start_index:]
And putting the strips together, we can strip all whitespace from the start or end:
def strip(line):
return lstrip(rstrip(line))
print(strip("\n\n\n\nNewlines in this string.\n\n\n\n\n"))
print(strip(" Spaces are here "))
print(strip("\t\n A mix\tof things\nin different, spots\t\r\n "))
with open("data/mountains.txt") as lyrics_file:
for line in lyrics_file:
print(strip(line)) # remove all `\n` from line so the only `\n` is the one inserted by the print function
Useful String and List Functions in File Reading¶
When reading files, we may need to use some common string and list operations to work with the data. We'll learn about the built-in features python has for these later in the semester, but we can write our own with iterating over strings and accumulator variables!
strip(line)
: Remove any leading/trailing white space or\n
split(line, ',')
: Separate a comma-separated sequence of words and create a list of stringsjoin(' ', string_list)
: Create a single “big” string with words separated by spaces instead of commasget_count(elem, l)
: Count the number of occurrences of some element within a list- …and so on!
We've written get_count(..)
in our Lab 4 Pre-Lab, and we'll be writing more of these in future Labs!
Comma Separated Values Files¶
Python can read in any file, but text files (such as .txt
, .csv
, .py
, .html
, etc.) often store the data we care about in our programs.
A CSV (Comma Separated Values) file is a specific type of plain text file that stores “tabular” data. Each row of a table is a line in the text file, with each column on the row separated by commas. This format is a common import and export format for spreadsheets and databases.
CSV Example¶
Name,Age
Charlie Brown,8
Snoopy,72
Patty,7
Since CSVs are just text files, we can process them in the same way as any other text file. And we can exploit their well-defined structure by using additional post-processing/splitting using string methods!
with open("data/superheroes.csv") as roster:
for line in roster:
print(strip(line))
# WE'LL DO THIS AS A PRE-LAB QUESTION!
def split(line, delimiter):
""" Splits a given str, line, into separate elements divided
by str, delimiter. Returns items as a list of str.
>>> split("A,quick,brown,fox")
['A','quick','brown','fox']
"""
return line.split(delimiter) # DO NOT USE THIS IN YOUR PRE-LAB. USE LOOPS & ACCUMULATOR VARIABLES.
# accumulator variables
names = []
powers = []
movies = []
with open("data/superheroes.csv") as roster:
for line in roster:
line = split(strip(line), ',') # remove trailing newline & split on commas
names = names + [line[0]]
powers = powers + [line[1]]
movies = movies + [int(line[2])] # convert movie count to integer!
print(powers)
print(movies)
Now let's actually count the number of appearances of the different types of superpowers there are! We'll need the get_count()
function we've previously written.
def get_count(ele, lst):
count = 0
for item in lst:
if ele == item:
count +=1
return count
unique_powers = list(set(powers))
count_list = []
for pwr in unique_powers:
count_list += [get_count(pwr, powers)]
print(unique_powers)
print(count_list)
There are 4 superheroes with Strength power, 1 with Telekinesis, 1 with Light, etc. etc.
Using matplotlib
to visualize the data¶
Now from these two lists, we can start to plot the data to visualize it.
matplotlib
is a Python package for visualizing data. Let's explore some of its basic functionality.
# let's first look at some basic matplotlib functionality
import matplotlib.pyplot as plt
# we can create a simple line plot as follows
# plt.plot(xValuesList, yValuesList)
# the following example corresponds to points (1,10), (2,14), etc
plt.plot([1, 2, 3, 4], [19, 14, 15, 18])
plt.show()
Decorating a Plot¶
We can specify length, width, title, name of labels etc to decorate our plot and add important details.
# a more advanced example where we customize the line plot
# create a 4 by 4 figure
plt.figure(figsize=(4, 4))
plt.plot([0, 5, 10], [4, 12, 14])
plt.xticks([0, 5, 10], # x values of axis `ticks`
['x1', 'x2', 'x3']) # values to show for `ticks`
# rotate the y tick labels, because are shown horizontally
plt.yticks([4, 12, 14])
# axis labels and title
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.title("Custom plot")
plt.show()
We can use these features to make interesting plots for data analysis:
x_values = [1970, 1980, 1990, 2000, 2010, 2020]
y_values = [1, 4, 0, 2, 3, 3]
# create 4x4 figure
plt.figure(figsize=(4, 4))
plt.plot(x_values, y_values)
plt.xticks([1970, 1990, 2010], # x values of axis `ticks`
["1970s", "1990s", "2010s"]) # values to show for `ticks`
# specify y-tick locations
plt.yticks([0, 2, 4])
# axis labels and title
plt.xlabel("Decade")
plt.ylabel("Count")
plt.title("Superman Movies By Decade")
plt.show()
Creating a bar plot¶
Let's return to our initial goal of creating a bar graph that illustrates the top goal scorers.
# Going back to our initial goal of creating a bar plot
x_values = unique_powers
y_values = count_list
# Create a new figure:
plt.figure()
# Make it a bar chart
plt.bar(x_values, y_values)
# rotate by 90 so labels are vertical and do not overlap
plt.xticks(x_values, rotation=90)
# Set title and label axes
plt.title("Count of Superpowers")
plt.xlabel("Superpowers")
plt.ylabel("Count")
# specify y axis range
plt.ylim([0, 10])
# Show our chart:
plt.show()
# if you'd like to save the plot as a PDF:
# this line just ensures the longer labels on the x axis don't get cut out
# plt.tight_layout()
# plt.savefig('superpowers.pdf')
Next class we'll explore data analysis and wrangling even more, to produce more interesting plots!