In the “real world”, data is everywhere. But the real world is, unfortunately, a very messy place.
You will often find that your first task when interacting with data
is to convert the data to a usable form. We will build some of these
skills by searching, parsing, and transforming text. In particular, this
lab will further our experience with the Python string data type
(str
), the Python list data type (list
), and
with using the for-each loop pattern to iterate through data stored in
sequences.
We will start with a few seemingly unrelated functions, and we will combine them to complete a concrete task: at the end of this lab, we will have written a small program to solve Madlib-style puzzles, which you can create and solve on your own.
This lab handout is like a “recipe” in the sense that adding the “ingredients” in order will lead to something more-or-less “edible”. But there is a difference between following a recipe and becoming a chef. That transition takes time and practice, and we’ll get there. Our hope is that, as we continue to build our CS toolboxes, the lab handouts will leave more room for creativity and problem-solving. For now, the best thing to do is to read through the entire handout first so you can appreciate the way the pieces will fit together. Nothing we’ll be doing in this assignment exists in isolation; the “big picture” is important for appreciating the utility of each function.
Your pre-lab task is to write a function that takes two string
arguments: string
, which can be any string value, and
char
, which should be a string of length 1 (i.e.,
char
should be a single character). For example,
string
could be the string "Hello world!"
, and
char
could be the string "r"
. Given these
arguments, your function should return a new string defined as
follows:
char
appears inside the sequence of characters in
string
, then your function should return all of the
characters in string
that appear before the first
occurrence of char
.char
is not a single character or if
char
does not appear inside the sequence of characters in
string
, then your program should return all of the
characters in string
Below is the result of calling
all_text_before(string, char)
on a few different inputs
inside an interactive Python session. Notice that string values are
being displayed; this is because the function returns
those values. Your function should not print the value it computes (in
fact, it should not print anything!). Instead, it should
return a string value.
>>> all_text_before("Hello World!", "r")
'Hello Wo'
>>> all_text_before("Hello World!", " ")
'Hello'
>>> all_text_before("Hello World!", "")
'Hello World!'
>>> all_text_before("Hello World!", "World")
'Hello World!'
>>> all_text_before("Hello World!", "H")
''
Why bother implementing this function? Why might it be useful? Well, we often want to manipulate data that conforms to particular formatting rules. For example, a common file format is a CSV file (a file with “comma separated values”), where each line in the file contains a list of individual values each separated by a comma. We could use this function to get the first column from a CSV, or, with slight modifications, we could break apart an entire line into its individual values.
Later in this lab, you will write a related function,
all_text_after(string, char)
as a building block for
solving a Madlibs-style puzzle (details below!), so taking the time to
think about this pre-lab function will pay dividends later.
As you work on this problem, try to make a mental note of any decisions you found yourself making, especially when you weren’t sure which choice was the “best” one. We’d love to talk about these choices and why you made the choice you did. We’ve found that this pattern of reflection rapidly advances our understanding. The “why” is just as important as the “what”!
Open the Terminal and go to your cs134
directory
using the cd
(change directory) command you used during Lab
1. (If you are working at a new or different computer, you may want to
create a new cs134
directory if one does not already exist
with the mkdir
command.)
Now we must retrieve the files we need to complete this week’s
lab. Navigate to https://evolene.cs.williams.edu
in your browser and log in using your CS credentials. Under
Projects, you should see a repository named
cs134-labs/23xyz3/lab03
(where 23xyz3
is your
CS username). This repository contains your starter files for this
week’s lab.
Clone the repository: find the blue button that is a drop-down
menu that says Clone
. Click on it and click on the
“clipboard” icon (Copy URL) option next to the option to
Clone with HTTPS
.
Return to the Terminal and type git clone
followed by
the URL of the lab project you just copied (you can paste on Windows by
pressing Ctrl-V
and on a Mac by pressing
Command-V
). This should look like the following:
git clone https://evolene.cs.williams.edu/cs134-labs/23xyz3/lab03.git
(where 23xyz3
is again a place-holder for your CS
username.)
Navigate to your newly created lab03 folder in the Terminal:
cd lab03
Explore the contents of the directory using the ls
command in the Terminal.
Open VS Code. Then go to File
menu option and choose
Open Folder
. Navigate to your lab03
directory
and click Open
. You should see the starter files of today’s
lab, including madlibs.py
, on the left pane of VS Code. All
of the code that you write this week (other than tests) will be in
madlibs.py
.
This lab is designed to be completed using only material we have
discussed so far in class. You will need to use for
loops
to iterate through sequences, +=
to update “accumulator
variables”, ==
to test the equality of two expressions, and
if
/elif
/else
to execute code when
certain conditions hold.
Although not strictly necessary, you may also benefit from other
Python language features, including sequence “slicing”, using the
len()
function to calculate a sequence’s length, using
range()
to generate a sequence of integers, or accessing
specific sequence elements using square brackets ([]
).
You SHOULD NOT use any language features or concepts that we have not yet discussed in lectures, even if they may seem like reasonable tools for parts of this assignment. We have intentionally chosen to focus on algorithmic thinking instead of Python-specific ways of doing things, and we will explore these so-called “Pythonisms” later. In particular, do NOT use any “string methods” or “list methods” (and if you don’t know what those are yet, that is the point of this note!).
A key skill for an algorithmic thinker is the ability to break a complex task into a set of smaller problems, that when solved individually, help solve the larger problem. We’ve referred to this as problem decomposition, and we’ve done a lot of that decomposition for you. The next steps of this lab will introduce those building blocks to you in an order that allows you to develop and test your code incrementally. Then, we will discuss the format that we will be using to represent our Madlibs puzzles, and we will strategize ways to use our building blocks to print the text of completed Madlibs puzzles.
A sequence pre
is said to be a prefix of another
sequence seq
if each successive element in pre
appears in the same order at the beginning of seq
. For
example, here are all prefixes of word
:
""
, "w"
, "wo"
,
"wor"
, "word"
Since a Python string is an ordered sequence of individual characters, we can determine that one string is a prefix of another by comparing the characters in those strings, in order, and confirming that they match.
You should implement this functionality in the Python function
is_prefix(prefix, string)
, which takes two strings as
input: prefix
and string
. Your function should
return True
if prefix
is a prefix of
string
, and False
otherwise. Below is the
result of calling is_prefix()
on a few inputs inside an
interactive Python session. Notice that the values True
and
False
are being displayed; this is because the function
returns those values. Your function should not print
“True” or “False” (in fact, it should not print anything!). Instead, it
should return a boolean
value.
>>> is_prefix("pre", "prefix")
True
>>> is_prefix("abc", "prefix")
False
>>> is_prefix("", "prefix")
True
>>> is_prefix("prefix", "prefix")
True
>>> is_prefix("prefixing", "prefix")
False
At first glance, it may seem like we need to iterate through the two sequences in unison. There is no way to do this in a single for-each loop without additional Python tricks, so for now, we’ll need to think of another strategy. Luckily, we are equipped with the tools to tackle this problem in several different ways!
One strategy is to, instead of iterating through either
string
or prefix
, iterate through something
else. The trick is to choose a “something else” that lets us refer to
the appropriate elements from our two actual sequences so we can compare
them. An integer lets us index into a sequence using the “square
bracket” notation (e.g., string[0]
), so constructing the
right integer sequence to serve as our indices would allow us to compare
the corresponding elements of string
and
prefix
to each other. Python’s range()
may
help!
Another observation that could help is that, when given two strings
str_a
and str_b
, the boolean expression
str_a == str_b
evaluates to True
if the
characters in the strings str_a
and str_b
are
an exact match. We may be able to construct appropriate subsequence of
string
and prefix
that we can use for
comparison.
To test your is_prefix()
implementation, you can use the
runtests.py
program. Since there are multiple functions we
will be testing in this program, we’ve added the ability to specify
which function’s tests to run. To run the is_prefix()
tests, you should give the additional argument “pre”:
python3 runtests.py pre
A related-but-slightly-trickier problem to the “prefix identification” problem above is the task of identifying suffixes. Before diving into what makes it tricky, let’s define the task.
A sequence suf
is said to be a suffix of another
sequence seq
if each successive element in suf
appears in the same order at the ending of seq
. For
example, here are all suffixes of word
:
""
, "d"
, "rd"
,
"ord"
, "word"
You should implement the functionality to identify whether a string
is a suffix of another inside is_suffix(suffix, string)
.
Like is_prefix()
, this function will also take two strings
as input: suffix
and string
. It should return
True
if suffix
is a suffix of
string
, and False
otherwise. Below are some
sample invocations from within an interactive Python session:
>>> is_suffix("fix", "suffix")
True
>>> is_suffix("abc", "suffix")
False
>>> is_suffix("", "suffix")
True
>>> is_suffix("suffix", "suffix")
True
>>> is_suffix("fixing", "suffix")
False
Why did we suggest that this task might be “harder” than identifying
a prefix? When we iterate through a sequence using the for-each pattern,
our iteration always starts at the sequence’s first element. For the
prefix version of this problem, the first elements of
string
and prefix
both correspond to the
characters that we want to compare. So do the second elements, the third
elements, and so on. However, when identifying most suffixes, we’ll want
to start somewhere in the middle of string
—not at its
beginning.
One way to approach this problem is to structure our for-each loop so
that it iterates over something other than the elements of
string
. Suppose we instead iterate over a sequence of
integers. As long as we can convert those integers into the elements of
our strings that we care about, we can compare them appropriately. The
question, then, is what sequence of integers, and how do we convert
those integers into the desired indices within string
and
suffix
?
Another option is modify string
so that when we
do iterate through its elements, we are only iterating through
elements that we care about. “Slicing” string
may let us
“start in the middle” and treat this problem similarly to prefix
identification. In fact, we may even be able to reuse the work we did in
that function!
Like many problems in computer science, there are multiple correct ways to attack it (including strategies not described here). You should choose the approach that matches your thought process—it will be easier to code and debug that way!
To test your is_suffix()
implementation, you can use the
runtests.py
program with the “suf” argument:
python3 runtests.py suf
In our pre-lab task, we wrote a function that returns the sequence of all characters in a string that precede the first occurrence of some other character. Although we didn’t use the term, this function extracted a prefix from that string. Now we will invert this problem and extract a string’s suffix.
Your third lab task is to write a function called
all_text_after(string, char)
that takes two string
arguments: string
, which can be any string value, and
char
, which should be a string of length 1 (i.e.,
char
should always be a single character). For example,
string
could be the string "Hello world!"
, and
char
could be the string "r"
. Given these
arguments, your function should return a new string
defined
as follows:
char
is not a single character (i.e., the string’s
length is not 1), then your function should return the empty stringchar
appears inside the sequence of characters in
string
, then your function should return all of the
characters in string
that appear after the first
occurrence of char
.char
does not appear inside the sequence of
characters in string
, then your function should return the
empty string (""
)Below is the result of calling
all_text_after(string, char)
on a few different inputs
inside an interactive Python session. Notice that string values are
being displayed; this is because the function returns
those values. Your function should not print the value it computes (in
fact, it should not print anything!). Instead, it should
return a string value.
>>> all_text_after("Hello World!", "r")
'ld'
>>> all_text_after("Hello World!", " ")
'World!'
>>> all_text_after("Hello World!", "")
''
>>> all_text_after("Hello World!", "World")
''
>>> all_text_after("Hello World!", "!")
''
>>> all_text_after("Hello World!", "H")
'ello World!'
The idea for this function is very similar to
all_text_before()
, but instead of stopping our
accumulation of characters when we find the first occurrence of
char
, we want to start accumulating characters
after the first occurrence of char
. We may pass through
several iterations of our for
loop before identifing a
match, so it may be helpful to create a boolean variable that changes
its value once the first match is found. You can use an if
inside your for
loop to selectively execute the
“accumulation” of characters after your condition is met.
To test your all_text_after()
implementation, you can
use the runtests.py
program with the “after” argument:
python3 runtests.py after
At this point we’ve developed several “utility functions” that will serve as useful building blocks when writing our Madlibs puzzle game. But what exactly is a “Madlibs puzzle”, and how is it represented as data in our program?
We’ve broken each Madlibs Puzzle into two files:
.story
).answerkey
)Puzzle story files contain text interspersed with
one or more “placeholders”. Placeholders are all encoded as single words
that start with a less-than symbol (<
), end with a
greater-than symbol (>
), and have one or more
alphanumeric characters between those symbols (non-punctuation,
non-whitespace characters). Typically, the text between the symbols is a
descriptor and a number. For example, <adjective1>
and <noun4>
are two placeholders in the provided
stories. These placeholders will eventually be substituted with words
that match the placeholder’s description, completing the story.
As an example, an excerpt from a story file named
nursery_rhyme.story
might contain the following text:
Mary had a/an <adjective1> lamb. It's <noun1> was <adjective2> as <noun2>.
If we replaced “<adjective1>” with “little”, “<noun1>” with “fleece”, “<adjective2>” with “white”, and “<noun2>” with “snow”, we’d have a (possibly) familiar excerpt from a nursery rhyme:
| Mary had a/an little lamb. It’s fleece was white as snow.
On the other hand, if we replaced “<adjective1>” with “enormous”, “<noun1>” with “mousepad”, “<adjective2>” with “loud”, and “<noun2>” with “rhombus”, we’d have something that is grammatically correct, but rather silly.
| Mary had a/an enormous lamb. It’s mousepad was loud as rhombus.
Puzzle key files are lists of “swap-key=swap-value
pairs” that describe the way that the placeholders from a story file
will be updated. Each “swap-key=swap-value pair” consists of a
placeholder (as described above), an equals sign (=
), and
text that will be substituted into the puzzle as a replacement for all
occurrences of the placeholder. For example, a puzzle key file for the
substitution above would include the lines:
<adjective1>=enormous
<noun1>=mousepad
<adjective2>=loud
<noun2>=rhombus
We called these “swap-key=swap-value” pairs because when we see an occurrence of a “swap-key” in our puzzle story file, we will look into our puzzle key file to find the matching “swap-value” that should replace it.
The game. Hopefully the above example illustrates
that a Madlib puzzle is, in some sense, a game. Outside of your
program, you can play this game by filling in the contents of a puzzle
key file. To have the most fun, you should avoid reading the puzzle
files (files that end in .story
). Instead, open up a puzzle
key file (a file that ends in .answerkey
) inside VSCode,
and fill in the puzzle key with words that you would like to use as
substitutions for the placeholders. The placeholder should describe the
qualities of a word that will fit into the story—grammatically and
thematically—but without knowing the context of the puzzle, the words
that you choose will often lead to humorous substitutions.
The rest of the lab will walk us through the steps to finish a program that takes a puzzle story file and a completed puzzle key file, and outputs the (hopefully humorous) version of the story that includes the substitutions.
To implement our Madlibs game, we need a way to store the contents of puzzle story files and puzzle key files in variables that our program can manipulate. Since our primary focus is on building our skills with strings and lists, we have implemented this file-reading functionality for you. At the top of your starter code, you should see two lines:
from text_utils import read_stringlist_from_file
from text_utils import formatted_madlib
These two lines allow you to invoke the functions
read_stringlist_from_file(filename)
and
formatted_madlib(string_list)
from inside your code. We’ve
provided the full implementations of these functions for you inside the
file text_utils.py
. You should not change the contents of
that file.
The function read_stringlist_from_file(filename)
takes
the location of a file as input (the filename
parameter’s
type is str
), and it returns a list of all of the words and
punctuation in that file. So for example, if I had a file named
"sample.story"
with the contents:
Uh-oh, I forgot to <verb1> for the <schoolsubject1> exam!
and a file named "sample.answerkey"
with the
contents:
<verb1>=study
<schoolsubject1>=CS134
Then the following interactive Python session output displays its behavior:
>>>read_stringlist_from_file("sample.story")
['Uh', '-', 'oh', ',', 'I', 'forgot', 'to', '<verb1>', 'for', 'the', '<schoolsubject1>', 'exam', '!']
>>>read_stringlist_from_file("sample.answerkey")
['<verb1>=study', '<schoolsubject1>=CS134']
The behavior of the inverse function, which takes a list
of strings (e.g., the text of a Madlibs puzzle—solved or not) and prints
it to the screen with nice formatting, is shown below:
>>> story_list = read_stringlist_from_file("sample.story")
>>> formatted_madlib(story_list)
'Uh-oh, I forgot to <verb1> for my <schoolsubject1> exam!'
Now is the point in the lab where we start to put all the pieces together. We’ll first present the algorithm, and then complete one last helper function before implementing the algorithm itself—problem decomposition is a continuous process!
So, at a high level, what is it that we need to do? We need to read in a puzzle’s story and key, then go through the story and substitute each placeholder with its matching swap-value from the puzzle’s key. The concrete steps below describe this “Madlibs algorithm” more explicitly to give us some context for our next task. You may be tempted to start coding it up, but hold of on implementing this algorithm! That step is coming up soon.
list
of strings.list
of strings.list
.
If a string is not a placeholder, then add the string
to our puzzle solution’s accumulator variable. If the string is
a placeholder (i.e., it begins with a "<"
and
ends with a ">"
), then it needs to be replaced. We can
do this by iterating through all the swap-keys in our puzzle
key, and when we find a match, adding the associated swap-value
to our puzzle solution accumulator variable.Now that we have described our algorithm, can you identify any steps from this algorithm that we can further decompose? In step 4, there are quite a few moving parts. Particularly, the task of searching for a swap-key and retrieving its associated swap-value looks daunting. Let’s break this down into its own function that we can use as our last building block.
Every time we identify a placeholder within our puzzle
story, we need to find the corresponding swap-value from within
the puzzle key. For example, if we see "<adjective1>"
in a story, we need to look through the puzzle key to find the string
that starts with "<adjective1>"
(a
swap-key), and then extract the matching swap-value.
We will implement this behavior inside the function
get_madlibs_replacement(placeholder, madlibs_key_list)
.
This function has two parameters: placeholder
, which is
a swap-key such as "<adjective1>"
, and
madlibs_key_list
, which is a list of “swap-key=swap-value
pairs” that we read from a puzzle key file.
Luckily, we’ve already written the hardest parts of this function.
Since placeholder
is the swap-key in one of the
“swap-key=swap-value pairs” found in madlibs_key_list
, we
just need to iterate through the elements in
madlibs_key_list
to find it. Once we’ve found a match, the
swap-value is all of the characters that appear after the
"="
. (If we don’t find a match, just return
placeholder
since no subsitution is possible.)
We can use our is_prefix()
function to identify the
matching swap-key, and the all_text_after()
function to extract the swap-value. Finally, we return that
suffix.
To test your get_madlibs_replacement()
implementation,
you can use the runtests.py
program with the “replace”
argument:
python3 runtests.py replace
The last step is to write
solved_madlibs(madlibs_story_list, madlibs_key_list)
, which
implements the Madlibs algorithm and returns a string containing the
completed story. As input, the function takes two parameters:
madlibs_story_list
is a list
of strings
that stores the contents of a Madlibs puzzle story.
This parameter will typically be the return value from calling
read_stringlist_from_file()
on a story file (ends in
.story
)madlibs_key_list
is a list
of strings that
stores the contents of a Madlibs puzzle key. This
parameter will typically be the return value from calling
read_stringlist_from_file()
on a key file (ends in
.answerkey
). Each element of this list
is a
“swap-key=swap-value” pair.The function should return a string
.
Given these two lists, you should do the following:
madlibs_puzzle
(the
contents of the story), and every time you encounter a string that is
not a placeholder, add it to your accumulator variable (that is
a list
of strings
) that stores every string in
the solved story. Every time you encounter a string that is a
placeholder, use your get_madlibs_replacement()
function to look up the matching swap-value from
madlibs_key_list
, and add that swap-value to your
accumulator variable.list
that contains all the
strings that are part of your completed puzzle. Convert the contents of
this list
into a string using the provided
formatted_madlib(solved_puzzle)
function that we’ve
implemented for you in text_utils.py
. The argument to this
function should be your accumulator variable. Your function should
return the resulting string.To test your solved_madlibs()
implementation, you can
use the runtests.py
program with the “solve” argument:
python3 runtests.py solve
At this point in the lab, you’ve written all of the functions you need to complete a Madlibs puzzle game. However, until we invoke those functions, we aren’t actually done. At the bottom of your program, there is a section that resembles the lines:
if __name__ == "__main__":
# comments ...
# more comments ...
pass
The section of your program is a standard part of many
.py
files. It is a conditional (an if
statement followed by a boolean expression, followed by an indented
section of code), which means that the indented region of code in its
body will only be executed if the condition evaluates to
True
. But what does the expression
__name__ == "__main__"
mean, and in what circumstances is
it True
or False
?
We will explore this syntax in more detail later this semester, but the important things to know are this:
program.py
as a
script (i.e., from the Terminal you type
python3 program.py
), this statement evaluates to
True
and the code is executed.import program
), this condition evaluates
to False
and the code is not executed.This is a very helpful feature. It gives us a place where we can put code in our programs that only executes when we “run” the program. This is often where we’ll write tests or where we’ll include code that reads arguments from the terminal and executes our functions using those arguments.
Now, make it so that running your program actually plays your game! You may wish to modify your program to resemble the following:
if __name__ == "__main__":
# Ask the user which Madlibs puzzle they want us to solve
= input("What puzzle story file would you like to use? ")
story_filename = input("What puzzle key file would you like to use? ")
key_filename
# Read the contents of the puzzle files into variables we can use
= read_stringlist_from_file(story_filename)
story_list = read_stringlist_from_file(key_filename)
key_list
print("Here is your completed Malib:")
print("")
# Solve and print the puzzle's solution
print(solved_madlibs(story_list, key_list))
Finally, to run your completed program, you can type the following command in the Terminal:
python3 madlibs.py