Lab 3

In the “real world”, data is everywhere. But the real world is, unfortunately, a very messy place.

You will often find that your first task when interacting with data is to convert the data to a usable form. We will build some of these skills by searching, parsing, and transforming text. In particular, this lab will further our experience with the Python string data type (str), the Python list data type (list), and with using the for-each loop pattern to iterate through data stored in sequences.

We will start with a few seemingly unrelated functions, and we will combine them to complete a concrete task: at the end of this lab, we will have written a small program to solve Madlib-style puzzles, which you can create and solve on your own.

This lab handout is like a “recipe” in the sense that adding the “ingredients” in order will lead to something more-or-less “edible”. But there is a difference between following a recipe and becoming a chef. That transition takes time and practice, and we’ll get there. Our hope is that, as we continue to build our CS toolboxes, the lab handouts will leave more room for creativity and problem-solving. For now, the best thing to do is to read through the entire handout first so you can appreciate the way the pieces will fit together. Nothing we’ll be doing in this assignment exists in isolation; the “big picture” is important for appreciating the utility of each function.

Pre-Lab Function

Your pre-lab task is to write a function that takes two string arguments: string, which can be any string value, and char, which should be a string of length 1 (i.e., char should be a single character). For example, string could be the string "Hello world!", and char could be the string "r". Given these arguments, your function should return a new string defined as follows:

Below is the result of calling all_text_before(string, char) on a few different inputs inside an interactive Python session. Notice that string values are being displayed; this is because the function returns those values. Your function should not print the value it computes (in fact, it should not print anything!). Instead, it should return a string value.

>>> all_text_before("Hello World!", "r")
'Hello Wo'
>>> all_text_before("Hello World!", " ")
'Hello'
>>> all_text_before("Hello World!", "")
'Hello World!'
>>> all_text_before("Hello World!", "World")
'Hello World!'
>>> all_text_before("Hello World!", "H")
''

Why bother implementing this function? Why might it be useful? Well, we often want to manipulate data that conforms to particular formatting rules. For example, a common file format is a CSV file (a file with “comma separated values”), where each line in the file contains a list of individual values each separated by a comma. We could use this function to get the first column from a CSV, or, with slight modifications, we could break apart an entire line into its individual values.

Later in this lab, you will write a related function, all_text_after(string, char) as a building block for solving a Madlibs-style puzzle (details below!), so taking the time to think about this pre-lab function will pay dividends later.

As you work on this problem, try to make a mental note of any decisions you found yourself making, especially when you weren’t sure which choice was the “best” one. We’d love to talk about these choices and why you made the choice you did. We’ve found that this pattern of reflection rapidly advances our understanding. The “why” is just as important as the “what”!

Getting Started in Lab

Rules and Guidelines

This lab is designed to be completed using only material we have discussed so far in class. You will need to use for loops to iterate through sequences, += to update “accumulator variables”, == to test the equality of two expressions, and if/elif/else to execute code when certain conditions hold.

Although not strictly necessary, you may also benefit from other Python language features, including sequence “slicing”, using the len() function to calculate a sequence’s length, using range() to generate a sequence of integers, or accessing specific sequence elements using square brackets ([]).

You SHOULD NOT use any language features or concepts that we have not yet discussed in lectures, even if they may seem like reasonable tools for parts of this assignment. We have intentionally chosen to focus on algorithmic thinking instead of Python-specific ways of doing things, and we will explore these so-called “Pythonisms” later. In particular, do NOT use any “string methods” or “list methods” (and if you don’t know what those are yet, that is the point of this note!).

Decomposing the Problem

A key skill for an algorithmic thinker is the ability to break a complex task into a set of smaller problems, that when solved individually, help solve the larger problem. We’ve referred to this as problem decomposition, and we’ve done a lot of that decomposition for you. The next steps of this lab will introduce those building blocks to you in an order that allows you to develop and test your code incrementally. Then, we will discuss the format that we will be using to represent our Madlibs puzzles, and we will strategize ways to use our building blocks to print the text of completed Madlibs puzzles.

Building Block: Identifying Prefixes

A sequence pre is said to be a prefix of another sequence seq if each successive element in pre appears in the same order at the beginning of seq. For example, here are all prefixes of word:

Since a Python string is an ordered sequence of individual characters, we can determine that one string is a prefix of another by comparing the characters in those strings, in order, and confirming that they match.

You should implement this functionality in the Python function is_prefix(prefix, string), which takes two strings as input: prefix and string. Your function should return True if prefix is a prefix of string, and False otherwise. Below is the result of calling is_prefix() on a few inputs inside an interactive Python session. Notice that the values True and False are being displayed; this is because the function returns those values. Your function should not print “True” or “False” (in fact, it should not print anything!). Instead, it should return a boolean value.

>>> is_prefix("pre", "prefix")
True
>>> is_prefix("abc", "prefix")
False
>>> is_prefix("", "prefix")
True
>>> is_prefix("prefix", "prefix")
True
>>> is_prefix("prefixing", "prefix")
False

At first glance, it may seem like we need to iterate through the two sequences in unison. There is no way to do this in a single for-each loop without additional Python tricks, so for now, we’ll need to think of another strategy. Luckily, we are equipped with the tools to tackle this problem in several different ways!

One strategy is to, instead of iterating through either string or prefix, iterate through something else. The trick is to choose a “something else” that lets us refer to the appropriate elements from our two actual sequences so we can compare them. An integer lets us index into a sequence using the “square bracket” notation (e.g., string[0]), so constructing the right integer sequence to serve as our indices would allow us to compare the corresponding elements of string and prefix to each other. Python’s range() may help!

Another observation that could help is that, when given two strings str_a and str_b, the boolean expression str_a == str_b evaluates to True if the characters in the strings str_a and str_b are an exact match. We may be able to construct appropriate subsequence of string and prefix that we can use for comparison.

To test your is_prefix() implementation, you can use the runtests.py program. Since there are multiple functions we will be testing in this program, we’ve added the ability to specify which function’s tests to run. To run the is_prefix() tests, you should give the additional argument “pre”:

python3 runtests.py pre

Building Block: Identifying Suffixes

A related-but-slightly-trickier problem to the “prefix identification” problem above is the task of identifying suffixes. Before diving into what makes it tricky, let’s define the task.

A sequence suf is said to be a suffix of another sequence seq if each successive element in suf appears in the same order at the ending of seq. For example, here are all suffixes of word:

You should implement the functionality to identify whether a string is a suffix of another inside is_suffix(suffix, string). Like is_prefix(), this function will also take two strings as input: suffix and string. It should return True if suffix is a suffix of string, and False otherwise. Below are some sample invocations from within an interactive Python session:

>>> is_suffix("fix", "suffix")
True
>>> is_suffix("abc", "suffix")
False
>>> is_suffix("", "suffix")
True
>>> is_suffix("suffix", "suffix")
True
>>> is_suffix("fixing", "suffix")
False

Why did we suggest that this task might be “harder” than identifying a prefix? When we iterate through a sequence using the for-each pattern, our iteration always starts at the sequence’s first element. For the prefix version of this problem, the first elements of string and prefix both correspond to the characters that we want to compare. So do the second elements, the third elements, and so on. However, when identifying most suffixes, we’ll want to start somewhere in the middle of string—not at its beginning.

One way to approach this problem is to structure our for-each loop so that it iterates over something other than the elements of string. Suppose we instead iterate over a sequence of integers. As long as we can convert those integers into the elements of our strings that we care about, we can compare them appropriately. The question, then, is what sequence of integers, and how do we convert those integers into the desired indices within string and suffix?

Another option is modify string so that when we do iterate through its elements, we are only iterating through elements that we care about. “Slicing” string may let us “start in the middle” and treat this problem similarly to prefix identification. In fact, we may even be able to reuse the work we did in that function!

Like many problems in computer science, there are multiple correct ways to attack it (including strategies not described here). You should choose the approach that matches your thought process—it will be easier to code and debug that way!

To test your is_suffix() implementation, you can use the runtests.py program with the “suf” argument:

python3 runtests.py suf

Building Block: Extracting Suffixes

In our pre-lab task, we wrote a function that returns the sequence of all characters in a string that precede the first occurrence of some other character. Although we didn’t use the term, this function extracted a prefix from that string. Now we will invert this problem and extract a string’s suffix.

Your third lab task is to write a function called all_text_after(string, char) that takes two string arguments: string, which can be any string value, and char, which should be a string of length 1 (i.e., char should always be a single character). For example, string could be the string "Hello world!", and char could be the string "r". Given these arguments, your function should return a new string defined as follows:

Below is the result of calling all_text_after(string, char) on a few different inputs inside an interactive Python session. Notice that string values are being displayed; this is because the function returns those values. Your function should not print the value it computes (in fact, it should not print anything!). Instead, it should return a string value.

>>> all_text_after("Hello World!", "r")
'ld'
>>> all_text_after("Hello World!", " ")
'World!'
>>> all_text_after("Hello World!", "")
''
>>> all_text_after("Hello World!", "World")
''
>>> all_text_after("Hello World!", "!")
''
>>> all_text_after("Hello World!", "H")
'ello World!'

The idea for this function is very similar to all_text_before(), but instead of stopping our accumulation of characters when we find the first occurrence of char, we want to start accumulating characters after the first occurrence of char. We may pass through several iterations of our for loop before identifing a match, so it may be helpful to create a boolean variable that changes its value once the first match is found. You can use an if inside your for loop to selectively execute the “accumulation” of characters after your condition is met.

To test your all_text_after() implementation, you can use the runtests.py program with the “after” argument:

python3 runtests.py after

Madlibs puzzles

At this point we’ve developed several “utility functions” that will serve as useful building blocks when writing our Madlibs puzzle game. But what exactly is a “Madlibs puzzle”, and how is it represented as data in our program?

Puzzle story files contain text interspersed with one or more “placeholders”. Placeholders are all encoded as single words that start with a less-than symbol (<), end with a greater-than symbol (>), and have one or more alphanumeric characters between those symbols (non-punctuation, non-whitespace characters). Typically, the text between the symbols is a descriptor and a number. For example, <adjective1> and <noun4> are two placeholders in the provided stories. These placeholders will eventually be substituted with words that match the placeholder’s description, completing the story.

As an example, an excerpt from a story file named nursery_rhyme.story might contain the following text:

If we replaced “<adjective1>” with “little”, “<noun1>” with “fleece”, “<adjective2>” with “white”, and “<noun2>” with “snow”, we’d have a (possibly) familiar excerpt from a nursery rhyme:

On the other hand, if we replaced “<adjective1>” with “enormous”, “<noun1>” with “mousepad”, “<adjective2>” with “loud”, and “<noun2>” with “rhombus”, we’d have something that is grammatically correct, but rather silly.

Puzzle key files are lists of “swap-key=swap-value pairs” that describe the way that the placeholders from a story file will be updated. Each “swap-key=swap-value pair” consists of a placeholder (as described above), an equals sign (=), and text that will be substituted into the puzzle as a replacement for all occurrences of the placeholder. For example, a puzzle key file for the substitution above would include the lines:

We called these “swap-key=swap-value” pairs because when we see an occurrence of a “swap-key” in our puzzle story file, we will look into our puzzle key file to find the matching “swap-value” that should replace it.

The game. Hopefully the above example illustrates that a Madlib puzzle is, in some sense, a game. Outside of your program, you can play this game by filling in the contents of a puzzle key file. To have the most fun, you should avoid reading the puzzle files (files that end in .story). Instead, open up a puzzle key file (a file that ends in .answerkey) inside VSCode, and fill in the puzzle key with words that you would like to use as substitutions for the placeholders. The placeholder should describe the qualities of a word that will fit into the story—grammatically and thematically—but without knowing the context of the puzzle, the words that you choose will often lead to humorous substitutions.

The rest of the lab will walk us through the steps to finish a program that takes a puzzle story file and a completed puzzle key file, and outputs the (hopefully humorous) version of the story that includes the substitutions.

Reading in Files

To implement our Madlibs game, we need a way to store the contents of puzzle story files and puzzle key files in variables that our program can manipulate. Since our primary focus is on building our skills with strings and lists, we have implemented this file-reading functionality for you. At the top of your starter code, you should see two lines:

from text_utils import read_stringlist_from_file
from text_utils import formatted_madlib

These two lines allow you to invoke the functions read_stringlist_from_file(filename) and formatted_madlib(string_list) from inside your code. We’ve provided the full implementations of these functions for you inside the file text_utils.py. You should not change the contents of that file.

The function read_stringlist_from_file(filename) takes the location of a file as input (the filename parameter’s type is str), and it returns a list of all of the words and punctuation in that file. So for example, if I had a file named "sample.story" with the contents:

>>>read_stringlist_from_file("sample.story")
['Uh', '-', 'oh', ',', 'I', 'forgot', 'to', '<verb1>', 'for', 'the', '<schoolsubject1>', 'exam', '!']
>>>read_stringlist_from_file("sample.answerkey")
['<verb1>=study', '<schoolsubject1>=CS134']

The behavior of the inverse function, which takes a list of strings (e.g., the text of a Madlibs puzzle—solved or not) and prints it to the screen with nice formatting, is shown below:

>>> story_list = read_stringlist_from_file("sample.story")
>>> formatted_madlib(story_list)
'Uh-oh, I forgot to <verb1> for my <schoolsubject1> exam!'

The Madlibs Algorithm

Now is the point in the lab where we start to put all the pieces together. We’ll first present the algorithm, and then complete one last helper function before implementing the algorithm itself—problem decomposition is a continuous process!

So, at a high level, what is it that we need to do? We need to read in a puzzle’s story and key, then go through the story and substitute each placeholder with its matching swap-value from the puzzle’s key. The concrete steps below describe this “Madlibs algorithm” more explicitly to give us some context for our next task. You may be tempted to start coding it up, but hold of on implementing this algorithm! That step is coming up soon.

Now that we have described our algorithm, can you identify any steps from this algorithm that we can further decompose? In step 4, there are quite a few moving parts. Particularly, the task of searching for a swap-key and retrieving its associated swap-value looks daunting. Let’s break this down into its own function that we can use as our last building block.

Identifying Swap-Values

Every time we identify a placeholder within our puzzle story, we need to find the corresponding swap-value from within the puzzle key. For example, if we see "<adjective1>" in a story, we need to look through the puzzle key to find the string that starts with "<adjective1>" (a swap-key), and then extract the matching swap-value. We will implement this behavior inside the function get_madlibs_replacement(placeholder, madlibs_key_list).

This function has two parameters: placeholder, which is a swap-key such as "<adjective1>", and madlibs_key_list, which is a list of “swap-key=swap-value pairs” that we read from a puzzle key file.

Luckily, we’ve already written the hardest parts of this function. Since placeholder is the swap-key in one of the “swap-key=swap-value pairs” found in madlibs_key_list, we just need to iterate through the elements in madlibs_key_list to find it. Once we’ve found a match, the swap-value is all of the characters that appear after the "=". (If we don’t find a match, just return placeholder since no subsitution is possible.)

We can use our is_prefix() function to identify the matching swap-key, and the all_text_after() function to extract the swap-value. Finally, we return that suffix.

To test your get_madlibs_replacement() implementation, you can use the runtests.py program with the “replace” argument:

python3 runtests.py replace

Solving the Puzzle

The last step is to write solved_madlibs(madlibs_story_list, madlibs_key_list), which implements the Madlibs algorithm and returns a string containing the completed story. As input, the function takes two parameters:

To test your solved_madlibs() implementation, you can use the runtests.py program with the “solve” argument:

python3 runtests.py solve

The Grand Finale

At this point in the lab, you’ve written all of the functions you need to complete a Madlibs puzzle game. However, until we invoke those functions, we aren’t actually done. At the bottom of your program, there is a section that resembles the lines:

if __name__ == "__main__":
    # comments ...
    # more comments ...
    pass

The section of your program is a standard part of many .py files. It is a conditional (an if statement followed by a boolean expression, followed by an indented section of code), which means that the indented region of code in its body will only be executed if the condition evaluates to True. But what does the expression __name__ == "__main__" mean, and in what circumstances is it True or False?

We will explore this syntax in more detail later this semester, but the important things to know are this:

This is a very helpful feature. It gives us a place where we can put code in our programs that only executes when we “run” the program. This is often where we’ll write tests or where we’ll include code that reads arguments from the terminal and executes our functions using those arguments.

Now, make it so that running your program actually plays your game! You may wish to modify your program to resemble the following:

if __name__ == "__main__":
    # Ask the user which Madlibs puzzle they want us to solve
    story_filename = input("What puzzle story file would you like to use? ")
    key_filename = input("What puzzle key file would you like to use? ")
    
    
    # Read the contents of the puzzle files into variables we can use
    story_list = read_stringlist_from_file(story_filename)
    key_list = read_stringlist_from_file(key_filename)
    
    print("Here is your completed Malib:")
    print("")
    
    # Solve and print the puzzle's solution
    print(solved_madlibs(story_list, key_list))

Finally, to run your completed program, you can type the following command in the Terminal:

python3 madlibs.py

Overview