Diving Into the Deluge of Data :: Lab 3 :: String Theory

Lab 3: String Theory

In the "real world", data is everywhere. But the real world is, unfortunately, a very messy place.

You will often find that your first task when interacting with data is to convert it to a usable form. We will build some of these skills by searching, parsing, and transforming large bodies of text. In particular, this lab will build our expertise with many of Python's string functions.

We will start with a few small functions, and build our way up to more complext tasks. At the end of this lab, we will build a small program to solve Mad-Lib-style puzzles.

Step 0: Lab Preparation

  • Please review Lectures 7 and 8
  • Please read the Python documentation for the builtin Text Sequence type string
  • Explore the wc and grep Unix utilities. You can type
    $ man wc
    and
    $ man grep
    to view documentation. You should also try them out yourself.

Step 1: Source Code

Step 2: Reading in Text

We haven't yet completed our in-class discussion of file input/output, so we have provided the textIO.py module. The module is very simple: it has one wrapper function that reads a source text file and stores its contents in a string.

      def read_text_from_file(filename):
          '''Returns the contents of 'filename', where 'filename' is a file path string.'''
          with open(filename, 'r') as myfile:
              return myfile.read()
    

Using this module, we can read the text from any file and use those contents within our programs. Here's an example program that imports and uses read_text_from_file to read the text from a file and print it out. Notice that we use from MODULE import NAME instead of import MODULE. This allows us to use NAME directly instead of typing MODULE.NAME.

      from textIO import read_text_from_file

      if __name__ == '__main__':
          print(read_text_from_file(sys.argv[1]))
    

Step 3: Useful Unix Utilities

The Unix program wc can be found on all of the lab computers. According to the manual page (which you can access from the command line by typing `man wc`), wc prints the newline, word, and byte counts for each file. Please write a function that performs each of these functions for an input string (see below). However, instead of counting bytes, we want to count characters.

You should put these functions in a file called stringutils.py, which is provided. Reviewing the split and splitlines methods might help. To test each piece of code you can start a python interpreter and import a function as follows:

	>>> from stringutils import line_count
	>>> line_count("there's no\n'b' in\n team")
	3
      

If you make changes to your stringutils file, they will not be reflected in the interpreter. You must exit python and start again.

    def character_count(text):
        """
        Returns the number of characters in the string 'text'.
        """

    def word_count(text):
        """
        Returns the number of words in the string 'text'.
        """

    def line_count(text):
        """
        Returns the number of lines in the string 'text'.
        """
  

Using these functions, we can create our own word counting utility and run it as a script. In the file wc.py, please use the character_count(), word_count(), and line_count() functions that you just wrote to implement the default wc functionality. Pay attention to the way that wc formats its output. You can compare by typing

$ wc wc.py

and

$ python3 wc.py wc.py

to confirm that you get the same result.


Like wc, the Unix program grep can be found on all of the lab computers. The grep program is much more complex than wc, but it is also much more versatile. At a high level, it searches a file (or set of files) for instances of a pattern, and it prints the lines that contain that pattern. If you pass it the -i flag, grep ignores case.

Please complete the grep-like function, grep_lite(), that searches the string 'text' for all lines that contain the string 'query' and returns those lines in a new list. Please add this code to the stringutils.py file. You may find the string method casefold, which is simliar to lower, to be helpful.

    def grep_lite(text, query):
        """
        Given an input string 'text', and a target string 'query',
        return a list of all the lines of 'text' that contain the string 'query'.
        grep_lite ignores case.  In other words, it also returns lines of 'text'
        where capitalization does not match.
        """
  
    >>> from stringutils import grep_lite
    >>> l = "a man\na plan\na canal\npanama\nnot this line"
    >>> print(l)
    a man
    a plan
    a canal
    panama
    not this line
    >>> grep_lite(l, "an")
    ['a man', 'a plan', 'a canal', 'panama']
  

Step 4: Search Party

Finding and replacing text is a very common—and important—task. Please implement functions to count the instances of a given string, replace one string with another, and remove all instances of a string entirely.

    def count_instances(text, query):
        """
        Given an input 'text', and a target string 'query',
        return the number of times that 'query' occurs in 'text'.
        count_instances() also ignores case. In other words, count_instances()
        also counts occurences of 'query' where capitalization does not match.
        """

    def replace_instances(text, query, replacement):
        """
        Given an input string 'text', a target string 'query',
        and a replacement string 'replacement', return a copy of
        'text' where all isntances of 'query' are replaced by
        the string 'replacement'.
        replace_instances() also ignores case. In other words, it
        replaces instances of 'query' where capitalization does not match.
        """

    def drop_instances(text, query):
        """
        Given an input string 'text', a target string 'query',
        return a copy of 'text' where all isntances of 'query' 
        are removed.
        drop_isntances() ignores case. In other words, it also
        removes instances of 'query' where capitalization does not match.
        """
    
  

Remember to test your functions using the python interpreter.

    >>> from stringutils import count_instances, replace_instances, drop_instances
    >>> l = "the rain in spain stays mainly on the plAin"
    >>> count_instances(l, "rain")
    1
    >>> count_instances(l, "ain")
    4
    >>> replace_instances(l, "rain", "snow")
    'the snow in spain stays mainly on the plAin'
    >>> replace_instances(l, "ain", "ite")
    'the rite in spite stays mitely on the plite'
    >>> drop_instances(l, "ain")
    'the r in sp stays mly on the pl'
  

Step 4:         -ing it all Together
           (verb)

Without realizing it, we have implemented the necessary functionality for completing a Mad Lib puzzle (at least we have implemented the tricky parts). What remains is for us to fill in some puzzles ourselves.

In your lab repository, you can find a support/ directory. This directory contains pairs of files, and for each pair, there is a puzzle and a solution key (they can be easily identified by their file extension). The puzzle is a Mad Lib story—a short text where several words have been removed and replaced by their part of speech. The solution key is a list of substitutions, one per line, that will be used to solve the puzzle.

We have provided the solvemadlib.py module to do the work of reading in a puzzle and a solution key. The substitute function goes through each substitution in your solution key, and calls your replace_instances function to update the puzzle.

Please complete the solution key files and use the solvemadlib.py module to fill in the puzzles. Remember, you can run test functions directly by running python in interpreter mode and typing:

      >>> from solvemadlib import substitute
      >>> substitute("support/study.madlib", "support/study.madlib.key")
      >>> substitute("support/president.madlib", "support/president.madlib.key")
    

If you implemented your search functions correctly, you should see completed puzzles. There are many madlib puzzles available online. Feel free to create your own puzzle and key files.

Step 5: Submission