# Lab 8: Auto(Complete)¶

Objectives

In this lab we will use classes to implement a version of an algorithm that is ubiquitous on modern smart phones – autocomplete! During this lab, you’ll gain experience with the following concepts:

• Writing classes and methods

• Thinking through some decisions to improve the efficiency of your code

## The Autocomplete Algorithm¶

Where would the world be without decent autocomplete? Stuck in the paste? Royally skewed? Up ship creek without a poodle? Fortunately, our phones are better than that. Most of the time…

As soon as you start typing in a word, you’ll notice that it suggests some possible completions for the word based on the letters typed in so far. For example, for the input `auto`, the phone might suggest the completions `auto, automatic, automobile`. Ideally, these suggestions also apply some clever rules to maximize their utility to the user – one way to ensure this is to say that the first suggestion will be the input itself if it already corresponds to a word in the dictionary, while the rest of the suggestions (including the first suggestion if the input isn’t a word in the dictionary) are presented in order of how commonly they are used in everyday speaking. We will implement a version of this algorithm in this week’s lab.

In the last part of the lab, we’ll also consider an alternative algorithm where the user can enter a word containing “wildcard characters” that match any letter. If the user enters `r***s`, our algorithm will return the three most common words starting with “r”, ending with “s”, and having any three letters in between. The output here may be `ranks, rooms, rules`, for example. While you may not find this particular feature on a cell phone, you may appreciate its utility if you’ve ever stared at a picture like this: .

The final product will be a program that takes words or patterns to complete from the command line and generates the three best completions. (Here the bracketed numbers are how common each completion is.)

```\$ python3 autocomplete.py moo cow "r***s"
moo --> mood[51] | moon[24] | moonlight[18]
cow --> coward[8] | cowboy[8] | cow[7]
r***s --> ranks[98] | rooms[86] | rules[58]
```

## Getting Started¶

Before you begin, clone this week’s repository using:

```https://evolene.cs.williams.edu/cs134-labs/usernameA-usernameB/lab08.git
```

where `usernameA-usernameB` are you and your partner’s usernames sorted alphabetically.

There are three Python files for this assignment: `freqword.py`, `result.py`, and `autocomplete.py`. Each will contain one class definition, as outlined below. You will also find a couple of CSV files called `gutenberg.csv` and `miniGutenberg.csv` in the data folder of your repository. Each line in these file corresponds to a word and the number of times it occurs in all of the books downloaded from Project Gutenberg. It contains 29,053 words. The `miniGutenberg.csv`, on the other hand, contains only five words. We’ll mostly use that version for testing and debugging purposes so you have a small file to look at to ensure your code is working as intended. We will use the full `gutenberg.csv` file corpus for determining the frequency with which words are used to order our autocomplete suggestions.

## Part 1: The `FreqWord` Class¶

The `FreqWord` class is one of two helper classes that will make writing your autocompletion code more elegant. A `FreqWord` object represents one word from a corpus (or collection of words), as well as the number of times that word appears in the corpus. This class should contain two protected attributes to record that information: `_text` that stores a string and `_count` that stores an integer.

Your first task is to implement the following methods appearing in the `FreqWord` class in the `freqword.py` file:

• The constructor `__init__(self, text, count)`, that creates a new `FreqWord` object with the given `text` and `count` (make sure `_count` is stored as an int);

• The accessor methods `getText(self)` and `getCount(self)` that return the object’s attribute values;

• The method `__str__(self)` that returns a string representing the objects attributes in a readable form; and

• The method `hasPrefix(self, prefix)` that returns True if the word starts with `prefix`. You might find the `.startswith()` string method useful here.

Here is an example of using these methods in interactive Python. Note the string printed by the `print(w)` test. Your `__str__()` method should return a string representing a `FreqWord` object using that format.

```>>> from freqword import *
>>> w = FreqWord("cow", 5)
>>> w.getText()
'cow'
>>> w.getCount()
5
>>> print(w)
cow[5]
>>> w.hasPrefix("co")
True
>>> w.hasPrefix("moo")
False
```

Once you have completed the aforementioned methods and your code passes the doctests for them, write two functions that will be useful for sorting lists of `FreqWords`:

• `textKey(freqWord)` that returns the text for a `FreqWord` object; and

• `countKey(freqWord)` that returns the count for a `FreqWord` object.

Note: These functions are not part of your class definition. Although they appear in the same script, they are not indented. Thus, they should not access your object’s attributes directly but instead use your accessor methods. We will use these functions in the AutoComplete class later.

Make sure that all doctests pass.

There is one additional method in this class – `matchesPattern()` – that is currently commented out. We’ll return to that method later. For now, just ignore it.

## Part 2: The `Result` Class¶

Our second helper class, `Result`, helps the autocompleter present results to the user in a readable format. This class should contain two protected attributes – `_input` that stores a string that the user entered for autocompletion, and `_completions` that stores a `list` of `FreqWords` corresponding to suggested completions.

In the `Result` class, implement the following methods:

• The constructor `__init__(self, inputWord, completionList)` that creates an instance of `Result` with the given input word and list of possible completions.

• The method `__str__(self)` that constructs a string representing the attributes of an instance in a readable format.

A demonstration of creating an instance of this class and printing its string representation in interactive Python is shown below.

```>>> from result import *
>>> r = Result("the", [FreqWord("the",4), FreqWord("theirs",3), FreqWord("then",2)])
>>> print(r)
the --> the[4] | theirs[3] | then[2]
```

As always, make sure that all doctests pass.

## Part 3: The `AutoComplete` Class¶

We are now ready to implement the `AutoComplete` class. Before starting, take a look at the contents of the class provided to you in `autocomplete.py` to familiarize yourself with some of the attributes and methods.

`AutoComplete` has one protected attribute: `_words`. This is a `list` of `FreqWord` objects, sorted in alphabetical order. You will initialize this attribute in the constructor. The class also has the following methods, which you should implement and test:

• The constructor `__init__(self, corpus)` : This method should read the contents of a CSV file (where `corpus` is a string representing the filename) that contains word-frequency pairs on each line. It initializes the attribute `_words` according to the description above. (Hint: To accomplish this, you will need to sort using a key function defined in FreqWord. Since they are functions, you can access them directly.)

• The method `__str__(self)`: This method should generate a string with each FreqWord in `_words` on a separate line as shown below.

```>>> print(AutoComplete("data/miniGutenberg.csv"))
circumstances[107]
scold[3]
scraped[21]
wooded[8]
wooden[37]
```
• The private method `_matchWords(self, criteria)`: This helper method takes as input a string `criteria` and returns a list of all `FreqWord` objects in `_words` whose text begins with that string. Take a look at the corresponding doctest in the starter code for an example of how the method works. (Hint: You should call methods in `FreqWords` whenever possible to simplify your code.)

• The method `suggestCompletions(self, inputString)`: This method takes as input a string called `inputString` and returns an instance of the `Result` class, where the input corresponds to the same input provided, and the output is the top suggested autocompletions generated according to the following two step algorithm:

1. Generate possible completions using `_matchWords` to find all words having `inputString` as a prefix.

2. Sort the possible completions according to their frequency of occurrence (hint: you will need to sort using a key function here), and return a `Result` instance with output corresponding to the top 3 frequently occurring words. Note: If there are less than 3 possible completions, this list may be shorter (possibly even empty corresponding to no possible completions).

Putting these all together you should be able to get fun outputs like the ones shown below in interactive Python.

```>>> auto = AutoComplete('data/gutenberg.csv')
>>> print(auto.suggestCompletions('cool'))
cool --> cool[11] | cooling[4] | cooled[3]
>>> print(auto.suggestCompletions('iris'))
iris --> irish[48] | iris[3] | irishman[1]
>>> print(str(auto.suggestCompletions('jeannie')).strip())
jeannie -->
>>> print(str(auto.suggestCompletions('lida')).strip())
lida -->
```

Make sure that all doctests pass.

Caution

Doctests that have no completions: Note that there is always a space after the `-->` in the outputs above, even for lida and jeannie. Be careful when creating new doctests and use `strip()` to prevent issues.

### Testing and Running your code¶

• Make sure all of your code passes the relevant doctests. We encourage you to write additional doctests for any sophisticated methods that you implement.

• We have configured the `autocomplete.py` file to use command line arguments, as we did last week in Lab 7. To generate autocompletions for one or more prefixes, just list them on the command line:

```\$ python3 autocomplete.py moo cow
moo --> mood[51] | moon[24] | moonlight[18]
cow --> coward[8] | cowboy[8] | cow[7]
```

## Part 4: Matching Patterns Instead of Prefixes¶

We’ll now extend your autocompleter to allow for more general matching based on patterns. For example, computing the completions for the pattern `'c**l'` will produce the three most common 4-letter words starting with `c` and ending the `l`. To do this:

1. Uncomment and implement the `matchesPattern(self, pattern)` method in your `FreqWord` class. This method takes as input a string `pattern`, which contains a mix of letters and ‘wildcard characters’ denoted as `'*'`, and returns whether or not the text of the `FreqWord` matches that pattern. The wildcard characters are used to denote that any letter is acceptable in the given spot where it appears. Here are a few examples in interactive Python:

```>>> FreqWord('contemplate', 100).matchesPattern('c***emp*at*')
True
>>> FreqWord('contemplate', 100).matchesPattern('contemp**')
False
>>> FreqWord('test', 100).matchesPattern('text')
False
>>> FreqWord('test', 100).matchesPattern('ne*t')
False
```
2. Modify your `_matchWords(self, criteria)` helper method in the `AutoComplete` class to handle input strings containing wildcards. Specifically, if `criteria` (a string) does not contain wildcard characters, `_matchWords()` should behave exactly as before. If `criteria` does have wildcards, it should instead use the `matchesPattern()` method in `FreqWord` to construct a list of all words in `_words` matching the pattern.

For example, if `_words` is a list of `FreqWord` instances for the words `'call', 'cat', 'chill', 'cool'` and the given pattern is `'c**l'`, this method should return a list containing only the instances for `'call', 'cool'`. Note that ‘chill’ is not returned as it consists of 5 letters rather than 4 as required by the pattern.

Here are two additional tests you can add to `suggestCompletions()` to test your pattern matching. You will likely want to add several additional tests as well.

```>>> print(str(AutoComplete("data/miniGutenberg.csv").suggestCompletions("woo*e*")).strip())
woo*e* --> wooden[37] | wooded[8]
>>> print(str(AutoComplete("data/gutenberg.csv").suggestCompletions("woo*e*")).strip())
woo*e* --> wooden[37] | woolen[15] | wooded[8]
```

Note

The Terminal has its own autocomplete that tries to match words with wildcards to filenames. So, to use command line arguments as patterns, put quotes around them to tell the terminal not to process them:

```\$ python3 autocomplete.py "r***s"
r***s --> ranks[98] | rooms[86] | rules[58]
```

## Extra Credit¶

There are many ways to extend this program. Here are a few ideas:

• Prefer prefix if it’s a complete word: Override the priority of the first element of the top ranked completions to be the input string itself if it happens to be an actual word that already exists in `_words`. That is, the top 3 choices in the case where `inputString` is a real word (and not just a prefix or expression) are (i) `inputString` itself, (ii) The top frequency word besides `inputString`, and (iii) The second most frequent word besides `inputString`. (Note: there are some subtle cases to consider – what if `inputString` is a complete word and also the second most frequent word?)

Implementing this feature may break some of your doctests. Be sure to go back and fix them to ensure all tests pass.

• `_matchWords()` Performance: Your `_matchWords()` method may be doing a lot more work than necessary in some cases. It is likely your initial implementation iterates over all words in `_words`. However, we can take advantage of that fact that `_words` is sorted, and stop processing words as soon as we know that all unprocessed words will not match our criteria. For example, if the provided prefix is `'ca'` and `_words` contains ‘apple’, ‘cat’, ‘catastrophe’, ‘dog’, and ‘doge’, the loop should stop executing as soon as it passes the `FreqWord` object for ‘catastrophe’.

Redesign `_matchWords` to stop early when matching prefixes. Also consider a similar way to stop early when matching patterns. That case will be somewhat more subtle.

## Submitting your work¶

When you’re finished, commit and push your work to the server as in previous labs.

Functionality and programming style are important, just as both the content and the writing style are important when writing an essay. Make sure your variables are named well, and your use of comments, white space, and line breaks promote readability. We expect to see code that makes your logic as clear and easy to follow as possible. Python Style Guide is available on the course website to help you with stylistic decisions.

Do not forget to add, commit, and push your work as it progresses! Test your code often to simplify debugging.