CS010 Assignment 3

Pointers, Arrays, Structures

Due: January 30, 5 PM

This assignment should be turned in for evaluation. To turn in the assignment, you will use the turnin command:

turnin -c 010 concordance.c

It is ok to turn in the same file more than once. This will overwrite the earlier version. You may want to do this if you discover a mistake or make an improvement after turning in your assignment.

  1. A concordance is an alphabetic list of words that appear in a document. For this assignment, you will build a concordance of the words contained in a file. If a word appears more than once, it should only be listed once in the concordance. You may use case-sensitive comparsions of words to simplify this program. So, if a word appears capitalized and in lower-case, it is ok for that word to appear twice in the concordance.

    To read input, simply use scanf to break the input up into words:

    scanf ("%s", nextWord);

    Keep in mind that memory for nextWord must be allocated prior to calling scanf. scanf will then return in nextWord all consecutive characters appearing between whitespace (blanks, carriage returns, tabs, ...). Unfortunately, this means that some words will actually be numbers or punctuation or some combination of letters and these non-letters. For the purposes of this assignment, a word should contain only alphabetic characters. So, after scanf reads in a "word", send it to a function isWord (that you need to write) that should return true if the word contains only alphabetic characters. If it does, add it to the concordance.

    scanf reads from standard input. It makes sense for this program to read from a file instead. Since we haven't done file manipulation yet, you don't know how to do that in C. To read from a file you should use Unix input redirection to tell your program that it should get its standard input from a file instead of from the keyboard. Here is how you do that:

    -> concordance < my_input_file

    Assuming I have a file called my_input_file in my current directory, the contents of this file will be sent to the concordance program just as if they had been entered from the keyboard. When the end of the file is reached scanf will detect this and will return a 0 to indicate that it was unable to set nextWord because it had run out of input.

    Here is an example of how the program should behave:
    -> more sample.txt
    Alice was beginning to get very tired of sitting by her sister on the
    bank, and of having nothing to do: once or twice she had peeped into
    the book her sister was reading, but it had no pictures or
    conversations in it, `and what is the use of a book,' thought Alice
    `without pictures or conversation?'
    -> concordance < sample.txt

    You must decide how to define the concordance itself. Here are some options:

    • Use an array. How can you make it big enough so that you don't overflow it?
    • Use a vector. This addresses the growth problem. The downside, though, is that since you want to keep the concordance sorted, you may need to insert elements near the beginning of the vector, causing you to need to move a lot of elements to make room for it. This could get slow as the vector grows.
    • Use a linked list. Now, insertion is quick, but you may need to walk a long list just to figure out where to put the new element. This also can get slow as the list grows, although not as bad as a vector.
    • If you've taken 136, you might try to use other structures you learned about there to maintain your sorted list, such as a tree.
    • If you haven't taken 136, you might try using an array whose elements are vectors or linked lists. In this case, each element of the array would contain a vector/linked list of all the words beginning with a particular letter. Knowing the first letter of the new word to add to the list, you can immediately decide which of the 26 vectors/linked lists to put it in. Then you need to walk/update a smaller list than if all the words were kept in one long list.

    I'll leave it up to you to decide which implementation to use. I'm more concerned that you get practice with C than that you produce the fastest concordance around. To be sure that you do get more practice, don't use either of the first two solutions (1 long array or 1 long vector). If you do use a vector as part of one of the other suggested approaches, you can reuse the vector implementation from class and practice assignments in the solution.

    Be sure to free the concordance after printing it out. Along the way, be sure to watch out for memory leaks and dangling references!

    As always, it's a good strategy to write one function, test it. Don't go on to the next function until the first one is working.

    If you run your program on alice.txt, it should generate a list of about 2200 words.

    The sample.txt and alice.txt files are in my shared/cs010/assignment3 directory.

Return to CS 010 Home Page