CSCI 432

Operating Systems

Home | Calendar | Assignments | CS@Williams

Project 0 - Inverted Index

For this assignment, you will write a program in C++ that generates an "inverted index" of all the words in a list of text files. (See Wikipedia for more details regarding inverted indexes.)

The goal of this project is to give you experience writing a (very) simple C++ program using the Standard Template Library (STL). It also gives you some practice with the autograder, and allows me to make sure everything is configured correctly. It is worth fewer points than the other projects.

The starter files (including a Makefile) can be found here. You should create a file called inverter.cc to implement your solution. More details are discussed below.

Note: You should compile with g++ -std=c++11. (C++17 and later versions eliminated dynamic exception specifications. Using these makes catching errors much simpler in our assignments, so we will stick with C++11 this semester.)

Inverter Input

Your inverter will take exactly one argument: a file that contains a list of filenames. Each filename will appear on a separate line. Each of the files described in the first file will contain text from which you will build your index.

For example:

inputs.txt
-----
foo1.txt
foo2.txt

foo1.txt
-----
this is a test. cool.

foo2.txt
-----
this is also a test.
boring.

Inverter Output

Your inverter should print all of the words from all of the inputs, in "alphabetical" order, followed by the document numbers in which they appear, in order. For example (note: your program must produce exactly this output):

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ASCII. So "The" and "the" are separate words, and "The" comes first. Only certain words should be indexed. Words are anything that is made up of only alpha characters, and not numbers, spaces, etc. "Th3e" is two words, "Th" and "e".

Files are incrementally numbered, starting with 0. Only valid, openable files should be included in the count. (is_open comes in handy here.) You may assume that you will not be given any duplicate files in the input file (i.e., foo1.txt will only appear once).

Your program should absolutely not produce any other output. Extraneous output, or output formatted incorrectly (extra spaces etc.) will make the autograder mark your solution as incorrect. Please leave yourself extra time to work these problems out.

Implementation Hints

Implement the data structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Use C++ strings and file streams. Sample Code:
#include <string>
#include <fstream>

Make sure that your project uses an ifstream, not an fstream. Both ifstreams and fstreams are found in the fstream library.

Remember, your program needs to be robust to errors. Files may be empty, etc. Please handle these cases gracefully and with no extra output.

The noskipws operator may be useful in parsing the input:
input >> noskipws >> c;

Project Turnin

Your code (inverter.cc) will be handed in using the autograding system. Please see the autograder web page for details on how to submit your solution.

Project Writeup

No writeup required for this project.