CSCI 432

Operating Systems

Home | Calendar | Assignments | CS@Williams

Project 0 - Inverted Index

This project will give you experience writing a simple C++ program using the STL. It also gives you some practice with the autograder, and allows me to make sure everything is configured correctly.

For this assignment, you will write a program in C++ that generates an "inverted index" of all the words in a list of text files. (See Wikipedia for more details regarding inverted indexes.) The goal of this assignment is to ensure that you are sufficiently up to speed in C++ to handle the rest of the course.

Future assignments require you to use C++11. You should compile with g++ -std=c++11. I also strongly encourage you to use a simple Makefile for compiling your code.

Inverter Input

Your inverter will take exactly one argument: a file that contains a list of filenames. Each filename will appear on a separate line. Each of the files described in the first file will contain text from which you will build your index. For convenience, you can download a sample Makefile and the files described below here.

For example:

inputs.txt
-----
foo1.txt
foo2.txt

foo1.txt
-----
this is a test. cool.

foo2.txt
-----
this is also a test.
boring.

Inverter Output

Your inverter should print all of the words from all of the inputs, in "alphabetical" order, followed by the document numbers in which they appear, in order. For example (note: your program must produce exactly this output):

a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1

Alphabetical is defined as the order according to ascii. So "The" and "the" are seperate words, and "The" comes first. Only certain words should be indexed. Words are anything that is made up of only alpha characters, and not numbers, spaces, etc. "Th3e" is two words, "Th" and "e".

Files are incrementally numbered, starting with 0. Only valid, openable files should be included in the count. (is_open comes in handy here.) You may assume that you will not be given any duplicate files in the input file (i.e., foo1.txt will only appear once).

Your program should absolutely not produce any other output. Extraneous output, or output formatted incorrectly (extra spaces etc.) will make the autograder mark your solution as incorrect. Please leave yourself extra time to work these problems out.

Implementation Hints

Implement the data structure using the C++ Standard Template Library (STL) as a map of sets, as in:

map<string, set<int> > invertedIndex;

Use C++ strings and file streams. Sample Code:
#include <string>
#include <fstream>

Make sure that your project uses an ifstream, not an fstream. Both ifstreams and fstreams are found in the fstream library.

Remember, your program needs to be robust to errors. Files may be empty, etc. Please handle these cases gracefully and with no extra output.

The noskipws operator may be useful in parsing the input:
input >> noskipws >> c;

Handing Project In

Your project will be handed in using the autograding system. Please see the autograder web page for details on how to submit your solution.

Project Writeup

No writeup required for this project.