CSCI 339
Distributed Systems
Home | Calendar | Assignments | CS@Williams
Assignment 3: Inverted Index
Due: Monday, April 21, 2008, 11:59pm
The goal of this project is to become familiar with Hadoop/MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache.
This project should be done in teams of two (or three with prior approval).
Part 0: Cluster Setup
You have all been provided with a set of virtual machines. These machines are only accessible via SSH from sysnet.cs.williams.edu. Please log in with the class account to sysnet, and then SSH to your VMs from sysnet. Do not store or do anything directly on sysnet. This is only to be used as a gateway to your private VMs.
You should have already completed the following steps to setup your cluster:
Part 1: Build an Inverted Index
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize
some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps
words in the documents to some sort of document identifier. For example, if given the following 2 documents:
Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2: Buffalo are mammals.
We could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2
Your goal is to build an inverted index of words to the documents which contain them. Please do this on the files in the dataset located here. You will need to copy these files to your cluster. This may take a while!
Your end result will be something of the form: (word, docid[]).
The big hiccup here is that the default file format doesn't provide you with the name of the file in your map function. You will have to figure some way around this (such as putting the name of the file on each line in the file, or even better, writing a new InputFormat class.) This will get you a little more intimate with the workings of Hadoop.
You'll also have to do a little more than simply tokenize the texts on whitespace. Make sure that punctuation and case also get stripped (although the above example did not strip case or punctuation). But ensure that contractions don't change meaning, (like "it's" becoming posessive "its").
Optional Extensions
If you get this working and want to try something else, here are some optional extensions that you can experiment with:
Part 2: Writeup
To submit your assignment, email a gzipped tarball to jeannie@cs.williams.edu.
Please include the following files in your tarball.
1) Your writeup (preferably PDF).
2) All the files for your source code only. Please do not include
any executables.
3) Your Makefile (or compile instructions) and your jar file (build.jar).
4) A snippet of your inverted index (sample output). This can be part of your writeup.
5) General thoughts/reflection. What did you think of this assignment? How long did it take you to complete (approximately)? Do you recommend
that I use it again in future clasess? This can also be included as part of your writeup.
Resources