CSCI 339

Distributed Systems

Home | Calendar | Assignments | CS@Williams

Writing a MapReduce

Introduction

MapReduce is based on two standard features in many functional programming languages.

The Map function takes a {key, value}, performs some computation on them, then outputs a {key, value}.

Map: (key, value) -> (key, value)

The Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].

Reduce: (key, value[]) -> (key, value)[]

output.collect(key, value); is how Map and Reduce functions emit their {key, value} pairs.

Writing a MapReduce is as simple as writing these two functions (or using already-provided functions), then telling a job object which functions you want to use.

Trivial Identity Example

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class Trivial {

  public static void main(String[] args) throws IOException {
    JobConf conf = new JobConf(Trivial.class);
    conf.setJobName("trivial");

    conf.setMapperClass(Trivial.IdentityMapper.class);
    conf.setReducerClass(Trivial.IdentityReducer.class);

    if (args.length < 2) {
      System.out.println("ERROR: Wrong number of parameters");
      System.out.println("trivial <input_path> <output_path>");
      System.exit(1);
    }

    conf.setInputPath(new Path(args[0]));
    conf.setOutputPath(new Path(args[1]));

    JobClient.runJob(conf);
  }

  public static class IdentityMapper extends MapReduceBase implements Mapper {
    public void map(WritableComparable key, Writable val,
                    OutputCollector output, Reporter reporter)
      throws IOException {
      output.collect(key, val);
    }
  }

  public static class IdentityReducer extends MapReduceBase implements Reducer {
    public void reduce(WritableComparable key, Iterator values,
                       OutputCollector output, Reporter reporter)
      throws IOException {
      while (values.hasNext()) {
        output.collect(key, (Writable)values.next());
      }
    }
  }
}
    

This may seem like a lot of code to setup, but it's pretty simple and straightforward.

Compiling your Map Reduce Program

In this example, we will assume that the above program resides in a directory called example in your home directory. Copy this Makefile and put it into the example directory. Now do:

    $ cd example
    $ make
    

At this point you should have a file called build.jar under the examples directory. You should now be able to launch your first Hadoop job!

    $ cd ~/hadoop-0.15.3
    $ bin/hadoop jar example/build.jar Trivial input-file output-dir
    

Note that this ASSUMES that you have already done the following:

Test this with a small input file first. Hadoop should report the progress of your job on standard output. For more detailed status report, you can check the status pages. Once the job finishes, the output will be available under output-dir inside your DFS. So you will have to do a dfs -get to retrieve the output to your local file system if you want to directly view/analyze it.

JobConf

There are many options for JobConf objects. For complete details see JobConf on the Hadoop API page. I shall cover what I have found to be the most useful options here.

Mapper & Reducer

While map is the only required function to overload in a Mapper class, and reduce is the only required function to overload in a Reducer class, there are a few other useful ones which are inherited from MapReduceBase. For complete details see MapReduceBase on the Hadoop API page. I shall cover what I have found to be the most useful options here.