Void’s Vault

Knowledge source for efficiency.

Testing MapReduce With MRUnit

I’ve recently started a big data project with Mathieu Dumoulin. We are using Mahout with Hadoop to do some machine learning with some Map Reduce in order to deal with big data the right way. We’ve found the way to test our Map Reduce code, so that’s what I present in this post.

Long story short, we are working on porting a learning algorithm to Mahout, for the only purpose of making a contribution to this framework that is growing in popularity. Since scaling algorithms involve some Map Reduce programming, we had to find a way to test our code.

Oh by the way, if you want more information about Mahout, Hadoop or even Map Reduce, just click on the names, use google or contact me.

The right way to test Map Reduce Java code is, as you’ve guessed, with JUnit, but with the help of another tool called MRUnit. The best tutorial, for now, of it’s usage can be found there.

Basically, there are three classes that will allow to test your mappers and your reducers. This classes are drivers that will manage the context by themselves, thus leaving you with the only concerns of what goes in and what gets out of the mappers and reducers.

The first one is MapDriver. You instanciate it by sending your mapper as a parameter for the constructor, and then all you only have to call the withInput method once or multiple times for defining the inputs to your mapper. After that, you use one or many calls to the withOutput method to validate the mapping. An other option is to call the run function and make assertions on the result. See the code samples provided on the last link.

The second one is ReduceDriver. Just like the MapDriver, you just have to call as many time as you need the withInput method, get the result of the Reduce with the run function and assert on that.

The last one is MapReduceDriver. Basically, it allows to test both the Mapper and the Reducer at the same time. Pretty cool for integration tests.

Here’s a little extra code sample from the project I’m currently doing with Mathieu Dumoulin. We created a tokenizing mapper to split a line of text into words. Here’s the test written for that mapper:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.Before;
import org.junit.Test;

public class TokenizingMapperTest {
    private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

    @Before
    public void setup() {
        TokenizingMapper mapper = new TokenizingMapper();
        mapDriver = MapDriver.newMapDriver(mapper);
    }

    @Test
    public void testMap() throws IOException, InterruptedException {
        mapDriver.withInput(new LongWritable(0), new Text("foo bar bar"));
        mapDriver.withOutput(new Text("foo"), new IntWritable(1));
        mapDriver.withOutput(new Text("bar"), new IntWritable(1));
        mapDriver.withOutput(new Text("bar"), new IntWritable(1));
        mapDriver.runTest();
    }
}

Since it’s tested, you don’t have to do some ugly things like line printing just to make sure the mapper did its job, or even create functions that would verify by hand that things are mapped correctly. The code stays as clean as the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class TokenizingMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final IntWritable one = new IntWritable(1);

    @Override
    protected void map(LongWritable longWritable, Text textToTokenize, Context context) throws IOException, InterruptedException {
        StringTokenizer stringTokenizer = new StringTokenizer(textToTokenize.toString());

        while (stringTokenizer.hasMoreTokens()) {
            Text word = new Text(stringTokenizer.nextToken());
            context.write(word, one);
        }
    }
}

Enjoy!