I’ve recently started a big data project with Mathieu Dumoulin. We are using Mahout with Hadoop to do some machine learning with some Map Reduce in order to deal with big data the right way. We’ve found the way to test our Map Reduce code, so that’s what I present in this post.
Long story short, we are working on porting a learning algorithm to Mahout, for the only purpose of making a contribution to this framework that is growing in popularity. Since scaling algorithms involve some Map Reduce programming, we had to find a way to test our code.
Basically, there are three classes that will allow to test your mappers and your reducers. This classes are drivers that will manage the context by themselves, thus leaving you with the only concerns of what goes in and what gets out of the mappers and reducers.
The first one is MapDriver. You instanciate it by sending your mapper as a parameter for the constructor, and then all you only have to call the withInput method once or multiple times for defining the inputs to your mapper. After that, you use one or many calls to the withOutput method to validate the mapping. An other option is to call the run function and make assertions on the result. See the code samples provided on the last link.
The second one is ReduceDriver. Just like the MapDriver, you just have to call as many time as you need the withInput method, get the result of the Reduce with the run function and assert on that.
The last one is MapReduceDriver. Basically, it allows to test both the Mapper and the Reducer at the same time. Pretty cool for integration tests.
Here’s a little extra code sample from the project I’m currently doing with Mathieu Dumoulin. We created a tokenizing mapper to split a line of text into words. Here’s the test written for that mapper:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Since it’s tested, you don’t have to do some ugly things like line printing just to make sure the mapper did its job, or even create functions that would verify by hand that things are mapped correctly. The code stays as clean as the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21