Reduce the Pain of MapReduce

Posted on: September 6th, 2012 by Ross Farrelly 1 Comment

Recently I re-read Dean and Ghemawat’s much cited 2004 paper which did so much to popularize MapReduce. I thought it would be nice to implement a couple of the problems which they cite as algorithms which are well suited to being solved by MapReduce. What made it even nicer was the ease with which these could be implemented using Aster’s Integrated Developer Environment (IDE).

"Aster Data's new integrated development environment will enable us to automate parts of development that currently take days, allowing us to build rich analytic applications significantly faster and more easily."

Let’s take one example: Reverse Web-Link Graph. Here we are provided with a list of web pages and for each web page a list of links from that page. We want a list of all web links which point to each page (a list which played a major part in the Google page rank algorithm). Dean and Ghemawat describe the algorithm as follows “The map function outputs <target; source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target; list(source)>”. Writing this in the Aster IDE is a piece of cake, since the MapReduce wrapper is taken care of for us and we just need to write the logic in the map step and in the reduce step.

In Eclipse (with the Aster plug in  installed of course) we create a new Aster MapReduce project and add a new Aster map function. The IDE builds the wrapper and all we need to do is add nine line of java code (Figure 1) to create the map function which ingests the page source page names and a list of target urls, and emits the <target, source> pair.


Figure 1

In the IDE we now create a new input file and a test configuration and check that the map function is working. Now for the reduce function. This is equally quick and easy to write. Using the built in wizard we create the wrapper for the reduce function and this time add just 13 lines of code to compete the function (Figure 2).


Figure 2

After testing the reduce function we are ready to install and run it on our cluster. This is easily done from within the IDE. We create and run an SQL file in Eclipse. This automatically zips the java MapReduce functions, installs them on the database and calls the functions. The functions are now available to be called from within SQL-MR scripts via putty or any other SQL client (Figure 3).


Figure 3

So, with very little effort, we can write, test, debug and install a MapReduce function on our cluster using the Aster IDE. For more information on how Aster and Hadoop compare with regard to ease of use and execution times – see this report recently released by the Enterprise Strategy Group.

Ross Farrelly
Chief Data Scientist
Teradata Australia and New Zealand

One Response

  1. Phillip Flop

    October 18, 2012

    Fascinating.

    Reply

Leave a comment


Refresh