Dr. Jimmy Lin, associate professor at the University of Maryland gave a lecture in Amsterdam on Friday about Hadoop MapReduce. Dr. Lin was invited to the Science Park by BigGrid – the Dutch NGI – and by the Amsterdam Information Retrieval community (AIR). Despite Friday was the submission deadline of the EGI Community Forum I decided to take a break from writing and listen to the talk. (BREAKING: Community Forum submission deadline has been extended until Wednesday!)
I was sure I would hear a good lecture. Jimmy Lin is the author of a book
I recently read about MapReduce and he has been working with MapReduce applications at Maryland and at Twitter for several years. What I did not know is that he has excellent presentation skill and very powerful slides. In one hour I not only learnt what MapReduce is and what it can and cannot be used for, but I also learnt how Google translate works and how the problem of DNA sequencing could be translated to the problem of shredding sentences of a Dickens book. On top of that the lecture was followed by a visit to the data centre of SARA, member of the Dutch NGI, operating an impressive number of clusters and a tape robot in the Science Park. So let’s see what really happened in one hour.
The first part of the talk was an overview of the “large data” concept, which, in a simplified way can be described as “a simpler technique on more data beats a more sophisticated technique on less data”. So if you are worried that the results of your simulations are not good enough, don’t blame the algorithm. Simply throw more data into the calculation! This is what Google describes as “The Unreasonable Effectiveness of Data
”. MapReduce, and its open source implementation Hadoop, helps us become successful in this “unreasonable world” as it provides a simple distributed computing model which can run on cheap commodity clusters. Hadoop simplifies distributed applications by saying that “the data centre is the computer” and by providing two functions for application developers to utilise data centres. The two functions are “map” and “reduce”:
- "map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker nodes apply the same transformation function to each sub-problem then pass the answers back to the master node.
- "reduce" step: The master node clusters the received answers based on unique key values then – through another redistribution to the workers – these values are combined through another type of transformation function.
Hadoop is quite simple to learn, yet quite powerful to process huge amounts of data (in the range of TeraBytes). But does it provide the right level of abstraction for distributed computing, or as the lecturer said, “is it the best thing since sliced bread?”. I was glad to see that Dr. Lin’s answer was of course it isn’t! MapReduce is more like a big, heavy hammer. One can hammer in almost all sorts of nails with it, but it can be brutal sometimes.
So, what type of applications is MapReduce good for and for what types is it insufficient? The final part of the talk and the question and answer session revolved around this question. Roughly speaking MapReduce can be good for any application that does not require tightly coupled parallel processing. MPI applications with large number of processes to simulate for example the simulation of nuclear explosions, floods or the weather cannot be translated efficiently to Hadoop. However, embarrassingly parallel applications and many other types of applications that are “somewhere in between these and nuclear explosions” can efficiently run with MapReduce. Language translation and DNA sequencing were used as examples during the talk. Based on what I heard on Friday I can imagine that many of the applications that currently use EGI middleware could be partly or completely redefined in terms of Map and Reduce stages and executed on EGI sites with Hadoop. Rewriting EGI applications for the sake of a more efficient middleware is rarely an option. However, I believe that Hadoop can be ideal for many new applications and should be considered by EGI user communities and by the NGIs who support them as a potential platform. The Dutch NGI already knows this and provides Hadoop services
for its national users. Do other NGIs also experiment with Hadoop? Let us know about your use cases in email, or by submitting an abstract to the EGI Community Forum. (before Wednseday midnight!)
Dr. Jimmy Lin’s slides and an overview of his MapReduce book are available on his website