samskivert: MapReduce: Simplified Data Processing on Large Clusters – Dean & Ghemawat

09 November 2009

Summary
Overview of map/reduce algorithm. Examples: dgrep, url count, reverse web-link graph, term-vector per host, inverted index, dsort. Overview of implementation in Google data center environment. Execution overview, fault tolerance approach, importance of locality, impact of task granularity, cost/benefit of backup task technique. Algorithm refinements: manual data partitioning, ordering guarantees, post-map combiner, I/O types, side-effects support, bad-record skipping, debuggable local execution, status reporting, counter support. Performance summary for grep and sort on terabyte input data set, effects of backup tasks and machine failures. Stats on adoption inside Google.

Comments
I have nothing to say about map/reduce that hasn’t surely already been said before. I will say that this paper does a great job of going “by the book”. They introduce their technique, give a number of concrete examples of its use, explain its implementation, cover some extensions, show its adoption by real engineers, give lots of hard data on how it performs, provide related work and a rousing conclusion. About the only thing they left out was a section on future work. We’ll assume that since they are Google engineers, the implied future work is “use map/reduce to take over the world.”

Source: PDF ACM

©1999–2022 Michael Bayne