MapReduce

Introduction

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It is a framework that allows for the easy development of scalable parallel applications to process vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original formsFunctional Programming.

Concept

MapReduce involves splitting the input data into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

Components

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Map Stage

The Map function takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The Map function is applied in parallel to every pair (keyed by the keys) in the input dataset. This step is the first brush sorting of the data. As different keys could go to any reducer, a degree of randomness is introduced into which keys go to which reducer.

Reduce Stage

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain. Each Reduce function processes the intermediate values for a particular key generated by the Map function and produces zero or more output values.

Advantages

MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

Applications

MapReduce is widely used for data-intensive applications such as indexing the Web, data mining, log file analysis, and machine learning. It is used by many Internet companies, including Google, Yahoo!, Amazon, and Facebook, for various tasks, including ad placement, data analysis, and machine learning. MapReduce is also used for scientific applications, including large-scale data processing, such as DNA sequencing, astronomical data analysis, and social network analysis.

References

1. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. 2. Lin, J., & Dyer, C. (2010). Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1), 1-177. 3. White, T. (2012). Hadoop: The definitive guide. " O'Reilly Media, Inc.". 4. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010, December). Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (Vol. 10, pp. 10-10).