June 12, 2015
Hadoop Map-Reduce is a software framework for developing applications which can process huge amounts of data typically running into terabytes in size. This programming model was initially developed at Google to power their search engine. Hadoop has later adopted this model under apache open source license.
MapReduce has two main functions at its core namely: map() and reduce(). These two operations are inspired from functional programming language Lisp.
• Given an input file to process, it is divided into smaller chunks (input splits). MapReduce framework will create a new map task for each input split.
• Map task reads each record from the input and maps input key-value pairs to intermediate key-value pairs.
• map(k1,v1) list(k2,v2) where (k2,v2) is an intermediate key/value pair.
Depends on the total size of the input. Typically number of map tasks is equal to number of blocks in input file.
• Mapper outputs are sorted and fed to a partitioner which will partition the intermediate key/value pairs among the reducers (all the intermediate key/value pairs with same keys get partitioned to common reducer).
• MapReduce framework then takes all the intermediate values for a given output key and then combines them together into a list.
• Each reduce task receives the output produced after Map Processing (which is key/list of values pairs) and then performs operation on the list of values against each key. It then emits output key-value pairs.
• (k2,[v2]) -> (k2,v3)
0.95 * no_of_nodes. This is to accommodate for the failed reduce tasks which needs to be restarted.
Let us take an example to see how map-reduce works. Consider an application which takes an input file and counts the number of occurrences of each word in the file. This can be modelled as a map-reduce application:
In this post, we learnt about MapReduce programming. In the next few posts, I will present you how to model various real world applications using this framework and see their implementation.