May 29, 2018
Big Data can be defined as a huge dataset or collection of such huge datasets that cannot be processed by traditional systems. Big Data has become a whole subject in itself which consists of a study of different tools, techniques and frameworks rather than just data. MapReduce is a framework which is used for making applications that help us with processing of huge volume of data on a large cluster of commodity hardware.
Traditional systems tend to use a centralized server for storing and retrieving data. Such huge amount of data cannot be accommodated by standard database servers. Also, centralized systems create too much of a bottleneck while processing multiple files simultaneously.
Google, came up with MapReduce to solve such bottleneck issues. MapReduce will divide the task into small parts and process each part independently by assigning them to different systems. After all the parts are processed and analyzed, the output of each computer is collected in one single location and then an output dataset is prepared for the given problem.
MapReduce is a programming paradigm or model used to process large datasets with a parallel distributed algorithm on a cluster (source: Wikipedia). In Big Data Analytics, MapReduce plays a crucial role. When it is combined with HDFS we can use MapReduce to handle Big Data.
The basic unit of information used by MapReduce is a key-value pair. All the data whether structured or unstructured needs to be translated to the key-value pair before it is passed through the MapReduce model.
MapReduce model as the name suggests has two different functions; Map-function and Reduce-function. The order of operation is always Map|Shuffle|Reduce.
Let us understand each phase in detail:
So the mapper will work on one key-value pair at a time. One input may produce any number of outputs. Basically, Map-function will process the data and make several small chunks of data.
By understanding the Map and Reduce stages, we understand that MapReduce is a sequential computation. For any Reducer to work, the Mapper must have completed the execution. If that is not the case, the Reducer stage won’t run. Since, the Reducer will have access to all the values, we can say that it will find all values with the same key and perform computations on them. So what actually happens is, since reducers are working on different keys, they are made to work simultaneously and parallelism is achieved.
Let us understand this by an example:
Suppose we have 4 sentences for processing:
When such an input is passed into the mapper, mapper will divide those into two different subsets.
First will be the subset of the first two sentences and second, the subset of remaining two sentences. Now, Mapper has:
Mapper will make the key-value pair for each subset. For our example, key is the colour and value is the number of times they have appeared. So, we will have key- value pairs for subset 1 as (Red, 1), (Green, 1), (Blue, 1) and so on. Similarly for subset 2.
Once this is done, the key-value pairs are given to the reducer as input. So, reducer will give us the final count of all the colours in our input subsets and then combine the two outputs. Reducer output will be, (Red, 4), (Green, 3), (Blue, 4), (Brown, 1), (Yellow, 3), (Orange, 2).
The functioning of MapReduce like we just went through is a sequential flow and the example was a very small and basic example to understand MapReduce at beginners level. The reason why it is admired so much is its capability of parallelism and getting the output based on key-value pair analysis. It is capable of doing big wonders when it comes to Big Data. The example was a very small thing but when it comes to real world problems, MapReduce does make it a great choice for easy processing of any volume of data. If Big Data is what you are looking forward to, MapReduce should be the first thing that comes to your mind.