Share with your network!

You might be wondering, how Facebook has designed hive. Hive data warehouse plays an important role in Hadoop ecosystem. Latest technologies are actually integrating with Hive. Impala, Drill and SparkSQL work on hive. Once you read this blog, you will get to know how actually hive queries are mapped to MapReduce.

Sharing knowledge is very important and in an attempt to do it, we are documenting our observations, our research in Hive to MapReduce Mapping.

Note: Readers are expected to Have some knowledge of MapReduce and HQL.

Everyone cannot learn Java and MapReduce.. What next?

In companies like Facebook, where data storage will be in petabytes,  analysis of this data will be the utmost priority. As everyone know, Hadoop is known for scalable storage of data and parallel processing, to get the results with low latency.

Now the question is, should Facebook train everyone on Map Reduce? To learn Map Reduce, one should learn a programming language. Most preferred languages are Python and Java. Learning Java and MapReduce is time consuming and difficult, as it involves understanding key value pairs, and its importance in parallel computing. Lets quickly understand different modules in Map Reduce:

Input format :  Creates Input splits and divide them into records.

Input split:   Input is divided into fixed – Size pieces called input splits, where each split is processed by a single individual map task.

Record reader:   Generates key value pairs ( key, value)  for each record in the map

Map phase

Mapper : Generates required Key Value pairs for a given problem statement.

Combiner :  Performs local aggregation, known as mini reducer.

Partitioner:  Partitions key value space across reducers.

Reduce Phase

Shuffle : Performs grouping operation on Key’s, and gathers corresponding values.

Sort : Sort Key’s in ascending by default.

Reduce : Performs aggregation.

Record writer : Write the results back to HDFS.

Map Reduce Data Flow diagram:

map reduce data diagram

Now as a programmer, after understanding different stages of Map Reduce, next step is to learn on how to choose Key Value pairs to solve a given business problem. Once you are well versed different stages in MR and choosing key values, next step would be to learn JAVA, Map Reduce API.

Facebook primarily deals with data related to customers interaction, checkin’s, ad clicks, etc. Now it becomes very important to bring insights out of this data, as quickly as possible. Surely writing programs to bring out quick insights, reports is not a workable option.
For companies like facebook, they want to know the number of customers from India who like page Reebok. Let us try to solve this using Map Reduce, assuming the data is on HDFS.

input key value format          :TextInputFormat.
Mapper I/P key, value          : Offset address, line
Filter condition in mapper   : page==Reebok
Mapper O/P key, value         : Reebok,1
Combiner O/P key, value     : Reebok, sum(<1,1,…1>)
Shuffle                                     : Reebok, <5,7,……>
Reducer                                   : Reebok, sum(<5,7….>

Steps mentioned, should be coded inside a Java Program, which has 3 parts.

Driver class   : Contains the properties and settings for Map Reduce program.

Map Class     : Contains the logic for Map, filter condition.

Reduce class: Contains the logic for Aggregation.

This clearly looks complicated. Say there are 100 employees in facebook, and daily they do 30 similar tasks like this. So are we expecting them to write many programs to solve business cases like this?

Let us do a time estimate to understand, how many man hours would be needed, to solve 500 different business cases:

Assuming each business case will take 2 hours to be coded in map reduce, it will take 1000 man hours. If this is should be done in one week, 40 hours, facebook should be looking for 25 expert programmers in Map Reduce and Java. Facebook already has employees who are well versed with data analysis tools, not programming. Now should facebook make them learn MapReduce and start coding?

What can be the answer………. Thinking??? Lets crack it!

Facebook nailed it.. Now is the time!!!

Now that you know, not every employee of Facebook can be trained on Map Reduce and Java or some programming language. Now Facebook’s goal is to create a tool which will make a analysts and programmers life simple, start analyzing big data with zero knowledge of Map Reduce.

lets us understand how Facebook designed Hive and mapped it to Map Reduce.

Let us work on the following sample dataset:

facebook hive

Data set consists of 4 fields, which is related to sales domain:

Custid    : Id of the customer.
Region  : Region to which customer visited.
Sales      : Value of transaction.
Gender : gender of customer.

Let us do data analysis, and following are the assumptions:

  • Data is structured.
  • Is distributed across two machines.

Now I want to find, customer total sales. How would I do with Map Reduce. In Map Reduce; distributed parallel programming, choosing key and values are more important.

In map, I will select custid as key and sales as value.

In reduce, for each key, sum of sales will be calculated.

map reduce  diagram

 Now, how did Facebook convert hive to a Map Reduce program. How did they map a query to key and value pairs. Is it actually possible? If possible, can we understand, how it happened. Lets crack the secret sauce!

Query in hive will be, " select custid, sum(sales) from table group by custid"

map reduce diagram with customerid

Whoa!!!!!!!.. It seems that, what ever goes in to select statement, will be coded in mapper, Grouping happens in Shuffle and Sum happens in Reduce. Let us dig more deeper.

In mapper custid is key and value is sales. In the hive query, custid and sales are selected.

In shuffle, for each custid, sales are grouped and brought to one place. In the query, a similar task is performed with group by custid.

In reduce, sum operation is performed on list of sales for each custid. In the query, sum keyword does the same thing.

To summarize, facebook has automated and has an application; Hive , which will read the query, understand key and value pairs, and converts it in to Map reduce program.

Selection in query, happens in Mapper, grouping happens in Shuffle and Aggregation happens in reduce.

Now tell me, would you prefer writing a hive query, or programs in Map reduce? Facebook wanted their analysts to understand data and bring quick insights. They did not want their analysts to learn Map Reduce, understand key value pairs concept and implement it in some programming language.

 Data Analyst’s, Be Happy!!

90% of the real world business problems can be solved with HQL on hive. Lets look at more complex queries and map them with Map Reduce, and uncover the hidden secrets of Hive.

Now your reporting team, has  a requirement to generate reports, to understand customer spendings, across different regions.

Region wise customer spendings:

Dataset:

node diagram

Query:

Select region, custid, sum(sales) from table group by region, custid. This query will give us, customer spendings across different regions.

node  diagram 2

Mapping HQL to Map Reduce:

Mapper : Key: Region, CustID and Value : Sales. In query, select region, custid, sales.
Shuffle: Key: region, custid and value: group of sales. In query, group by region, custid.
Reduce: Key: region, custid and value: sum(group of sales). In query, sum(sales).

There are two key take away’s:

  • If you are map reduce developer, and trying to figure out what key value pairs need to be selected, just write a query, what ever goes in to selection, will form your key value pairs.
  • If you are data analyst, trying to figure out on how to analyze big data, just learn Hive. It will write Map Reduce programs for you.

Now let us solve a more complex problem in Map Reduce:

Your finance team, is looking for most valuable customers. In this case, sort the customers in descending order, based on their spendings. Now how can I do the same, in Map Reduce and Hive.

In Map Reduce, we will have two stages. Stage 1, will compute, each customer, total spendings. Stage 2, sort on spendings.

This is actually complicated. Output of stage 1 is sent to another map reduce programs. In Map Reduce, sorting happens only on keys.  Now to sort on Spendings, key value pairs from Stage1, will be flipped in Mapper of stage 2. Doing this, sorting will happen on spendings, again in Reduce, flip key value pairs and persist it on to disk.

Lets diagrammatically see how it happens:

map reduce  phases

In hive, with a simple query, I am getting the expected results. If you are trying to the same in Map reduce, you should write two map reduce programs.

Mapping HQL to Map Reduce:

Phase1 :

Mapper: key: custid, value: sales. In query,  Select custid, sales.
Shuffle: Key: custid, value : grouped sales. In query, group by custid.
Reduce : key: custid, value: sum(grouped sales). In query, sum(sales).

Phase2:

Mapper output : Key: total sales, value: custid.
Sort output : Key: total sales, value: custid. In Query, order by sum(sales).

To understand, hive compiler, decodes the query, figures out what are key value pairs, understands how many stages of map reduce programs, and generates map reduce programs with two stages.

Hive’s HQL is a gift to the world of Big Data Analyst’s. Simple querying, quicker insights and results.

Data Analysis has become simple!

Now your marketing team, is looking for identifying customers whom, platinum credit cards can be offered. For doing this, it requested the analytics team to give the list of customers, whose total spendings is more than 150.

Query: Select custid, sum(sales) from table group by custid having sum(sales)>150

Mapper: key: custid and value sales.
Shuffle : key: custid and value grouped sales.
Reduce: Key: custid and value sum(sales)> 150.

map reduce sorting

Mapping HQL to Map Reduce:

Selection of Custid and sales in query, happens in mapper of Map reduce.

Group by custid in query, happens in shuffle of Map Reduce.

Sum(sales) and having sum(sales)>150 in query, happens in Reduce phase in Map Reduce.

Now decide, would you still like to write Map Reduce programs for simple data analysis?

If yes, let me convince you with one final example.

Now I want to find customer spendings, in which, I am not interested in region C.

Query: Select custid, sum(sales) from table group by custid where region<>’C’

Mapper: key: custid and value sales. Here if region is not C, then only, key value pairs are sent to next phase.
Shuffle : key: custid and value grouped sales.
Reduce: Key: custid and value sum(sales).

mapper reduce

Mapping HQL to Map Reduce:

Selection of Custid and sales, where clause in query, happens in mapper of Map reduce.

Group by custid in query, happens in shuffle of Map Reduce.

Sum(sales)  in query, happens in Reduce phase in Map Reduce.

Now I am pretty sure, you got convinced, how Facebook created a tool hive, and how does hive convert query to Map reduce program.

Conclusion:

Facebook designed hive for data analysis and bring out insights quickly, not putting more time on writing programs on data.

Key points:

Let us map the functionalities of Hive to Map Reduce.

query table

  • Selection in query happens in Mapper
  • Where condition happens in Mapper
  • Grouping happens in Shuffle stage.
  • Case statements, if statements happens in Mapper.
  • Functions like UPPER,LOWER happens in Mapper.
  • Order in ascending and descending happens in sort phase.
  • Group level aggregations like average, sum, max and minimum happens in reduce phase.
  • Having clause in query happens in Reduce phase.

This is how Facebook designed Hive, which will convert queries to Map reduce.

For Data Analysis, Hive is the most preferred tool as it has Data warehousing capabilities, and at the same time, HQL will make out life easier.

On the other side, it doesn’t work for complex data analysis like Machine learning or Highly Unstructured Data Analysis like Multimedia datasets.

Now you know, how facebook designed hive on Map Reduce.!!

Share knowledge, happy learning!!.