Jobs & Responsibilities of Hadoop Professional

What is Big Data & Hadoop?

Big data is a terminology used for large and complex amount of structured and unstructured data sets, which is analyzed using modern data processing application software and tools. Hadoop is a software framework for storing and processing Big Data. It is used to find and fix the challenges in capturing, storing, analyzing, data creation, sharing, transferring, visualizing, querying and updating the Data sets within an adequate elapsed time.

Though Big Data is a broader concept than Hadoop, Hadoop has become such an integral part of Big Data that many a times both these concepts are used synonymously.

Who can pursue Hadoop Course?

Hadoop comprises of multiple concepts and modules like HDFS, Map-Reduce, HBASE, PIG, HIVE, SQOOP and ZOOKEEPER. These Hadoop tools are used to process the high volume, high velocity and high variety of data to generate value. To pursue this you need to have hands on experience in Core Java and good analytical skills.

  • Managers, who are looking for the latest technologies to be implemented in their organization, to meet the current & upcoming challenges of data management
  • Any Graduate/Post-Graduate, who is aspiring a great career in the cutting-edge technologies in IT
  • Software Engineers, who are into ETL/Programming and exploring for great job opportunities in Hadoop

Job roles after Hadoop Course:

Big data professionals can analyze all data & create useful information are highly sought after by companies across the world. There are different job titles and opportunities available for Hadoop Developers.

Here is list of job titles, which will help you making the right decision by assisting you in choosing the desired job role as a Hadoop expert. It is not just the tech companies that are offering Hadoop jobs but all types of companies including financial firms, retail organizations, banks and healthcare organizations.

Hadoop Developer

A Hadoop Developer is responsible for the actual coding or programming of Hadoop applications. This role is like that of a Software Developer. The job role is pretty much the same, but the former is a part of the Big Data domain. 

Hadoop Developer Job Description

A Hadoop Developer has many responsibilities. The following are the tasks a Hadoop Developer is responsible for:

  • Hadoop development and implementation
  • Pre-processing using Hive and Pig
  • Designing, building, installing, configuring and supporting Hadoop
  • Perform analysis of vast data stores and uncover insights
  • Create scalable and high-performance web services for data tracking
  • Managing and deploying HBase
  • Test prototypes and oversee handover to operational teams

Salary and career opportunities

Hadoop Certification helps you to climb the step and rise quickly in your career. It Helps people who are attempting to move into Hadoop from distinctive specialized foundations. Minimum Salary of Hadoop developers is almost 6 lakhs per annum after completing the course.

Hadoop Architect

Big Data Architect designs and builds efficient yet cost-effective Big Data applications to help clients answer their business questions. Sounds like what Enterprise Architects or Solution Architects typically do? But Big Data Architects are required to look at traditional data processing problems from different lenses. You must love data. Lots of data. Of good and bad quality. It also helps to be nimble, especially when it comes to newer technology. You should be judicious about tool selection and should have the ability to embrace open-source technologies, with all their good and challenging aspects.

Job Description of Hadoop Architect

Hadoop Architects, as the name suggests, are the ones entrusted with the tremendous responsibility of dictating where the organization will go in terms of Big Data Hadoop deployment.  They are involved in planning, designing and strategizing the roadmap and decides how the organization moves forward.

This is what you can expect as part of your Hadoop Architect work routine:

  • Take end-to-end responsibility of the Hadoop Life Cycle in the organization
  • Be the bridge between data scientists, engineers and the organizational needs
  • Do in-depth requirement analysis and exclusively choose the work platform
  • Full knowledge of Hadoop Architecture and HDFS is a must
  • Working knowledge of MapReduce, HBase, Pig, Java and Hive
  • Ensuring the chosen Hadoop solution is being deployed without any hindrance

Salary and Career Opportunities

The Hadoop architects have lucrative opportunities for professionals globally. The market demands Hadoop Architect professionals having knowledge in depth of the Hadoop ecosystem. Average Salary of a Hadoop Architect is 4 to 10 lakhs per annum

Big Data Visualizer

A big data visualizer is a creative thinker who understands User Interface design as well as other visualizations skills such as typography, user experience design and visual art design. One of the most important aspects of big data is the ability to visualize the data in a way that it will understand easily and find new patterns and insights.

Job Description after Data Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to check analytics presented visually. This is designed to help you understand and use the important concepts and techniques in Tableau to move from simple to complex visualizations. After Data Visualization you will be able to:

  • Understand Tableau terminology
  • Use the Tableau interface to effectively create powerful visualizations
  • Use Reference Lines to highlight elements of your data
  • Use bins, hierarchies, sorts, sets, and filters to create focused and effective visualizations
  • Create visualizations with multiple measures and dimensions
  • Share visualized data
  • Attach the visualized data into Interactive dashboards

Salary and Career Opportunities

There are visualization jobs pretty much everywhere there is data. A good data visualizer will able to read the raw big data analyses, or even perform the analyses, as well as design, illustrate and create the results. That is why there are various career option available for you such as:

  • Data Infographic Designer in News Agency
  • Enhanced Assimilation of Business Information
  • Quick Access to Relevant Business Insights
  • Operational & Business Activities
  • Predictive Sales Analyzer

After completing the Data Visualization coursethere are lucrative jobs with high paid salary is waiting for you. Average salary of a Data Visualization Profession is almost 9 lakhs per annum.

Big Data Analyst

Data analysts translate numbers into plain business data, whether it’s sales figures, market research, logistics, or transportation costs. A Data Analyst focuses on analysis of data and solving problems related to types of data, and relationships among data elements within a business system or IT system.

Job Description of Data Analyst:

Data analysts or you say Business Analysts are responsible of conducting full lifecycle analysis of Big Data. Following are important roles of a Data Analyst:

  • They monitor performance and quality control plans to identify improvements.
  • Develop data analytics and other strategies that optimize statistical efficiency and quality
  • Filter and clean data by reviewing computer reports, and performance indicators to locate and correct code problems
  • Work with management to prioritize business and information needs
  • Locate and define new process improvement opportunities

Salary and Career Opportunities
There are different types of data analysts in the field, including operations analysts, marketing analysts, financial analysts, etc. In the current business climate, skilled data analysts are some of the most sought-after professionals in the world.
Data analysts work on different fields such as:

  • Big investment banks
  • Hedge funds and private equity firms
  • Large insurance companies
  • Credit bureaus
  • Technology firms

Data analysts command huge salaries and excellent perks, even at the entry level.

Data Analyst (India)

Compare your salary. Get a FREE salary report »

Data Scientist

Data scientists come together to solve some of the hardest data problems an organization might face. These professionals are skilled in automating methods of collecting and analyzing data and utilizing inquisitive exploring techniques to discover previously hidden insight from this data that can profoundly impact the success of any business.

Data Scientist Job Description:

A data scientist will be able to take a business problems and translate it to a data question. Data science is the study of the generalizable extraction of knowledge from data. Data Science course is not restricted to only big data. It talks about Business Intelligence and Analysis. Data Scientists have different key responsibilities such as:

  • Building and optimizing of complex data using machine learning techniques
  • Data mining using state-of-the-art methods
  • Enhancing data collection procedures
  • Processing, cleansing, and verifying the integrity of data used for analysis
  • Doing ad-hoc analysis and presenting results in a clear manner

Salary and Career Opportunities

The career opportunities in data science are growing dramatically. Data Scientist jobs are among the most sought-after jobs in the tech world today. It is no surprise today that a Data Scientist is one of the most sought career paths. Expected Job Roles for the Data Analyst would be

  • The Statistician
  • The Database Administration
  • The Business Analyst
  • Data and Analytics Manager

Average Salary of a Data Scientist is 6 lakhs per annum

Compare your salary. Get a FREE salary report »

At EduPristine, our counselors can you to make a career in Big Data and Hadoop. They have counseled different professionals and students as well. If you wish to get into a one-on-one discussion with one of our counselors regarding a career in Hadoop, register here

Top and trending Hadoop tools in Big Data

Big Data Certification is one of the most desired skills of 2017 for IT professionals. It gives you an edge over other Professionals to accelerate career growth. The certificate is a proof that you can be responsible enough handle huge amount of data. Big Data professionals have many benefits such as better career growth, salary and job opportunities.

Top Hadoop Tools

Importance of Big Data:

Big Data is becoming popular all over the world. Companies across all verticals like retail, media, pharmaceuticals and much more are pursuing this IT concept. Big Data tools and techniques help the companies to illustrate the huge amount of data quicker; which helps to raise production efficiency and improves new data‐driven products and services.

Uses of Hadoop in Big Data:

A Big data developer is liable for the actual coding/programming of Hadoop applications. Mentioned below is some information on Hadoop architecture

  • It includes the variety of latest Hadoop features and tools
  • Apache Hadoop enables excessive data to be streamlined for any distributed processing system over clusters of computers using simple programming models.
  • Hadoop has two chief parts – a data processing framework and a distributed file system for data storage. 
  • It stocks large files in the range of gigabytes to terabytes across different machines.
  • Hadoop makes it easier to run applications on systems with a large number of commodity hardware nodes.

9 most popular Big Data tools:

To save your time and help you pick the right tool, we have constructed a list of top Big Data tools in the areas of data extracting, storing, cleaning, mining, visualizing, analyzing and integrating.

Data Extraction Tool:


Talend is a software vendor specializing in Big Data Integration. Talend Open Studio for Data Integration helps you to efficiently and effectively manage all facets of data extraction, data transformation, and data loading using of their ETL system.

How Talend’s ETL Works:

In computing, Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing. Data extraction is where data is extracted from data sources; data transformation where the data is transformed for storing in the proper format; data loading where the data is loaded into the final target database.

Features of Talend:

ETL tool boosts developer productivity with a rich set of features including:

  • Its graphical integrated development environment
  • Drag-and-drop job design
  • More than 900 components and built-in connectors
  • Robust ETL functionality: string manipulations, automatic lookup handling


Pentaho Data Integration also called Kettle is the component of Pentaho responsible for the Extract, Transform and Load (ETL) processes. PDI is created with a graphical tool where you specify what to do without writing code to indicate how to do it.

How PentahoWorks:

PDI can be used as a standalone application, or it can be used as part of the larger Pentaho Suite. As an ETL tool, it is the most popular open source tool available. PDI supports a vast array of input and output formats, including text files, data sheets, and commercial and free database engines.

Important Features of Pentaho

  • Graphical extract-transform-load (ETL) designing system
  • Powerful orchestration capabilities
  • Complete visual big data integration tools

Data Storing Tool:


MongoDB is an open source database that uses a document-oriented data model. MongoDB

How it Works:

MongoDB stores data using a flexible document data model that is similar to JSON. Documents contain one or more fields, including arrays, binary data and sub-documents. Fields can vary from document to document.

Some features of MongoDB Tool

MongoDB can be used as a file system with load balancing and data replication features over multiple machines for storing files. Following are the main features of MongoDB.

  • Ad hoc queries
  • Indexing
  • Replication
  • Load balancing
  • Aggregation
  • Server-side JavaScript execution
  • Capped collections


Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

How it Works:

Hive has three main functions data summarization, query and analysis.It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop.

Features of Apache Hive

Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 file system. Other features of Hive include:

  • Index type including compaction and Bitmap index as of 0.10
  • Variety of storage types such as plain text, RCFile, HBase, ORC, and others
  • Operating on algorithms including DEFLATE, BWT, snappy, etc.


Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data into HDFS

How it Works

Sqoop got the name from SQL + Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import.

Here are some important and usable features of Sqoop

  • Parallel import/export
  • Import results of SQL query
  • Connectors for all major RDBMS Databases
  • Kerberos Security Integration
  • Support for Accumulate

Data Mining Tool:

Oracle Data Mining

Oracle Data Mining (ODM), a component of the Oracle Advanced Analytics Database Option, provides powerful data mining algorithms that enable data analysts to discover insights, make predictions and leverage their Oracle data and investment

How it Works

Oracle Corporation has implemented a variety of data mining algorithms inside the Oracle relational database. With Oracle Data Mining system, you can build and apply predictive models inside the Oracle Database to help you predict customer behavior, develop customer profiles, identify cross-selling opportunities and detect potential fraud

Features of Oracle Data Mining

Oracle Data Miner tool is an extension to Oracle SQL Developer, work directly with data inside the database using

  • Graphical “drag and drop” workflow and component pallet
  • Oracle Data Miner work flows capture and document the user’s analytical methodology
  • Oracle Data Miner can generate SQL and PL/SQL scripts

Data Analyzing Tool:


HBase is an open source, non-relational, distributed database and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop

How it works

Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. HBase can leverage the distributed processing of the Hadoop Distributed File System. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware

The Apache HBase is consist of the following features:

  • Linear and modular scalability
  • Convenient base classes for backing Hadoop
  • Easy to use Java API for client access
  • Block cache and Bloom Filters for real-time queries
  • Query predicate push down via server side Filters
  • Support for exporting metrics


Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.

How it Works

Pig enables data workers to write complex data transformations without knowing Java. Pig is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL


Pig Features:

Pig Tool has the following key properties:

  • It is trivial to achieve parallel execution of simple
  • Permits the system to optimize their execution automatically
  • Users can create their own functions to do special-purpose processing.

Data Integrating Tool:


Apache Zookeeper is a coordination service for distributed application that enables synchronization across a cluster. Zookeeper in Hadoop can be viewed as centralized repository where distributed applications can put data and get data out of it.

How it Works

Zookeeper provides an infrastructure for cross-node synchronization and can be used by applications to ensure that tasks across the cluster are serialized or synchronized. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application.

Features of Zookeeper

ZooKeeper will help you with coordination between Hadoop nodes. Following mention the important features of it

  • Managing and configuration of nodes
  • Implement reliable messaging
  • Implement redundant services
  • Synchronize process execution

Further, if you want to start a career in Big Data and learn Hadoop and other Big Data technologies, I would recommend you to go through a structured training on Big Data and Hadoop. If you have a structured training in Big Data and Hadoop, you will find it very easy to land up a dream job in Hadoop.

Hope this post helps you. If you have any questions you can comment below and I will be glad to help you out.

How Hadoop Training benefits Java Developers

What is Hadoop?

Hadoop is an efficient open source framework that allows the processing of large data storage using simple programming models. This open source platform has the ability of enormous processing and capability to control limitless synchronized jobs.

Hadoop Training for Java Developers:

  1. For Java Developers, Hadoop framework is easy to learn as it’s entirely written in Java and for professionals planning to switch from Java to Hadoop – it is the easiest to opt because in Hadoop MapReduce script is written in Java.
  2. By switching career from Java to Hadoop, you can expect high lucrative packages and secured career.
  3. Advanced Knowledge in Hadoop and Big Data could increase your chance to work with top companies such as IBM, Facebook, Twitter, Yahoo and Amazon.
  4. To stay ahead of competition and to work with top MNC’s, you just only need to attend a hadoop training program and learn all the concepts from trained experts.
  5. Learning hadoop can be beneficial for professionals as it will help you to deal with complex projects and improve your quality of work.
  6. Hadoop market is likely to increase at a compound annual growth rate of 43.4% (CAGR) exceeding $40.69 billion by 2021. So professionals have a great opportunity to switch from Java to Hadoop and upgrade skills to grow with the industry.

hadoop training for java developers

Why Hadoop Training is Important?

An advanced Hadoop training from a reputed and authorized training provider is a must to get started and once you get the training in Hadoop and a Hadoop Certification, you can apply for various profile such as Hadoop Analyst, Hadoop Administrator, Hadoop Architect, Hadoop Developer.

Hadoop Administrator must possess knowledge of Linux, Java, database management and in-depth skills of programming algorithm for data processing. For Hadoop Architects, you have to be expert in Hadoop MapReduce Programming, Hbase, Pig, Hive and Java. Hadoop developers should have knowledge about SQL and Core Java to get started in creating Big Data Hadoop solutions. Hadoop Analysts must possess understanding of data analysis software solutions such as R, SAS, SPSS etc.

When you are dealing with the big data, following challenges occur:

  1. Lot of time is spent in query building
  2. Tracking and sorting big data volume requires a lot of manual intervention.
  3. Capital investment in purchasing a server which has extremely high processing capacity is really high.

Hadoop Advantage

  1. Hadoop is compatible on all the platforms and that’s the main reason it is so popular among government organization, online marketing agencies, top multinational companies and financial services.
  2. Hadoop library has been designed to detect and handle failures.
  3. Hadoop can operate without any interruption, even if you add or remove server from the cluster.
  4. Hadoop framework is the quickest and smartest way to store and process huge volumes of data.

Importance of Hadoop Training

In Finance and Banking sector, Hadoop can find solutions to ease out tasks and expand efficiency. Healthcare Industry comprises massive amount of data related to patient files, financial and clinical data. Hadoop has an important role in the Sports Industry too as every team exploits big data for health, player fitness and game analysis. In Retail sector, Hadoop connects customers to buy and purchase in a convenient way.

Hadoop is the most widely used in industries such as Call Centers, IT analytics, Social media and opting a big data hadoop training from EduPristine could be the preeminent way to lift your Hadoop career. This 60 hours of hadoop classroom training is designed to make a career in big data analytics using hadoop framework. EduPristine is the only institute which provides complimentary “Java Essential for Hadoop” course to all the candidates who enroll for the Hadoop training and once you complete your hadoop training, you will find that staying updated with the latest applications will be a lot easier and getting into biggest silicon valley companies will never be ‘a dream’.


You might be wondering, how Facebook has designed hive. Hive data warehouse plays an important role in Hadoop ecosystem. Latest technologies are actually integrating with Hive. Impala, Drill and SparkSQL work on hive. Once you read this blog, you will get to know how actually hive queries are mapped to MapReduce.

Sharing knowledge is very important and in an attempt to do it, we are documenting our observations, our research in Hive to MapReduce Mapping.

Note: Readers are expected to Have some knowledge of MapReduce and HQL.

Everyone cannot learn Java and MapReduce.. What next?

In companies like Facebook, where data storage will be in petabytes,  analysis of this data will be the utmost priority. As everyone know, Hadoop is known for scalable storage of data and parallel processing, to get the results with low latency.

Now the question is, should Facebook train everyone on Map Reduce? To learn Map Reduce, one should learn a programming language. Most preferred languages are Python and Java. Learning Java and MapReduce is time consuming and difficult, as it involves understanding key value pairs, and its importance in parallel computing. Lets quickly understand different modules in Map Reduce:

Input format :  Creates Input splits and divide them into records.

Input split:   Input is divided into fixed – Size pieces called input splits, where each split is processed by a single individual map task.

Record reader:   Generates key value pairs ( key, value)  for each record in the map

Map phase

Mapper : Generates required Key Value pairs for a given problem statement.

Combiner :  Performs local aggregation, known as mini reducer.

Partitioner:  Partitions key value space across reducers.

Reduce Phase

Shuffle : Performs grouping operation on Key’s, and gathers corresponding values.

Sort : Sort Key’s in ascending by default.

Reduce : Performs aggregation.

Record writer : Write the results back to HDFS.

Map Reduce Data Flow diagram:

map reduce data diagram

Now as a programmer, after understanding different stages of Map Reduce, next step is to learn on how to choose Key Value pairs to solve a given business problem. Once you are well versed different stages in MR and choosing key values, next step would be to learn JAVA, Map Reduce API.

Facebook primarily deals with data related to customers interaction, checkin’s, ad clicks, etc. Now it becomes very important to bring insights out of this data, as quickly as possible. Surely writing programs to bring out quick insights, reports is not a workable option.
For companies like facebook, they want to know the number of customers from India who like page Reebok. Let us try to solve this using Map Reduce, assuming the data is on HDFS.

input key value format          :TextInputFormat.
Mapper I/P key, value          : Offset address, line
Filter condition in mapper   : page==Reebok
Mapper O/P key, value         : Reebok,1
Combiner O/P key, value     : Reebok, sum(<1,1,…1>)
Shuffle                                     : Reebok, <5,7,……>
Reducer                                   : Reebok, sum(<5,7….>

Steps mentioned, should be coded inside a Java Program, which has 3 parts.

Driver class   : Contains the properties and settings for Map Reduce program.

Map Class     : Contains the logic for Map, filter condition.

Reduce class: Contains the logic for Aggregation.

This clearly looks complicated. Say there are 100 employees in facebook, and daily they do 30 similar tasks like this. So are we expecting them to write many programs to solve business cases like this?

Let us do a time estimate to understand, how many man hours would be needed, to solve 500 different business cases:

Assuming each business case will take 2 hours to be coded in map reduce, it will take 1000 man hours. If this is should be done in one week, 40 hours, facebook should be looking for 25 expert programmers in Map Reduce and Java. Facebook already has employees who are well versed with data analysis tools, not programming. Now should facebook make them learn MapReduce and start coding?

What can be the answer………. Thinking??? Lets crack it!

Facebook nailed it.. Now is the time!!!

Now that you know, not every employee of Facebook can be trained on Map Reduce and Java or some programming language. Now Facebook’s goal is to create a tool which will make a analysts and programmers life simple, start analyzing big data with zero knowledge of Map Reduce.

lets us understand how Facebook designed Hive and mapped it to Map Reduce.

Let us work on the following sample dataset:


Data set consists of 4 fields, which is related to sales domain:

Custid    : Id of the customer.
Region  : Region to which customer visited.
Sales      : Value of transaction.
Gender : gender of customer.

Let us do data analysis, and following are the assumptions:

  • Data is structured.
  • Is distributed across two machines.

Now I want to find, customer total sales. How would I do with Map Reduce. In Map Reduce; distributed parallel programming, choosing key and values are more important.

In map, I will select custid as key and sales as value.

In reduce, for each key, sum of sales will be calculated.

map reduce  diagram

 Now, how did Facebook convert hive to a Map Reduce program. How did they map a query to key and value pairs. Is it actually possible? If possible, can we understand, how it happened. Lets crack the secret sauce!

Query in hive will be, " select custid, sum(sales) from table group by custid"

map reduce diagram with customerid

Whoa!!!!!!!.. It seems that, what ever goes in to select statement, will be coded in mapper, Grouping happens in Shuffle and Sum happens in Reduce. Let us dig more deeper.

In mapper custid is key and value is sales. In the hive query, custid and sales are selected.

In shuffle, for each custid, sales are grouped and brought to one place. In the query, a similar task is performed with group by custid.

In reduce, sum operation is performed on list of sales for each custid. In the query, sum keyword does the same thing.

To summarize, facebook has automated and has an application; Hive , which will read the query, understand key and value pairs, and converts it in to Map reduce program.

Selection in query, happens in Mapper, grouping happens in Shuffle and Aggregation happens in reduce.

Now tell me, would you prefer writing a hive query, or programs in Map reduce? Facebook wanted their analysts to understand data and bring quick insights. They did not want their analysts to learn Map Reduce, understand key value pairs concept and implement it in some programming language.

 Data Analyst’s, Be Happy!!

90% of the real world business problems can be solved with HQL on hive. Lets look at more complex queries and map them with Map Reduce, and uncover the hidden secrets of Hive.

Now your reporting team, has  a requirement to generate reports, to understand customer spendings, across different regions.

Region wise customer spendings:


node diagram


Select region, custid, sum(sales) from table group by region, custid. This query will give us, customer spendings across different regions.

node  diagram 2

Mapping HQL to Map Reduce:

Mapper : Key: Region, CustID and Value : Sales. In query, select region, custid, sales.
Shuffle: Key: region, custid and value: group of sales. In query, group by region, custid.
Reduce: Key: region, custid and value: sum(group of sales). In query, sum(sales).

There are two key take away’s:

  • If you are map reduce developer, and trying to figure out what key value pairs need to be selected, just write a query, what ever goes in to selection, will form your key value pairs.
  • If you are data analyst, trying to figure out on how to analyze big data, just learn Hive. It will write Map Reduce programs for you.

Now let us solve a more complex problem in Map Reduce:

Your finance team, is looking for most valuable customers. In this case, sort the customers in descending order, based on their spendings. Now how can I do the same, in Map Reduce and Hive.

In Map Reduce, we will have two stages. Stage 1, will compute, each customer, total spendings. Stage 2, sort on spendings.

This is actually complicated. Output of stage 1 is sent to another map reduce programs. In Map Reduce, sorting happens only on keys.  Now to sort on Spendings, key value pairs from Stage1, will be flipped in Mapper of stage 2. Doing this, sorting will happen on spendings, again in Reduce, flip key value pairs and persist it on to disk.

Lets diagrammatically see how it happens:

map reduce  phases

In hive, with a simple query, I am getting the expected results. If you are trying to the same in Map reduce, you should write two map reduce programs.

Mapping HQL to Map Reduce:

Phase1 :

Mapper: key: custid, value: sales. In query,  Select custid, sales.
Shuffle: Key: custid, value : grouped sales. In query, group by custid.
Reduce : key: custid, value: sum(grouped sales). In query, sum(sales).


Mapper output : Key: total sales, value: custid.
Sort output : Key: total sales, value: custid. In Query, order by sum(sales).

To understand, hive compiler, decodes the query, figures out what are key value pairs, understands how many stages of map reduce programs, and generates map reduce programs with two stages.

Hive’s HQL is a gift to the world of Big Data Analyst’s. Simple querying, quicker insights and results.

Data Analysis has become simple!

Now your marketing team, is looking for identifying customers whom, platinum credit cards can be offered. For doing this, it requested the analytics team to give the list of customers, whose total spendings is more than 150.

Query: Select custid, sum(sales) from table group by custid having sum(sales)>150

Mapper: key: custid and value sales.
Shuffle : key: custid and value grouped sales.
Reduce: Key: custid and value sum(sales)> 150.

map reduce sorting

Mapping HQL to Map Reduce:

Selection of Custid and sales in query, happens in mapper of Map reduce.

Group by custid in query, happens in shuffle of Map Reduce.

Sum(sales) and having sum(sales)>150 in query, happens in Reduce phase in Map Reduce.

Now decide, would you still like to write Map Reduce programs for simple data analysis?

If yes, let me convince you with one final example.

Now I want to find customer spendings, in which, I am not interested in region C.

Query: Select custid, sum(sales) from table group by custid where region<>’C’

Mapper: key: custid and value sales. Here if region is not C, then only, key value pairs are sent to next phase.
Shuffle : key: custid and value grouped sales.
Reduce: Key: custid and value sum(sales).

mapper reduce

Mapping HQL to Map Reduce:

Selection of Custid and sales, where clause in query, happens in mapper of Map reduce.

Group by custid in query, happens in shuffle of Map Reduce.

Sum(sales)  in query, happens in Reduce phase in Map Reduce.

Now I am pretty sure, you got convinced, how Facebook created a tool hive, and how does hive convert query to Map reduce program.


Facebook designed hive for data analysis and bring out insights quickly, not putting more time on writing programs on data.

Key points:

Let us map the functionalities of Hive to Map Reduce.

query table

  • Selection in query happens in Mapper
  • Where condition happens in Mapper
  • Grouping happens in Shuffle stage.
  • Case statements, if statements happens in Mapper.
  • Functions like UPPER,LOWER happens in Mapper.
  • Order in ascending and descending happens in sort phase.
  • Group level aggregations like average, sum, max and minimum happens in reduce phase.
  • Having clause in query happens in Reduce phase.

This is how Facebook designed Hive, which will convert queries to Map reduce.

For Data Analysis, Hive is the most preferred tool as it has Data warehousing capabilities, and at the same time, HQL will make out life easier.

On the other side, it doesn’t work for complex data analysis like Machine learning or Highly Unstructured Data Analysis like Multimedia datasets.

Now you know, how facebook designed hive on Map Reduce.!!

Share knowledge, happy learning!!.

Problems of Small Data and How to Handle Them

You’ve heard about Big Data enough. It’s a fad that is rising and fading simultaneously, depending on whom you talk to! While need for tools and skills on Big Data continues to be on rise, buzz on Big Data is either paradigm shift or over-hype. So it may come as surprise to you that not all data is Big Data. In fields such as medicines, sociology, psychology, geology, etc. small samples are not rare occurrences but the norm. Most experiments involving primary research with real people will have small data due to sheer cost of conducting in-person interviews. Sometimes population from which sample is drawn itself may be small to begin with, say, number of countries in the world, or number of exoplanets discovered in the Universe.

In times such as these, nuances of statistics come handy. While an expert data scientist is generally well familiar with algorithms behind modeling techniques, he or she is generally not familiar with or doesn’t give importance to assumptions behind the working of that technique. Some people believe data science to be statistics sans checking for assumptions – and it is true for most part, since using repertoire of tools and algorithms, performance validation on test data, and ensemble approach to counter high variance problem, most models do good job. As they say, proof of pudding is in eating; so as long as model makes accurate predictions on unseen data, it does its job. Except, when it doesn’t.

Problems of small data

So what are problems if you have small data? Surely, you can use same methods and models. Well, small data exacerbates certain issues, like…

  • Outliers – Outlier handling is important for many models, but can be lived with if proportion of outliers is small. This is obviously not the case with small data since even few outliers will form large proportion and significantly alter the model.
  • Train and Test Data – A good design choice in model training is to split the data on which model is trained (“train data”) and report generalized performance on unseen data (“test data” or “holdout data”). In case, holdout data is used for tuning model parameters (sometimes called “cross-validation data”), you may need to split all observations into three sets. With small data, one doesn’t have luxury to keep out many samples, and even when one does so, number of observations in test data may be too few to give meaningful performance estimate, and/or number of observations in cross-validation data may be too few to guide parameter search optimally.
  • Overfitting – If your training dataset itself is small, overfitting is more likely to occur. And using cross-validation to reduce overfitting has risk mentioned above.
  • Measurement Errors – Each metric, either a predictor or target, is measured in real world, and has associated measurement error. At small scale, effects of such errors become important and affect the model adversely.
  • Missing Values – Missing values in data has effect in similar direction as of measurement errors, but perhaps more in magnitude. Limited number of observations means that imputing missing values can be difficult. Further, if target has missing values then whole observation may have to be dropped which is not desirable in such cases.
  • Sampling Bias – Problem with small data can be worse if data is biased and not sampled randomly from population. This is often problem in sociology research, if not controlled in design, where test subjects are often people in same circle or environment as the researcher, say, undergraduate students of the college in which researcher practices.

How to handle them

  • Data review – Since abnormal data values impact predictive capacity more for small data, spend time in reviewing, cleaning, and managing your data. This means detecting outliers, imputing missing values or deciding how to use them, and understanding impact of measurement errors.
  • Simpler models – Lesser the degrees of freedom compared to number of training observations, more robust are parameter estimates. Prefer simpler models when possible and limit number of parameters to be estimated. This means going for Logistics Regression rather than Neural Network, or k-Nearest Neighbours rather than Regression Splines. Use simplifying assumptions, such as those that favour Linear Discriminant Analysis over Quadratic Discriminant Analysis.
  • Domain expertise – Use prior experience and domain expertise to decide on model form. Small data doesn’t offer luxury of testing different model forms and hence expert opinion counts more. Use domain knowledge to design features effectively and do feature selection. We cannot afford to throw all possible features in the mix and let the model figure out right set.
  • Consortium approach – Build and grow your data over time and across sectors. Even small data adds up over time. Using slightly unrelated data to increase number of observations and then subtracting impact of unrelated-ness mathematically can still produce better performing models. For example, use Panel Regression instead of separate Linear Regressions for different groups within the data.
  • Ensemble approach – Build multiple simple models rather than one best model, and use bagging or stacking approach. Ensemble models tend to reduce overfitting without increasing number of parameters to be estimated.
  • No cross-validation data – This extends idea of simpler models. Don’t over-use cross-validation data for hyper-parameter optimization. If number of observations is really small, do not use cross-validation data for model training.
  • Regularization – Regularization is way to produce more robust parameter estimates and is very useful in small data space. While regularization does add one more parameter to modeling process, often this increase is worthwhile. Lasso or L1 regularization produces fewer non-zero parameters and indirectly does feature selection. Ridge or L2 regularization produces smaller (in absolute value) and conservative coefficient estimates.
  • Confidence intervals – Try to predict with margin of error, rather than point estimates. Models on small data will have large confidence intervals but it is better to be aware of range when making actionable decisions on predictions rather than not know.

A good experiment design is important if one expects sample size to be small. Data scientists should get involved from data gathering step itself to ensure that data is not biased, not missing, and is representative of the population. This is surely better than try to make do with small and unclean data later on.

So how small is small data? There is probably not a clear boundary but you would know if you see it. I would hazard a guess that fewer than couple of thousands is small in most cases, albeit number depends on dimensions of data and problem being attempted.

Other Articles by the same author:

11 facts about data science

Semi-Supervised Clustering

Other Related Links that you may like:

Overview of Text Mining

Big Data for entrepreneur

What You Should Know About Big Data as an Entrepreneur

Data is king in modern business.  The more information entrepreneurs can get, accurately analyze, and use effectively the better their chances for success.  The key is not to become overwhelmed by the sheer volume of big data that’s available today.  This can make some entrepreneurs become paralyzed by fear, doubt and indecision and fail to take effective and timely action.  Use the data to your advantage.  Let it inform your decisions.  If your research and instincts say success is possible, then forge ahead.  Don’t research a subject to death and miss great opportunities.  Creating a marketing dashboard can help.

importance of big data

1. Use Big Data To Understand The Customer’s Problem.

Successful entrepreneurs create solutions to the problems plaguing large number of people.  Begin by doing research on the problem and the potential solutions.   Weight the value of the solution you are proposing using the data you’ve collected. Developing personas to serve as the potential customer based on the data you’ve generated can help you to better understand your targeted customers.  Look at the common questions they have and the decision they make. Doing this early in the process can save your time and money.

You should document the results you gather and use the insight you gain to identify the pertinent data sources and prioritize them.  Those data sources should help you to capture valuable, usable information about the needs, products and operations of your potential customers.  You can then use the data based on its business value and ability to help improve ease of implementation of your business plan.  Deciding how much and which of the data you receive is important and which is trivial often makes the difference between wasting time and efficient marketing.  Your dashboard should provide a clear, concise, easy-to-use formula for doing this.

2. Use Data to Understand the Customer’s Environment and how Your Product Fits Into It

Many companies make large investments in their technology and data environments. They loathe wasting that investment. If you can show them how the solution you offer can help them leverage their investment in existing data and technology, then these companies are more likely to look at what you have to offer.  The investments companies make in data, analytics, dashboard tools, reports, and SQL are considered valued strategic organizational assets. Your ability to show potential customers ways to more efficiently and effectively use their existing big data technologies, capabilities, and products can lead to your success.

If you can show companies how spending a small amount today can improve the return on the investments they have already made in data warehousing and business intelligence exponentially, they will beat a path to your door.  Using big data to help your potential customers to better leverage their investment is a winning strategy which can help you to find success as you grow your business.  Helping customers add value to previously made investments makes them feel wise and can make both of you money.

3. Use Big Data To Identify And Leverage Cloud And Open Source Technologies.

There are many free, scalable open source technologies that can help organizations generate, gather, and use big data to expedite product development and the speed with which products are released into the marketplace and begin to generate profits. These include technologies like Hadoop, Spark, YARN, Mahout, HBase, Hive, and R.  Savvy entrepreneurs are able to use these tools to create unique, compelling solutions to common business problems and differentiate themselves in the marketplace.

If entrepreneurs leverage cloud and open sources products for their development environment, they can use big data to help create initial prototypes and get them to market in a flash.  Entrepreneurs should also heavily instrument their products.  That will give them details on the means and methods customers are attempting to use with their products.  The entrepreneur must then use the data generated from this instrumentation to quickly learn, evolve, and improve.  In modern business an entrepreneur’s ability to use big data to improve speed, innovative ideas, and customer service is a surefire path to success.

4. Big Data Can Help Create Short Payback ROI.

Successful entrepreneurs understand how to assist organizations in finding innovative, cost-effective new methods to monetize their analytic assets and data. Focusing on providing products and solutions business stakeholders can utilize to optimize their essential business processes is a pathway to success. Entrepreneurs that create and develop a good ROI for their products and understand how to use it as compelling evidence to encourage their customers to try and subsequently buy their product will do well in the marketplace.

The ability to generate, analyze, and utilize big data effectively is key in modern business.  Entrepreneurs that can empower the front-line employees of their customers to monetize data and analytic assets and improve ROI are destined for success.