Introduction to Hive – A Data Warehouse on top of Hadoop

April 2, 2015
, , ,

Most of us might have already heard of the history of Hadoop and how Hadoop is being used in more and more organizations today for batch processing of large sets of data. There are many components in the Hadoop eco-system, each serving a definite purpose. For e.g. HDFS is the storage layer and a File System, Map-Reduce is a programming model and an execution framework. Now talking about Hive, it is another component in Hadoop stack; which was built by Facebook and contributed back to the community.

A little history about Apache Hive will be a nice story and will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was incoming at the rate of 10s of GBs/day around 2006. Then in 2007, it grew it 1 TB/day and in few years increased to around 15 TBs/day. Initially, they had written Python scripts to ingest data in Oracle databases, but with the growth in data rate and also the sources/types of data incoming, this was getting difficult. The Oracle instances were getting filled pretty fast and it was time to get a new kind of system that handled large amounts of data. So, they built Hive so most of the people who had SQL skills could use the new system with minimal changes compared to other RDBMs.

Hive is a data warehouse implementation on top of HDFS. It provides a SQL like dialect to interact with data. It is most close to the MySQL syntax, however it is not a complete database. It does not conform to ANSI SQL standard. Also, it is not OLTP ready yet. It can be used as an OLAP system. One of the disadvantages is there are no provisions for row level Insert, Update or Delete. Hive works on the files stored on HDFS. The data is exposed via tables in Hive. It is more suited for large data warehouse like operation for analytics, report generation and mining.

Hive can be used by the Database Administrators, architects and analysts without them being required to learn anything new. Architects who have been designing databases and tables, can do all of those and now on even more data. Hive supports the following as part of its data model; tables, partitions and buckets. Analysts will be productive without having to learn a lot of new stuffs, by writing queries similar to SQL. Many organizations have started using HIVE as part of the ETL and have discontinued the older systems.

Image source:

The architecture of Hive is as follows. We will progress with the assumption that we are having knowledge of Hadoop working and architecture. There are 3 components in Hive; a Client, Driver and Metastore. The client can be a CLI (command line interface), or an application connecting via JDBC/ODBC, which in turn goes through Thrift server, or Web based GUI like HWI (Hive Web Interface) or Hue. The Driver is made up of 3 sub-modules; Compiler, Optimizer and Executor. Hive queries get converted into Map-Reduce programs, which are then executed on Hadoop. The task of compiling and executing these jobs are carried out by the Driver. Metastore is like the namespace. It contains all the metadata of databases, tables, columns, partitions and the physical locations of the files that are part of the tables. The metastore is installed on an external RDBMS, MySQL being the popular choice. Hive also supports User Defined Functions (UDF) thus allowing the user to implement any operations and using it on a cell or column.

Some of the differences between RDBMS and Hive are:

• Hive does not support row-level updates, transactions and indexing.

• Hive follows the Schema on Read rule. So, a data load is just copying or moving files and there is no checking or parsing of data, nor any validation.

• Hive works on the concept of Write Once, Read many times. However, RDBMS functions work on Read and Write Many times

In the next article we will discuss setting up HIVE and using hive to query data.

The pre-requisite would be to install Hadoop on your Linux box. HDFS and Map Reduce should be running.


About the Author

Skanda Bhargav is an Engineering graduate from Visvesvaraya Technological University (VTU), Belgaum in Karnataka, India. He did his majors in Computer Science Engineering. He is currently employed with a MNC based out of Bangalore. He is a Cloudera-certified developer in Apache Hadoop. His interests are Big Data and Hadoop.

Popular Courses

Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc

CFA® Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA® Institute, CFA® Program, CFA® Institute Investment Foundations and Chartered Financial Analyst® are trademarks owned by CFA® Institute.

Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to and we will rectify it.

Post ID = 73689