April 2, 2015
Most of us might have already heard of the history of Hadoop and how Hadoop is being used in more and more organizations today for batch processing of large sets of data. There are many components in the Hadoop eco-system, each serving a definite purpose. For e.g. HDFS is the storage layer and a File System, Map-Reduce is a programming model and an execution framework. Now talking about Hive, it is another component in Hadoop stack; which was built by Facebook and contributed back to the community.
A little history about Apache Hive will be a nice story and will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was incoming at the rate of 10s of GBs/day around 2006. Then in 2007, it grew it 1 TB/day and in few years increased to around 15 TBs/day. Initially, they had written Python scripts to ingest data in Oracle databases, but with the growth in data rate and also the sources/types of data incoming, this was getting difficult. The Oracle instances were getting filled pretty fast and it was time to get a new kind of system that handled large amounts of data. So, they built Hive so most of the people who had SQL skills could use the new system with minimal changes compared to other RDBMs.
Hive is a data warehouse implementation on top of HDFS. It provides a SQL like dialect to interact with data. It is most close to the MySQL syntax, however it is not a complete database. It does not conform to ANSI SQL standard. Also, it is not OLTP ready yet. It can be used as an OLAP system. One of the disadvantages is there are no provisions for row level Insert, Update or Delete. Hive works on the files stored on HDFS. The data is exposed via tables in Hive. It is more suited for large data warehouse like operation for analytics, report generation and mining.
Hive can be used by the Database Administrators, architects and analysts without them being required to learn anything new. Architects who have been designing databases and tables, can do all of those and now on even more data. Hive supports the following as part of its data model; tables, partitions and buckets. Analysts will be productive without having to learn a lot of new stuffs, by writing queries similar to SQL. Many organizations have started using HIVE as part of the ETL and have discontinued the older systems.
Image source: http://www.stratapps.net/images/HiveComponents.jpg
The architecture of Hive is as follows. We will progress with the assumption that we are having knowledge of Hadoop working and architecture. There are 3 components in Hive; a Client, Driver and Metastore. The client can be a CLI (command line interface), or an application connecting via JDBC/ODBC, which in turn goes through Thrift server, or Web based GUI like HWI (Hive Web Interface) or Hue. The Driver is made up of 3 sub-modules; Compiler, Optimizer and Executor. Hive queries get converted into Map-Reduce programs, which are then executed on Hadoop. The task of compiling and executing these jobs are carried out by the Driver. Metastore is like the namespace. It contains all the metadata of databases, tables, columns, partitions and the physical locations of the files that are part of the tables. The metastore is installed on an external RDBMS, MySQL being the popular choice. Hive also supports User Defined Functions (UDF) thus allowing the user to implement any operations and using it on a cell or column.
Some of the differences between RDBMS and Hive are:
• Hive does not support row-level updates, transactions and indexing.
• Hive follows the Schema on Read rule. So, a data load is just copying or moving files and there is no checking or parsing of data, nor any validation.
• Hive works on the concept of Write Once, Read many times. However, RDBMS functions work on Read and Write Many times
In the next article we will discuss setting up HIVE and using hive to query data.
The pre-requisite would be to install Hadoop on your Linux box. HDFS and Map Reduce should be running.