Hive Partitions – Explained with Example

The previous post had all the concepts covered related to partitions. In this post, we will practically design and implement a Hive table with partitions. Let us take in consideration the same data. Geographical hierarchy of India. We will consider the table name as India – country, then columns and one of those columns state we will make this as partitioned column. Pay attention to the syntax while defining the partitioned table.

Let me first share the sample data we have for this exercise. We have a dataset from Govt. Of India website, related to the postal department. These are the post office, type, pin code, city, state and so on. Following is the schema of the data we have.

OFFICE NAME

OFFICE STATUS

PINCODE

TELEPHONE

TALUK

DISTRICT

STATE.

POSTAL DIVISION

POSTAL REGION

POSTAL CIRCLE

There are two ways to load data to a partitioned table, today we will look at the first one. First we will create a temporary table, without partitions. Then load the data into this temporary non-partitioned table. Next, we create the actual table with partitions and load data from temporary table into partitioned table.

1. Create a database for this exercise

CREATE DATABASE HIVE_PARTITION;

USE HIVE_PARTITION;

2. Create a temporary table

CREATE TABLE temp_India (

OFFICE_NAME STRING,

OFFICE_STATUS     STRING,

PINCODE           INT,

TELEPHONE   BIGINT,

TALUK       STRING,

DISTRICT    STRING,

STATE       STRING,

POSTAL_DIVISION   STRING,

POSTAL_REGION     STRING,

POSTAL_CIRCLE     STRING

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

3. Load data into temporary table

LOAD DATA LOCAL INPATH '/home/cloudera/All_States_PinCode.csv' INTO

TABLE temp_India;

4. Create partitioned table

CREATE TABLE India (

OFFICE_NAME STRING,

OFFICE_STATUS     STRING,

PINCODE           INT,

TELEPHONE   BIGINT,

TALUK       STRING,

DISTRICT    STRING,

POSTAL_DIVISION   STRING,

POSTAL_REGION     STRING,

POSTAL_CIRCLE     STRING

)

PARTITIONED BY (STATE   STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

5. Instruct hive to dynamically load partitions

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

6. Import data into partitioned table from temporary table

INSERT OVERWRITE TABLE India PARTITION (STATE)

SELECT

OFFICE_NAME       ,

OFFICE_STATUS     ,

PINCODE           ,

TELEPHONE         ,

TALUK             ,

DISTRICT          ,

STATE             ,

POSTAL_DIVISION   ,

POSTAL_REGION     ,

POSTAL_CIRCLE

FROM temp_India;

We have successfully loaded data into our partitioned table. You can verify this by the following screenshots. One, we check the HDFS folder under the hive warehouse for our table and verify there are folders present for each partition. Also, we can see the schema of the partitioned table using the following command:

desc formatted india;

importing  data into partitioned table

importing data into partitioned table

To view the partitions for a particular table, use the following command inside Hive:

show partitions india;

Output would be similar to the following screenshot.

How to view  table partitions in Hive

One of the observations we can make is the name of the partitions. The partitions will be named along with column name. Meaning, here we have the column name as state and value of column name are the various state names. Both column + form combined together are the folder names. And if you go inside the folder and open the data files, you will not see the state column. This is the same we discussed in the previous post, where records themselves will not have column value stored.

Let us now try and run a query on top of the partitioned table. You will observe that there isn’t any difference between the query you issue on a partitioned table and the one on a non-partitioned table.

select count(*) from india where state=’Punjab’;

naming the  partition in Hive

If you run the same query on non-partitioned table, as follows you will see the same result.

select count(*) from temp_india where state=’Punjab’;

In this section we learnt about creating, loading and querying data on partitioned table using a temporary staging table. In the next post we will learn on how to load data directly into Hive partitioned without using a temporary staging hive table.

The dataset for this exercise is available here.

Hive Partitions

In the previous posts we learned about Hive as a Data Warehouse on top of HDFS data. We looked at the basics of creating a database, creating tables, loading data, querying data in the table and viewing the schema structure of the tables. Today, we are going to learn about partitions in Hive. What is meant by partitioning in table, how to create partitions and why partitions are useful and recommended? So fasten your seat belts and let the journey begin on Hive partitions.

Data in HDFS is stored in huge volumes and in the order of Tera Bytes and Peta Bytes. Now all of this data is stored under some folders but may not be organized. If the data is stored in some random order under different folders then accessing data can be slower. Hive partitions work with the concept of creating a different folder for each partition. This means, for each column value of the partitioned column, there will be a separate folder under the table’s location in HDFS. Also, data for the column which is chosen for partition will not be present as part of the files. The value is directly referred to from folder name. Advantage is, there isn’t repetition of values for n-number of rows or records, thereby saving a little space from each partition. However, the most important use of partitioning the table is faster querying. A partitioned table will return the results faster compared to non-partitioned tables, and especially when the column beings queried for on condition are the partitioned ones. Hive will go and search only those folders where the column value matches the folder name. This implies, that it will ignore other folders and hence, the data to be read is relatively lot lesser.

Let us now understand this by an example.

Consider the geographical hierarchy of India. At the top is country, India. India is made up of so many states, 29 to be precise with some Union territories. Let us take only states into consideration for now. Each state has cities and towns. All the states and cities are identified by name. Now, if we wanted to search for Mumbai, we will look into state Maharashtra. When we specify this state column as part of query, then Hive will look only into Maharashtra folder and search for Mumbai city. It will ignore all other 28 states. Now, isn’t this a performance optimization and faster results retrieval. But the important aspect to consider is to design properly before creating a table. If you have a partition for each city and not much info within each city and a lot of cities, then your table design may not be the most appropriate one. This will lead to too many folders being created with city name as value and in turn will increase load on Name Node, thereby affecting its performance too.

Graphically, we can represent the hierarchy as follows. The directories which store the data for partitioned columns will be in tree structure, like most Operating System arrange the folder. Here are graphical representations of both.

graphical representation of partitioned column

graphical representation of partitioned column

The first image describes how we can visually structure the hierarchy from Country -> States -> City. Note that the cities are just entities here and not actual folder. The data belonging to various cities can be in same file or spread across different files. We will look at how to organize cities into specific files in a post later when we discuss about bucketing.

The second image depicts a tree structure for the table country. The states are the folder names here, and each city will be placed in its corresponding folder according to which state it belongs to. The folder names will be slightly different, and we are going to see this in next post.
In the next post, we will be practically implementing the partitioned table in Hive. We will look at loading data into partitioned tables, how the folders are organized and querying partitioned tables.

Working with Hive

In the last post we discussed about Hive and how it can be used as a data warehouse over the data available in HDFS. In today’s post we will be talking about writing Hive queries. We will look at creating databases, tables and querying on the data. We will also look at loading data into Hive tables.

Hive provides a logical view of HDFS data files in the form of tables. To view the data in tables, we will first create a table in Hive. Assuming you already have hive installed on your machines, let us begin by starting hive. When you type ‘hive’ in your shell and press the enter key, hive shell will be loaded. By default, the database named ‘default’ will be started. If you want to create a new database, the command is following:

hive> create database my_first_db;

Remember to end all the statements or commands in Hive with a semicolon (;). This is the syntax of Hive. The Hive syntax is most similar to the MySQL syntax, but not exactly same. After creating your first database, let us switch to the newly created database. Use the following command:

hive> use my_first_db;

Now you will be switched to the database ‘my_first_db’. Let us create a table in the new database. The command to create a simple table is as follows:

hive> CREATE TABLE emp_details ( id INT, name STRING )

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ‘,’

> LINES TERMINATED BY ‘\n’

> STORED AS TEXTFILE;

creating table in Hive

Let us go through the above syntax line by line. The words in ALLCAPS are keywords. Make sure you do not use the keywords as table or column names; else you will not be able to successfully execute the commands. Here we are creating a table with the name ’emp_details’, having two columns. The column names are ‘id’ and ‘name’. The data type of column ‘id’ is integer represented by INT, and for ‘name’ the data type is text represented by STRING. The columns and their corresponding data types should be places in braces ‘( )’. The next 3 lines are for telling Hive the delimiters for fields and lines in the file. Here our fields are separated using a ‘,’ and records are separated by new lines. Check the sample file in the next section and you will be able to relate to the delimiters. The last line of our create table command tells Hive the file type. Here it is stored as textfile.

Create a sample file in your favorite editor as following:

1,Skanda

2,Bhargav

3,Nilesh

4,Abhi

creating sample file in Hive editor

Save this file on your desktop as ‘data1’. Let us load the data into Hive table created above. Use the following command to load the data in file ‘data1′ into table ’emp_details’ in database ‘my_first_db’:

hive> LOAD DATA LOCAL INPATH ‘~/Desktop/data1’ INTO TABLE my_first_db.emp_details;

loading data into Hive table

The syntax is keyword LOCAL, which says the data is in your local file system. The path to file is inside single quotes. It is always a best practice to specify the absolute path. Then, we specify the table name. Here we have included the database name along with table name. Use a period ‘.’ to separate database and table. If everything went good, you will see a ‘OK’ message, else you will see relevant error message.

Time to write your first query on the data you just loaded. We will first list all of the data in this table. Then, we will query with a condition.

hive> SELECT * FROM my_first_db.emp_details;

This command will display all the rows in the table ’emp_details’. The next command will query and include a specific condition.

hive> SELECT * FROM my_first_db.emp_details WHERE id=3;

writing query in Hive

To view all the databases in your Hive environment use the following command:

hive> show databases;

To view all the tables in a database, first switch to that database and then use the follow command:

hive> show tables;

The command to view the schema/metadata of a particular table use the command:

hive> desc emp_details;

Today we saw how to create a database, table and inserting data into table. We loaded data from local files. In the next post we will see how to load data from HDFS into HIVE and also different types of tables.