The previous post had all the concepts covered related to partitions. In this post, we will practically design and implement a Hive table with partitions. Let us take in consideration the same data. Geographical hierarchy of India. We will consider the table name as India – country, then columns and one of those columns state we will make this as partitioned column. Pay attention to the syntax while defining the partitioned table.
Let me first share the sample data we have for this exercise. We have a dataset from Govt. Of India website, related to the postal department. These are the post office, type, pin code, city, state and so on. Following is the schema of the data we have.
OFFICE NAME | OFFICE STATUS | PINCODE | TELEPHONE | TALUK | DISTRICT | STATE. | POSTAL DIVISION | POSTAL REGION | POSTAL CIRCLE |
There are two ways to load data to a partitioned table, today we will look at the first one. First we will create a temporary table, without partitions. Then load the data into this temporary non-partitioned table. Next, we create the actual table with partitions and load data from temporary table into partitioned table.
1. Create a database for this exercise
CREATE DATABASE HIVE_PARTITION;
USE HIVE_PARTITION;
2. Create a temporary table
CREATE TABLE temp_India (
OFFICE_NAME STRING,
OFFICE_STATUSÂ Â Â Â STRING,
PINCODEÂ Â Â Â Â Â Â Â Â Â INT,
TELEPHONEÂ Â BIGINT,
TALUKÂ Â Â Â Â Â STRING,
DISTRICTÂ Â Â STRING,
STATEÂ Â Â Â Â Â STRING,
POSTAL_DIVISIONÂ Â STRING,
POSTAL_REGIONÂ Â Â Â STRING,
POSTAL_CIRCLEÂ Â Â Â STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
3. Load data into temporary table
LOAD DATA LOCAL INPATH '/home/cloudera/All_States_PinCode.csv' INTO
TABLE temp_India;
4. Create partitioned table
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUSÂ Â Â Â STRING,
PINCODEÂ Â Â Â Â Â Â Â Â Â INT,
TELEPHONEÂ Â BIGINT,
TALUKÂ Â Â Â Â Â STRING,
DISTRICTÂ Â Â STRING,
POSTAL_DIVISIONÂ Â STRING,
POSTAL_REGIONÂ Â Â Â STRING,
POSTAL_CIRCLEÂ Â Â Â STRING
)
PARTITIONED BY (STATEÂ Â STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
6. Import data into partitioned table from temporary table
INSERT OVERWRITE TABLE India PARTITION (STATE)
SELECT
OFFICE_NAMEÂ Â Â Â Â Â ,
OFFICE_STATUSÂ Â Â Â ,
PINCODEÂ Â Â Â Â Â Â Â Â ,
TELEPHONEÂ Â Â Â Â Â Â Â ,
TALUK Â Â Â Â Â Â Â Â Â Â Â ,
DISTRICTÂ Â Â Â Â Â Â Â Â ,
STATE Â Â Â Â Â Â Â Â Â Â Â ,
POSTAL_DIVISIONÂ Â ,
POSTAL_REGIONÂ Â Â Â ,
POSTAL_CIRCLE
FROM temp_India;
We have successfully loaded data into our partitioned table. You can verify this by the following screenshots. One, we check the HDFS folder under the hive warehouse for our table and verify there are folders present for each partition. Also, we can see the schema of the partitioned table using the following command:
desc formatted india;


To view the partitions for a particular table, use the following command inside Hive:
show partitions india;
Output would be similar to the following screenshot.

One of the observations we can make is the name of the partitions. The partitions will be named along with column name. Meaning, here we have the column name as state and value of column name are the various state names. Both column + form combined together are the folder names. And if you go inside the folder and open the data files, you will not see the state column. This is the same we discussed in the previous post, where records themselves will not have column value stored.
Let us now try and run a query on top of the partitioned table. You will observe that there isn’t any difference between the query you issue on a partitioned table and the one on a non-partitioned table.
select count(*) from india where state=’Punjab’;

If you run the same query on non-partitioned table, as follows you will see the same result.
select count(*) from temp_india where state=’Punjab’;
In this section we learnt about creating, loading and querying data on partitioned table using a temporary staging table. In the next post we will learn on how to load data directly into Hive partitioned without using a temporary staging hive table.
The dataset for this exercise is available here.
Related links you will like:
Comments