June 3, 2015
In the previous posts we learned about Hive as a Data Warehouse on top of HDFS data. We looked at the basics of creating a database, creating tables, loading data, querying data in the table and viewing the schema structure of the tables. Today, we are going to learn about partitions in Hive. What is meant by partitioning in table, how to create partitions and why partitions are useful and recommended? So fasten your seat belts and let the journey begin on Hive partitions.
Data in HDFS is stored in huge volumes and in the order of Tera Bytes and Peta Bytes. Now all of this data is stored under some folders but may not be organized. If the data is stored in some random order under different folders then accessing data can be slower. Hive partitions work with the concept of creating a different folder for each partition. This means, for each column value of the partitioned column, there will be a separate folder under the table’s location in HDFS. Also, data for the column which is chosen for partition will not be present as part of the files. The value is directly referred to from folder name. Advantage is, there isn’t repetition of values for n-number of rows or records, thereby saving a little space from each partition. However, the most important use of partitioning the table is faster querying. A partitioned table will return the results faster compared to non-partitioned tables, and especially when the column beings queried for on condition are the partitioned ones. Hive will go and search only those folders where the column value matches the folder name. This implies, that it will ignore other folders and hence, the data to be read is relatively lot lesser.
Consider the geographical hierarchy of India. At the top is country, India. India is made up of so many states, 29 to be precise with some Union territories. Let us take only states into consideration for now. Each state has cities and towns. All the states and cities are identified by name. Now, if we wanted to search for Mumbai, we will look into state Maharashtra. When we specify this state column as part of query, then Hive will look only into Maharashtra folder and search for Mumbai city. It will ignore all other 28 states. Now, isn’t this a performance optimization and faster results retrieval. But the important aspect to consider is to design properly before creating a table. If you have a partition for each city and not much info within each city and a lot of cities, then your table design may not be the most appropriate one. This will lead to too many folders being created with city name as value and in turn will increase load on Name Node, thereby affecting its performance too.
Graphically, we can represent the hierarchy as follows. The directories which store the data for partitioned columns will be in tree structure, like most Operating System arrange the folder. Here are graphical representations of both.
The first image describes how we can visually structure the hierarchy from Country -> States -> City. Note that the cities are just entities here and not actual folder. The data belonging to various cities can be in same file or spread across different files. We will look at how to organize cities into specific files in a post later when we discuss about bucketing.
The second image depicts a tree structure for the table country. The states are the folder names here, and each city will be placed in its corresponding folder according to which state it belongs to. The folder names will be slightly different, and we are going to see this in next post.
In the next post, we will be practically implementing the partitioned table in Hive. We will look at loading data into partitioned tables, how the folders are organized and querying partitioned tables.