347 647 9001+1 714 797 8196Request a Call
Call Me

Introduction of Hadoop Distributed File System (HDFS)

April 6, 2015
, , , ,

Hadoop Distributed File System (HDFS) is a distributed file system which is designed to run on commodity hardware. Commodity hardware is cheaper in cost. Since Hadoop requires processing power of multiple machines and since it is expensive to deploy costly hardware, we use commodity hardware. When commodity hardware is used, failures are more common rather than an exception. HDFS is highly fault-tolerant and is designed to run on commodity hardware.

HDFS provides high throughput access to the data stored. So it is extremely useful when you want to build applications which require large data sets.

HDFS was originally built as infrastructure layer for Apache Nutch. It is now pretty much part of Apache Hadoop project.

Hadoop Distributed File System Architecture

HDFS has master/slave architecture. In this architecture one of the machines will be designated as a master node (or name node). Every other machine would be acting as slave (or data node). NameNode/DataNode are java processes that run on the machines when Hadoop software is installed.

NameNode is responsible for managing the metadata about the HDFS Files. This metadata includes various information about the HDFS File such as Name of the file, File Permissions, FileSize, Blocks etc. It is also responsible for performing various namespace operations like opening, closing, renaming the files or directories.

Whenever a file is to be stored in HDFS, it is divided into blocks. By default, blocksize is 64MB (Configurable). These blocks are replicated (default is 3) and stored across various datanodes to take care of hardware failures and for faster data transfers. NameNode maintains a mapping of blocks to DataNodes.

DataNodes serves the read and write requests from HDFS file system clients. They are also responsible for creation of block replicas and for checking if blocks are corrupted or not. It sends the ping messages to the NameNode in the form of block mappings.

How communication happens?

1. HDFS exposes Java/C API using which user can write an application to interact with HDFS. Application using this API Interacts with Client Library (present on the same client machine).

2. Client (Library) connects to the NameNode using RPC. The communication between them happens using ClientProtocol. Major functionality in ClientProtocol includes Create (creates a file in name space), Append (add to the end of already existing file), Complete (client has finished writing to file), Read etc.

3. Client (Library) interacts with DataNode directly using DataTransferProtocol. The DataTransferProtocol defines operations to read a block, write to block, get checksum of block, copy the block etc.

4. Interaction between NameNode and DataNode. It’s always DataNode which initiates the communication first and NameNode just responds to the requests intiated. The communication usually involves DataNode Registration, DataNode sending heart beat messages, DataNode sending blockreport, DataNode notifying the receipt of Block from a client or another DataNode during replication of blocks.

In this post, we have discussed the high level architecture of HDFS and then we understood various daemons that are running behind the scenes for HDFS. We also saw how communication happens between client vs HDFS and also among various daemons of HDFS.

In next few posts, let’s dig deeper and understand how HDFS achieves its robustness, data availability and high data transfers.

Happy Learning!


About the Author

Kalyan Nelluri is an open-source enthusiast and technologist who enjoys working on research areas including Machine Learning, Data Mining and Big Data Processing. He holds a master's degree in computer science from IIIT Hyderabad. He is currently working as a Software Engineer in Amazon India


Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc

CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA®, Claritas® and Chartered Financial Analyst® are trademarks owned by CFA Institute.

Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to and we will rectify it.

Popular Blogs: Whatsapp Revenue Model | CFA vs CPA | CMA vs CPA | ACCA vs CPA | CFA vs FRM

Post ID = 73821