December 5, 2014
A Hadoop cluster can be defined as a special type of computational cluster designed to serve the purpose of storing and analysing huge amounts of data that
is not structured, in a distributed computing environment.
Clusters like this can run on Hadoop’s open source distributed processing software on low cost computers, commodity computers to be specific
Hadoop Cluster Architecture:
Hadoop cluster has 3 components:
It is neither a master nor a slave, the work of a client is to submit the MapReduce jobs describing how the way data should be processed and then retrieve
the data to know the response after completion of the Job.
Master consists of 3 components, namely, NameNode, Secondary Node Name, and Job Tracker.
NameNode does not store the actual files, it stores the meta information of the files. NameNode oversees the health of the DataNode and coordinates the
access to the data.
b. JobTracker: JobTracker coordinates the parallel processing of data using MapReduce. To know more about JobTracker, please read the
article All You Want to Know about MapReduce (The Heart of Hadoop).
c. Secondary NameNode: the job of Secondary NameNode is to contact the NameNode periodically to recall the metadata of the filesystem from
the NameNode and saves it to a clean file folder and send it back to the NameNode. Essentially secondary Name Node does the job of house keeping. In case
fo NameNode failure the saved meta data which is stored in the RAM of NameNode, can be rebuilt using the secondary Node.
Slave nodes are the majority of the machines in Hadoop Cluster and are responsible for storing the data and processing the computation.
Why use Hadoop Clusters:
Hadoop clusters are particularly known for boosting the speed of data analysis applications and their scalability. If at any point a cluster’s processing
power is under stress by the growing volumes of data, it can be dealt by adding additional cluster nodes to increase throughput. Hadoop clusters have high
resistance to failure because each block of data is copied onto other nodes ensuring that the data is not lost if a single node fails.