December 11, 2014
Hadoop is a free, Java-based programming framework that enables the processing of large data sets in a distributed computing environment. It is part of the
Apache open source project sponsored by the Apache Software Foundation.
Hadoop makes it possible to run applications on systems with thousands of nodes involving Hadoop enables running applications on systems with thousands of
nodes involving thousands of terabytes. Its distributed file system, HDFS, assists rapid data transfer rates among the nodes while allowing the system to
continue operating uninterruptedly despite a node failure. This process lowers the overall risk of system failure at a catastrophic level, even if a
significant number of nodes become inoperative.
Inspiration for Hadoop was Google’s MapReduce, a framework in which an application is broken down into numerous small parts, also known as fragments or
blocks. Any of these can be run on any node in the cluster. The name Hadoop is given by its creator, Doug Cutting, after his child’s stuffed toy elephant.
Currently, the Apache Hadoop ecosystem consists of Hadoop Kernel, MapReduce, HDFS (Hadoop Distributed File System) and a few related projects such as
Apache Hive, HBase and Zookeeper.
The hadoop framework was initially used by major players such as Google, yahoo and IBM for applications involving search engines and advertising. The
preferred OS is either Windows or Linux for Hadoop but it can also be run on OS X and BSD.
So what’s the fuss all about?
The reason Hadoop is so popular is because Hadoop enables a computing solution that is:
– New nodes can always be added based on the needs without any requirement to change the data formats or its loading, how jobs are written or the
– Hadoop brings the massive parallel computing to commodity servers. As a result, there is a sizeable decrease in the cost per terabyte of storage.
This makes modelling all of data pretty affordable.
– Hadoop is schema-less, and is capable of absorbing any type of structured or unstructured data from any number of sources. Multiple source data
can be joined and aggregated in arbitrary ways enabling deeper analysis that cannot be provided by any one system.
– When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat.