September 24, 2015
The underlying data needed to be used to gain right outcomes for all above tasks is comparatively very large. It cannot be handled efficiently (in terms of both space and time) by traditional systems.
These are all big data scenarios.
To collect, store and do computations on this kind of voluminous data we need a specialized cluster computing system. Apache Hadoop has solved this problem for us.
It offers a distributed storage system (HDFS) and parallel computing platform (MapReduce).
Hadoop framework works as below:
Hadoop has got specialized tools in its ecosystem to perform different tasks. So, if you want to run an end to end lifecycle of an application you need to go with multiple tools. For example, for SQL queries u will use, hive/pig, for streaming sources you have to go with Hadoop inbuilt streaming or Apache Storm (Which is not part of Hadoop ecosystem) or for machine learning algorithms you have to use Mahout. Integrating all these systems together to build a single data pipeline use case is quite a task.
In MapReduce job,
This process makes job slower causing Disk I/O and network I/O. This also makes Mapreduce unfit for iterative processing where you have to apply machine learning algorithms to same group of data over and over again.
Apache Spark is developed in UC Berkeley AMPLAB in 2009 and in 2010 it went to become Apache top contributed open source project till date.
Apache Spark is more generalised system, where you can run both batch and streaming jobs at a time. It supersedes its predecessor MapReduce in speed by adding capabilities to process data faster in memory. It is also more efficient on disk. It leverages in memory processing using its basic data unit RDD (Resilient Distributed Dataset). These hold as much dataset as possible in memory for complete lifecycle of job hence saving on disk I/O. Some data can get spilled over disk after memory upper limits.
Below graph shows running time in seconds of both Apache Hadoop and Spark for calculating logistic regression. Hadoop took 110 seconds while spark finished same job in only 0.9 seconds.
Spark does not store all data in memory. But if data is in memory it makes best use of LRU cache to process it faster. It is 100x faster while computing data in memory and still faster on disk than Hadoop.
Spark’s distributed data storage model, resilient distributed datasets (RDD), guarantees fault tolerance which in turn minimizes network I/O. Spark paper says:
"RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition."
So you don’t need to replicate data to achieve fault tolerance.
In Spark MapReduce, mappers output is kept in OS buffer cache and reducers pull it to their side and write it directly to their memory, unlike Hadoop where output gets spilled to disk and read it again.
Spark’s in memory cache makes it fit for machine learning algorithms where you need to use same data over and over again. Spark can run complex jobs, multiple steps data pipelines using Direct Acyclic Graph (DAGs).
Spark is written in Scala and it runs on JVM (Java Virtual Machine). Spark offers development APIs for Java, Scala, Python and R languages. Spark runs on Hadoop YARN, Apache Mesos as well as it has its own standalone cluster manager.
In 2014 it secured 1st place in world record for sorting 100TB data (1 trillion records) benchmark in just 23 minutes, where as previous record of Hadoop by Yahoo was about 72 minutes. This proves that spark sorted data 3 times faster and with 10 times fewer machines. All sorting happened on disk (HDFS), without actually using spark in-memory cache capability.
Spark is meant for doing advanced analytics in one go, for achieving that it offers following components:
Spark core API is base of Apache Spark framework, which handles job scheduling, task distribution, memory management, I/O operations and recovering from failures. Main logical data unit in spark is called RDD (Resilient Distributed Dataset), which stores data in distributed way to be processed parallel later. It lazily computes operations. Therefore, memory need not be occupied all the time, and other jobs can utilize it.
It offers interactive querying capabilities with low latency. New DataFrame API can hold both structured and semi-structured data and allow all SQL operations and functions to do computations.
It provides real time streaming APIs, which collects and processes data in micro batches.
It uses Dstreams which is nothing but continuous sequence of RDDs, to compute business logics on incoming data and generate results immediately.
It is spark’s machine learning library (almost 9 times faster than Mahout) which provides machine learning as well as statistical algorithms like classification, regression, collaborative filtering etc.
GraphX API provides capabilities to handle graphs and perform graph-parallel computations. It includes graph algorithms like PageRank and various functions to analyze graphs.
Spark is still young system, not as matured as Hadoop. There is no tool for NOSQL like HBase. Considering high memory requirement for faster data processing you cannot really say it runs on commodity hardware. Spark does not have its own storage system. It relies on HDFS for that.
So, Hadoop MapReduce is still good for certain batch jobs, which does not include much data pipelining.
“New technology never completely replaces old one; they both would rather coexist.”
In this blog we looked at why you need tool like Spark, what makes it faster cluster computing system and its core components. Next part we will go deeper into Spark core API RDDs, transformations and actions.