December 29, 2015
Apache Hadoop has become a de-facto software framework for reliable, scalable, distributed and large scale computing. Since its inception, Hadoop has been first choice framework for Batch processing. From large banks to online retail giants, everyone is using Hadoop for their periodic report generation, computations and for many more use cases. Typically, these use cases are batch-oriented processes and require few hours before we get meaning out of data. Today’s swift world requires meaning or actions out of raw data almost at the time it is generated. This has led to a concept of stream processing. Though Hadoop is not originally considered fit for stream processing, invention of YARN (Hadoop 2.0) has made it good candidate for it. Currently there are multiple stream processing frameworks in Hadoop eco-system and Apex is brand new one entering in this crowded market.
Apache Apex is native YARN based platform that helps application developer to write stream oriented as well as Batch oriented applications. It is designed to process data in-motion, in distributed, highly performant, fault tolerant way. Icing over this is the easy API, which enables users to write their java code with limited knowledge of stream processing.
The Apex is based on separate functional and operational specifications rather compounding them together. This makes application developers focus on writing User Defined Functions without having to think about how they will operate in distributed environment.
Apache Apex has rich library of commonly used functions. These are added as part of Apache Apex-Malhar library. This library has operators to access different filesystems, databases, message queues. Community is adding operators’ day-by-day making application developers life more easy.
The architecture of Apex is very simple. Apex has Malhar, an operator library and core engine to work with. Core of Apex could be depicted as follow, they are often referred as major blocks of Apache Apex.
You can clearly separate layers and can get overview where it fits. Let us see information about these blocks.
Apex provides partitioning based on keys and dynamic load balancing. Even user can define its own partitioning scheme.
Apache Apex has this very useful and unique feature. It supports Logical DAG change, Physical execution plan change.
Apex supports Kerberos. Underlying secured Hadoop cluster, it can access with inherent Kerberos integration.
Malhar has many operators that will help in actual business logic implementation. This library has
Buffer servers lie at each operator boundary. In case of data, local operator buffer servers can be after strings of operators. The main purpose of them is temporary hold data at edges before forwarding to next. They important role when node is recovered from failure. Buffer servers load data from last check-pointed state to replay
This features rich framework and Malhar library that means application developers only need to connect operators and launch application. Therefore, your application is nothing but a sequence of operators.
This is how rich framework makes developer’s life easy. So let’s see how this demo application runs
So let’s start with ‘Hello World of Big Data J ’, a small demo of word count using Apache Apex.
To run this demo, we need to configure Apex. You can install Apache Apex on your existing cluster or there is simple way to try-out, you can download pre install sandbox VM from DataTorrent website from here. For this demo we will use pre-installed VM.
Apex comes with a very intuitively and beautifully design UI Console which you can use for launching, monitoring and managing applications. It includes various statistics regarding different components that are deployed.
After, you have downloaded the sandbox VM, UnTar it and load it in your favorite VM player (I use VMWare VM player). All the software and tools that are required for running Apex are already configured in this VM and all startup scripts are configured to execute at OS boot. So, when your VM will be up, you will be having running instance of Apache Apex. Now, to view the console, just hit http://locahost:9090 in your favorite web browser and login into console. Default username: password for sandbox VM is dtadmin: dtadmin. You will see console as below
This page gives us complete overview of all system like CPU and Memory usage, Applications, Performance, Issues etc.
To deploy an application, go to Develop tab at the top of page.
Here you can deploy your application packages and can manage the tuple schemas for the data inside Apex.
Apex provides you with number of applications out of box, which you can see listed below:
Now, let’s launch the wordcount application. You can do this by clicking on launch application option next to DataTorrent Wordcount Demo. Next you can to provide a different to application and modify configuration details if required (We will not do so as most of defaults work fine, let’s just modify app name to “MyWordCountDemo”). You will see a message which says application was deployed successfully with a link to the application. Click on that link.
This opens a new page. Wait for few seconds until the application status changes from Accepted to Running. You will now see a page that is full of various statistics and information. Next two screenshot depicts them.
This pages show us various information like logical, physical and metric view of the application, along with statistics of various tuples/records which processed by application every second. It shows graphical representation of tuples that are emitted and the latencies etc.
You can click on any of logical operator, inspect its records, and even record a sample. Let’s do that for console operator. Click on console operator and you will get detailed information about operator as below:
Next, select one of the partitions and click on record a sample.
After a few seconds, you will see tuples are populated, click on tuple to view its content. As you can see from content, application has performed wordcount on data based on windows and there were 2 “to”, 4 “the”, 4 “a” etc. in the 0th input tuple for this window. You can now stop the application by clicking on “Shutdown” or “Kill” on main page of application.
That’s it, we have successfully deployed and run wordcount application.
So that was the introduction to a new Streaming tool – Apache Apex and successful running of an application in Apache Apex. Apache Apex has many salient features that gives it an edge over other existing framework that I will cover in subsequent posts.