Apache Hadoop has become a de-facto software framework for reliable, scalable, distributed and large scale computing. Since its inception, Hadoop has been first choice framework for Batch processing. From large banks to online retail giants, everyone is using Hadoop for their periodic report generation, computations and for many more use cases. Typically, these use cases are batch-oriented processes and require few hours before we get meaning out of data. Today’s swift world requires meaning or actions out of raw data almost at the time it is generated. This has led to a concept of stream processing. Though Hadoop is not originally considered fit for stream processing, invention of YARN (Hadoop 2.0) has made it good candidate for it. Currently there are multiple stream processing frameworks in Hadoop eco-system and Apex is brand new one entering in this crowded market.
What is Apache Apex?
Apache Apex is native YARN based platform that helps application developer to write stream oriented as well as Batch oriented applications. It is designed to process data in-motion, in distributed, highly performant, fault tolerant way. Icing over this is the easy API, which enables users to write their java code with limited knowledge of stream processing.
The Apex is based on separate functional and operational specifications rather compounding them together. This makes application developers focus on writing User Defined Functions without having to think about how they will operate in distributed environment.
Apache Apex has rich library of commonly used functions. These are added as part of Apache Apex-Malhar library. This library has operators to access different filesystems, databases, message queues. Community is adding operators’ day-by-day making application developers life more easy.
What are core Blocks of Apache Apex?
The architecture of Apex is very simple. Apex has Malhar, an operator library and core engine to work with. Core of Apex could be depicted as follow, they are often referred as major blocks of Apache Apex.
You can clearly separate layers and can get overview where it fits. Let us see information about these blocks.
- StrAM (Stream Application Master)
StrAM is a YARN Application Master. Its responsibility includes stream application launch, resource allocation, scheduling logical DAG’s. Along with these YARN operations, StrAM initializes operators, streams. StrAM also gathers statistics from its children.
- State Snapshot
Stream processing frameworks could not afford losing processed results. In addition, they must know how much they have processed in order to correctly process records after they recovered from failure. So periodically, check pointing is important in stream processing. In Apex, StrAM keeps track of check pointing and at operator boundary, periodically check pointing is carried out on HDFS.
- REST API
StrAM is access point for REST API. External tools can access this REST API and do integrate with any external application.
Apex provides CLI to launch and monitor Apex applications. Even we can build our own with the help of REST API’s. Along with CLI, application can configure with static configuration scripts for automated launch.
- Dynamic Modifications
- SLA Analysis
Apache Apex do SLA analysis on its own periodically. It does latency, bottleneck and throughput analysis and adds more resources to meet SLA configured.
- High Availability
Apache Apex utilizes YARN’s restarting functionality and restarts from last check pointed state.
Apache Apex –Malhar is operator’s library with numerous operators. These operators are categorized into
Apex provides partitioning based on keys and dynamic load balancing. Even user can define its own partitioning scheme.
Apache Apex has this very useful and unique feature. It supports Logical DAG change, Physical execution plan change.
Apex supports Kerberos. Underlying secured Hadoop cluster, it can access with inherent Kerberos integration.
- Input / Output operators –
Under this category, currently Malhar has operators to read/write from
- File System
- NoSQL stores
- Message Queues
- In-memory databases
- Social Media
Malhar has many operators that will help in actual business logic implementation. This library has
- Pattern Matching
- Stats and Math
- Machine Learning
- Social Media
Buffer servers lie at each operator boundary. In case of data, local operator buffer servers can be after strings of operators. The main purpose of them is temporary hold data at edges before forwarding to next. They important role when node is recovered from failure. Buffer servers load data from last check-pointed state to replay
What is Apex Application programming model?
This features rich framework and Malhar library that means application developers only need to connect operators and launch application. Therefore, your application is nothing but a sequence of operators.
This is how rich framework makes developer’s life easy. So let’s see how this demo application runs
Apache Apex Demo
So let’s start with ‘Hello World of Big Data J ’, a small demo of word count using Apache Apex.
Setting up Apache Apex
To run this demo, we need to configure Apex. You can install Apache Apex on your existing cluster or there is simple way to try-out, you can download pre install sandbox VM from DataTorrent website from here. For this demo we will use pre-installed VM.
Walkthrough Apex UI Console
Apex comes with a very intuitively and beautifully design UI Console which you can use for launching, monitoring and managing applications. It includes various statistics regarding different components that are deployed.
After, you have downloaded the sandbox VM, UnTar it and load it in your favorite VM player (I use VMWare VM player). All the software and tools that are required for running Apex are already configured in this VM and all startup scripts are configured to execute at OS boot. So, when your VM will be up, you will be having running instance of Apache Apex. Now, to view the console, just hit http://locahost:9090 in your favorite web browser and login into console. Default username: password for sandbox VM is dtadmin: dtadmin. You will see console as below
This page gives us complete overview of all system like CPU and Memory usage, Applications, Performance, Issues etc.
To deploy an application, go to Develop tab at the top of page.
Here you can deploy your application packages and can manage the tuple schemas for the data inside Apex.
Apex provides you with number of applications out of box, which you can see listed below:
Now, let’s launch the wordcount application. You can do this by clicking on launch application option next to DataTorrent Wordcount Demo. Next you can to provide a different to application and modify configuration details if required (We will not do so as most of defaults work fine, let’s just modify app name to “MyWordCountDemo”). You will see a message which says application was deployed successfully with a link to the application. Click on that link.
This opens a new page. Wait for few seconds until the application status changes from Accepted to Running. You will now see a page that is full of various statistics and information. Next two screenshot depicts them.
This pages show us various information like logical, physical and metric view of the application, along with statistics of various tuples/records which processed by application every second. It shows graphical representation of tuples that are emitted and the latencies etc.
You can click on any of logical operator, inspect its records, and even record a sample. Let’s do that for console operator. Click on console operator and you will get detailed information about operator as below:
Next, select one of the partitions and click on record a sample.
After a few seconds, you will see tuples are populated, click on tuple to view its content. As you can see from content, application has performed wordcount on data based on windows and there were 2 “to”, 4 “the”, 4 “a” etc. in the 0th input tuple for this window. You can now stop the application by clicking on “Shutdown” or “Kill” on main page of application.
That’s it, we have successfully deployed and run wordcount application.
So that was the introduction to a new Streaming tool – Apache Apex and successful running of an application in Apache Apex. Apache Apex has many salient features that gives it an edge over other existing framework that I will cover in subsequent posts.