Share with your network!

Apache Hadoop has become a de-facto software framework for reliable, scalable, distributed and large scale computing.  Since its inception, Hadoop has been first choice framework for Batch processing. From large banks to online retail giants, everyone is using Hadoop for their periodic report generation, computations and for many more use cases. Typically, these use cases are batch-oriented processes and require few hours before we get meaning out of data. Today’s swift world requires meaning or actions out of raw data almost at the time it is generated. This has led to a concept of stream processing. Though Hadoop is not originally considered fit for stream processing, invention of YARN (Hadoop 2.0) has made it good candidate for it. Currently there are multiple stream processing frameworks in Hadoop eco-system and Apex is brand new one entering in this crowded market.

What is Apache Apex?

Apache Apex is native YARN based platform that helps application developer to write stream oriented as well as Batch oriented applications. It is designed to process data in-motion, in distributed, highly performant, fault tolerant way. Icing over this is the easy API, which enables users to write their java code with limited knowledge of stream processing.

The Apex is based on separate functional and operational specifications rather compounding them together. This makes application developers focus on writing User Defined Functions without having to think about how they will operate in distributed environment.

Apache Apex has rich library of commonly used functions. These are added as part of Apache Apex-Malhar library. This library has operators to access different filesystems, databases, message queues. Community is adding operators’ day-by-day making application developers life more easy.

What are core Blocks of Apache Apex?

The architecture of Apex is very simple. Apex has Malhar, an operator library and core engine to work with. Core of Apex could be depicted as follow, they are often referred as major blocks of Apache Apex.

core blocks of Apache Apex

You can clearly separate layers and can get overview where it fits. Let us see information about these blocks.

  1. StrAM (Stream Application Master)
    StrAM is a YARN Application Master. Its responsibility includes stream application launch, resource allocation, scheduling logical DAG’s. Along with these YARN operations, StrAM initializes operators, streams. StrAM also gathers statistics from its children.
  2. State Snapshot
    Stream processing frameworks could not afford losing processed results. In addition, they must know how much they have processed in order to correctly process records after they recovered from failure. So periodically, check pointing is important in stream processing.  In Apex, StrAM keeps track of check pointing and at operator boundary, periodically check pointing is carried out on HDFS.
    StrAM is access point for REST API. External tools can access this REST API and do integrate with any external application.
  4. Tools
    Apex provides CLI to launch and monitor Apex applications. Even we can build our own with the help of REST API’s. Along with CLI, application can configure with static configuration scripts for automated launch.
  5. Partitioning
  6. Apex provides partitioning based on keys and dynamic load balancing. Even user can define its own partitioning scheme.

  7. Dynamic Modifications
  8. Apache Apex has this very useful and unique feature. It supports Logical DAG change, Physical execution plan change.

  9. SLA Analysis
    Apache Apex do SLA analysis on its own periodically. It does latency, bottleneck and throughput analysis and adds more resources to meet SLA configured.
  10. Security
  11. Apex supports Kerberos. Underlying secured Hadoop cluster, it can access with inherent Kerberos integration.

  12. High Availability
    Apache Apex utilizes YARN’s restarting functionality and restarts from last check pointed state. 
  13. Malhar
    Apache Apex –Malhar is operator’s library with numerous operators. These operators are categorized into
    • Input / Output operators
      Under this category, currently Malhar has operators to read/write from
    • File System
    • RDBMS
    • NoSQL stores
    • Message Queues
    • In-memory databases
    • Social Media
  14. Computation Operators –
  15. Malhar has many operators that will help in actual business logic implementation. This library has

        • Pattern Matching
        • Stats and Math
        • Machine Learning
        • Parsers
        • Social Media
      • Buffer Servers

      Buffer servers lie at each operator boundary. In case of data, local operator buffer servers can be after strings of operators. The main purpose of them is temporary hold data at edges before forwarding to next. They important role when node is recovered from failure. Buffer servers load data from last check-pointed state to replay

      What is Apex Application programming model?

      This features rich framework and Malhar library that means application developers only need to connect operators and launch application. Therefore, your application is nothing but a sequence of operators.

      Apex Application programming  model

      This is how rich framework makes developer’s life easy. So let’s see how this demo application runs

      Apache Apex Demo

      So let’s start with ‘Hello World of Big Data J ’, a small demo of word count using Apache Apex.

      Setting up Apache Apex

      To run this demo, we need to configure Apex. You can install Apache Apex on your existing cluster or there is simple way to try-out, you can download pre install sandbox VM from DataTorrent website from here. For this demo we will use pre-installed VM.

      Walkthrough Apex UI Console

      Apex comes with a very intuitively and beautifully design UI Console which you can use for launching, monitoring and managing applications. It includes various statistics regarding different components that are deployed.

      After, you have downloaded the sandbox VM, UnTar it and load it in your favorite VM player (I use VMWare VM player). All the software and tools that are required for running Apex are already configured in this VM and all startup scripts are configured to execute at OS boot. So, when your VM will be up, you will be having running instance of Apache Apex. Now, to view the console, just hit http://locahost:9090 in your favorite web browser and login into console. Default username: password for sandbox VM is dtadmin: dtadmin. You will see console as below

      Apache Apex UI  console

      This page gives us complete overview of all system like CPU and Memory usage, Applications, Performance, Issues etc.

      To deploy an application, go to Develop tab at the top of page.

      data torrent

      Here you can deploy your application packages and can manage the tuple schemas for the data inside Apex.

      Apex provides you with number of applications out of box, which you can see listed below:

      data torrent

      WordCount Demo

      Now, let’s launch the wordcount application. You can do this by clicking on launch application option next to DataTorrent Wordcount Demo. Next you can to provide a different to application and modify configuration details if required (We will not do so as most of defaults work fine, let’s just modify app name to “MyWordCountDemo”). You will see a message which says application was deployed successfully with a link to the application. Click on that link.

      wordcount demo

      This opens a new page. Wait for few seconds until the application status changes from Accepted to Running. You will now see a page that is full of various statistics and information. Next two screenshot depicts them.

      wordcount demo

      Apache Apex UI  console

      This pages show us various information like logical, physical and metric view of the application, along with statistics of various tuples/records which processed by application every second. It shows graphical representation of tuples that are emitted and the latencies etc.

      You can click on any of logical operator, inspect its records, and even record a sample. Let’s do that for console operator. Click on console operator and you will get detailed information about operator as below:

      Apache Apex UI  console

      Next, select one of the partitions and click on record a sample.

      Apache Apex UI  console

      After a few seconds, you will see tuples are populated, click on tuple to view its content. As you can see from content, application has performed wordcount on data based on windows and there were 2 “to”, 4 “the”, 4 “a” etc. in the 0th input tuple for this window. You can now stop the application by clicking on “Shutdown” or “Kill” on main page of application.

      That’s it, we have successfully deployed and run wordcount application.


      So that was the introduction to a new Streaming tool – Apache Apex and successful running of an application in Apache Apex. Apache Apex has many salient features that gives it an edge over other existing framework that I will cover in subsequent posts.