Apache Hadoop has become a de-facto software framework for reliable, scalable, distributed and large scale computing. Unlike other computing system, it brings computation to data rather than sending data to computation. Hadoop was created in 2006 at Yahoo by Doug Cutting based on paper published by Google. As Hadoop has matured, over the years many new components and tools were added to its ecosystem to enhance its usability and functionality. Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop etc. to name a few.
With growing popularity of Hadoop, many developers jump in to this technology to have a taste of it. But as they say Hadoop is not for faint hearted, many developers could not even cross the barrier of installing Hadoop. Many distributions offer pre-installed sandbox of VM to try out things but it does not give you the feel of distributed computing. However, installing a multi-node is a not an easy task and with growing number of components it is very tricky to handle so many configuration parameters. Thankfully Apache Ambari comes here to our rescue!
What is Ambari?
Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. It has a very simple and interactive UI to install various tools and perform various management, configuring and monitoring tasks. Below we take you through various steps in installing Hadoop and it’s various ecosystem components on multi-node cluster.
Ambari architecture is shown below
Ambari have two components
- Ambari server – This is the master process which communicates with Ambari agents installed on each node participating in the cluster. This has postgres database instance which is used to maintain all cluster related metadata.
- Ambari Agent – These are acting agents for Ambari on each node. Each agent periodically sends his own health status along with different metrics, installed services status and many more things. According master decides on next action and conveys back to the agent to act.
How to install Ambari ?
Ambari installation is easy a task of few commands.
We will cover Ambari installation and cluster setup. We are assumed to be having 4 nodes. Node1, Node2, Node3 and Node4. And we are picking Node1 as our Ambari server.
These are installation steps on the RHEL based system, for debian and other systems steps will vary little.
- Installation of Ambari: –
From Ambari server node (Node 1 as we decided)
i. Download Ambari public repo
This command will add Hortonworks Ambari repository into yum which is a default package manager for RHEL systems.
ii.Install Ambari RPMS
This will take some time and will install Ambari on this system.
iii. Configuring Ambari server
The next thing to do after Ambari installation is to configure Ambari and set it up to provision the cluster.
Following step will take care of this
iv. Start the server and Login to web UI
Start the server with
Now we can access Ambari web UI (hosted on 8080 port).
Login into Ambari with default username “admin” and default password “admin “
Setting up Hadoop cluster
1. Landing page
Click on “Launch Install Wizard” to start cluster setup
2. Cluster Name
Give you cluster a good name.
Note: This is just a simple name for cluster, it is not that significant, so don’t worry about it and choose any name for it.
3. Stack selection
This page will list stacks available to install. Each stack is pre-packaged with Hadoop ecosystem component. These stacks are from Hortonworks. (We can install plain Hadoop too. That we will cover in later posts).
4.Hosts Entry and SSH key entry
Prior moving further this step we should have password less SSH setup for all the participating nodes.
Add the hostnames of the nodes, single entry on each line. [ Add FQDN which can be obtained by hostname –f command]. Select private key used while setting up password less SSH and username using which private key was created.
5. Hosts registration status
You can see some operations being performed, these operations include setting Ambari-agent on each node, creating basic setups on each nodes. Once we see ALL GREEN we are ready to move on. Sometimes this may take time as it installs few packages.
6. Choose services you wish to install
As per selected stacks in step 3, we have number of services that we can install in the cluster. You can choose one you want. Ambari intelligently selects dependent services if you haven’t selected it. For instance, you selected HBase but not Zookeeper it will prompt same and will add Zookeeper also to the cluster.
7. Master services mapping with Nodes
As you are aware of Hadoop ecosystem has tools which are based on master-slave architecture. In this step we will associate master processes with the node. Here make sure you properly balance your cluster. Also keep in mind primary and secondary services like Namenode and secondary Namenode are not on the same machine.
8. Slaves mapping with Nodes
Similar to masters, map slave services on the nodes. In general, all the nodes will have slave process running at least for Datanodes and Nodemanagers.
9. Customize services
This is very important page for Admins.
Here you can configure properties for your cluster to make it most suited to your use cases.
Also it will have some required properties like Hive metastore password (if hive is selected) etc. These will be pointed with Red error like symbols.
10. Review and start provisioning
Make sure you review the cluster configuration before launch as this will save from unknowingly set wrong configurations.
11. Launch and stay back until status becomes GREEN.
Yaay! We have successfully Installed Hadoop and all the components on all the nodes of the cluster. Now we can get ourselves started with playing with Hadoop.
Ambari runs a MapReduce wordcount job to verify if everything is running fine. Let’s check the log the job ran by ambari-qa user.
As you can see in the above screenshot, WordCount job completed successfully. This confirm that our cluster is working fine.
That’s it, we have now learned how to install Hadoop and its components on multi-node cluster using a simple web based tool called Apache Ambari. Apache Ambari provides us a simpler interface and saves lots of our efforts on installation, monitoring and management which would have be very tedious with so many components and their different installation steps and monitoring controls.
Let me leave you with a hack
Ambari Installer checks /etc/lsb-release to get OS details. In Linux Mint, the same file for the Ubuntu version is under /etc/upstream-release/lsb-release. To fool the installer, just replace the former with the latter (You should back up the file first).
At some point after your install is done, you can restore the original with:
P.S. This is a hack without any guarantees, it worked for me so I thought sharing it with you.
You are a developer/dev-ops and need to install Hadoop quickly. We have a good news for you, Ambari provides a way where you can skip the complete wizard process and completed installation process with a single script, and I will bring it in next post, so stay tuned and till then Happy Hadooping!