Setting up a Big Data Cluster within Minutes in 3 Easy Steps

HDP is the industry’s a truly secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation. (Source: https://hortonworks.com/products/data-platforms/hdp/).

I am going to install HDP to create a cluster of 5 nodes deployed on CloudSigma. CloudSigma provides easy deployment, a vast library of operating systems and easy-to-use interface to set a Big Data Platform within minutes.

Step 1: Set up and Configure your Desired Server Infrastructure

I have already created five machines at CloudSigma. These machines have 16 GB RAM, 8 cores (2.5 GHz each) and 256 GB SSD. These configurations cost around 20 cents per hour for each machine on CloudSigma to run. I have installed Ubuntu 16.04 on each of the machines. I have cloned the following Ubuntu drives from CloudSigma’s library:

Ubuntu 16.04 with VirtiO drivers Python 3 and 2.7.12 Pip 9.0.1 OpenSSL 1.0.2l Cloud-init 0.7.9 and latest updates until 2017-12-26

Step 2: Set up the Master/Slave Configuration

For our big data tools to work properly, we require that our host (master) should be able to communicate with each of the nodes (slaves). So, we create another sudo user account, say m1 with the following commands on each machine.

Now for the machines to be able to communicate to each other, we first give each of the machine a name in /etc/hosts file:

Add entries similar to these with the IPs of your machine and the names you want to give the machines, for example:

    IP_1 machine1.CloudSigma.dann machine1
    IP_2 machine2.CloudSigma.dann machine2
    IP_3 machine3.CloudSigma.dann machine3
    IP_4 machine4.CloudSigma.dann machine4
    IP_5 machine5.CloudSigma.dann machine5

Now we want that our m1 user from machine1 can access m1 user on other machines without being asked for password. For that passwordless ssh setup is done.

On machine1:

    I. Log onto user m1

    II. Create a ssh key

    III. Copy the key to other machines

Step 3: Get Ambari Up and Running

Go to HortonWorks’s HDP download page and choose your preferred option. We are going to install HDP 2.6.4 (Automated) with Ambari 2.6.1. Click on download and it will redirect you to Apache Ambari Installation page. Select the base OS. In our case, I have Ubuntu 16 machines.

Login to the host machine as root.

Download the Ambari repository file to a directory of choice. Execute the commands as mentioned on the page to download the repository file.

Now that we have the repo file, we can install Ambari. It downloads files of around 750 MB. A cloud platform is preferable for such clusters as they provide an average download speed of around 40 MBps. So, it takes seconds with CloudSigma,

It’s time to set up the Ambari Server.

It will ask several things but default options are fine for our purposes. So, we can just hit enter while going through them and the setup will be done.

Finally, with the following command you can start Ambari:

To access the Ambari UI, go to this address using your browser on any computer/tablet.

For example, if my IP is 213.125.36.21, then I will go to the address http://213.125.36.21:8080.

Now that you are in Ambari UI, you can log in using default username – admin and password – admin. You should change them to something secure straight away.

Voilà!

For more tutorials, go ahead and explore our Community Section on the website.

Happy Computing!

Share this Post

About Akshay Nagpal

Big Data Analytics and ML enthusiast.

Leave a Reply