Setting up a Big Cluster in 3 Easy Steps

Setting up a Big Data Cluster within Minutes in 3 Easy Steps

HDP is the industry’s a truly secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation. (Source: https://hortonworks.com/products/data-platforms/hdp/).

I am going to install HDP to create a big cluster of 5 nodes deployed on CloudSigma. CloudSigma provides easy deployment, a vast library of operating systems and easy-to-use interface to set a Big Data Platform within minutes.

Step 1: Set up and Configure your Desired Server Infrastructure

To begin, I have already created five machines at CloudSigma. These machines have 16 GB RAM, 8 cores (2.5 GHz each) and 256 GB SSD. These configurations cost around 20 cents per hour for each machine on CloudSigma to run. I have installed Ubuntu 16.04 on each of the machines. Also, I have cloned the following Ubuntu drives from CloudSigma’s library:

Ubuntu 16.04 with VirtiO drivers Python 3 and 2.7.12 Pip 9.0.1 OpenSSL 1.0.2l Cloud-init 0.7.9 and latest updates until 2017-12-26

Step 2: Set up the Master/Slave Configuration

Next, for our big data tools to work properly, we require that our host (master) should be able to communicate with each of the nodes (slaves). So, we create another sudo user account, say m1 with the following commands on each machine.

Now for the machines to be able to communicate to each other, we first give each of the machine a name in /etc/hosts file:

Add entries similar to these with the IPs of your machine and the names you want to give the machines, for example:

    1. IP_1 machine1.CloudSigma.dann machine1

 

    1. IP_2 machine2.CloudSigma.dann machine2

 

    1. IP_3 machine3.CloudSigma.dann machine3

 

    1. IP_4 machine4.CloudSigma.dann machine4

 

    IP_5 machine5.CloudSigma.dann machine5

Now we want that our m1 user from machine1 can access m1 user on other machines without being asked for password. For that passwordless ssh setup is done.

On machine1:

    1. I. Log onto user m1
II. Create a ssh key
III. After that, copy the key to other machines
Step 3: Get Ambari Up and Running

Go to HortonWorks’s HDP download page and choose your preferred option. We are going to install HDP 2.6.4 (Automated) with Ambari 2.6.1. Click on download and it will redirect you to Apache Ambari Installation page. Select the base OS. In our case, I have Ubuntu 16 machines.

Following that, login to the host machine as root.

Next, download the Ambari repository file to a directory of choice. Execute the commands as mentioned on the page to download the repository file.

Now that we have the repo file, we can install Ambari. Since it downloads files of around 750 MB, a cloud platform is preferable for such clusters. This is because they provide an average download speed of around 40 MBps. So, it takes seconds with CloudSigma,

It’s time to set up the Ambari Server next.

It will ask several things but default options are fine for our purposes. So, we can just hit enter while going through them and the setup will be done.

Finally, with the following command you can start Ambari:

In order to access the Ambari UI, go to this address using your browser on any computer/tablet.

For example, if my IP is 213.125.36.21, then I will go to the address http://213.125.36.21:8080.

Now that you are in Ambari UI, you can log in using default username – admin and password – admin. You should change them to something secure straight away.

And voilà – we are finally finished! This was our tutorial on how to set up a big cluster in 3 simple steps.

For more tutorials, go ahead and explore our Community Section on the website.

Happy Computing!

About Akshay Nagpal

Big Data Analytics and ML enthusiast.