Installing Hadoop Tutorial

Installing Hadoop on a Single Node in Five Simple Steps

Welcome to our guide on installing Hadoop in five simple steps.
To start with, the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, where each may be prone to failures. (Source)

As of now, a Hadoop project consists of the following modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • YARN: A framework for job scheduling and cluster resource management.
  • MapReduce: A YARN-based system for parallel processing of large data sets.

In this tutorial, I am going to install Hadoop 2.9.1 on a single node with Ubuntu installed on it.

Step 1: Setting up the instance on CloudSigma

I am using a machine with the following resources:

12 GHz CPU

16 GB RAM

100 GB SSD

I am cloning Ubuntu 18.04 from the library and resizing it to 100 GB. Ubuntu 18.04 on the library comes with VirtIO drivers Python3 and Python 2.7.15, Pip 10.0.1, OpenSSL 1.1.0g and latest updates until 2018-06-11.

I am creating a password for default user, cloudsigma using the following command:

To configure SSH, running the following command to generate ssh keys:

 

Step 2: Installing Prerequisites

On the server, I will first upgrade the package list and then upgrade the already installed packages. This would help to get the updated versions of any package/software.

For Hadoop to be installed java, ssh and rsync packages need to be installed. Subsequently, after making sure that all the software packages are at their latest versions, we can proceed with the rest of the process.

Now, I am going to set the JAVA_HOME directory. In my case it is /usr/lib/jvm/java-8-openjdk-amd64/jre.

To find out JAVA_HOME directory, enter the command:

It gives me /usr/bin/java as a result. This is a path which refers to the original location of java. Now, I would enter the following command:

This command would find out where /usr/bin/java is pointing to. I get the following as a result:

lrwxrwxrwx 1 root root 22 Sep 13 13:19 /usr/bin/java -> /etc/alternatives/java

Next, it shows the /usr/bin/java is pointing to /etc/alternatives/java. I will run the following command now:

The above command gives the following as result:

lrwxrwxrwx 1 root root 46 Sep 13 13:19 /etc/alternatives/java -> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

This shows that /etc/alternatives/java is pointing towards /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java. Out of the entire path, remove the java and the bin at the end. The remaining path, “/usr/lib/jvm/java-8-openjdk-amd64/jre” becomes my JAVA_HOME directory.

To set JAVA_HOME, use the following command:

 

Step 3: Downloading Hadoop

For downloading Hadoop, go to this link. I am choosing HTTP, and then selecting stable folder for a stable release. Under the stable folder, I can see 2.9.1 version is there. I will copy the link of hadoop-2.9.1.tar.gz and not the source one.

Now, on my instance, I will download the entire file using the following command:

Now that I have downloaded the file, I will extract the file:

Setting Hadoop home directory. In my case, /home/cloudsigma/hadoop-2.9.1 is where I have extracted the files:

Step 4: Hadoop Configurations

Now that HADOOP_HOME is set, I am going to make some configuration changes.

Firstly, I will add fs.defaultFS property in HADOOP_HOME/etc/hadoop/core-site.xml

The final file should look like this:

This property provides HDFS address to our dfs commands. If this isn’t specified, we would need to give HDFS address in each of the dfs command.

Further, I am creating two folders namenode and datanode for next their respective directories.

Next, I am going to edit HADOOP_HOME/etc/hadoop/hdfs-site.xml. I am adding replication factor, namenode directory as well the datanode directory.

The final file should look like this:

Replication factor controls how many times the blocks of data are replicated. I have specified 1, that means, no replicas would be made.

dfs.namenode.name.dir determines where the DFS name node should store the name table (fsimage) on the local filesystem.

dfs.datanode.name.dir determines where the DFS data node should store its blocks on the local filesystem.

In HADOOP_HOME/etc/hadoop/hadoop-env.sh, I am hard coding JAVA_HOME since sometimes it’s unable to import this value from local session. Next, I will edit the line:

to make it:

 

To make hdfs and hadoop commands easily accessible from anywhere, I am adding their paths to the PATH variable:

 

Since we will be starting HDFS for the first time, I am formatting the namenode:

However, I am getting this error while running this command,

18/09/13 14:22:39 WARN net.DNS: Unable to determine address of the host-falling back to “localhost” address
java.net.UnknownHostException: rev-xxx.xxx.xx.xxx-static.atman.pl: rev-xxx.xxx.xx.xxx-static.atman.pl: Name or service not known

In order to resolve this issue, add this line in /etc/hosts with sudo permissions in the format:

IP_Address    rev-xxx.xxx.xx.xxx-static.atman.pl

The above solution would resolve the issue. Now, run the namenode format command again.

Step 5: Starting Hadoop

Firstly, from anywhere on the machine, enter the commands,

The output will be like this:

Now, we will start YARN using the command:

The output will be like this:

Secondly, we can check it using some hdfs commands,

It will give us an output like this:

Finally, we’ve reached the end of this tutorial. Hadoop is now installed and fully operational!