Introduction to Hadoop

Data can be referred to as a collection of useful information in a meaningful manner which can be used for various purposes. In an IT company, it can be used for analyzing the productivity of employees over certain set of projects or in a consulting firm, it can be used for predicting the best investment options based on the past. On a more personal level, we store our memories using data, like photos, videos of family, travel excursions.

Continuous generation and storage of data is becoming a natural part of our daily lives with the advent of technology. From looking for daily weather reports on our phones to paying a highway ticket using a RFID tag, we are generating data knowingly or unknowingly. All this data can be used to make better facilities, businesses and solutions.

With social media booming, the amount of data that is generated is unprecedented. The next big thing in the field of data generation is the Internet of Things (IoT). Data generated by IoT devices has taken the scale to another level. With the amount of data being as big as a haystack, the tools we used to analyze needle size data in the past could not be used on it anymore. This vast amount of varied data being generated at a great velocity is called Big Data.

For analyzing and working on Big Data, new tools had to be created. Hadoop was one such groundbreaking tool that was released first in April 2006. It was based on two of Google’s papers, “Google File System” and “MapReduce: Simplified Data Processing on Large Clusters”.

With the traditional approach to data, the problem was huge data had to be stored and then that data had to be queried/worked upon. The limitation of storage capacity and computational power of the machines presented huge challenges. Also, the time taken was huge. With Hadoop, one could connect various commodity machines and store parts of the entire data on each of them. The file system takes care of the replication of the data in case any machine fails just like RAID. The MapReduce part of it, converts the data in a key – value pair and then aggregates the solution. Rather than bringing the data to the machine where the program resides, it packages the program and sends it to each machine where the data resides. The Map part of the program converts the data into a key value pair based on our business logic and then returns the resultant pair to the machine. The machines working out the final results collect these key value pairs from each machine and reduce these to the final result. This way of solving big data problems is the called map reduce paradigm.

Hadoop was first released in 2006. The project is being developed by Apache currently. From Hadoop v1 to v3, it has developed a lot and lots of new modules have been introduced in it.


Hadoop v1 composes of mainly just MapReduce processing model and HDFS.
Map Reduce has two components, Job Tracker and Task Tracker. Job Tracker oversees the state of the Task Trackers as well as allocated them the jobs/tasks. Whereas task tracker just completes the jobs/tasks assigned to it by job tracker and send the status of the same to job tracker.

MapReduce is responsible for both processing and resource management in the cluster. V1 is limited to a 4000 nodes cluster. It can only process batch applications. Realtime and Streaming applications were not supported then, moreover it’s just available for Linux.

HDFS has two components as well, Name Node and Data Node. Name Node is placed in Master Node. Name nodes generally contain metadata, information about the primary data. There is only a single name node to manage the namespace hence it is the single point of failure. Data nodes are placed in slave nodes. Data nodes are used to store the actual data.


As of today, Hadoop v2 is the latest stable version available.
Hadoop 2 introduces the possibility of using other computing models like Spark etc. Instead of MapReduce handling everything, YARN is introduced which handles cluster resource management and the processing part is taken care of by various processing models. Yarn works on the concept of containers. Resources are divided amongst various containers on an as needed basis and containers are created and killed based on the task. It has two components, Scheduler and Application Manager. Scheduler is responsible to schedule the required resources to the application, it also monitors the applications.

Instead of just 4000 node maximum scalability limit of Hadoop v1, Hadoop v2 can scale up to 10000 nodes per cluster. In V2 multiple name nodes are used and they manage multiple namespaces. Secondary name nodes are present in case the primary one fails. Not just batch processing is supported but real time operations, event processing and streaming jobs are also possible. V2 has also added support for Windows.


Hadoop v3 is available now but isn’t a stable release yet. It adds numerous enhancements over v2.

In v2, the minimum requirement for Java was version 7 which has now increased to version 8 in Hadoop v3.

Erasure Coding is introduced which reduces the standard 3x overhead of data storage with replication. Rather than replicating the entire data, it introduces parities just like we see in RAID to recover data. Generally, erasure coding is used for less frequently accessed data since it adds additional network and CPU overhead during reconstruction. This is optional, and users can choose whether to deploy it or not.

Yarn timeline service v2 is introduced which tackles reliability and scaling issues as well as introduces flows and aggregation for enhanced usability.

V3 also supports scheduling of additional resources like disks and GPUs for better integration with containers, deep learning and machine learning. This provides the data science and AI communities with a great opportunity for running the computations in much lesser time.

Intra-node disk balancing is introduced which balances the space evenly on each of the disks within the cluster.

From v2 of Hadoop to V3, many other tools like Spark, Flink etc. have come into the picture which have different processing models than Map Reduce and process the data faster in some cases. You can read about the usage of many of the components on CloudSigma’s blog and how to make the best out of those tools. You can start by installing Hadoop v2 on CloudSigma’s machines within minutes and follow some of the tutorials.

Share this Post

About Akshay Nagpal

Big Data Analytics and ML enthusiast.

Leave a Reply