Apache Hadoop

Introduction to Hadoop

Data can be referred to as a collection of useful information in a meaningful manner which can be used for various purposes. Аn IT company can use ит for analyzing the productivity of employees over certain set of projects or in a consulting firm and also for predicting the best investment options based on the past. On a more personal level, we store our memories using data, like photos, videos of family, travel excursions.

Continuous generation and storage of data is becoming a natural part of our daily lives with the advent of technology. From looking for daily weather reports on our phones to paying a highway ticket using a RFID tag, we are generating data knowingly or unknowingly. We can use this data to make better facilities, businesses and solutions.

With social media booming, the amount of generated data is unprecedented. The next big thing in the field of data generation is the Internet of Things (IoT). Data generated by IoT devices has taken the scale to another level. With the amount of data being as big as a haystack, the tools we used to analyze needle size data in the past could not be used on it anymore. This vast amount of varied data being generated at a great velocity is called Big Data.

What is HADOOP

For analyzing and working on Big Data, new tools had to be created. Hadoop launched in April 2016 as a groundbreaking tool. It was based on two of Google’s papers, “Google File System” and “MapReduce: Simplified Data Processing on Large Clusters”.

With the traditional approach to data, the problem was having to store huge amounts of data and then querying/working on that data. The limitation of storage capacity and computational power of the machines presented huge challenges. Also, the time taken was huge. With Hadoop, one could connect various commodity machines and store parts of the entire data on each of them. The file system takes care of the replication of the data in case any machine fails just like RAID.

The MapReduce part of it, converts the data in a key – value pair and then aggregates the solution. Rather than bringing the data to the machine where the program resides, it packages the program and sends it to each machine where the data resides. The Map part of the program converts the data into a key value pair based on our business logic and then returns the resultant pair to the machine. The machines working out the final results collect these key value pairs from each machine and reduce these to the final result. This way of solving big data problems is the called map reduce paradigm.

Hadoop was first released in 2006. The project is being developed by Apache currently. From Hadoop v1 to v3, they have developed it a lot and introduced lots of new modules into it.

HADOOP v1:

Hadoop v1 composes of mainly just MapReduce processing model and HDFS.
Map Reduce has two components, Job Tracker and Task Tracker. Job Tracker oversees the state of the Task Trackers as well as allocated them the jobs/tasks. Whereas task tracker just completes the jobs/tasks assigned to it by job tracker and send the status of the same to job tracker.

MapReduce is responsible for both processing and resource management in the cluster. V1 has a limit of a 4000 nodes cluster. It can only process batch applications. Hadoop did not support Realtime and Streaming applications then, moreover it’s just available for Linux.

HDFS has two components as well, Name Node and Data Node. We should put the Name Node in the Master Node. Name nodes generally contain metadata, information about the primary data. There is only a single name node to manage the namespace hence it is the single point of failure. Data nodes are placed in slave nodes. Data nodes are used to store the actual data.

HADOOP v2:

As of today, Hadoop v2 is the latest stable version available.
Hadoop 2 introduces the possibility of using other computing models like Spark etc. They’ve also introduced YARN, instead of MapReduce handling everything. YARN handles cluster resource management and takes care of the processing part by various processing models. Yarn works on the concept of containers. They have divided resources among various containers on a needed basis and have created and killed containers based on the task. It has two components, Scheduler and Application Manager. Scheduler is responsible to schedule the required resources to the application, it also monitors the applications.

Instead of just 4000 node maximum scalability limit of Hadoop v1, Hadoop v2 can scale up to 10000 nodes per cluster. In V2, they’ve used multiple name nodes and they manage multiple namespaces. Secondary name nodes are present in case the primary one fails. It doesn’t just support batch processing, but real time operations, event processing and streaming jobs are also possible. V2 has also added support for Windows.

HADOOP v3:

Hadoop v3 is available now but isn’t a stable release yet. It adds numerous enhancements over v2.

In v2, the minimum requirement for Java was version 7 which has now increased to version 8 in Hadoop v3.

Here, they have introduved Erasure Coding which reduces the standard 3x overhead of data storage with replication. Rather than replicating the entire data, it introduces parities just like we see in RAID to recover data. Generally, we use erasure coding for less frequently accessed data since it adds additional network and CPU overhead during reconstruction. This is optional, and users can choose whether to deploy it or not.

V2 has also introduced Yarn timeline service, which tackles reliability and scaling issues as well as introduces flows and aggregation for enhanced usability.

V3 also supports scheduling of additional resources like disks and GPUs for better integration with containers, deep learning and machine learning. This provides the data science and AI communities with a great opportunity for running the computations in much lesser time.

They have also introduced intra-node disk balancing which balances the space evenly on each of the disks within the cluster.

From v2 of Hadoop to V3, many other tools like Spark, Flink etc. have come into the picture which have different processing models than Map Reduce and process the data faster in some cases. You can read about the usage of many of the components on CloudSigma’s blog and how to make the best out of those tools. You can start by installing Hadoop v2 on CloudSigma’s machines within minutes and follow some of the tutorials.