Before learning about Hadoop, let’s discuss some terminology that will help us to understand why Hadoop came into the data picture. We are dealing with a huge amount of data that is generating at an enormous rate. For example, 87432 emails coming into the inbox per second, 48329 tweets, 23483 images uploaded on Instagram, etc. This massive data generation is big data. To handle it, we need the framework Hadoop that itself consists of multiple components combined called the Hadoop ecosystem.
• Data in Instagram, YouTube, Facebook, etc. is mostly in unstructured formats. Previous systems have only been designed to take care of structured data.
• In databases one needs to add more memory, processing, etc. if the requirement is so, and it can be costly.
• Data is stored in different schemas and tables. Fetching all this data together and analyzing it can be a quite difficult task.
This is where Hadoop comes in.
Hadoop is an open source framework able to handle big data. Hadoop handles big data in a distributed environment that contains clusters of machines that work together closely to make the impression of a single big machine.
• It is very scalable as data is handled by it in a distributed manner.
• Unlike Rdbms that offer vertical scaling, Hadoop has horizontal scaling.
• It simultaneously saves replicas of data that makes it fault tolerant.
• Hadoop deals with all types of data, may it be structured, semi-structured, unstructured, etc. This is very important as today’s data has no defined format.
• All the cluster nodes are normal hardware, therefore it is economical.
HDFS stands for Hadoop Distributed File System. It is the actual storage part of Hadoop that stores data in files. Every file is divided into 128mb blocks (changeable), these files are stored on different machines.
HDFS follows a master-slave structure.
Name node, also called the master node, is just one in one cluster. It consists of all the metadata about all the block locations in the cluster.
Data node is also called the slave node, and there are many data nodes in one main cluster. The work of a data node is to fetch data when it’s needed.
Yarn is a resource negotiator that supervises the resources in the cluster and takes care of the applications over Hadoop. With the help of yarn data in HDFS can be processed and run by various processing engines such as interactive processing, batch processing etc. Yarn basically increases efficiency.
Hadoop handles big data with the help of Map Reduce which segregates one task into multiple tasks and processes these multiple tasks on multiple machines.
There are two important phases:
Map phase: In map phase data is sorted, filtered and grouped. The input data is segregated into multiple machines and outputs a pair of key-values.
Reduce Phase: The output of this phase is taken by reduce task and it accumulates the data, sums the result and stores that on HDFS.
Pig deals with large volumes of data and is used when there is any problem to write and map reduce functions.
It has two parts:
Pig Latin: Pig Latin is a language like SQL.
Pig Engine: Pig Latin runs on an execution engine that is the Pig Engine. The code is written in Pig Latin and that is converted to MapReduce functions. This is done to make it easy for the programmers who are not good in Java.
Sqoop plays a role in taking data from relational databases and bringing it to HDFS and vice versa. It works with mostly all databases such as sqlite, postgre, mysql, etc.
It is a NOSQL database that allows HDFS to handle any type of data. It deals with read write operations on data and some real time processing.
It deals with connecting the applications that are generating data and the ones that are consuming the data. It actually makes the bridge between the producers of data and consumers of data.
It is the scheduler that allows the jobs written on several platforms like Hive, Pig, Map Reduce etc. With the help of Oozie, a pipeline of jobs that need to execute sequentially or in parallel can be created. Oozie is used to perform some ETL operations and save it in HDFS.
With the above components it can be difficult to relate them to how the structure of Hadoop works. Let’s break it down.
Sqoop and Kafka are used to inject data from various sources to HDFS.
The storage unit of Hadoop is HDFS where the data from Hbase is also stored.
To perform various tasks simultaneously on different machines Map Reduce is used.
Pig and Hive are used to analyze the data.
Oozie is used as a scheduler for jobs.
Zookeeper synchronizes different machines and is used in mostly every stage of processing.
Cost effective: It is open source, and uses hardware that is cheap and easy to buy.
Fault tolerant: It has very high fault tolerance compared to Spark or Flink, as the data in Hadoop is replicated across various nodes and it can use it in case of any issue.
Scalable: It is easy to scale as compared to Spark as it solely depends on RAM for computations. In Hadoop, you just need to add nodes and more disks for more storage.
Security: If we compare the security of Spark and Hadoop, Hadoop wins here too as it deals with different access control and authentication methods. Spark can reach a good level of security if it is integrated with Hadoop.
https://sqoop.apache.org/
https://hbase.apache.org/
https://www.ibm.com/analytics/hadoop