What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING

Agenda for Today
❖ Traditional Way of Processing
❖ Big Data Growth Drivers
❖ Problem Associated with Big Data
❖ Hadoop: Solution to Big Data Problem
❖ What is Hadoop?
❖ HDFS
❖ MapReduce
❖ Hadoop Ecosystem
❖ Hadoop Case Study - Orbitz

Traditional Way of Processing

Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Instrumentation
Premature data
death
Moving data to compute
doesn’t scale
Can’t explore original
high fidelity raw data
A meagre 10% of
the ~2PB data is
available for BI
Storage
Processing
90% of the
~2PB archived
Hadoop Case Study – Sears

Hadoop Case Study – Orbitz Worldwide
warehouse
Users Orbitz.com
P
R
O
C
E
S
S
I
N
G
500 GB log
data per day
1.5 Million Flight
& 1 Million Hotel
searches every day

Big Data Growth Drivers

Data Generated Every Minute!

IOT: 50 Billion Devices by 2020
Rapid adoption rate of digital infrastructure
5x faster than electricity & telephony
50
Billion
SmartObjects
World Population
Inflection Point
2003 2008 2010 2015 2020
6.307
6.721 6.894 7.347 7.83
Tablets, Laptops, Phones
“~6 things online” per person
Sensors, Smart, Objects, Device Clustered Systems

What is Big Data?

What is Big Data?
“23 Exabytes of information was
recorded and replicated in 2002.
We now record and transfer that
much information every 7 days”
“Big data is the term for a
collection of data sets so large
and complex that it becomes
difficult to process using on-hand
database management tools or
traditional data processing
applications”
Traditional System

Problems Associated with Big Data

Big Data Problems
▪ Data generated in past 2 years is more than the previous
human race history in total
▪ By 2020, total digital data will grow to 44 Zettabytes or
44 trillion Gigabytes approximately
▪ By 2020, about 1.7 MB of new info will be created every
second for every person
1. Storing huge and exponentially growing datasets :

 Structured  Semi - Structured  Unstructured
▪ Organized data
format
▪ Data schema is fixed
▪ Ex: RDBMS data, etc.
▪ Partial organized data
▪ Lacks formal structure
of a data model
▪ Ex: XML & JSON files,
etc.
▪ Un-organized data
▪ Unknown schema
▪ Ex: multi-media files,
etc.
Big Data Problems
2. Processing data having complex structure:

➢ The data is growing at much faster rate than that
of disk read/write speed
Slave A
Slave B
Slave C Slave D
Slave E
Master
Data 
➢ Bringing huge amount of data to computation
unit becomes a bottleneck
Big Data Problems
3. Processing data faster:
Source: Tom’s Hardware

Before moving to the solution for Big Data problems,
let us understand What is DFS?

DFS – Distributed File System
▪ Storing and managing data, i.e. files or folders across multiple computers or servers
▪ Provides abstraction of single and large file system
Before DFS: After DFS Consolidation:
Server1/accounts Server2/finance
Server3/customer Server4/reports
▪ Edureka/accounts
▪ Edureka/finance
▪ Edureka/customer
▪ Edureka/reports

Why DFS?

Hadoop: Solution to Big Data Problem

What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel
and distributed fashion
Hadoop Cluster
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Master
Slaves
Allows to dump any kind
of data across the cluster

Hadoop to Rescue
How?

Hadoop to Rescue
Solution: HDFS
▪ Storage unit of Hadoop
▪ It is a Distributed File System
▪ Divide files (input data) into smaller chunks and
stores it across the cluster
▪ Scalable as per requirement
Problem 1: Storing huge and exponentially growing datasets
512 MB
File
128 MB
128 MB
128 MB
128 MB

Hadoop to Rescue
Solution: HDFS
▪ Allows to store any kind of data, be it structured,
semi-structured or unstructured
▪ Follows WORM (Write Once Read Many)
▪ No schema validation is done while dumping data
Problem 2: Storing unstructured data
HDFS
ReadWrite

Hadoop to Rescue
Solution: Hadoop MapReduce
▪ Provides parallel processing of data present in HDFS
▪ Allows to process data locally i.e. each node works with a part of data which is stored on it
Problem 3: Processing data faster
1 2
File to be
processed
4 hr.
1 hr.

Hadoop Core Components

Hadoop Distributed File System
HDFS Components
HDFS Architecture
HDFS Blocks
Rack Awareness
HDFS Read/Write Mechanism

HDFS Components
NameNode
▪ Master daemon
▪ Maintains and Manages DataNodes
▪ Records metadata e.g. location of blocks stored,
the size of the files, permissions, hierarchy, etc.
▪ Receives heartbeat and block report from all the
DataNodes
DataNode
▪ Slave daemons
▪ Stores actual data
▪ Serves read and write requests from the clients

HDFS Architecture
HDFS Components
HDFS Architecture
HDFS Blocks
Rack Awareness

HDFS Architecture

HDFS Components
HDFS Architecture
HDFS Blocks
Rack Awareness
Let Us Talk About,
How Data is Stored in HDFS?

HDFS Blocks
• Each file is stored on HDFS as blocks
• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
• Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on
HDFS

HDFS Blocks
• Each file is stored on HDFS as blocks
• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
• Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on
HDFS
How many blocks will be created if a
file of size 514 MB is copied to HDFS ?

HDFS Components
HDFS Architecture
HDFS Blocks
Rack Awareness
How Blocks are Placed in HDFS?

Hadoop Architecture: Rack Awareness

HDFS Components
HDFS Architecture
HDFS Blocks
Rack Awareness

HDFS Write Mechanism – Pipeline Setup

HDFS Write Mechanism – Writing a Block

HDFS Write Mechanism - Acknowledgement

HDFS Multi-Block Write Mechanism
For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

HDFS Read Mechanism

Anatomy of MapReduce
MapReduce Example
YARN Components
MapReduce Job Workflow
Hadoop YARN
(MapReduce 2.0)

MapReduce Example
YARN Components
MapReduce Example

MapReduce Example

MapReduce Example
YARN Components
YARN Components

YARN Components
Node Manager:
▪ One per DataNode
▪ Monitors resources on DataNode
ApplicationMaster:
▪ One per application
▪ Short life
▪ Coordinates and Manages MapReduce Jobs
▪ Negotiates with Resource Manager to schedule tasks
Container:
▪ Created by NM when requested
▪ Allocates certain amount of
resources (memory, CPU etc.) on a
slave node
Resource Manager:
▪ Cluster Level resource manager
▪ Long Life, High Quality
Hardware

MapReduce Example
YARN Components

Client Node
JVM
MR
Code
MR
Job
1. run job
RM Node
ResourceManager
JVM
2. Submit job
3. Get application ID
AppMaster JVM
4.2 Launch
AppMaster
NodeManager 5. Allocate
Resources
6.1 Start container
4.1 Start container
task JVM
NodeManager
YARN child
MR Task
6.2 launch
7. run
MapReduce Application Workflow

Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM 1
RM NM AMClient

2. RM allocates a container to start AM
2
1
RM NM AMClient

3. AM registers with RM
3
2
1
RM NM AMClient

4. AM asks containers from RM
3
2
44
1
RM NM AMClient
4

5. AM notifies NM to launch containers
3
2
4
5
4
5
1
RM NM AMClient
4
5

6. Application code is executed in container
6
3
2
4
5
4
5
1
RM NM AMClient
4
5

7. Client contacts RM/AM to monitor application’s status 6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7

7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
8

Hadoop Ecosystem

Challenges:
▪ Current data infrastructure is not capable of storing and processing the data generated
by users everyday
▪ Enhancing current infrastructure was very expensive
warehouse
users orbitz.com
P
R
O
C
E
S
S
I
N
G
500 GB log
data per day
1.5 Million Flight
& 1 Million Hotel
searches every day

▪ Efficient and long term Storage System
that can store any kind of data
▪ Analytical tool for making important
business decision
▪ Cost Effective
▪ Open Source framework that used to
store and process huge data sets
▪ Easily scalable as per the need
▪ Comes with various analytical tools
Solution: Apache HadoopRequirement:

What is Apache Hive?

Apache Hive is a data warehousing tool on top of Hadoop that allows you to perform analytics on huge
datasets using HiveQL which is very similar to SQL
Hive
Query Language
SQL MapReduce
HDFS
Hive
Select name
from cust_details;
MapReduce
Name Age
John 24
Mike 37
Ashley 29
Name
John
Mike
Ashley
query output
Example:

?
Large amount of
unstructured log data
generated every day Can store any
type of data
2 Can parallelly
process data faster Output
Structured
Data
4
Hive Query
Language
6
Write fancy query to
analyze hotel position
in search bar using log data
Analytical
Report
8
5
7
3
1
1

Hadoop Deployment:
load data
Local
log data
Apache Hive
query result
source
MapReduce for processing uncleaned data
Hadoop Cluster
Analyst

Let Us See An example to Understand
Hadoop & Hive Implementation at Orbitz

1. Impression List:
▪ It contains the ranking of each hotel in the search bar
along with the session id of the visitor who has clicked
on it.
▪ Format of Impression List:
(session_id, hotel_id, position, rate)
Types of Website Logs

2. WebTrends Log:
▪ It contains the customer details who has booked a hotel on the website.
▪ Format of WebTrends Log:
(session_id, visitors_ip, hotel_id, booking_date, number_of_guests, booking_time)

Query for analyzing the Hotel position in Search bar on the website:
Steps:
▪ Clean the website log data using MapReduce
▪ Load the cleaned data into Hive
▪ Compare the ranking of a hotel in the search list with its booking frequency using Hive query.
Website Log
(Uncleaned)
MapReduce
(Data Cleaning)
HDFS
Hive
(Analytics)

▪ Comparison of performance of previous methodology to
Hadoop implementation
▪ Months worth of data is archived easily
▪ Earlier process took 109m 14s for extracting and
processing logs whereas MapReduce process took 25m
58s only
▪ Allows them to easily derive various metrics for analytics which
was a tedious task earlier
Accomplishment with Hadoop:

Thank You…
Questions/Queries/Feedback

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka

Similar to What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training | Edureka