This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial:
1) Traditional Way of Processing - SEARS
2) Big Data Growth Drivers
3) Problem Associated with Big Data
4) Hadoop: Solution to Big Data Problem
5) What is Hadoop?
6) HDFS
7) MapReduce
8) Hadoop Ecosystem
9) Demo: Hadoop Case Study - Orbitz
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
2. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Agenda for Today
❖ Traditional Way of Processing
❖ Big Data Growth Drivers
❖ Problem Associated with Big Data
❖ Hadoop: Solution to Big Data Problem
❖ What is Hadoop?
❖ HDFS
❖ MapReduce
❖ Hadoop Ecosystem
❖ Hadoop Case Study - Orbitz
4. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Instrumentation
Premature data
death
Moving data to compute
doesn’t scale
Can’t explore original
high fidelity raw data
A meagre 10% of
the ~2PB data is
available for BI
Storage
Processing
90% of the
~2PB archived
Hadoop Case Study – Sears
8. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
IOT: 50 Billion Devices by 2020
Rapid adoption rate of digital infrastructure
5x faster than electricity & telephony
50
Billion
SmartObjects
World Population
Inflection Point
2003 2008 2010 2015 2020
6.307
6.721 6.894 7.347 7.83
Tablets, Laptops, Phones
“~6 things online” per person
Sensors, Smart, Objects, Device Clustered Systems
10. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
“23 Exabytes of information was
recorded and replicated in 2002.
We now record and transfer that
much information every 7 days”
“Big data is the term for a
collection of data sets so large
and complex that it becomes
difficult to process using on-hand
database management tools or
traditional data processing
applications”
Traditional System
12. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data Problems
▪ Data generated in past 2 years is more than the previous
human race history in total
▪ By 2020, total digital data will grow to 44 Zettabytes or
44 trillion Gigabytes approximately
▪ By 2020, about 1.7 MB of new info will be created every
second for every person
1. Storing huge and exponentially growing datasets :
13. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Structured Semi - Structured Unstructured
▪ Organized data
format
▪ Data schema is fixed
▪ Ex: RDBMS data, etc.
▪ Partial organized data
▪ Lacks formal structure
of a data model
▪ Ex: XML & JSON files,
etc.
▪ Un-organized data
▪ Unknown schema
▪ Ex: multi-media files,
etc.
Big Data Problems
2. Processing data having complex structure:
14. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
➢ The data is growing at much faster rate than that
of disk read/write speed
Slave A
Slave B
Slave C Slave D
Slave E
Master
Data
➢ Bringing huge amount of data to computation
unit becomes a bottleneck
Big Data Problems
3. Processing data faster:
Source: Tom’s Hardware
16. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
DFS – Distributed File System
▪ Storing and managing data, i.e. files or folders across multiple computers or servers
▪ Provides abstraction of single and large file system
Before DFS: After DFS Consolidation:
Server1/accounts Server2/finance
Server3/customer Server4/reports
▪ Edureka/accounts
▪ Edureka/finance
▪ Edureka/customer
▪ Edureka/reports
21. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel
and distributed fashion
Hadoop Cluster
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing
of the data stored in HDFS
Master
Slaves
Allows to dump any kind
of data across the cluster
23. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop to Rescue
Solution: HDFS
▪ Storage unit of Hadoop
▪ It is a Distributed File System
▪ Divide files (input data) into smaller chunks and
stores it across the cluster
▪ Scalable as per requirement
Problem 1: Storing huge and exponentially growing datasets
512 MB
File
128 MB
128 MB
128 MB
128 MB
24. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop to Rescue
Solution: HDFS
▪ Allows to store any kind of data, be it structured,
semi-structured or unstructured
▪ Follows WORM (Write Once Read Many)
▪ No schema validation is done while dumping data
Problem 2: Storing unstructured data
HDFS
ReadWrite
25. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop to Rescue
Solution: Hadoop MapReduce
▪ Provides parallel processing of data present in HDFS
▪ Allows to process data locally i.e. each node works with a part of data which is stored on it
Problem 3: Processing data faster
1 2
File to be
processed
4 hr.
1 hr.
29. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Components
NameNode
▪ Master daemon
▪ Maintains and Manages DataNodes
▪ Records metadata e.g. location of blocks stored,
the size of the files, permissions, hierarchy, etc.
▪ Receives heartbeat and block report from all the
DataNodes
DataNode
▪ Slave daemons
▪ Stores actual data
▪ Serves read and write requests from the clients
33. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Blocks
• Each file is stored on HDFS as blocks
• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
• Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on
HDFS
34. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Blocks
• Each file is stored on HDFS as blocks
• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
• Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on
HDFS
How many blocks will be created if a
file of size 514 MB is copied to HDFS ?
35. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
HDFS Blocks
• Each file is stored on HDFS as blocks
• The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x)
• Let us say, I have a file example.txt of size 248 MB. Below is the representation of how it will be stored on
HDFS
How many blocks will be created if a
file of size 514 MB is copied to HDFS ?
55. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
YARN Components
Node Manager:
▪ One per DataNode
▪ Monitors resources on DataNode
ApplicationMaster:
▪ One per application
▪ Short life
▪ Coordinates and Manages MapReduce Jobs
▪ Negotiates with Resource Manager to schedule tasks
Container:
▪ Created by NM when requested
▪ Allocates certain amount of
resources (memory, CPU etc.) on a
slave node
Resource Manager:
▪ Cluster Level resource manager
▪ Long Life, High Quality
Hardware
61. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
3
2
44
1
RM NM AMClient
4
62. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
3
2
4
5
4
5
1
RM NM AMClient
4
5
63. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
64. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status 6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
65. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
MapReduce Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
6
3
2
4
5
4
5
1
RM NM AMClient
4
5
7
8
69. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Challenges:
▪ Current data infrastructure is not capable of storing and processing the data generated
by users everyday
▪ Enhancing current infrastructure was very expensive
Hadoop Case Study – Orbitz Worldwide
warehouse
users orbitz.com
P
R
O
C
E
S
S
I
N
G
500 GB log
data per day
1.5 Million Flight
& 1 Million Hotel
searches every day
70. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Case Study – Orbitz Worldwide
▪ Efficient and long term Storage System
that can store any kind of data
▪ Analytical tool for making important
business decision
▪ Cost Effective
▪ Open Source framework that used to
store and process huge data sets
▪ Easily scalable as per the need
▪ Comes with various analytical tools
Solution: Apache HadoopRequirement:
72. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Case Study – Orbitz Worldwide
Apache Hive is a data warehousing tool on top of Hadoop that allows you to perform analytics on huge
datasets using HiveQL which is very similar to SQL
Hive
Query Language
SQL MapReduce
HDFS
Hive
Select name
from cust_details;
MapReduce
Name Age
John 24
Mike 37
Ashley 29
Name
John
Mike
Ashley
query output
Example:
73. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
?
Large amount of
unstructured log data
generated every day Can store any
type of data
2 Can parallelly
process data faster Output
Structured
Data
4
Hive Query
Language
6
Write fancy query to
analyze hotel position
in search bar using log data
Analytical
Report
8
5
7
3
1
1
Hadoop Case Study – Orbitz Worldwide
76. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
1. Impression List:
▪ It contains the ranking of each hotel in the search bar
along with the session id of the visitor who has clicked
on it.
▪ Format of Impression List:
(session_id, hotel_id, position, rate)
Types of Website Logs
Hadoop Case Study – Orbitz Worldwide
77. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
2. WebTrends Log:
▪ It contains the customer details who has booked a hotel on the website.
▪ Format of WebTrends Log:
(session_id, visitors_ip, hotel_id, booking_date, number_of_guests, booking_time)
Hadoop Case Study – Orbitz Worldwide
78. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Query for analyzing the Hotel position in Search bar on the website:
Steps:
▪ Clean the website log data using MapReduce
▪ Load the cleaned data into Hive
▪ Compare the ranking of a hotel in the search list with its booking frequency using Hive query.
Website Log
(Uncleaned)
MapReduce
(Data Cleaning)
HDFS
Hive
(Analytics)
Hadoop Case Study – Orbitz Worldwide
79. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
▪ Comparison of performance of previous methodology to
Hadoop implementation
▪ Months worth of data is archived easily
▪ Earlier process took 109m 14s for extracting and
processing logs whereas MapReduce process took 25m
58s only
▪ Allows them to easily derive various metrics for analytics which
was a tedious task earlier
Accomplishment with Hadoop:
Hadoop Case Study – Orbitz Worldwide