1. Introduction to Hadoop Administration
View Hadoop Administration course details at www.edureka.co/hadoop-admin
2. LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work on Large Data Base
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?
3. www.edureka.co/hadoop-adminSlide 3 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives of this Session
At the end of this module, you will be able to
Understand how Hadoop overruled the limitations of traditional technologies
Understand key responsibilities of Hadoop Administrator
Understand Hadoop Federation and High Availability
Understand Hadoop Cluster Modes
Set up a Hadoop Cluster
Commission and decommission a DataNode
4. www.edureka.co/hadoop-adminSlide 4 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
Systems/Enterprises generate huge amount of data
from Terabytes and even Petabytes of information
What is Big Data?
Stock market generates about one terabyte of
new trade data per day to perform stock trading
analytics to determine trends for optimal trades.
5. www.edureka.co/hadoop-adminSlide 5 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition
Web
logs
Images
Videos
Audios
Sensor
Data
VOLUME VELOCITY VARIETY
6. www.edureka.co/hadoop-adminSlide 6 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Limitations of Existing Data Analytics Architecture
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (Original Raw Data)
Collection
7. www.edureka.co/hadoop-adminSlide 7 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Solution: A Combined Storage Computer Layer
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data exploration &
advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
data is
available for
processing
Both
storage
and
processing
9. www.edureka.co/hadoop-adminSlide 9 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Hadoop?
The Hadoop platform is
designed to solve problems
posed by Big Data.
Size of Data Variety of Data
10. www.edureka.co/hadoop-adminSlide 10 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
11. www.edureka.co/hadoop-adminSlide 11 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop Key Characteristics
Reliable
EconomicalFlexible
Scalable
Hadoop
Features
14. www.edureka.co/hadoop-adminSlide 14 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Skills Required
General operational expertise such as good troubleshooting skills,
understanding of Capacity Planning.
Hadoop skills like HBase, Hive, Pig, Mahout, etc.
They should be able to deploy Hadoop cluster, monitor and scale critical
parts of the cluster.
Good knowledge of Linux as Hadoop runs on Linux.
Familiarity with open source configuration management and deployment
tools such as Puppet or Chef and Linux scripting.
Knowledge of Troubleshooting Core Java Applications is a plus.
15. www.edureka.co/hadoop-adminSlide 15 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
16. www.edureka.co/hadoop-adminSlide 16 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 1.x and Hadoop 2.x Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GIRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
HBase
Structured DataUnstructured/
Semi-structured Data
Hadoop 1.x Hadoop 2.x
17. www.edureka.co/hadoop-adminSlide 17 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 1.x Core Components
Hadoop is a system for large scale data processing.
2 Main Hadoop 1.x Core Components
Storage Processing
HDFS MapReduce
Distributed across “nodes”
Natively redundant
NameNode tracks locations
Splits a task across processors
“near” the data & assembles results
Self-healing, high bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Additional Administration Tools:
» Filesystem utilities
» Job scheduling and
monitoring
» Web UI
18. www.edureka.co/hadoop-adminSlide 18 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 2.x Core Components
Hadoop is a system for large scale data processing.
2 Main Hadoop 2.x Core Components
Storage Processing
HDFS
MapReduce
NextGen / YARN / MRv2
Highly available
Distributed across “nodes”
NameNode tracks locations
Splits a task across processors
“near” the data & assembles results
Resource management and job
scheduling/monitoring
Clustered storage
Individual application can utilize cluster
resources in a shared, secure and multi-
tenant manner
Maintains API compatibility with previous
stable releases of Hadoop
19. www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Main Components of HDFS
NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients
20. www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
NameNode Metadata
Meta-data in Memory
» The entire metadata is in main memory
» No demand paging of FS meta-data
Types of Metadata
» List of Files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor
A Transaction Log
» Records file creations, file deletions. etc
NameNode
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
NameNode:
Keeps track of overall file directory
structure and the placement of Data Block
21. www.edureka.co/hadoop-adminSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Secondary NameNode
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Secondary
NameNode
NameNode
metadata
metadata
Single Point
Failure
Only in case of
hadoop 1.x , not in
hadoop 2.x
22. www.edureka.co/hadoop-adminSlide 22 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 2.x – In Summary
NameNode High
Availability
Next Generation
MapReduce
Client
HDFS YARN
Resource ManagerSecondary
Name Node
Standby
NameNode
Active
NameNode
Distributed Data Storage Distributed Data Processing
DataNode
Node Manager
Container
App
Master
…….
Masters
Slaves
Node Manager
DataNode
Container
App
Master
DataNode
Node Manager
Container
App
Master
Shared
edit logs
OR
Journal
Node
Scheduler
Applications
Manager
(AsM)
23. www.edureka.co/hadoop-adminSlide 23 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 2.x Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
…
NamespaceBlockStorage
Namespace
NN-1 NN-k NN-n
Common Storage
BlockStorage
… …
Hadoop 1.0 Hadoop 2.0
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
NS1 NSk NSn
DatanodeDatanode
Datanode 1
…
Datanode m
…
Datanode 2
…
Pool 1 Pool k Pool n
Block Pools
24. www.edureka.co/hadoop-adminSlide 24 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Hadoop 2.x – High Availability
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure Secondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master
25. www.edureka.co/hadoop-adminSlide 25 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop 2.x – Resource Management
Node Manager
Container
App
Master
Node Manager
Container
App
Master
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure Secondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master
26. www.edureka.co/hadoop-adminSlide 26 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop Cluster: A Typical Use Case
NameNode Secondary NameNode
DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
DataNode
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
30. www.edureka.co/hadoop-adminSlide 30 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM.
Suitable for running MapReduce programs during development.
Has no DFS.
Hadoop daemons run on the local machine.
Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode