Introduction to Hadoop Administration

Introduction to Hadoop Administration
View Hadoop Administration course details at www.edureka.co/hadoop-admin

LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work on Large Data Base
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?

Objectives of this Session
At the end of this module, you will be able to
 Understand how Hadoop overruled the limitations of traditional technologies
 Understand key responsibilities of Hadoop Administrator
 Understand Hadoop Federation and High Availability
 Understand Hadoop Cluster Modes
 Set up a Hadoop Cluster
 Commission and decommission a DataNode

Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
 Systems/Enterprises generate huge amount of data
from Terabytes and even Petabytes of information
What is Big Data?
Stock market generates about one terabyte of
new trade data per day to perform stock trading
analytics to determine trends for optimal trades.

IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition
Web
logs
Images
Videos
Audios
Sensor
Data
VOLUME VELOCITY VARIETY

Limitations of Existing Data Analytics Architecture
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (Original Raw Data)
Collection

Solution: A Combined Storage Computer Layer
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data exploration &
advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
data is
available for
processing
Both
storage
and
processing

Why Hadoop?
How to solve the challenges
posed by Big Data?

Why Hadoop?
The Hadoop platform is
designed to solve problems
posed by Big Data.
Size of Data Variety of Data

What is Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.

Hadoop Key Characteristics
Reliable
EconomicalFlexible
Scalable
Hadoop
Features

Some of the Hadoop Users

Job Market

Skills Required
General operational expertise such as good troubleshooting skills,
understanding of Capacity Planning.
Hadoop skills like HBase, Hive, Pig, Mahout, etc.
They should be able to deploy Hadoop cluster, monitor and scale critical
parts of the cluster.
Good knowledge of Linux as Hadoop runs on Linux.
Familiarity with open source configuration management and deployment
tools such as Puppet or Chef and Linux scripting.
Knowledge of Troubleshooting Core Java Applications is a plus.

Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.

Hadoop 1.x and Hadoop 2.x Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GIRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
HBase
Structured DataUnstructured/
Semi-structured Data
Hadoop 1.x Hadoop 2.x

Hadoop 1.x Core Components
Hadoop is a system for large scale data processing.
2 Main Hadoop 1.x Core Components
Storage Processing
HDFS MapReduce
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations
 Splits a task across processors
 “near” the data & assembles results
 Self-healing, high bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
Additional Administration Tools:
» Filesystem utilities
» Job scheduling and
monitoring
» Web UI

Hadoop 2.x Core Components
Hadoop is a system for large scale data processing.
2 Main Hadoop 2.x Core Components
Storage Processing
HDFS
MapReduce
NextGen / YARN / MRv2
 Highly available
 Distributed across “nodes”
 NameNode tracks locations
 Splits a task across processors
 “near” the data & assembles results
 Resource management and job
scheduling/monitoring
 Clustered storage
 Individual application can utilize cluster
resources in a shared, secure and multi-
tenant manner
 Maintains API compatibility with previous
stable releases of Hadoop

Main Components of HDFS
NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients

NameNode Metadata
Meta-data in Memory
» The entire metadata is in main memory
» No demand paging of FS meta-data
Types of Metadata
» List of Files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor
A Transaction Log
» Records file creations, file deletions. etc
NameNode
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
NameNode:
Keeps track of overall file directory
structure and the placement of Data Block

Secondary NameNode
 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NameNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Secondary
NameNode
NameNode
metadata
metadata
Single Point
Failure
Only in case of
hadoop 1.x , not in
hadoop 2.x

Hadoop 2.x – In Summary
NameNode High
Availability
Next Generation
MapReduce
Client
HDFS YARN
Resource ManagerSecondary
Name Node
Standby
NameNode
Active
NameNode
Distributed Data Storage Distributed Data Processing
DataNode
Node Manager
Container
App
Master
…….
Masters
Slaves
Node Manager
DataNode
Container
App
Master
DataNode
Node Manager
Container
App
Master
Shared
edit logs
OR
Journal
Node
Scheduler
Applications
Manager
(AsM)

Hadoop 2.x Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
…
NamespaceBlockStorage
Namespace
NN-1 NN-k NN-n
Common Storage
BlockStorage
… …
Hadoop 1.0 Hadoop 2.0
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
NS1 NSk NSn
DatanodeDatanode
Datanode 1
…
Datanode m
…
Datanode 2
…
Pool 1 Pool k Pool n
Block Pools

Node Manager
Container
App
Master
Node Manager
Container
App
Master
Hadoop 2.x – High Availability
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure Secondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master

Hadoop 2.x – Resource Management
Node Manager
Container
App
Master
Node Manager
Container
App
Master
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure Secondary
NameNode
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master

Hadoop Cluster: A Typical Use Case
NameNode Secondary NameNode
DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
DataNode
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS

Hadoop 1.x Configuration Files – Apache Hadoop
Core
HDFS
core-site.xml
hdfs-site.xml
mapred-site.xml
Map
Reduce

Hadoop 2.x Configuration Files – Apache Hadoop
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xml
Map
Reduce

Replication and Rack Awareness

Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
 Hadoop daemons run on the local machine.
 Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode

DEMO
Hadoop Cluster Setup

DEMO
Hadoop Rack Awareness

DEMO
Secondary NameNode

Commissioning and Decommissioning of DataNode
DataNode
Master Node
DataNode
DataNode DataNode DataNode
DataNodeDataNode
DataNode
DecommissioningCommissioning

Questions

Course Topics
 Module 1
» Understanding Big Data
» Hadoop Components
 Module 2
» Different Hadoop Server Roles
» Hadoop Cluster Configuration
 Module 3
» Hadoop Cluster Planning
» Job Scheduling
 Module 4
» Securing your Hadoop Cluster
» Backup and Recovery
 Module 5
» Hadoop 2.0 New Features
» HDFS High Availability
 Module 6
» Quorum Journal Manager (QJM)
» Hadoop 2.0 - YARN
 Module 7
» Oozie Workflow Scheduler
» Hive and Hbase Administration
 Module 8
» Hadoop Cluster Case Study
» Hadoop Implementation

Introduction to Hadoop Administration

Introduction to Hadoop Administration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Hadoop Administration

Similar to Introduction to Hadoop Administration (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop Administration