HDFS HA Deep Dive

HDFS

High
Availability

Suresh
S rinivas-‐
H ortonworks

Aaron
T .
M yers
-‐
C loudera

Overview

•  Part
1
–
Suresh
Srinivas(Hortonworks)

− HDFS
Availability
and
Reliability
–
what
is
the
record?

− HA
Use
Cases

− HA
Design

•  Part
2
–
Aaron
T.
Myers
(Cloudera)

− NN
HA
Design
Details

ü AutomaJc
failure
detecJon
and
NN
failover

ü Client-‐NN
connecJon
failover

− OperaJons
and
Admin
of
HA

− Future
Work

2

Availability,
Reliability
and
Maintainability

Reliability
=
MTBF/(1
+
MTBF)

•  Probability
a
system
performs
its
funcJons
without
failure
for

a
desired
period
of
Jme

Maintainability
=
1/(1+MTTR)

•  Probability
that
a
failed
system
can
be
restored
within
a
given

Jmeframe

Availability
=
MTTF/MTBF

•  Probability
that
a
system
is
up
when
requested
for
use

•  Depends
on
both
on
Reliability
and
Maintainability

Mean
Time
To
Failure
(MTTF):
Average
Jme
between
successive
failures

Mean
Time
To
Repair/Restore
(MTTR):
Average
Jme
to
repair
failed
system

Mean
Time
Between
Failures
(MTBF):
Average
Jme
between
successive
failures
=
MTTR
+
MTTF

3

Current
HDFS
Availability
&
Data
Integrity

•  Simple
design
for
Higher
Reliability

− Storage:
Rely
on
NaJve
ﬁle
system
on
the
OS
rather
than
use
raw
disk

− Single
NameNode
master

ü  EnJre
ﬁle
system
state
is
in
memory

− DataNodes
simply
store
and
deliver
blocks

ü  All
sophisJcated
recovery
mechanisms
in
NN

•  Fault
Tolerance

− Design
assumes
disks,
nodes
and
racks
fail

− MulJple
replicas
of
blocks

ü  acJve
monitoring
and
replicaJon

ü  DN
acJvely
monitor
for
block
deleJon
and
corrupJon

− Restart/migrate
the
NameNode
on
failure

ü  Persistent
state:

mulJple
copies

+
checkpoints

ü  FuncJons
as
Cold
Standby

− Restart/replace
the
DNs
on
failure

− DNs
tolerate
individual
disk
failures

4

How
Well
Did
HDFS
Work?

•  Data
Reliability

− Lost
19
out
of
329
Million
blocks
on
10
clusters
with
20K
nodes
in
2009

− 7-‐9’s
of
reliability

− Related
bugs
ﬁxed
in
20
and
21.

•  NameNode
Availability

− 18
months
Study:
22
failures
on
25
clusters
-‐
0.58
failures
per
year
per
cluster

− Only
8
would
have
beneﬁ1ed
from
HA
failover!!
(0.23
failures
per
cluster
year)

− NN
is
very
reliable

ü  Resilient
against
overload
caused
by
misbehaving
apps

•  Maintainability

− Large
clusters
see
failure
of
one
DataNode/day
and
more
frequent
disk
failures

− Maintenance
once
in
3
months
to
repair
or
replace
DataNodes

5

Why
NameNode
HA?

•  NameNode
is
highly
reliable
(low
MTTF)

− But
Availability
is
not
the
same
as
Reliability

•  NameNode
MTTR
depends
on

− RestarJng
NameNode
daemon
on
failure

ü  Operator
restart
–
(failure
detecJon
+
manual
restore)
Jme

ü  AutomaJc
restart
–
1-‐2
minutes

− NameNode
Startup
Jme

ü  Small/medium
cluster
1-‐2
minutes

ü  Very
large
cluster
–
5-‐15
minutes

•  Aﬀects
applicaJons
that
have
real
Jme
requirement

•  For
higher
HDFS
Availability

− Need
redundant
NameNode
to
eliminate
SPOF

− Need
automaJc
failover
to
reduce
MTTR
and
improve
Maintainability

− Need
Hot
standby
to
reduce
MTTR
for
very
large
clusters

ü  Cold
standby
is
suﬃcient
for
small
clusters

6

NameNode
HA
–
IniLal
Goals

•  Support
for
AcJve
and
a
single
Standby

− AcJve
and
Standby
with
manual
failover

ü  Standby
could
be
cold/warm/hot

ü  Addresses
downJme
during
upgrades
–
main
cause
of
unavailability

− AcJve
and
Standby
with
automaJc
failover

ü  Hot
standby

ü  Addresses
downJme
during
upgrades
and
other
failures

•  Backward
compaJble
conﬁguraJon

•  Standby
performs
checkpoinJng

− Secondary
NameNode
not
needed

•  Management
and
monitoring
tools

•  Design
philosophy
–
choose
data
integrity
over
service
availability

7

High
Level
Use
Cases

•  Planned
downJme
Supported
failures

− Upgrades
•  Single
hardware
failure

− Conﬁg
changes

− Double
hardware
failure
not

− Main
reason
for
downJme
supported

•  Some
sogware
failures

− Same
sogware
failure
aﬀects

•  Unplanned
downJme
both
acJve
and
standby

− Hardware
failure

− Server
unresponsive

− Sogware
failures

− Occurs
infrequently

8

High
Level
Design

•  Service
monitoring
and
leader
elecJon
outside
NN

− Similar
to
industry
standard
HA
frameworks

•  Parallel
Block
reports
to
both
AcJve
and
Standby
NN

•  Shared
or
non-‐shared
NN
ﬁle
system
state

•  Fencing
of
shared
resources/data

− DataNodes

− Shared
NN
state
(if
any)

•  Client
failover

− Client
side
failover
(based
on
conﬁguraJon
or
ZooKeeper)

− IP
Failover

9

Design
ConsideraLons

•  Sharing
state
between
AcJve
and
Hot
Standby

− File
system
state
and
Block
locaJons

•  AutomaJc
Failover

− Monitoring
AcJve
NN
and
performing
failover
on
failure

•  Making
a
NameNode
acJve
during
startup

− Reliable
mechanism
for
choosing
only
one
NN
as
acJve
and
the
other
as

standby

•  Prevent
data
corrupJon
on
split
brain

− Shared
Resource
Fencing

ü  DataNodes
and
shared
storage
for
NN
metadata

− NameNode
Fencing

ü  when
shared
resource
cannot
be
fenced

•  Client
failover

− Clients
connect
to
the
new
AcJve
NN
during
failover

10

Failover
Control
Outside
NN

•  Similar
to
Industry
Standard
HA

frameworks

•  HA
daemon
outside
NameNode

ZooKeeper

− Simpler
to
build

− Immune
to
NN
failures

•  Daemon
manages
resources

Resources

Failover
Resources

Controller
AcJons

start,
stop,

Resources

− Resources
–
OS,
HW,
Network
etc.

− NameNode
is
just
another
resource

failover,
monitor,
…

•  Performs

Shared

Resources
− AcJve
NN
elecJon
during
startup

− AutomaJc
Failover

− Fencing

ü Shared
resources

ü NameNode

Architecture

ZK
ZK
ZK

Leader
elecJon

Failover
Failover

Controller
Controller

AcJve
Standby

Cmds
editlog

Monitor
Health
Monitor
Health

editlogs

NN
(fencing)
NN

AcJve
Standby

Block
Reports

DN
DN
DN

First
Phase
–
Hot
Standby

Needs
to
be
HA

editlogs

NN
(Shared
NFS
storage)
NN

AcJve
Standby

Manual
Failover

Block
Reports

DN
fencing

DN
DN
DN

HA
Design
Details

14

Client
Failover
Design
Details

•  Smart
clients
(client
side
failover)

− Users
use
one
logical
URI,
client
selects
correct
NN
to
connect
to

− Clients
know
which
operaJons
are
idempotent,
therefore
safe
to
retry

on
a
failover

− Clients
have
conﬁgurable
failover/retry
strategies

•  Current
implementaJon

− Client
conﬁgured
with
the
addresses
of
all
NNs

•  Other
implementaJons
in
the
future
(more
later)

15

Client
Failover
ConﬁguraLon
Example

...
<property>
<name>dfs.namenode.rpc-address.name-service1.nn1</name>
<value>host1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.name-service1.nn2</name>
</property>
<property>
<name>dfs.namenode.http-address.name-service1.nn1</name>
</property>
...

16

AutomaLc
Failover
Design
Details

•  AutomaJc
failover
requires
Zookeeper

− Not
required
for
manual
failover

− ZK
makes
it
easy
to:

ü Detect
failure
of
the
acJve
NN

ü Determine
which
NN
should
become
the
AcJve
NN

•  On
both
NN
machines,
run
another
daemon

− ZKFailoverController
(Zookeeper
Failover
Controller)

•  Each
ZKFC
is
responsible
for:

− Health
monitoring
of
its
associated
NameNode

− ZK
session
management
/
ZK-‐based
leader
elecJon

•  See
HDFS-‐2185
and
HADOOP-‐8206
for
more
details

17

AutomaLc
Failover
Design
Details
(cont)

18

Ops/Admin:
Shared
Storage

•  To
share
NN
state,
need
shared
storage

− Needs
to
be
HA
itself
to
avoid
just
shiging
SPOF

− Many
come
with
IP
fencing
opJons

− Recommended
mount
opJons:

ü tcp,soft,intr,timeo=60,retrans=10
•  SJll
conﬁgure
local
edits
dirs,
but
shared
dir
is
special

•  Work
is
currently
underway
to
do
away
with
shared
storage

requirement
(more
later)

19

Ops/Admin:
NN
fencing

•  CriJcal
for
correctness
that
only
one
NN
is
acJve
at
a
Jme

•  Out
of
the
box

− RPC
to
acJve
NN
to
tell
it
to
go
to
standby
(graceful
failover)

− SSH
to
acJve
NN
and
`kill -9’
NN

•  Pluggable
opJons

− Many
ﬁlers
have
protocols
for
IP-‐based
fencing
opJons

− Many
PDUs
have
protocols
for
IP-‐based
plug-‐pulling
(STONITH)

ü Nuke
the
node
from
orbit.
It’s
the
only
way
to
be
sure.

•  Conﬁgure
extra
opJons
if
available
to
you

− Will
be
tried
in
order
during
a
failover
event

− Escalate
the
aggressiveness
of
the
method

− Fencing
is
criJcal
for
correctness
of
NN
metadata

20

Ops/Admin:
AutomaLc
Failover

•  Deploy
ZK
as
usual
(3
or
5
nodes)
or
reuse
exisJng
ZK

− ZK
daemons
have
light
resource
requirement

− OK
to
collocate
1
on
each
NN,
many
collocate
3rd
on
the
YARN
RM

− Advisable
to
conﬁgure
ZK
daemons
with
dedicated
disks
for
isolaJon

− Fine
to
use
the
same
ZK
quorum
as
for
HBase,
etc.

•  Fencing
methods
sJll
required

− The
ZKFC
that
wins
the
elecJon
is
responsible
for
performing
fencing

− Fencing
script(s)
must
be
conﬁgured
and
work
from
the
NNs

•  Admin
commands
which
manually
iniJate
failovers
sJll
work

− But
rather
than
coordinaJng
the
failover
themselves,
use
the
ZKFCs

21

Ops/Admin:
Monitoring

•  New
NN
metrics

− Size
of
pending
DN
message
queues

− Seconds
since
the
standby
NN
last
read
from
shared
edit
log

− DN
block
report
lag

− All
measurements
of
standby
NN
lag
–
monitor/alert
on
all
of
these

•  Monitor
shared
storage
soluJon

− Volumes
fill
up,
disks
go
bad,
etc

− Should
configure
paranoid
edit
log
retenJon
policy
(default
is
2)

•  Canary-‐based
monitoring
of
HDFS
a
good
idea

− Pinging
both
NNs
not
sufficient

22

Ops/Admin:
Hardware

•  AcJve/Standby
NNs
should
be
on
separate
racks

•  Shared
storage
system
should
be
on
separate
rack

•  AcJve/Standby
NNs
should
have
close
to
the
same
hardware

− Same
amount
of
RAM
–
need
to
store
the
same
things

− Same
#
of
processors
-‐
need
to
serve
same
number
of
clients

•  All
the
same
recommendaJons
sJll
apply
for
NN

− ECC
memory,
48GB

− Several
separate
disks
for
NN
metadata
directories

− Redundant
disks
for
OS
drives,
probably
RAID
5
or
mirroring

− Redundant
power

23

Future
Work

•  Other
opJons
to
share
NN
metadata

− Journal
daemons
with
list
of
acJve
JDs
stored
in
ZK
(HDFS-‐3092)

− Journal
daemons
with
quorum
writes
(HDFS-‐3077)

•  More
advanced
client
failover/load
shedding

− Serve
stale
reads
from
the
standby
NN

− SpeculaJve
RPC

− Non-‐RPC
clients
(IP
failover,
DNS
failover,
proxy,
etc.)

− Less
client-‐side
conﬁguraJon
(ZK,
custom
DNS
records,
HDFS-‐3043)

•  Even
Higher
HA

− MulJple
standby
NNs

24

QA

•  HA
design:
HDFS-‐1623

− First
released
in
Hadoop
2.0.0-‐alpha

•  Auto
failover
design:
HDFS-‐3042
/
-‐2185

− First
released
in
Hadoop
2.0.1-‐alpha

•  Community
eﬀort

25

HDFS HA Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HDFS HA Deep Dive

Similar to HDFS HA Deep Dive (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

HDFS HA Deep Dive