Apache HBase Low Latency

HBase Low Latency
Nick
Dimiduk,
Hortonworks
(@xefyr)

Nicolas
Liochon,
Scaled
Risk
(@nkeywal)

HBaseCon
May
5,
2014

Agenda
•  Latency,
what
is
it,
how
to
measure
it

•  Write
path

•  Read
path

•  Next
steps

What’s low latency
Latency
is
about
percenJles

•  Average
!=
50%
percenJle

•  There
are
oRen
order
of
magnitudes
between
«
average
»
and
«
95

percenJle
»

•  Post
99%
=
«
magical
1%
».
Work
in
progress
here.

•  Meaning
from
micro
seconds
(High
Frequency

Trading)
to
seconds
(interacJve
queries)

•  In
this
talk
milliseconds

Measure latency
bin/hbase
org.apache.hadoop.hbase.PerformanceEvaluaJon

•  More
opJons
related
to
HBase:
autoﬂush,
replicas,
…

•  Latency
measured
in
micro
second

•  Easier
for
internal
analysis

YCSB
-‐
Yahoo!
Cloud
Serving
Benchmark

•  Useful
for
comparison
between
databases

•  Set
of
workload
already
deﬁned

Write path
•  Two
parts

•  Single
put
(WAL)

•  The
client
just
sends
the
put

•  MulJple
puts
from
the
client
(new
behavior
since
0.96)

•  The
client
is
much
smarter

•  Four
stages
to
look
at
for
latency

•  Start
(establish
tcp
connecJons,
etc.)

•  Steady:
when
expected
condiJons
are
met

•  Machine
failure:
expected
as
well

•  Overloaded
system

Single put: communica>on & scheduling
•  Client:
TCP
connecJon
to
the
server

•  Shared:
mulJtheads
on
the
same
client
are
using
the
same
TCP
connecJon

•  Pooling
is
possible
and
does
improve
the
performances
in
some
circonstances

•  hbase.client.ipc.pool.size

•  Server:
mulJple
calls
from
mulJple
threads
on
mulJple
machines

•  Can
become
thousand
of
simultaneous
queries

•  Scheduling
is
required

Single put: real work
•  The
server
must

•  Write
into
the
WAL
queue

•  Sync
the

WAL
queue
(HDFS
ﬂush)

•  Write
into
the
memstore

•  WALs
queue
is
shared
between
all
the
regions/handlers

•  Sync
is
avoided
if
another
handlers
did
the
work

•  You
may
ﬂush
more
than
expected

Simple put: A small run
Percen&le
Time
in
ms

Mean
1.21

50%
0.95

95%
1.50

99%
2.12

Latency sources
•  Candidate
one:
network

•  0.5ms
within
a
datacenter

•  Much
less
between
nodes
in
the
same
rack

Percen&le
Time
in
ms

Mean
0.13

50%
0.12

95%
0.15

99%
0.47

Latency sources
•  Candidate
two:
HDFS
Flush

•  We
can
sJll
do
beier:
HADOOP-‐7714
&
sons.

Percen&le
Time
in
ms

Mean
0.33

50%
0.26

95%
0.59

99%
1.24

Latency sources
•  Millisecond
world:
everything
can
go
wrong

•  JVM

•  Network

•  OS
Scheduler

•  File
System

•  All
this
goes
into
the
post
99%
percenJle

•  Requires
monitoring

•  Usually
using
the
latest
version
shelps.

Latency sources
•  Split
(and
presplits)

•  Autosharding
is
great!

•  Puts
have
to
wait

•  Impacts:
seconds

•  Balance

•  Regions
move

•  Triggers
a
retry
for
the
client

•  hbase.client.pause
=
100ms
since
HBase
0.96

• 
Garbage
CollecJon

•  Impacts:
10’s
of
ms,
even
with
a
good
conﬁg

•  Covered
with
the
read
path
of
this
talk

From steady to loaded and overloaded
•  Number
of
concurrent
tasks
is
a
factor
of

•  Number
of
cores

•  Number
of
disks

•  Number
of
remote
machines
used

•  Diﬃcult
to
esJmate

•  Queues
are
doomed
to
happen

•  hbase.regionserver.handler.count
•  So
for
low
latency

•  Replable
scheduler
since
HBase
0.98
(HBASE-‐8884).
Requires
speciﬁc
code.

•  RPC
PrioriJes:
work
in
progress
(HBASE-‐11048)

From loaded to overloaded
•  MemStore
takes
too
much
room:
flush,
then
blocksquite
quickly

•  hbase.regionserver.global.memstore.size.lower.limit
•  hbase.regionserver.global.memstore.size
•  hbase.hregion.memstore.block.multiplier
•  Too
many
Hfiles:
block
unJl
compacJons
keep
up

•  hbase.hstore.blockingStoreFiles
•  Too
many
WALs
files:
Flush
and
block

•  hbase.regionserver.maxlogs

Machine failure
•  Failure

•  Dectect

•  Reallocate

•  Replay
WAL

•  Replaying
WAL
is
NOT
required
for
puts

•  hbase.master.distributed.log.replay

•  (default
true
in
1.0)

•  Failure
=
Dectect
+
Reallocate
+
Retry

•  That’s
in
the
range
of
~1s
for
simple
failures

•  Silent
failures
leads
puts
you
in
the
10s
range
if
the
hardware
does
not
help

•  zookeeper.session.timeout

Single puts
•  Millisecond
range

•  Spikes
do
happen
in
steady
mode

•  100ms

•  Causes:
GC,
load,
splits

Streaming puts
Htable#setAutoFlushTo(false)!
Htable#put!
Htable#flushCommit!
•  As
simple
puts,
but

•  Puts
are
grouped
and
send
in
background

•  Load
is
taken
into
account

•  Does
not
block

Mul>ple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
•  Decouple
the
client
from
a
latency
spike
of
a
region
server

•  Increase
the
throughput
by
50%
compared
to
old
mulJput

•  Makes
split
and
GC
more
transparent

Conclusion on write path
•  Single
puts
can
be
very
fast

•  It’s
not
a
«
hard
real
Jme
»
system:
there
are
spikes

•  Most
latency
spikes
can
be
hidden
when
streaming
puts

•  Failure
are
NOT
that
diﬃcult
for
the
write
path

•  No
WAL
to
replay

And now for the read path

Read path
•  Get/short
scan
are
assumed
for
low-‐latency
operaJons

•  Again,
two
APIs

•  Single
get:
HTable#get(Get)
•  MulJ-‐get:
HTable#get(List<Get>)
•  Four
stages,
same
as
write
path

•  Start
(tcp
connecJon,
…)

•  Steady:
when
expected
condiJons
are
met

•  Machine
failure:
expected
as
well

•  Overloaded
system:
you
may
need
to
add
machines
or
tune
your
workload

Mul> get / Client
Group
Gets
by

RegionServer

Execute
them

one
by
one

Access latency magnides
Storage hierarchy: a different view
Dean/2009

Memory
is
100000x

faster
than
disk!

Disk
seek
=
10ms

Known unknowns
•  For
each
candidate
HFile

•  Exclude
by
ﬁle
metadata

•  Timestamp

•  Rowkey
range

•  Exclude
by
bloom
ﬁlter

StoreFileScanner#

shouldUseScanner()

Unknown knowns
•  Merge
sort
results
polled
from
Stores

•  Seek
each
scanner
to
a
reference
KeyValue

•  Retrieve
candidate
data
from
disk

•  MulJple
HFiles
=>
mulitple
seeks

•  hbase.storescanner.parallel.seek.enable=true

•  Short
Circuit
Reads

•  dfs.client.read.shortcircuit=true

•  Block
locality

•  Happy
clusters
compact!

HFileBlock#

readBlockData()

BlockCache
•  Reuse
previously
read
data

•  Maximize
cache
hit
rate

•  Larger
cache

•  Temporal
access
locality

•  Physical
access
locality

BlockCache#getBlock()

BlockCache Showdown
•  LruBlockCache

•  Default,
onheap

•  Quite
good
most
of
the
Jme

•  EvicJons
impact
GC

•  BucketCache

•  Oxeap
alternaJve

•  SerializaJon
overhead

•  Large
memory
conﬁguraJons

hip://www.n10k.com/blog/
blockcache-‐showdown/

L2
oﬀ-‐heap
BucketCache

makes
a
strong
showing

Latency enemies: Garbage Collec>on
•  Use
heap.
Not
too
much.
With
CMS.

•  Max
heap

•  30GB
(compressed
pointers)

•  8-‐16GB
if
you
care
about
9’s

•  Healthy
cluster
load

•  regular,
reliable
collecJons

•  25-‐100ms
pause
on
regular
interval

•  Overloaded
RegionServer
suﬀers
GC
overmuch

Oﬀ-‐heap to the rescue?
•  BucketCache
(0.96,
HBASE-‐7404)

•  Network
interfaces
(HBASE-‐9535)

•  MemStore
et
al
(HBASE-‐10191)

Latency enemies: Compac>ons
•  Fewer
HFiles
=>
fewer
seeks

•  Evict
data
blocks!

•  Evict
Index
blocks!!

•  hfile.block.index.cacheonwrite

•  Evict
bloom
blocks!!!

•  hfile.block.bloom.cacheonwrite

•  OS
buffer
cache
to
the
rescue

•  Compactected
data
is
sJll
fresh

•  Beier
than
going
all
the
way
back
to
disk

Failure
•  Detect
+
Reassign
+
Replay

•  Strong
consistency
requires
replay

•  Locality
drops
to
0

•  Cache
starts
from
scratch

Hedging our bets
•  HDFS
Hedged
reads
(2.4,
HDFS-‐5776)

•  Reads
on
secondary
DataNodes

•  Strongly
consistent

•  Works
at
the
HDFS
level

•  Timeline
consistency
(HBASE-‐10070)

•  Reads
on
«
Replica
Region
»

•  Not
strongly
consistent

Read latency in summary
•  Steady
mode

•  Cache
hit:
<
1
ms

•  Cache
miss:
+
10
ms
per
seek

•  WriJng
while
reading
=>
cache
churn

•  GC:
25-‐100ms
pause
on
regular
interval

Network
request
+
(1
-‐
P(cache
hit))
*
(10
ms
*
seeks)

•  Same
long
tail
issues
as
write

•  Overloaded:
same
scheduling
issues
as
write

•  ParJal
failures
hurt
a
lot

HBase ranges for 99% latency

Put

Streamed

Mul&put
Get
Timeline
get

Steady
milliseconds
milliseconds
milliseconds
milliseconds

Failure
seconds
seconds
seconds
milliseconds

GC

10’s
of

milliseconds
milliseconds

10’s
of

milliseconds
milliseconds

What’s next
•  Less
GC

•  Use
less
objects

•  Oxeap

•  Compressed
BlockCache
(HBASE-‐8894)

•  Prefered
locaJon
(HBASE-‐4755)

•  The
«
magical
1%
»

•  Most
tools
stops
at
the
99%
latency

•  What
happens
aRer
is
much
more
complex

Thanks!
Nick
Dimiduk,
Hortonworks
(@xefyr)

Nicolas
Liochon,
Scaled
Risk
(@nkeywal)

HBaseCon
May
5,
2014

Apache HBase Low Latency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache HBase Low Latency

Similar to Apache HBase Low Latency (20)

More from Nick Dimiduk

More from Nick Dimiduk (11)

Recently uploaded

Recently uploaded (20)

Apache HBase Low Latency