HBase Low Latency

HBase Low Latency
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014

Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps

What’s low latency
Latency is about percentiles
• Average != 50% percentile
• There are often order of magnitudes between « average » and « 95
percentile »
• Post 99% = « magical 1% ». Work in progress here.
• Meaning from micro seconds (High Frequency
Trading) to seconds (interactive queries)
• In this talk milliseconds

Measure latency
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
• More options related to HBase: autoflush, replicas, …
• Latency measured in micro second
• Easier for internal analysis
YCSB - Yahoo! Cloud Serving Benchmark
• Useful for comparison between databases
• Set of workload already defined

Write path
• Two parts
• Single put (WAL)
• The client just sends the put
• Multiple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency
• Start (establish tcp connections, etc.)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system

Single put: communication & scheduling
• Client: TCP connection to the server
• Shared: multitheads on the same client are using the same TCP connection
• Pooling is possible and does improve the performances in some circonstances
• hbase.client.ipc.pool.size
• Server: multiple calls from multiple threads on multiple machines
• Can become thousand of simultaneous queries
• Scheduling is required

Single put: real work
• The server must
• Write into the WAL queue
• Sync the WAL queue (HDFS flush)
• Write into the memstore
• WALs queue is shared between all the regions/handlers
• Sync is avoided if another handlers did the work
• Your handler may flush more data than expected

Simple put: A small run
Percentile Time in ms
Mean 1.21
50% 0.95
95% 1.50
99% 2.12

Latency sources
• Candidate one: network
• 0.5ms within a datacenter
• Much less between nodes in the same rack
Mean 0.13
50% 0.12
95% 0.15
99% 0.47

Latency sources
• Candidate two: HDFS Flush
• We can still do better: HADOOP-7714 & sons.
Mean 0.33
50% 0.26
95% 0.59
99% 1.24

Latency sources
• Millisecond world: everything can go wrong
• JVM
• Network
• OS Scheduler
• File System
• All this goes into the post 99% percentile
• Requires monitoring
• Usually using the latest version helps

Latency sources
• Split (and presplits)
• Autosharding is great!
• Puts have to wait
• Impacts: seconds
• Balance
• Regions move
• Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage Collection
• Impacts: 10’s of ms, even with a good config
• Covered with the read path of this talk

From steady to loaded and overloaded
• Number of concurrent tasks is a factor of
• Number of cores
• Number of disks
• Number of remote machines used
• Difficult to estimate
• Queues are doomed to happen
• hbase.regionserver.handler.count
• So for low latency
• Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code.
• RPC Priorities: work in progress (HBASE-11048)

From loaded to overloaded
• MemStore takes too much room: flush, then blocksquite quickly
• hbase.regionserver.global.memstore.size.lower.limit
• hbase.regionserver.global.memstore.size
• hbase.hregion.memstore.block.multiplier
• Too many Hfiles: block until compactions keep up
• hbase.hstore.blockingStoreFiles
• Too many WALs files: Flush and block
• hbase.regionserver.maxlogs

Machine failure
• Failure
• Dectect
• Reallocate
• Replay WAL
• Replaying WAL is NOT required for puts
• hbase.master.distributed.log.replay
• (default true in 1.0)
• Failure = Dectect + Reallocate + Retry
• That’s in the range of ~1s for simple failures
• Silent failures leads puts you in the 10s range if the hardware does not help
• zookeeper.session.timeout

Single puts
• Millisecond range
• Spikes do happen in steady mode
• 100ms
• Causes: GC, load, splits

Streaming puts
Htable#setAutoFlushTo(false)
Htable#put
Htable#flushCommit
• As simple puts, but
• Puts are grouped and send in background
• Load is taken into account
• Does not block

Multiple puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency spike of a region server
• Increase the throughput by 50% compared to old multiput
• Makes split and GC more transparent

Conclusion on write path
• Single puts can be very fast
• It’s not a « hard real time » system: there are spikes
• Most latency spikes can be hidden when streaming puts
• Failure are NOT that difficult for the write path
• No WAL to replay

Read path
• Get/short scan are assumed for low-latency operations
• Again, two APIs
• Single get: HTable#get(Get)
• Multi-get: HTable#get(List<Get>)
• Four stages, same as write path
• Start (tcp connection, …)
• Steady: when expected conditions are met
• Machine failure: expected as well
• Overloaded system: you may need to add machines or tune your workload

Multi get / Client
Group Gets by
RegionServer
Execute them
one by one

Access latency magnidesStorage hierarchy: a different view
Dean/2009
Memory is 100000x
faster than disk!
Disk seek = 10ms

Known unknowns
• For each candidate HFile
• Exclude by file metadata
• Timestamp
• Rowkey range
• Exclude by bloom filter
StoreFileScanner#
shouldUseScanner()

Unknown knowns
• Merge sort results polled from Stores
• Seek each scanner to a reference KeyValue
• Retrieve candidate data from disk
• Multiple HFiles => mulitple seeks
• hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads
• dfs.client.read.shortcircuit=true
• Block locality
• Happy clusters compact!
HFileBlock#
readBlockData()

BlockCache
• Reuse previously read data
• Maximize cache hit rate
• Larger cache
• Temporal access locality
• Physical access locality
BlockCache#getBlock()

BlockCache Showdown
• LruBlockCache
• Default, onheap
• Quite good most of the time
• Evictions impact GC
• BucketCache
• Offheap alternative
• Serialization overhead
• Large memory configurations
http://www.n10k.com/blog/block
cache-showdown/
L2 off-heap BucketCache
makes a strong showing

Latency enemies: Garbage Collection
• Use heap. Not too much. With CMS.
• Max heap
• 30GB (compressed pointers)
• 8-16GB if you care about 9’s
• Healthy cluster load
• regular, reliable collections
• 25-100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch

Off-heap to the rescue?
• BucketCache (0.96, HBASE-7404)
• Network interfaces (HBASE-9535)
• MemStore et al (HBASE-10191)

Latency enemies: Compactions
• Fewer HFiles => fewer seeks
• Evict data blocks!
• Evict Index blocks!!
• hfile.block.index.cacheonwrite
• Evict bloom blocks!!!
• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue
• Compactected data is still fresh
• Better than going all the way back to disk

Failure
• Detect + Reassign + Replay
• Strong consistency requires replay
• Locality drops to 0
• Cache starts from scratch

Hedging our bets
• HDFS Hedged reads (2.4, HDFS-5776)
• Reads on secondary DataNodes
• Strongly consistent
• Works at the HDFS level
• Timeline consistency (HBASE-10070)
• Reads on « Replica Region »
• Not strongly consistent

Read latency in summary
• Steady mode
• Cache hit: < 1 ms
• Cache miss: + 10 ms per seek
• Writing while reading => cache churn
• GC: 25-100ms pause on regular interval
Network request + (1 - P(cache hit)) * (10 ms * seeks)
• Same long tail issues as write
• Overloaded: same scheduling issues as write
• Partial failures hurt a lot

HBase ranges for 99% latency
Put
Streamed
Multiput Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC
10’s of
milliseconds milliseconds
10’s of
milliseconds milliseconds

What’s next
• Less GC
• Use less objects
• Offheap
• Compressed BlockCache (HBASE-8894)
• Prefered location (HBASE-4755)
• The « magical 1% »
• Most tools stops at the 99% latency
• What happens after is much more complex

Thanks!
Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
Hadoop Summit June 4, 2014

HBase Low Latency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to HBase Low Latency

Similar to HBase Low Latency (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

HBase Low Latency

Editor's Notes