Speakers: Liang Xie and Honghua Feng (Xiamoi)
This talk covers the HBase environment at Xiaomi, including thoughts and practices around latency, hardware/OS/VM configuration, GC tuning, the use of a new write thread model and reverse scan, and block index optimization. It will also include some discussion of planned JIRAs based on these approaches.
4. 4
About Xiaomi
Mobile internet company founded in 2010
Sold 18.7 million phones in 2013
Over $5 billion revenue in 2013
Sold 11 million phones in Q1, 2014
www.mi.com
8. 8
About Our HBase Team
Founded in October 2012
5 members
Liang Xie
Shaohui Liu
Jianwei Cui
Liangliang He
Honghua Feng
Resolved 130+ JIRAs so far
www.mi.com
18. 18
Long first “put” issue (HBASE-10010)
Token invalid (HDFS-5637)
Retry/timeout setting in DFSClient
Reduce write traffic? (HLog compression)
HDFS IO Priority (HADOOP-10410)
Other Meaningful Latency Work
www.mi.com
19. 19
Real-time HDFS, esp. priority related
Core data structure GC friendly
More off-heap; shenandoah GC
TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
Wish List
20. New write thread model(HBASE-8755)
Reverse scan(HBASE-4811)
Per table/cf replication(HBASE-8751)
Block index key optimization(HBASE-7845)
20www.mi.com
Some Patches Xiaomi Contributed
21. WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
WriteHandler : write to HDFS
WriteHandler : sync to HDFS
Local Buffer
Problem : WriteHandler does everything, severe lock race!
Old model:
21www.mi.com
256
256
256
22. WriteHandler :sync to HDFSWriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
AsyncWriter : write to HDFS
AsyncSyncer : sync to HDFS
Local Buffer
New model :
AsyncNotifier : notify writers
22www.mi.com
256
1
1
4
23. New Write Thread Model
Low load : No improvement
Heavy load : Huge improvement (3.5x)
23www.mi.com
24. 2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv3
Row1 kv1
Row2 kv1
Row2 kv3
Row4 kv1
Row4 kv3
Row6 kv1
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Performance : 70% of forward scan
24www.mi.com
25. Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA
(backup)
PeerB
(T2:cfX)
T1 : cfA, cfB
T2 : cfX, cfY
PeerB creates T2 only : replication can’t work!
T1:cfA,cfB; T2:cfX,cfY
?
PeerB creates T1&T2 : all data replicated!
25www.mi.com
29. http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
Two-phase commit : strong cross-table/row consistency
Global timestamp server : global strictly incremental timestamp
No touch to HBase internal: based on HBase Client and coprocessor
Read : 90%, Write : 23% (same downgrade as Google percolator)
More details : HBASE-10999
29www.mi.com
30. 2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
PeriodicMemstoreFlusher : flush old memstores forcefully
‘flushCheckInterval’/‘flushPerChanges’ : hard to config
Result in ‘tiny’ HFiles
HBASE-10499 : problematic region can’t be flushed!
30
www.mi.com
31. HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
Compact : HLog 1,2,3,4 HLog x
Archive : HLog1,2,3,4
HLog x
31www.mi.com
32. 3. Adjusted Delete Semantic
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
Result : kvA can be read out
Scenario 2
5. Read kvA
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
32www.mi.com
33. 4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
Compact uses a global HDFS, while whether to compact is decided locally!
33www.mi.com
34. Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OK
Can I ?
NO
HDFS (global resource)
Compact is scheduled by master, no compact storm any longer
34www.mi.com
35. 5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
Master
ZooKeeper
X
Read info/states
A
A
When active master serves, standby master stays ‘really’ idle
When standby master becomes active, it needs to rebuild in-memory status
35www.mi.com
36. Quorum Master
Master 3 Master 1
Master 2
RS RSRS
X
A
A
Better master failover perf : No phase to rebuild in-memory status
No external(ZooKeeper) dependency
No potential consistency issue
Simpler deployment
Better restart perf for BIG cluster(10+K regions)
36www.mi.com
37. Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
Acknowledgement
37www.mi.com
This is the throughput comparison against a single regionserver: when the write load is low, there is almost no improvement, but as write load gets heavier and heavier, the improvement is pretty amazing, 3.5x at most
Actually when write load is very low, new model has some small downgrade(about 10%), Michael Stack has fixed this downgrade in another patch, Thanks Stack!
The second one is reverse scan. Before explaining how reverse scan works, I want to point out an important fact which can help understanding this patch. This fact is the granularity of scan is row, not key-value.
All key-values of a row are read out in order from HFile or Memstore, and assembled together as a result row in RegionServer’s memory and then be returned to the client. This work is the same for both forward scan and reverse scan.
So the difficulty of reverse scan is when the current row is done, figure out which is the next row, then jump to that row, and start to scan. Let’s see how we do it
Since there are two more extra seek operations compared to forward scan, there is 30% downgrade in performance compared to forward scan, almost the same as in LevelDB. Finally thanks Chunhui very much for porting our patch to trunk!
This is the third patch : per table/cf replication. Suppose we have a source cluster, it has two tables and four column families, all can be replicated. For data safety we deployed a peer cluster for backup, and the source cluster replicates all the data to this backup cluster, that’s just what we want and the replication works pretty well
Then for some reason such as data analysis or experimental purpose we deployed another peer cluster, and our experimental program just needs data from cfX of table T2, What kind of replication we expect?
Ideally we expect only data from cfX of T2 is replicated… but replication can’t work!
Then we have to create all tables and column-families in PeerB, and all the data will be replicated, it’s really bad, either in term of bandwidth between source and PeerB, or in term of PeerB’s resource usage.
Then we implement this feature, it allows to specify which data will be replicated to a peer cluster. For PeerA, the add_peer command is the same as before since PeerA want to replicate all the data. But for PeerB, the add_peer has an additional argument to specify which tables or column-families to replicate
The implementation change is quite straightforward : In the source cluster, when parsing the log entries, the replication source thread will ignore all other ones and only replicates entries from cfX of table T2
This is the fourth patch : block index key optimization. It is to reduce the overall block index size
Suppose two contiguous blocks, the last key-avlue’s row of Block1 is “ab”, the first key-value’s row of Block2 is “ah, hello world”, before our patch the block index key of Block2 is “ah, hello world”(the first keyvalue of Block2), after our patch the block index key is “ac”(a fake key, it’s the minimal keyvalue which is larger than the last keyvalue of Block1 and less than or equal to the first keyvalue of Block2, with shortest row length), the new block index key is much shorter than old one
Now let’s continue to talk about some work items we are currently working on
The second one is HLog compactor, its target is to keep as few HLogs as possible, so we can say its final target is to improve regionserver failover performance, since the less HLog files to split, the better failover performance is
We know a regionserver typically serves many regions, and the write patterns for all these regions can be quite different, so the flush frequency and timing of these regions can also be very different. Considering there is a region x, its memstore contains quite few entries, no flush triggered for a long time, and all its entries scatter in many HLogs. For these HLogs, though all other entries have been flushed to HFiles, they still can’t be archived since they contain entries from region x…
We do have a background flusher thread to flush old memstores forcefully, but it has some obvious drawbacks, the first one is it’s hard to configure good-enough flushCheckInterval and flushPerChanges, second is forceful flush will result in tiny Hfiles, last one, as in jira HBASE-10499, some problematic region just can’t be flushed at all by this background flusher thread!
Our patch works as this : we introduce another background thread, HLog compactor. When the HLog size is too large compared to the memstore size(which means we flushed enough, but not enough archive), we trigger the HLog compactor, it reads entries from all active HLog files, if the entry is still in some region’s memstore, write it to new HLog file; if not in any memstore(which means it has been flushed to some HFile) ignore it. After the compaction, we can archive all the old HLog files without flushing any memstore
We have finished this feature and are testing it in our test cluster, we’ll share the patch after the test
Let’s consider two scenarios
The first scenario: first we write kvA at timestamp t0, then delete it and flush, and then write it again, and finally we try to read it, the result is we can’t read it out since both writes are masked by the delete
The second scenario is the same as the first one except that before writing kvA for the second time we trigger a major compact. But this time kvA can be read out, since the delete is collected by the major compact
This is inconsistent since major compact is transparent to the client but the read results are different depending on whether major compact occurs or not, the root cause is that the delete can even mask a key-values put later than it.
The fix is simple, since mvcc represents the order all writes(including put/delete) entering HBase, we use it as an additional delete criterion to prevent delete from masking later put
We ever have some severe discussion on this patch, personally I still insist that it deserves further thinking and discussion
The fourth item is coordinated compaction. We talk about compact storm from time to time, now let’s check how it happens, when a regionserver wants to do compact, it just triggers it, and compact reads from HDFS and write back to HDFS, a regionserver can trigger a new compact no matter how overloaded the whole system is
So we can see the problem is, what compact eventually uses is a global HDFS, but whether to trigger a compact is a local decision by each regionserver
What we propose is using the master as a coordinator for compact scheduling, it works this way: when a regionserver want a compact, it asks the master, if the master says yes, it can trigger a compact, if the master thinks the system is loaded, it will reject all later compact requests until the system becomes not loaded
The last item is quorum master. This is a master re-design and there are some discussion on it already. And I noticed that JimmyXiang from Cloudera and Mikhail from wandisco have put some efforts on it. It’s great!
Current master design has 2 problems:
1. the first problem is some system-wide metadata and status are only maintained in the active master, for master failover these metadata and status are stored in ZooKeeper as well, and during master failover the new active master needs to read from ZooKeeper to rebuild the in-memory state
2. the second problem is the way ZooKeeper is used as the communication channel between master and regionservers for the state machine of region assigning task, ZooKeeper’s asynchronous notification mechanism is just not suitable for state machine logic, it’s also the root cause of many tricky bugs ever found
We propose this new design: Instead of storing in-memory status in ZooKeeper, we replicate it among all master instances using a consensus protocol such as Raft or Paxos. This way when active master fails, a new active master is elected via consensus protocol among all alive standby masters, and the new active master serves immediately without reading from elsewhere
Quorum master has some advantages:
Better master failover performance
Better restart performance for big cluster, since the communication between master and ZooKeeper is the bottleneck when a big number region assignment tasks happen concurrently
No external dependency on ZooKeeper
No potential consistency issue any longer
Simpler deployment