In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.
5. Clusters and Scenarios
Mainland China
20 online clusters / 2 offline clusters in 3 data centers
AWS
4 online clusters / 1 offline cluster in 2 regions
Online Service
Mi Cloud, Mi Push, Galaxy, Mi Message,...
Offline Processing
User Profile, Trace, Recommendation, ...
5 / 35
6. Scenario A: Mi Cloud
Personal cloud storage for smart phones
Numbers
90+ milion users, 3X increased in 2014
500 billion rows, 6X increased in 2014
1000+ regions in the largest table
See: https://i.mi.com
6 / 35
7. Scenario B: Mi Push
Push service on android
Data stored in HBase
Pub-sub relations of topics and devices
Messages to each device
Numbers
200+ milion users
Push 2 billion+ messages every day
200,000+ requests per second at peak
7 / 35
8. Deployment
Two clusters with master-master replication in different data centers
Client switches clusters through configs on ZooKeeper
Using canary for availability check and alerts
8 / 35
10. Long Full GC Pauses for RegionServer
Problem: Long full GC pauses making ZooKeeper session expire
zookeeper.session.timeout = 30s
Full GC pause of RegionServer with heap of 30G can be 40s
Solution:
Multi regionserver instances in a node
More memory on offheap using bucket cache
10 / 35
11. Hotspot for Temporal Data
Problem: Writes of temporal Data go to a small set of regions
Solution: Salted Table
Based on SaltedHTable opensourced by Intel Hadoop team
See: https://github.com/intel-hadoop/SaltedHTable
Transparent to applications by table schema support
MapReduce support
11 / 35
12. Coordinated Compaction
Problem: Compaction storm
Solution:
A compaction manager in HMaster coordinates all the compactions in the
cluster
Before a compaction starts, regionserver needs to acquire a compaction
quota
12 / 35
13. Exception Aggregation
Purposes: Find the potential bugs in the clusters
Solution:
Write HMaster/RegionServer log asynchronously to HDFS through Scribe
Using MapReduce to aggregate errors and exceptions of clusters
13 / 35
14. Table Based Replication Queue (in progress)
Problems:
Too much data stored on ZooKeeper
Over 200MB replication data for a disabled peer
Too many writes to ZooKeeper
5k/s writes to ZooKeeper in a cluster with 100k/s writes (HBASE-12636)
14 / 35
15. Table Based Replication Queue (in progress)
Solution: Move replication queue to a system table
Row key : server name + peer id + hlog name
One column records the offset at which the log is replicated
15 / 35
16. Asynchronous Event Notification (in progress)
Purposes:
Incremental statistics of data in HBase
Table schema transformation
Asynchronous data indexing
Solution:
An asynchronous event notification framework on HBase (HBASE-12884)
Replication based implementation:
Add a fake replication peer, which can receive the WAL edits from HBase
clusters
16 / 35
18. Cross-Row Transaction
Why cross-row transaction?
Cross-row data consistency
Rows in different regions / tables
Example
Music index building
18 / 35
19. Cross-Row Transaction
Features
ACID
No central coordinator
Integrated without HBase code change
Google’s Percolator
Large-scale Incremental Processing Using Distribute Transactions and
Notifications, by Daniel Peng and Frank Dabek, 2010
Themis
https://github.com/Xiaomi/themis
Provides cross-row transactions on HBase based on Percolator
19 / 35
21. Timestamp Server
Seperate a long type timestamp into two parts :
Higher 46 bits: Sync with system time
Lower 18 bits: Incremental counter
Hundreds of thousands unique timestamps in each millisecond
Incremental in one timestamp server
21 / 35
22. Timestamp Server
Incremental cross timestamp servers
Periodically save a future timestamp into ZooKeeper
Allocated timestamp must be smaller than saved timestamp
Another server needs to read the saved timestamp when starting
High availability
High throughput : 600,000 RPCs per second
Batch concurrent requests in one RPC
22 / 35
23. Themis
Cross-row mutation example : Cash Table
Rows for Bob and Joe are in different regions
Transfer $3 from Joe to Bob atomically
Two auxiliary columns : lock column and commit column
Two-phase commit
Prewrite Phase
Commit Phase
checkAndMutate of HBase : guarantee the atomicity for a single row
23 / 35
24. Prewrite Phase
Fetch a prewrite timestamp from timestamp server (prewriteTs=99)
Select primary and secondary columns
Primary
Column : (Joe, f:c)
PrimaryLock: {secondaries : [(Bob, f:c)]}
Secondaries
Column : (Bob, f:c)
SecondaryLock: {primary : (Joe, f:c)}
24 / 35
25. Prewrite Phase
Prewrite primary column
Write primary lock and data if no lock exists in lock column
checkAndMutate of HBase to guarantee the atomicity
Prevent other clients mutating the same column concurrently
25 / 35
27. Commit Phase
Fetch commit timestamp from timestamp server (commitTs=100)
Commit Primary
Delete the lock and write commit column if the lock exists
checkAndMutate of HBase to guarantee the atmocity
Decide the success or failure the whole transaction
27 / 35
29. Themis Read
Fetch a read timestamp (readTs=101)
Read commit columns with commitTs < readTs
(Joe, c:f#c) => (100 : 99)
(Bob, c:f#c) => (100 : 99)
Read data column with prewriteTs
prewriteTs is just the value of commit column
(Joe, 99: $17) and (Bob, 99: $12)
29 / 35
30. Performance Comparison
Single-Row Transaction
The worst case compared with raw HBase
One region server with 10GB heap memory
Write Performance : Preload 3 million rows, 256MB LRU cache
Read Performance : Preload 30GB data
30 / 35