Since 2013, Yahoo! has been successfully running multi-tenant HBase clusters. Our tenants run applications ranging from real-time processing (e.g. content personalization, Ad targeting) to operational warehouses (e.g. advertising, content). Tenants are guaranteed an adequate level of resource isolation and security. This is achieved through the use of open source and in-house developed HBase features such as region server groups, group-based replication, and group-based favored nodes.
Today, with the increase in adoption and new use cases, we are working towards scaling our HBase clusters to support petabytes of data without compromising on performance and operability. A common tradeoff when scaling a cluster to this size is to increase the size of a region, thus avoiding the problem of having too many regions on a cluster. However, large regions negatively affect the performance and operability of a cluster mainly because region size determines the following: 1. granularity for load distribution, and 2. amount of write amplification due to compaction. Thus we are working towards enabling an HBase cluster to host at least a million regions.
In this presentation, we will walk through the key features we have implemented as well as share our experiences working on multi-tenancy and scaling the cluster.
6. Region Server Groups - Overview
▪ Member Tables
▪ Resource Isolation
▪ Flexibility with configuration
Group Bar
Region Server 5…8
Table3
Table4
Group Foo
Region Server 1…4
Table1
Table2
RS1
Table1
Table2
RS2
Table1
Table2
RS3
Table1
Table2
RS4 RS5
Table3
Table4
RS6
Table3
Table4
RS7
Table3
Table4
RS8
Configs
7. Region Server Groups - Implementation
LoadBalancer
GroupBasedLoadBalancer
GroupAdminEndpoint
GroupMasterObserver
HMaster
FilterBy
Group
foo
bar
GroupInfoManager
Group Table
Group
ZNode
8. Namespace
▪ Analogous to Database
▪ Full Table Name: <table namespace>:<table name>
▪ i.e. my_ns:my_table
▪ Reserved namespaces
› default – tables with no explicit namespace
› hbase – system tables (ie hbase:meta, hbase:acl, etc)
▪ Table Path: /<hbaseRoot>/data/<namespace>/<tableName>
10. Replication
▪ Sinks are randomly picked
▪ Sources recover any queue
▪ Shared RPC Quality of Protection config
source: https://hbase.apache.org/replication.html
11. Replication + Group
▪ Region Server Group Aware
▪ Rule based API
› Source: {namespace},[Table], [CF]
› Slave: {Peer}
› Effective Time
Group Foo
Group Bar
Table1
Table2
Group Foo
Table1
Table2
13. Favored Nodes
▪ What are Favored Nodes ?
› While writing data, we can pass a set of preferred hosts to HDFS client to replicate data.
› preferred hosts => “Favored Nodes”
› Usually 3 hosts : primary, secondary, tertiary.
› Constraint: Primary host on one rack , secondary and tertiary hosts on different rack.
▪ Favored Nodes of regions are scattered across various groups.
› No guarantees about data locality within a region server group.
15. Example
▪ Locality is lost when region server RS1 dies.
RS7
DN7
RS Group - B
RS5
DN5 DN6
RS6
RS8
DN8
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS dies
16. ▪ Fix the data locality problem by
› choosing favored nodes within region server group
› Assigning regions to only favored nodes
Group Aware Favored Nodes
RS7
DN7
RS Group - B
RS5
DN5 DN6
RS6
RS8
DN8
RS3
DN3
RS Group - A
RS1
DN1 DN2
RS2
RS4
DN4
17. FavoredGroupLoadBalancer
▪ Region server groups aware
▪ Region assignment on favored nodes
▪ Region balancing done using Stochastic Load Balancer
▪ Favored Node Management
› Generate favored nodes for regions
› Favored nodes are inherited during a region split/merge events.
› Favored nodes do not change unless required.
18.
19. Favored Node Management APIs
▪ Redistribute
› Ability to expand region block replicas to newly added nodes.
› Change favored nodes of regions such that replicas spread to newly added nodes
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
RS5
DN5
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
RS5
DN5
redistribute
New node
added
20. Favored Node Management APIs
▪ Complete_Redistribute
› Ability to recreate entire set of favored nodes in balanced fashion
› Balances the replica load evenly among all the nodes
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
complete
redistribute
RS3
DN3
RS Group - A
DN1 DN2
RS2
RS4
DN4
RS1
Host with least number of
replicas
21. Enhancements
▪ Improvements to Stochastic Load Balancer (HBASE-13376)
▪ Improvements to Region Placement Maintainer Tool
› Ability to view locality of region on each of its FN.
› Ability to view primary, secondary and tertiary node distribution of region servers.
▪ Hadoop JIRA’s
› HDFS-7300
› HDFS-7795
▪ Configuration changes made on Hadoop side
› Set “dfs.namenode.replication.considerLoad” to false in small clusters
22. Scaling to 1M and beyond (HBASE-11165)
▪ Store Petabytes of data
▪ Support mixed workload (batch and near real-time)
▪ Performance
› Latency, throughput
▪ Operability
› Load balancing, compactions, etc.
23. Experience at Scale
▪ Web Crawl Cache
› ~2.3PB Table
› 80GB regions -> 20GB regions
› Batch workload
▪ Hot Regions
▪ Large compactions (Write amplification)
▪ Longer failover time
▪ Less Parallel/Imbalanced MapReduce Tasks
▪ Large MapReduce tasks
24. Scaling Region Count
▪ Master Region Management
› Creation, Assign, Balance, etc.
› Meta table
▪ Metadata
› HDFS scalability
› Zookeeper
› Region Server density
25. RSMaster
Meta
region
Zookeeper
Region 1
Region 2
Region 1
Region 2
RS
RS
Assignment
communication
Write
ops
Observations
▪ Assignment
› ZK assignment - complex and more storage
› High CPU usage on master
▪ Single hot meta
› 7GB in size for 1M
› Master writing at 400 ops/second
› Longer scanning times
▪ HDFS
▪ Longer directory creation time
26. User region 1
User region 2
RS
Master
▪ Assignment
› Zk less assignment (HBASE-11059)
› Simpler
› No involvement of Zk
› Unlock region states (HBASE-11290)
Enhancements - Assignment
User region 1
User region 2
User region
Meta region
RS
User region 1
User region 2
RS
27. ▪ Split meta (HBASE-11288)
› Distributed IO load
› Distributed caching
› Shorter scan time
› Distributed compaction
Meta region
User region
RS
Master
Meta region
User region
User region
Meta region
RS
Meta region
User region
RS
Enhancements – Split Meta
28. Region dir creation time - 4k buckets
1M regions 5M 10M
normal table 20 mins 4 hours 23 minutes Doesn’t finish
humongous table 15 mins 48 secs 1 hour 27 minutes 2hr 53 minutes
Enhancements - Hierarchical region dir
● Scaling namenode operations - Table dir has millions of region files
● Approach - Buckets within table directory
● E.g 3 letters of bucket names gives 4k buckets