Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

1© Cloudera, Inc. All rights reserved.
Hadoop Summit EU, 16 Apr 2015
Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC
Dima Spivak | HBase QE Lead @ Cloudera
Multi-tenant, Multi-cluster and
Multi-container Apache HBase
Deployments

• Jonathan Hsieh
• Tech Lead, HBase Team @ Cloudera
• Apache HBase PMC Member
• Apache Flume founder
• Contact
• jon@cloudera.com
• @jmhsieh
• Dima Spivak
• QE Lead, HBase Team @Cloudera
• Contact
• dspivak@cloudera.com
• @dimaspivak
Who are we?
16 Apr 2015. Hadoop Summit EU '15. Hsieh and Spivak

What is Apache HBase?
Apache HBase is an
consistent, low latency,
random access, non-
relational database built
on Apache Hadoop.

Some HBase Contributors, Users, and Providers

Challenges as usage increases
• How does one:
• Isolate different application workloads.
• Share datasets between different workloads.
• Prepare for geographic redundancy and availability.
• Manage cluster migrations.
• Test and prototype (multi-)cluster deployments.
• There are multiple solutions!

Multiple Multi- Solutions
Using more than one cluster for
an application.
Using one cluster for more than
one application.
Using one machine to run [one
or more] multi-node clusters.
Multi-Cluster Multi-Tenant Multi-Container

Multi-Cluster
Safety in numbers

Multi-Cluster Deployments
• Deploy multiple HBase cluster instances.
• Motivation:
• Isolating different workloads from each other.
• Geographic disaster recovery, redundancy, and availability.

Isolation
• Isolation is usually done in were many apps share one data center.
• Two different workloads on the same dataset.
• Perform latency-sensitive workloads on the same set of data as analytic MR
workload.
• Two disjoint applications workloads and datasets.
• Deploy OpenTSDB on HBase in same data center, but as cluster to monitor
production HBase cluster.

Isolation: Operational with Analytical access pattern
HBase Client
Get, Scan
HBase Replication
low latency
Isolated from full scans
high throughput
MapReduce
HBase Scanner
HBase Client
Put, Incr, Append
Bulk Import
HBase Client
HBase Replication
high throughput

Geographic Recovery, Redundancy, and Availability
• Run multiple HBase clusters in multiple data centers.
• Often using “Podding” schemes.
• Primarily for backups of data in case data center outages.
• Locality for Performance.
• Locality for Compliance.
• Availability while a datacenter is down.
• Deploy with:
• HBase replication - master master, master slave.
• Multicluster clients.

Master-Master Replication
logs logs
logs
Replicating data reduces chances of data loss.

HBase Multi-Cluster Client
• High Availability with Eventual
Consistency when using replication.
• Simple implementation.
• Hedged operations. If primary takes
too long, go to the failover cluster.
• Same HConnection interface just a
different factory
HConnectionManagerMultiClusterWrapper.get
Connection(conf)
• HBase.MCC to be available in Cloudera
Labs.
Work by Ted Malaska (Cloudera Solution Architect)
https://github.com/tmalaska/HBase.MCC

Multi-Tenant
We’re all in this together

Multi-tenant deployments
• Deploy multiple workloads on one cluster.
• Motivation:
• Better Resource utilization.
• Cost efficiency.
• Simpler operations.
• Shared data.
• Multiple services on one cluster.
• Running HBase, Spark, Impala and MR on the same cluster.

Security and namespaces
• Challenges:
• Resource management, prioritizing and fairness.
• Authentication and Authorization.
• Mechanisms:
• HBase Security – Authentication, Authorization for commands via ACLs.
• Namespaces – Isolate administrative domains for ACLs.
• Proxy Impersonation – Thrift proxy doAs, and REST proxy doAs.

Request Throttling
• Idea: some tables or users get a limited
budget of ops or throughput, while others
do not.
• Multiple workloads on one dataset.
• Production/real-time user: unthrottled.
• Analytic/adhoc workloads user: throttled.
• Caveat: if all users throttled, we may not use
all machine resources.

Request Scheduling
• Idea: gets should have high priority while
scans should get deprioritized the more
they are used (HBASE-10994).
• Multiple workloads on one dataset .
• Production real-time gets: immediately
scheduled.
• Analytic scan workloads: delay
scheduled.
• All resources are used.
• Caveat: requires manual tuning .
1 1 2 1 1 3 1
1 1 21 1 31
Delayed by long
scan requests
Rescheduled so
new request get
priority

Performance Isolation inside a cluster
• Region Server Groups (under review).
• Limit performance impact load on one
table has on others (HBASE-6721).
• Multiple workloads on multiple data sets
on one HBase cluster.
• Two separate apps on one cluster.
Mixed workload
Isolated
workload

• Today, the easiest strategy for isolating latency-sensitive HBase deployment from
other services is static partitioning.
• Future:
• Improve IO isolation via YARN/Slider/Mesos.
• Separate HBase actions into separate processes.
• e.g. externalize compaction for better resource management.
Service Isolation
Yarn NM/MR
HBase RS
impalad
HDFS DN
Yarn NM/MR
HBase RS
impalad
HDFS DN
Yarn NM/MR
HBase RS
impalad
HDFS DN
Yarn NM/MR
HBase RS
impalad
HDFS DN
Yarn NM/MR
HBase RS
impalad
HDFS DN
Yarn NM/MR
HBase RS
impalad
HDFS DN
HBase RS
HDFS DN
Yarn NM/MR
impalad
HDFS DN
HBase RS
HDFS DN
HBase RS
HDFS DN
Yarn NM/MR
impalad
HDFS DN
Yarn NM/MR
impalad
HDFS DN
Multi service deployment Statically partitioned service deployment

Multi-Container
My name is Jonah

Multi-container deployments
• Run a distributed HBase cluster on a single host.
• Testing applications.
• Use cases requiring quick cluster stand-up.

Linux containers
• cgroups (2.6.24+).
• Isolating resources (memory, CPU, networking).
• Namespace isolation (filesystems, process trees).

Virtual Machines vs Linux Containers
Hypervisor
Host Operating System
Guest OS Guest OS Guest OS Guest OS
Libraries Libraries Libraries Libraries
User
processes
User
processes
User
processes
User
processes
Virtual Machines
Host Operating System
Libraries
User
processes
User
processes
User
processes
User
processes
Containers

Docker
• User front-end for containers.
• Container management (start, stop,
pause).
• docker run
• Images (templates for containers).
• docker commit
• Registries (repository for images).
• docker push

Integration testing
• Automate long-running tests from hbase-it module.
• $ hbase org.apache.hadoop.hbase.IntegrationTest…
• Integration with fault injection framework (Chaos Monkey).

Starting container cluster
DNS server
dnsserver
(10.0.0.2)
Node
node-1
(10.0.0.3)
Node
node-2
(10.0.0.4)
Start cluster
Master Slave
Node
node-3
(10.0.0.5)
Slave
Node
node-4
(10.0.0.6)
Slave

Automation
• Replace fragile infrastructure.
• Setup distributed cluster as part of test execution.

In progress
• Extend this workflow to upstream Apache HBase (HBASE-12721)
• Upstream integration testing (builds.apache.org)
• Multi-cluster use cases (e.g. MCC, replication)
• Upgrades

Conclusions
Multi multi multi

Summary
• Fancy table that summarizes our talk
Goal Multi Cluster Multi Tenant Multi-Container
Isolate workloads One cluster per workload. Region Server Groups. cgroups.

Summary
Multiple workloads on
same dataset
(real-time vs analytic
workload)
Separate cluster per
workload.
Request throttling,
request scheduling.
Containers as “VMs” or
microservices.

Summary
same dataset
workload)
workload.
Request throttling,
request scheduling.
microservices.
Reliability and
Availability
Disaster recovery,
master-master replication,
multi-cluster client.
Multiple tables with Region
Server Groups.
More realistic testing.

Summary
• Fancy table that summarizes our talkGoal Multi Cluster Multi Tenant Multi-Container
same dataset
workload)
workload.
Request throttling,
request scheduling.
microservices.
Reliability and
Availability
Disaster recovery,
master-master replication,
multi-cluster client.
Multiple tables with Region
Server Groups.
More realistic testing.
Cost Savings Disaster recovery. One cluster, multiple use
cases.
One machine, multiple
nodes.

Futures
• We are seeing more and more deployments that are multi cluster and/or multi-
tenant.
• Traditional workflows are giving way to hybrid ones
• More knobs to turn to optimize for performance and value
• Multi-container deployments are a way forward to make prototyping and testing
these deployments easier.

Thank you!

HBaseCon 2015 is Coming!
Thurs., May 7, in San Francisco
Presentations from the world’s biggest HBase operators:
Bloomberg, Dropbox, eBay, Facebook, Google, Pinterest, Xiaomi, Yahoo!, more!
Seats are limited; register at hbasecon.com
Community Sponsor

Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

Similar to Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

Editor's Notes