SlideShare a Scribd company logo
1 of 73
Download to read offline
Improving HBase availability in a multi-tenant
environment
James Moore (jcmoore@hubspot.com)
Kahlil Oppenheimer (kahlil@hubspot.com)
Outline
● Who are we?
● What problems are we solving?
● Reducing the cost of Region Server failure
● Improving the stochastic load balancer
● Eliminating workload driven failure
● Reducing the impact of hardware failure
Who are we?
● HubSpot’s Big Data Team provides HBase as a Service to
our 50+ Product and Engineering teams.
● HBase is Hubspot’s canonical Data store and reliably
stores all of our customers’ data
What problems we’re we solving
● In a typical month we lose ~3.5% of our Region Servers to
total network failure
● 3500+ microservices interacting with 350+ tables
● 500+ TBs of raw data served from HBase
● We run an internally patched version of CDH 5.9.0
● We’ve reached 99.99% availability in every cluster.
Reducing the cost of failure
What does a Region Server crash look like?
Distributed log splitting performance
● Minimum of 5 seconds
● Scales linearly with the
number of HLog files
● Performance grows
erratic with the number
of HLogs
Region assignment & log replay
● This Process is Non-Linear
● Scales at least by Regions
* HLogs
Improving mean time to recovery (MTTR)
● Running more and smaller instances can improve overall
availability
● Migrating from d2.8xls to d2.2xls reduced our MTTR from
90 seconds to 30 seconds
d2.8xl d2.2xl
244GB ram 61GB ram
24*2TB 7*2TB HDD
36 cores 8 cores
… But the AsyncProcess can get in the way
● AsyncProcess & AsyncRequestFutureImpl didn’t respect
operation timeout
● Meta fetches didn’t respect operation timeout
● Open connections to an offline meta server could cause
the client to block until socket failure far into the
future.
What about stalled instances?
What is the cause?
● Degraded hardware performance?
○ Slow I/O
○ Stuck kernel threads
○ High packet loss
● Long GC pauses?
● Hot regions?
● Noisy user?
Strategy
● Eliminate usage related failures
● Tune GC bit.ly/hbasegc
● Monitor and proactively remove misbehaving hardware
Eliminating usage failures
1. Keep regions less than 30GB
2. Use the Normalizer to ensure tables do not have unevenly
sized regions
3. Optimize the balancer for multi-tenancy
4. Usage limits and Guardrails
Improving the balancer
Problem: HBase is unstable, unbalanced
We investigated and found one issue...
1. Balancer stops if servers host same number of regions
We investigated and found one two issues...
1. Balancer stops if servers host same number of regions
2. Requests for tables aren’t distributed across cluster
We investigated and found one two three issues...
1. Balancer stops if servers host same number of regions
2. Requests for tables aren’t distributed across cluster
3. Balancer performance doesn’t scale well with cluster size
Quick refresher
on stochastic load
balancer There will be pictures...
The first issue
1. Balancer stops if servers host same number of regions
The first issue...turned out to be intentional
1. Balancer stops if servers host same number of regions
Theory: design choice when balancer was simpler
So our solution is perhaps a design choice
1. Balancer stops if servers host same number of regions
Theory: design choice when balancer was simpler
Solution: disable the “slop” logic for the stochastic load balancer
The second issue
2. Requests for tables aren’t distributed across cluster
The second issue, involved some theorizing
2. Requests for tables aren’t distributed across cluster
Theory: cluster has high table skew
But every good theory requires a bit of background...
What is table skew?
What is table skew?
Problem: When unbalanced, all requests to R1-R7 and Q1-Q5 hit Server 1
If cluster has high table skew, we expect...
● High table skew cost
● High responsiveness to tuning table skew weight
But we found...
● High low table skew cost
● High low responsiveness to tuning table skew weight
So we dug in and discovered...
● Regions are not evenly distributed across the cluster
● Table skew is present, contrary to what balancer reports
Time for a new theory...
2. Requests for single tables hit subset of region servers
Theory: table skew not properly computed
And a solution!
2. Requests for single tables hit subset of region servers
Theory: table skew not properly computed
Solution: Create new table skew cost function and generator
New table skew cost function (HBASE-17707)
● Computes number of moves/swaps required to evenly
distribute each table’s regions
New table skew cost function (HBASE-17707)
● Computes number of moves/swaps required to evenly
distribute each table’s regions
● Outputs a value between [0, 1] proportional to the number
of moves required
New table skew cost function (HBASE-17707)
● Computes number of moves/swaps required to evenly
distribute each table’s regions
● Outputs a value between [0, 1] proportional to the number
of moves required
● Analogous to “edit distance” between two words
New table skew candidate generator (HBASE-17707)
Propose region moves/swaps to optimize TableSkewCostFunction
How did our fix work in practice?
75th Percentile Request Queue Times
Spot the deploy
Spot the deploy (and the bug fix)
Back to investigating...
3. Balancer performance doesn’t scale well with cluster size
Back to investigating...
3. Balancer performance doesn’t scale well with cluster size
Theory: table skew calculation takes too long
Back to investigating...
3. Balancer performance doesn’t scale well with cluster size
Theory: table skew calculation takes too long
But new table skew code didn’t improve performance...
So we benchmarked cost functions
But found a surprise
We benchmarked generators too...
And...surprise!
What is locality?
A measure of how close the data is to the computation
For HBase: How much of a region’s data is on its host’s disk
Back to theorizing
3. Balancer performance doesn’t scale well with cluster size
Theory: table skew calculation takes too long
Theory: locality computation takes too long
And solution-izing
3. Balancer performance doesn’t scale well with cluster size
Theory: Locality computation takes too long
Solution: Redesign locality cost function and generator
The old locality code
● Recomputes locality of every region on every server
The old locality code
● Recomputes locality of every region on every server
● Reads every HDFS block location of every region
The old locality code
● Recomputes locality of every region on every server
● Reads every HDFS block location of every region
● For every action the balancer proposes!
New locality code (HBASE-18164)
● Caches localities of all regions on all servers
New locality code (HBASE-18164)
● Caches localities of all regions on all servers
● Incrementally updates locality cost based on proposed
region moves/swaps
New locality code (HBASE-18164)
● Caches localities of all regions on all servers
● Incrementally updates locality cost based on proposed
region moves/swaps
● Add per-rack locality computation
○ Reduce AWS charges for inter-rack data transfer
But does it work?
But does it it does work
● 8-10x more balancer actions considered in all clusters
○ 20x in big clusters
But does it it does work
● 8-10x more balancer actions considered in all clusters
○ 20x in big clusters
● LocalityCandidateGenerator from 30,485ms to 4,721ms
○ ~6x faster
But does it it does work
● 8-10x more balancer actions considered in all clusters
○ 20x in big clusters
● LocalityCandidateGenerator from 30,485ms to 4,721ms
○ ~6x faster
● LocalityCostFunction from 28,267ms to 57ms
○ ~495x faster
Balancer changes summarized
1. Balancer is smarter about table skew
2. Balancer has more time to “think” with less time spent in
locality computation
3. Cluster stability increased with better balance
Eliminating workload driven failure
Limit RPC handler usage by service
● We use a semaphore to limit the number of active
CallRunners any given hadoop User can have open
● In the event that insufficient RPC handlers are available
to a hadoop user return a Retryable Exception and rely on
client side exponential backoff
Client guardrails
● With 3500+ services communicating best practices is
difficult
● Communicate best practices with code!
Max Request Size 10 MB
Max Response Size 10 MB
Max Batch Size 20,000
Max Read Columns 20,000
Max Write Cells 20,000
Reducing the impact of hardware failure
Prioritize removal of bad hardware
● Watch kernel logs for
○ I/O failures
○ Stuck Threads
○ General kernel errors
● Prioritize removal of these instances to prevent
regionserver stalls
● We use an in-house orchestration tool known as
Tranquility for this
Now we’ll take questions :)
Thanks for listening. Feel free to reach out!
jcmoore@hubspot.com, kahlil@hubspot.com

More Related Content

What's hot

Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 

What's hot (20)

HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and MoreHBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
Tales from Taming the Long Tail
Tales from Taming the Long TailTales from Taming the Long Tail
Tales from Taming the Long Tail
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
 
HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 

Similar to HBaseCon2017 Improving HBase availability in a multi tenant environment

Similar to HBaseCon2017 Improving HBase availability in a multi tenant environment (20)

Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Avoiding the ring of death
Avoiding the ring of deathAvoiding the ring of death
Avoiding the ring of death
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
HBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minuteHBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minute
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.do
 
Robust ha solutions with proxysql
Robust ha solutions with proxysqlRobust ha solutions with proxysql
Robust ha solutions with proxysql
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 

More from HBaseCon

HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon
 
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
HBaseCon
 

More from HBaseCon (18)

hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
 
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
 
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
HBaseCon2017 Achieving HBase Multi-Tenancy with RegionServer Groups and Favor...
 
HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

HBaseCon2017 Improving HBase availability in a multi tenant environment

  • 1. Improving HBase availability in a multi-tenant environment James Moore (jcmoore@hubspot.com) Kahlil Oppenheimer (kahlil@hubspot.com)
  • 2. Outline ● Who are we? ● What problems are we solving? ● Reducing the cost of Region Server failure ● Improving the stochastic load balancer ● Eliminating workload driven failure ● Reducing the impact of hardware failure
  • 3. Who are we? ● HubSpot’s Big Data Team provides HBase as a Service to our 50+ Product and Engineering teams. ● HBase is Hubspot’s canonical Data store and reliably stores all of our customers’ data
  • 4. What problems we’re we solving ● In a typical month we lose ~3.5% of our Region Servers to total network failure ● 3500+ microservices interacting with 350+ tables ● 500+ TBs of raw data served from HBase ● We run an internally patched version of CDH 5.9.0 ● We’ve reached 99.99% availability in every cluster.
  • 5. Reducing the cost of failure
  • 6. What does a Region Server crash look like?
  • 7. Distributed log splitting performance ● Minimum of 5 seconds ● Scales linearly with the number of HLog files ● Performance grows erratic with the number of HLogs
  • 8. Region assignment & log replay ● This Process is Non-Linear ● Scales at least by Regions * HLogs
  • 9. Improving mean time to recovery (MTTR) ● Running more and smaller instances can improve overall availability ● Migrating from d2.8xls to d2.2xls reduced our MTTR from 90 seconds to 30 seconds d2.8xl d2.2xl 244GB ram 61GB ram 24*2TB 7*2TB HDD 36 cores 8 cores
  • 10. … But the AsyncProcess can get in the way ● AsyncProcess & AsyncRequestFutureImpl didn’t respect operation timeout ● Meta fetches didn’t respect operation timeout ● Open connections to an offline meta server could cause the client to block until socket failure far into the future.
  • 11. What about stalled instances?
  • 12. What is the cause? ● Degraded hardware performance? ○ Slow I/O ○ Stuck kernel threads ○ High packet loss ● Long GC pauses? ● Hot regions? ● Noisy user?
  • 13. Strategy ● Eliminate usage related failures ● Tune GC bit.ly/hbasegc ● Monitor and proactively remove misbehaving hardware
  • 14. Eliminating usage failures 1. Keep regions less than 30GB 2. Use the Normalizer to ensure tables do not have unevenly sized regions 3. Optimize the balancer for multi-tenancy 4. Usage limits and Guardrails
  • 16. Problem: HBase is unstable, unbalanced
  • 17. We investigated and found one issue... 1. Balancer stops if servers host same number of regions
  • 18. We investigated and found one two issues... 1. Balancer stops if servers host same number of regions 2. Requests for tables aren’t distributed across cluster
  • 19. We investigated and found one two three issues... 1. Balancer stops if servers host same number of regions 2. Requests for tables aren’t distributed across cluster 3. Balancer performance doesn’t scale well with cluster size
  • 20. Quick refresher on stochastic load balancer There will be pictures...
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. The first issue 1. Balancer stops if servers host same number of regions
  • 26. The first issue...turned out to be intentional 1. Balancer stops if servers host same number of regions Theory: design choice when balancer was simpler
  • 27. So our solution is perhaps a design choice 1. Balancer stops if servers host same number of regions Theory: design choice when balancer was simpler Solution: disable the “slop” logic for the stochastic load balancer
  • 28. The second issue 2. Requests for tables aren’t distributed across cluster
  • 29. The second issue, involved some theorizing 2. Requests for tables aren’t distributed across cluster Theory: cluster has high table skew
  • 30. But every good theory requires a bit of background...
  • 31. What is table skew?
  • 32. What is table skew? Problem: When unbalanced, all requests to R1-R7 and Q1-Q5 hit Server 1
  • 33. If cluster has high table skew, we expect... ● High table skew cost ● High responsiveness to tuning table skew weight
  • 34. But we found... ● High low table skew cost ● High low responsiveness to tuning table skew weight
  • 35. So we dug in and discovered... ● Regions are not evenly distributed across the cluster ● Table skew is present, contrary to what balancer reports
  • 36. Time for a new theory... 2. Requests for single tables hit subset of region servers Theory: table skew not properly computed
  • 37. And a solution! 2. Requests for single tables hit subset of region servers Theory: table skew not properly computed Solution: Create new table skew cost function and generator
  • 38. New table skew cost function (HBASE-17707) ● Computes number of moves/swaps required to evenly distribute each table’s regions
  • 39. New table skew cost function (HBASE-17707) ● Computes number of moves/swaps required to evenly distribute each table’s regions ● Outputs a value between [0, 1] proportional to the number of moves required
  • 40. New table skew cost function (HBASE-17707) ● Computes number of moves/swaps required to evenly distribute each table’s regions ● Outputs a value between [0, 1] proportional to the number of moves required ● Analogous to “edit distance” between two words
  • 41. New table skew candidate generator (HBASE-17707) Propose region moves/swaps to optimize TableSkewCostFunction
  • 42. How did our fix work in practice?
  • 43. 75th Percentile Request Queue Times
  • 45. Spot the deploy (and the bug fix)
  • 46. Back to investigating... 3. Balancer performance doesn’t scale well with cluster size
  • 47. Back to investigating... 3. Balancer performance doesn’t scale well with cluster size Theory: table skew calculation takes too long
  • 48. Back to investigating... 3. Balancer performance doesn’t scale well with cluster size Theory: table skew calculation takes too long But new table skew code didn’t improve performance...
  • 49. So we benchmarked cost functions
  • 50. But found a surprise
  • 53. What is locality? A measure of how close the data is to the computation For HBase: How much of a region’s data is on its host’s disk
  • 54. Back to theorizing 3. Balancer performance doesn’t scale well with cluster size Theory: table skew calculation takes too long Theory: locality computation takes too long
  • 55. And solution-izing 3. Balancer performance doesn’t scale well with cluster size Theory: Locality computation takes too long Solution: Redesign locality cost function and generator
  • 56. The old locality code ● Recomputes locality of every region on every server
  • 57. The old locality code ● Recomputes locality of every region on every server ● Reads every HDFS block location of every region
  • 58. The old locality code ● Recomputes locality of every region on every server ● Reads every HDFS block location of every region ● For every action the balancer proposes!
  • 59. New locality code (HBASE-18164) ● Caches localities of all regions on all servers
  • 60. New locality code (HBASE-18164) ● Caches localities of all regions on all servers ● Incrementally updates locality cost based on proposed region moves/swaps
  • 61. New locality code (HBASE-18164) ● Caches localities of all regions on all servers ● Incrementally updates locality cost based on proposed region moves/swaps ● Add per-rack locality computation ○ Reduce AWS charges for inter-rack data transfer
  • 62. But does it work?
  • 63. But does it it does work ● 8-10x more balancer actions considered in all clusters ○ 20x in big clusters
  • 64. But does it it does work ● 8-10x more balancer actions considered in all clusters ○ 20x in big clusters ● LocalityCandidateGenerator from 30,485ms to 4,721ms ○ ~6x faster
  • 65. But does it it does work ● 8-10x more balancer actions considered in all clusters ○ 20x in big clusters ● LocalityCandidateGenerator from 30,485ms to 4,721ms ○ ~6x faster ● LocalityCostFunction from 28,267ms to 57ms ○ ~495x faster
  • 66. Balancer changes summarized 1. Balancer is smarter about table skew 2. Balancer has more time to “think” with less time spent in locality computation 3. Cluster stability increased with better balance
  • 68. Limit RPC handler usage by service ● We use a semaphore to limit the number of active CallRunners any given hadoop User can have open ● In the event that insufficient RPC handlers are available to a hadoop user return a Retryable Exception and rely on client side exponential backoff
  • 69. Client guardrails ● With 3500+ services communicating best practices is difficult ● Communicate best practices with code! Max Request Size 10 MB Max Response Size 10 MB Max Batch Size 20,000 Max Read Columns 20,000 Max Write Cells 20,000
  • 70. Reducing the impact of hardware failure
  • 71. Prioritize removal of bad hardware ● Watch kernel logs for ○ I/O failures ○ Stuck Threads ○ General kernel errors ● Prioritize removal of these instances to prevent regionserver stalls ● We use an in-house orchestration tool known as Tranquility for this
  • 72. Now we’ll take questions :)
  • 73. Thanks for listening. Feel free to reach out! jcmoore@hubspot.com, kahlil@hubspot.com