SlideShare a Scribd company logo
1 of 44
Download to read offline
Sizing Your HBase 
Cluster 
Lars George | @larsgeorge 
EMEA Chief Architect @ Cloudera
2 
Agenda 
• Introduction 
• Technical Background/Primer 
• Best Practices 
• Summary 
©2014 Cloudera, Inc. All rights reserved.
3 
Who I am… 
Lars George [EMEA Chief Architect] 
• Clouderan since October 2010 
• Hadooper since mid 2007 
• HBase/Whirr Committer (of Hearts) 
• github.com/larsgeorge 
©2014 Cloudera, Inc. All rights reserved.
4 
Bruce Lee: ”As you think, so shall you become.” 
©2014 Cloudera, Inc. All rights reserved.
5 
Introduction 
©2014 Cloudera, Inc. All rights reserved.
6 
HBase Sizing Is... 
• Making the most out of the cluster you have by... 
– Understanding how HBase uses low-level resources 
– Helping HBase understand your use-case by configuring it appropriately - and/or - 
– Design the use-case to help HBase along 
• Being able to gauge how many servers are needed for a given use-case
7 
Technical Background 
“To understand your fear is the beginning of 
really seeing…” 
— Bruce Lee 
©2014 Cloudera, Inc. All rights reserved.
8 
HBase Dilemma 
Although HBase can host many applications, they may require completely opposite 
features 
Events Entities 
Time Series Message Store
9 
Competing Resources 
• Reads and Writes compete for the same low-level resources 
– Disk (HDFS) and Network I/O 
– RPC Handlers and Threads 
– Memory (Java Heap) 
• Otherwise they do exercise completely separate code paths
10 
Memory Sharing 
• By default every region server is dividing its memory (i.e. given maximum heap) 
into 
– 40% for in-memory stores (write ops) 
– 20% (40%) for block caching (reads ops) 
– Remaining space (here 40% or 20%) go towards usual Java heap usage 
• Objects etc. 
• Region information (HFile metadata) 
• Share of memory needs to be tweaked
11 
Writes 
• The cluster size is often determined by the write performance 
– Simple schema design implies writing to all (entities) or only one region (events) 
• Log structured merge trees like 
– Store mutation in in-memory store and write-ahead log 
– Flush out aggregated, sorted maps at specified threshold - or - when under pressure 
– Discard logs with no pending edits 
– Perform regular compactions of store files
12 
Writes: Flushes and Compactions 
Older TIME Newer 
SIZE (MB) 
1000 
750 
500 
250 
0
13 
Flushes 
• Every mutation call (put, delete etc.) causes a check for a flush 
• If threshold is met, flush file to disk and schedule a compaction 
– Try to compact newly flushed files quickly 
• The compaction returns - if necessary - where a region should be split
14 
Compaction Storms 
• Premature flushing because of # of logs or memory pressure 
– Files will be smaller than the configured flush size 
• The background compactions are hard at work merging small flush files into the 
existing, larger store files 
– Rewrite hundreds of MB over and over
15 
Dependencies 
• Flushes happen across all stores/column families, even if just one triggers it 
• The flush size is compared to the size of all stores combined 
– Many column families dilute the size 
– Example: 55MB + 5MB + 4MB
16 
Write-Ahead Log 
• Currently only one per region server 
– Shared across all stores (i.e. column families) 
– Synchronized on file append calls 
• Work being done on mitigating this 
– WAL Compression 
– Multithreaded WAL with Ring Buffer 
– Multiple WAL’s per region server ➜ Start more than one region server per node?
17 
Write-Ahead Log (cont.) 
• Size set to 95% of default block size 
– 64MB or 128MB, but check config! 
• Keep number low to reduce recovery time 
– Limit set to 32, but can be increased 
• Increase size of logs - and/or - increase the number of logs before blocking 
• Compute number based on fill distribution and flush frequencies
18 
Write-Ahead Log (cont.) 
• Writes are synchronized across all stores 
– A large cell in one family can stop all writes of another 
– In this case the RPC handlers go binary, i.e. either work or all block 
• Can be bypassed on writes, but means no real durability and no replication 
– Maybe use coprocessor to restore dependent data sets (preWALRestore)
19 
Some Numbers 
• Typical write performance of HDFS is 35-50MB/s 
Cell Size OPS 
0.5MB 70-100 
100KB 350-500 
10KB 3500-5000 ?? 
1KB 35000-50000 ???? 
This is way to high in practice - Contention!
20 
Some More Numbers 
• Under real world conditions the rate is less, more like 15MB/s or less 
– Thread contention and serialization overhead is cause for massive slow down 
Cell Size OPS 
0.5MB 10 
100KB 100 
10KB 800 
1KB 6000
21 
Write Performance 
• There are many factors to the overall write performance of a cluster 
– Key Distribution ➜ Avoid region hotspot 
– Handlers ➜ Do not pile up too early 
– Write-ahead log ➜ Bottleneck #1 
– Compactions ➜ Badly tuned can cause ever increasing background noise
22 
Cheat Sheet 
• Ensure you have enough or large enough write-ahead logs 
• Ensure you do not oversubscribe available memstore space 
• Ensure to set flush size large enough but not too large 
• Check write-ahead log usage carefully 
• Enable compression to store more data per node 
• Tweak compaction algorithm to peg background I/O at some level 
• Consider putting uneven column families in separate tables 
• Check metrics carefully for block cache, memstore, and all queues
23 
Example: Write to All Regions 
• Java Xmx heap at 10GB 
• Memstore share at 40% (default) 
– 10GB Heap x 0.4 = 4GB 
• Desired flush size at 128MB 
– 4GB / 128MB = 32 regions max! 
• For WAL size of 128MB x 0.95% 
– 4GB / (128MB x 0.95) = ~33 partially uncommitted logs to keep around 
• Region size at 20GB 
– 20GB x 32 regions = 640GB raw storage used
24 
Notes 
• Compute memstore sizes based on number of written-to regions x flush size 
• Compute number of logs to keep based on fill and flush rate 
• Ultimately the capacity is driven by 
– Java Heap 
– Region Count and Size 
– Key Distribution
25 
Reads 
• Locate and route request to appropriate region server 
– Client caches information for faster lookups 
• Eliminate store files if possible using time ranges or Bloom filter 
• Try block cache, if block is missing then load from disk
26 
Seeking with Bloom Filters
27 
Writes: Where’s the Data at? 
Older TIME Newer 
SIZE (MB) 
1000 
750 
500 
250 
0 
Existing Row Mutations 
Unique Row Inserts
28 
Block Cache 
• Use exported metrics to see effectiveness of block cache 
– Check fill and eviction rate, as well as hit ratios ➜ random reads are not ideal 
• Tweak up or down as needed, but watch overall heap usage 
• You absolutely need the block cache 
– Set to 10% at least for short term benefits
29 
Testing: Scans 
HBase scan performance 
• Use available tools to test 
• Determine raw and KeyValue read performance 
– Raw is just bytes, while KeyValue means block parsing 
• Insert data using YCSB, then compact table 
– Single region enforced 
• Two test cases 
– Small data: 1 column with 1 byte value 
– Large(r) data: 1 column with 1KB value 
• About same size for both in total: 15GB 
©2014 Cloudera, Inc. All rights reserved.
30 
Testing: Scans 
©2014 Cloudera, Inc. All rights reserved.
31 
Scan Row Range 
• Set start and end key to limit 
scan size
32 
Best Practices 
“If you spend too much time thinking about a thing, you'll never get it done.” 
— Bruce Lee 
©2014 Cloudera, Inc. All rights reserved.
33 
How to Plan 
Advice on 
• Number of nodes 
• Number of disk and total disk capacity 
• RAM capacity 
• Region sizes and count 
• Compaction tuning 
©2014 Cloudera, Inc. All rights reserved.
34 
Advice on Nodes 
• Use previous example to compute effective storage based on heap size, region 
count and size 
– 10GB heap x 0.4 / 128MB x 20GB = 640GB, if all regions are active 
– Address more storage with read-from-only regions 
• Typical advice is to use more nodes with fewer, smaller disks (6 x 1TB SATA or 
600GB SAS, or SSDs) 
• CPU is not an issue, I/O is (even with compression) 
©2014 Cloudera, Inc. All rights reserved.
35 
Advice on Nodes 
• Memory is not an issue, heap sizes small because of Java Garbage Collection 
limitation 
– Up to 20GB has been used 
– Newer versions of Java should help 
– Use off-heap cache 
• Current servers typically have 48GB+ memory 
©2014 Cloudera, Inc. All rights reserved.
36 
Advice on Tuning 
• Trade off throughput against size of single data points 
– This might cause schema redesign 
• Trade off read performance against write amplification 
– Advise users to understand read/write performance and background write amplification 
Ø This drives the number of nodes needed! 
©2014 Cloudera, Inc. All rights reserved.
37 
Advice on Cluster Sizing 
• Compute the number of nodes needed based on 
– Total storage needed 
– Throughput required for either reads and writes 
• Assume ≈15MB/s minimum for each read and write 
– Increasing the KeyValue sizes improves this 
©2014 Cloudera, Inc. All rights reserved.
38 
Example: Twitter Firehose 
©2014 Cloudera, Inc. All rights reserved.
39 
Example: Consume Data 
©2014 Cloudera, Inc. All rights reserved.
40 
HBase Heap Usage 
• Overall addressable amount of data is driven 
by heap size 
– Only read-from regions need space for indexes, 
filters 
– Written-to regions also need MemStore space 
• Java heap space is limited still as garbage 
collections will cause pauses 
– Typically up to 20GB heap 
– Or invest is pause-less GC
41 
Summary 
“All fixed set patterns are incapable of 
adaptability or pliability. The truth is 
outside of all fixed patterns.” 
— Bruce Lee 
©2014 Cloudera, Inc. All rights reserved.
42 
WHHAT BRUCE? IT DEPENDS? L 
©2014 Cloudera, Inc. All rights reserved.
43 
Checklist 
To plan for the size of an HBase cluster you have to: 
• Know the use-case 
– Read/write mix 
– Expected throughput 
– Retention policy 
• Optimize the schema and compaction strategy 
– Devise a schema that allows for only some regions being written to 
• Take “known” numbers to compute cluster size 
©2014 Cloudera, Inc. All rights reserved.
Thank you 
@larsgeorge

More Related Content

What's hot

Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsSandesh Rao
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedSumant Tambe
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Anju Garg
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

What's hot (20)

Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata Environments
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Viewers also liked

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best PracticesVenu Anuganti
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data PipelinesMapR Technologies
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Spark Summit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010Ysance
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
See who is using MemSQL
See who is using MemSQLSee who is using MemSQL
See who is using MemSQLjenjermain
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 

Viewers also liked (20)

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Hbase at Salesforce.com
Hbase at Salesforce.comHbase at Salesforce.com
Hbase at Salesforce.com
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
See who is using MemSQL
See who is using MemSQLSee who is using MemSQL
See who is using MemSQL
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 

Similar to HBase Sizing Guide

In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)Lars Marowsky-Brée
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future DesignPivotalOpenSourceHub
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaPeter Lawrey
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedEqunix Business Solutions
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB plc
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB plc
 

Similar to HBase Sizing Guide (20)

In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in Java
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Apache Geode Offheap Storage
Apache Geode Offheap StorageApache Geode Offheap Storage
Apache Geode Offheap Storage
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 

More from larsgeorge

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 

More from larsgeorge (7)

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

HBase Sizing Guide

  • 1. Sizing Your HBase Cluster Lars George | @larsgeorge EMEA Chief Architect @ Cloudera
  • 2. 2 Agenda • Introduction • Technical Background/Primer • Best Practices • Summary ©2014 Cloudera, Inc. All rights reserved.
  • 3. 3 Who I am… Lars George [EMEA Chief Architect] • Clouderan since October 2010 • Hadooper since mid 2007 • HBase/Whirr Committer (of Hearts) • github.com/larsgeorge ©2014 Cloudera, Inc. All rights reserved.
  • 4. 4 Bruce Lee: ”As you think, so shall you become.” ©2014 Cloudera, Inc. All rights reserved.
  • 5. 5 Introduction ©2014 Cloudera, Inc. All rights reserved.
  • 6. 6 HBase Sizing Is... • Making the most out of the cluster you have by... – Understanding how HBase uses low-level resources – Helping HBase understand your use-case by configuring it appropriately - and/or - – Design the use-case to help HBase along • Being able to gauge how many servers are needed for a given use-case
  • 7. 7 Technical Background “To understand your fear is the beginning of really seeing…” — Bruce Lee ©2014 Cloudera, Inc. All rights reserved.
  • 8. 8 HBase Dilemma Although HBase can host many applications, they may require completely opposite features Events Entities Time Series Message Store
  • 9. 9 Competing Resources • Reads and Writes compete for the same low-level resources – Disk (HDFS) and Network I/O – RPC Handlers and Threads – Memory (Java Heap) • Otherwise they do exercise completely separate code paths
  • 10. 10 Memory Sharing • By default every region server is dividing its memory (i.e. given maximum heap) into – 40% for in-memory stores (write ops) – 20% (40%) for block caching (reads ops) – Remaining space (here 40% or 20%) go towards usual Java heap usage • Objects etc. • Region information (HFile metadata) • Share of memory needs to be tweaked
  • 11. 11 Writes • The cluster size is often determined by the write performance – Simple schema design implies writing to all (entities) or only one region (events) • Log structured merge trees like – Store mutation in in-memory store and write-ahead log – Flush out aggregated, sorted maps at specified threshold - or - when under pressure – Discard logs with no pending edits – Perform regular compactions of store files
  • 12. 12 Writes: Flushes and Compactions Older TIME Newer SIZE (MB) 1000 750 500 250 0
  • 13. 13 Flushes • Every mutation call (put, delete etc.) causes a check for a flush • If threshold is met, flush file to disk and schedule a compaction – Try to compact newly flushed files quickly • The compaction returns - if necessary - where a region should be split
  • 14. 14 Compaction Storms • Premature flushing because of # of logs or memory pressure – Files will be smaller than the configured flush size • The background compactions are hard at work merging small flush files into the existing, larger store files – Rewrite hundreds of MB over and over
  • 15. 15 Dependencies • Flushes happen across all stores/column families, even if just one triggers it • The flush size is compared to the size of all stores combined – Many column families dilute the size – Example: 55MB + 5MB + 4MB
  • 16. 16 Write-Ahead Log • Currently only one per region server – Shared across all stores (i.e. column families) – Synchronized on file append calls • Work being done on mitigating this – WAL Compression – Multithreaded WAL with Ring Buffer – Multiple WAL’s per region server ➜ Start more than one region server per node?
  • 17. 17 Write-Ahead Log (cont.) • Size set to 95% of default block size – 64MB or 128MB, but check config! • Keep number low to reduce recovery time – Limit set to 32, but can be increased • Increase size of logs - and/or - increase the number of logs before blocking • Compute number based on fill distribution and flush frequencies
  • 18. 18 Write-Ahead Log (cont.) • Writes are synchronized across all stores – A large cell in one family can stop all writes of another – In this case the RPC handlers go binary, i.e. either work or all block • Can be bypassed on writes, but means no real durability and no replication – Maybe use coprocessor to restore dependent data sets (preWALRestore)
  • 19. 19 Some Numbers • Typical write performance of HDFS is 35-50MB/s Cell Size OPS 0.5MB 70-100 100KB 350-500 10KB 3500-5000 ?? 1KB 35000-50000 ???? This is way to high in practice - Contention!
  • 20. 20 Some More Numbers • Under real world conditions the rate is less, more like 15MB/s or less – Thread contention and serialization overhead is cause for massive slow down Cell Size OPS 0.5MB 10 100KB 100 10KB 800 1KB 6000
  • 21. 21 Write Performance • There are many factors to the overall write performance of a cluster – Key Distribution ➜ Avoid region hotspot – Handlers ➜ Do not pile up too early – Write-ahead log ➜ Bottleneck #1 – Compactions ➜ Badly tuned can cause ever increasing background noise
  • 22. 22 Cheat Sheet • Ensure you have enough or large enough write-ahead logs • Ensure you do not oversubscribe available memstore space • Ensure to set flush size large enough but not too large • Check write-ahead log usage carefully • Enable compression to store more data per node • Tweak compaction algorithm to peg background I/O at some level • Consider putting uneven column families in separate tables • Check metrics carefully for block cache, memstore, and all queues
  • 23. 23 Example: Write to All Regions • Java Xmx heap at 10GB • Memstore share at 40% (default) – 10GB Heap x 0.4 = 4GB • Desired flush size at 128MB – 4GB / 128MB = 32 regions max! • For WAL size of 128MB x 0.95% – 4GB / (128MB x 0.95) = ~33 partially uncommitted logs to keep around • Region size at 20GB – 20GB x 32 regions = 640GB raw storage used
  • 24. 24 Notes • Compute memstore sizes based on number of written-to regions x flush size • Compute number of logs to keep based on fill and flush rate • Ultimately the capacity is driven by – Java Heap – Region Count and Size – Key Distribution
  • 25. 25 Reads • Locate and route request to appropriate region server – Client caches information for faster lookups • Eliminate store files if possible using time ranges or Bloom filter • Try block cache, if block is missing then load from disk
  • 26. 26 Seeking with Bloom Filters
  • 27. 27 Writes: Where’s the Data at? Older TIME Newer SIZE (MB) 1000 750 500 250 0 Existing Row Mutations Unique Row Inserts
  • 28. 28 Block Cache • Use exported metrics to see effectiveness of block cache – Check fill and eviction rate, as well as hit ratios ➜ random reads are not ideal • Tweak up or down as needed, but watch overall heap usage • You absolutely need the block cache – Set to 10% at least for short term benefits
  • 29. 29 Testing: Scans HBase scan performance • Use available tools to test • Determine raw and KeyValue read performance – Raw is just bytes, while KeyValue means block parsing • Insert data using YCSB, then compact table – Single region enforced • Two test cases – Small data: 1 column with 1 byte value – Large(r) data: 1 column with 1KB value • About same size for both in total: 15GB ©2014 Cloudera, Inc. All rights reserved.
  • 30. 30 Testing: Scans ©2014 Cloudera, Inc. All rights reserved.
  • 31. 31 Scan Row Range • Set start and end key to limit scan size
  • 32. 32 Best Practices “If you spend too much time thinking about a thing, you'll never get it done.” — Bruce Lee ©2014 Cloudera, Inc. All rights reserved.
  • 33. 33 How to Plan Advice on • Number of nodes • Number of disk and total disk capacity • RAM capacity • Region sizes and count • Compaction tuning ©2014 Cloudera, Inc. All rights reserved.
  • 34. 34 Advice on Nodes • Use previous example to compute effective storage based on heap size, region count and size – 10GB heap x 0.4 / 128MB x 20GB = 640GB, if all regions are active – Address more storage with read-from-only regions • Typical advice is to use more nodes with fewer, smaller disks (6 x 1TB SATA or 600GB SAS, or SSDs) • CPU is not an issue, I/O is (even with compression) ©2014 Cloudera, Inc. All rights reserved.
  • 35. 35 Advice on Nodes • Memory is not an issue, heap sizes small because of Java Garbage Collection limitation – Up to 20GB has been used – Newer versions of Java should help – Use off-heap cache • Current servers typically have 48GB+ memory ©2014 Cloudera, Inc. All rights reserved.
  • 36. 36 Advice on Tuning • Trade off throughput against size of single data points – This might cause schema redesign • Trade off read performance against write amplification – Advise users to understand read/write performance and background write amplification Ø This drives the number of nodes needed! ©2014 Cloudera, Inc. All rights reserved.
  • 37. 37 Advice on Cluster Sizing • Compute the number of nodes needed based on – Total storage needed – Throughput required for either reads and writes • Assume ≈15MB/s minimum for each read and write – Increasing the KeyValue sizes improves this ©2014 Cloudera, Inc. All rights reserved.
  • 38. 38 Example: Twitter Firehose ©2014 Cloudera, Inc. All rights reserved.
  • 39. 39 Example: Consume Data ©2014 Cloudera, Inc. All rights reserved.
  • 40. 40 HBase Heap Usage • Overall addressable amount of data is driven by heap size – Only read-from regions need space for indexes, filters – Written-to regions also need MemStore space • Java heap space is limited still as garbage collections will cause pauses – Typically up to 20GB heap – Or invest is pause-less GC
  • 41. 41 Summary “All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns.” — Bruce Lee ©2014 Cloudera, Inc. All rights reserved.
  • 42. 42 WHHAT BRUCE? IT DEPENDS? L ©2014 Cloudera, Inc. All rights reserved.
  • 43. 43 Checklist To plan for the size of an HBase cluster you have to: • Know the use-case – Read/write mix – Expected throughput – Retention policy • Optimize the schema and compaction strategy – Devise a schema that allows for only some regions being written to • Take “known” numbers to compute cluster size ©2014 Cloudera, Inc. All rights reserved.