SlideShare a Scribd company logo
1 of 18
Download to read offline
© Cloudera, Inc. All rights reserved.
HBase Tales From the Trenches
Wellington Chevreuil
© Cloudera, Inc. All rights reserved.
Agenda
• Common types of problems
• Most affected features
• Common reasons
• Case stories review
• General best practices
© Cloudera, Inc. All rights reserved.
Common types of problems
• RegionServers/Master crashes
• Performance
• Regions stuck in transition (RIT)
• FileSystem usage
• Corruption / Data loss / Replication data consistency
© Cloudera, Inc. All rights reserved.
Most affected features or sub-systems
• AssignmentManager
• Replication
• Snapshot
• Region Server Memory Management
• Master Initialisation
• WAL/StoreFile
• Custom applications / filters / co-processors
© Cloudera, Inc. All rights reserved.
Common reasons
• Performance
• Overload/under-dimensioned cluster
• Non optimal configurations
• Crashes
• Memory exhaustion
• File system issues
• Known bugs
• RIT
• Bugs
• Self induced (wrong hbck commands triggered)
© Cloudera, Inc. All rights reserved.
Common reasons
• RIT (cont.)
• Can also happen as side effect of performance, crash or corruption issues
• File system usage exhaustion
• Replication related issues
• Too many snapshots
• Corruption / Data loss / Replication data consistency
• Bugs
• Faulty peers / custom or third party components
• FileSystem problems
© Cloudera, Inc. All rights reserved.
Case Stories Review
© Cloudera, Inc. All rights reserved.
Case story - RegionServers slow/crashing randomly
Type: Process crash | Service outage | Performance
Feature: Region Server Memory Management
Reason: Long GC pauses, due to mismatching heap sizes and workloads
Diagnosing:
• Frequent JvmPauseMonitor alerts on RSes logs
• Occasionally OOME on stdout
• Too many regions per RS (more than 200 regions)
• JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole)
Resolution:
• Initially increase the heap size, but CMS may experience slowness with large heaps
• For heaps larger than 20GB, general G1 recommendations from Cloudera engineering
blog post had provided good results
• Horizontal scaling by adding more RSes
© Cloudera, Inc. All rights reserved.
Case story - Slow scans, compactions delayed
Type: Performance
Feature: Store File
Reason: PrefixTree HFile encoding issues (HBASE-17375)
Diagnosing:
• Compaction queue piling up
• jstacks from RSes show below trace over several frames:
"regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0
tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000]
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214)
at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628)
Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding
Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
© Cloudera, Inc. All rights reserved.
Case story - My HBase is slow
Type: Performance
Feature: Client Application Read/Write operations
Reason:
• Dependency services underperforming
• Poor client implementation not reusing connections
• Faulty CPs or custom filters
Diagnosing:
• Client application/RSes jstacks
• General HBase stats such as: compaction queue size, data locality, cache hit ratio
• HDFS/ZK logs
Resolution: Usually require tunings on dependency services or redesign of
client application/custom CPs/Filters
© Cloudera, Inc. All rights reserved.
Case story - Client scans failing | HBCK reports
inconsistencies
Type: RIT
Feature: AssignmentManager
Reason: Various
• Misuse of hbck
• Snapshots cold backups out of hdfs
• Busy/overloaded clusters where regions keep moving constantly
Diagnosing:
• Evident from hbck reports/Master Web UI
• Master logs would show RS opening/hosting region
• RS holding region should have relevant error message logs
Resolution:
• There's no single recipe
• Each case may require a combination of hbck/hbck2 commands
© Cloudera, Inc. All rights reserved.
Case story - Master timing out during initialisation
Type: Master Crash | Service Outage
Feature: Master initialisation
Reason: Different bugs can cause procedures to pile up:
• HBASE-22263, HBASE-16488, HBASE-18109
Diagnosing:
• Listing "/hbase/MasterProcWALs" shows hundreds or more files
• Master times out and crashes before assigning namespace region
ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after
Resolution:
• Stop Master and clean "/hbase/MasterProcWALs" folder
• Caution when on hbase > 2.x releases
Mitigation:
• Increase init timeout, number of open region threads
© Cloudera, Inc. All rights reserved.
Case story - Replication lags
Type: Replication stuck
Feature: Replication
Reason: Single WAL entries with too many OPs, leading to RPCs larger than
"hbase.ipc.server.max.callqueue.size"
Diagnosing: Destination peer RSes showing type of log messages below
2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops,
last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call
queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on
region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
Resolution: requires wal copy and replay from source to destination, plus
manual znode cleanup
Mitigation: releases including HBASE-18027 would prevent this situation
© Cloudera, Inc. All rights reserved.
Case story - Client scan failing on specific regions
Type: HFile Corruption
Feature: Store File
Reason: Unknown
Diagnosing: Following errors when scanning specific regions
java.lang.InternalError: Could not decompress data. Input is invalid.
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at
Or
Resolution: Requires sideline of affecting files and re-ingestion of row keys
stored on this file. Potential data loss
© Cloudera, Inc. All rights reserved.
Case story - HBase is eating HDFS space
Type: FileSystem usage
Feature: Replication | Snapshot | Compaction | Cleaners
Reason: Various
• Replication: Slow, faulty or disabled peer, missing tables on remote peer
• Snapshot: Too many snapshots being retained
• Compaction and Cleaners threads stuck or not running
Diagnosing:
• Check usage for "archive" and "oldWALS"
• Master logs would show if cleaner threads are running
• Is replication stuck or lagging?
• How about snapshot retention policy?
Resolution:
• If cleaner threads are not running, restart master
• For disabled peers, only enabling it again, or remove it, if no replication is wanted
• Too many snapshots would require some cleaning or cold backup
• Replication lags reason may vary
© Cloudera, Inc. All rights reserved.
General best practices
• Heap usage monitoring
• Keep regions/RS on low hundreds
• Consider G1 GC for heaps > 20GB
• Data locality
• Adjust caching according to workload
• Compaction Policy (Consider offline compactions using CompactionTool)
• Consider an exclusive Zookeeper for HBase
© Cloudera, Inc. All rights reserved.
General best practices
• Adjust Master initialization timeout accordingly
• Consider increase number of "open region" handlers
• Define reasonable snapshot retention policy
• Caution with experimental/non stable features (snappy/prefixtree)
• Define deployment policy for custom applications/CPs/Filters
• Define patch/bug fix upgrades schedule
© Cloudera, Inc. All rights reserved.
Q&A

More Related Content

What's hot

Handling Schema Changes Using pt-online-schema change.
Handling Schema Changes Using pt-online-schema change.Handling Schema Changes Using pt-online-schema change.
Handling Schema Changes Using pt-online-schema change.Mydbops
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsDataWorks Summit
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safeconfluent
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka confluent
 

What's hot (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Handling Schema Changes Using pt-online-schema change.
Handling Schema Changes Using pt-online-schema change.Handling Schema Changes Using pt-online-schema change.
Handling Schema Changes Using pt-online-schema change.
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High Availability
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
SQL Server Clustering Part1
SQL Server Clustering Part1SQL Server Clustering Part1
SQL Server Clustering Part1
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
kafka
kafkakafka
kafka
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
IBM GPFS
IBM GPFSIBM GPFS
IBM GPFS
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Tingjaxconf
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best PracticesVenu Anuganti
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Perforce
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field. (20)

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 

More from DataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

  • 1. © Cloudera, Inc. All rights reserved. HBase Tales From the Trenches Wellington Chevreuil
  • 2. © Cloudera, Inc. All rights reserved. Agenda • Common types of problems • Most affected features • Common reasons • Case stories review • General best practices
  • 3. © Cloudera, Inc. All rights reserved. Common types of problems • RegionServers/Master crashes • Performance • Regions stuck in transition (RIT) • FileSystem usage • Corruption / Data loss / Replication data consistency
  • 4. © Cloudera, Inc. All rights reserved. Most affected features or sub-systems • AssignmentManager • Replication • Snapshot • Region Server Memory Management • Master Initialisation • WAL/StoreFile • Custom applications / filters / co-processors
  • 5. © Cloudera, Inc. All rights reserved. Common reasons • Performance • Overload/under-dimensioned cluster • Non optimal configurations • Crashes • Memory exhaustion • File system issues • Known bugs • RIT • Bugs • Self induced (wrong hbck commands triggered)
  • 6. © Cloudera, Inc. All rights reserved. Common reasons • RIT (cont.) • Can also happen as side effect of performance, crash or corruption issues • File system usage exhaustion • Replication related issues • Too many snapshots • Corruption / Data loss / Replication data consistency • Bugs • Faulty peers / custom or third party components • FileSystem problems
  • 7. © Cloudera, Inc. All rights reserved. Case Stories Review
  • 8. © Cloudera, Inc. All rights reserved. Case story - RegionServers slow/crashing randomly Type: Process crash | Service outage | Performance Feature: Region Server Memory Management Reason: Long GC pauses, due to mismatching heap sizes and workloads Diagnosing: • Frequent JvmPauseMonitor alerts on RSes logs • Occasionally OOME on stdout • Too many regions per RS (more than 200 regions) • JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole) Resolution: • Initially increase the heap size, but CMS may experience slowness with large heaps • For heaps larger than 20GB, general G1 recommendations from Cloudera engineering blog post had provided good results • Horizontal scaling by adding more RSes
  • 9. © Cloudera, Inc. All rights reserved. Case story - Slow scans, compactions delayed Type: Performance Feature: Store File Reason: PrefixTree HFile encoding issues (HBASE-17375) Diagnosing: • Compaction queue piling up • jstacks from RSes show below trace over several frames: "regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0 tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214) at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628) Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
  • 10. © Cloudera, Inc. All rights reserved. Case story - My HBase is slow Type: Performance Feature: Client Application Read/Write operations Reason: • Dependency services underperforming • Poor client implementation not reusing connections • Faulty CPs or custom filters Diagnosing: • Client application/RSes jstacks • General HBase stats such as: compaction queue size, data locality, cache hit ratio • HDFS/ZK logs Resolution: Usually require tunings on dependency services or redesign of client application/custom CPs/Filters
  • 11. © Cloudera, Inc. All rights reserved. Case story - Client scans failing | HBCK reports inconsistencies Type: RIT Feature: AssignmentManager Reason: Various • Misuse of hbck • Snapshots cold backups out of hdfs • Busy/overloaded clusters where regions keep moving constantly Diagnosing: • Evident from hbck reports/Master Web UI • Master logs would show RS opening/hosting region • RS holding region should have relevant error message logs Resolution: • There's no single recipe • Each case may require a combination of hbck/hbck2 commands
  • 12. © Cloudera, Inc. All rights reserved. Case story - Master timing out during initialisation Type: Master Crash | Service Outage Feature: Master initialisation Reason: Different bugs can cause procedures to pile up: • HBASE-22263, HBASE-16488, HBASE-18109 Diagnosing: • Listing "/hbase/MasterProcWALs" shows hundreds or more files • Master times out and crashes before assigning namespace region ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after Resolution: • Stop Master and clean "/hbase/MasterProcWALs" folder • Caution when on hbase > 2.x releases Mitigation: • Increase init timeout, number of open region threads
  • 13. © Cloudera, Inc. All rights reserved. Case story - Replication lags Type: Replication stuck Feature: Replication Reason: Single WAL entries with too many OPs, leading to RPCs larger than "hbase.ipc.server.max.callqueue.size" Diagnosing: Destination peer RSes showing type of log messages below 2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure 2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because: Resolution: requires wal copy and replay from source to destination, plus manual znode cleanup Mitigation: releases including HBASE-18027 would prevent this situation
  • 14. © Cloudera, Inc. All rights reserved. Case story - Client scan failing on specific regions Type: HFile Corruption Feature: Store File Reason: Unknown Diagnosing: Following errors when scanning specific regions java.lang.InternalError: Could not decompress data. Input is invalid. at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at Or Resolution: Requires sideline of affecting files and re-ingestion of row keys stored on this file. Potential data loss
  • 15. © Cloudera, Inc. All rights reserved. Case story - HBase is eating HDFS space Type: FileSystem usage Feature: Replication | Snapshot | Compaction | Cleaners Reason: Various • Replication: Slow, faulty or disabled peer, missing tables on remote peer • Snapshot: Too many snapshots being retained • Compaction and Cleaners threads stuck or not running Diagnosing: • Check usage for "archive" and "oldWALS" • Master logs would show if cleaner threads are running • Is replication stuck or lagging? • How about snapshot retention policy? Resolution: • If cleaner threads are not running, restart master • For disabled peers, only enabling it again, or remove it, if no replication is wanted • Too many snapshots would require some cleaning or cold backup • Replication lags reason may vary
  • 16. © Cloudera, Inc. All rights reserved. General best practices • Heap usage monitoring • Keep regions/RS on low hundreds • Consider G1 GC for heaps > 20GB • Data locality • Adjust caching according to workload • Compaction Policy (Consider offline compactions using CompactionTool) • Consider an exclusive Zookeeper for HBase
  • 17. © Cloudera, Inc. All rights reserved. General best practices • Adjust Master initialization timeout accordingly • Consider increase number of "open region" handlers • Define reasonable snapshot retention policy • Caution with experimental/non stable features (snappy/prefixtree) • Define deployment policy for custom applications/CPs/Filters • Define patch/bug fix upgrades schedule
  • 18. © Cloudera, Inc. All rights reserved. Q&A