SlideShare a Scribd company logo
1 of 20
Floating on a Raft
HBase Durability with Apache Ratis
NoSQL Day 2019
Washington, D.C.
Ankit Singhal, Josh Elser
Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
Distributed Consensus
Problem: How do a collection of computers agree on state in the face of failures?
A = 1
A = 2
A = 1
CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
Distributed Consensus
Goals: Low-latency, high-throughput, fault-tolerant
Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped
Replication
Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft
Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache
Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft,
Hazelcast Raft, Neo4j, WANdisco...
Easy to understand, easy to implement.
“New” (2013) -- Diego Ongaro, John Ousterhout
Proven correctness via TLA+
Paxos is “old” (1989), but still hard
Raft
Apache Ratis
Incubating project at the Apache Software Foundation
A library-oriented, Java implementation of Raft (not a service!)
Pluggable pieces:
● Transport (gRPC, Netty, Hadoop RPC)
● State Machine (your code!)
● Raft Log (In-memory, segmented files on disk)
A StateMachine is the abstraction
point for user-code.
Interface to query and modify “state”
Ratis Arithmetic Example:
Maintain variables (e.g. a = 1) and
apply mathematical operations.
Read expr’s: add, subtract, multiply, divide
Write expr’s: assignment
Ratis State Machines
class Arithmetic implements StateMachine {
Map<String,Double> variables;
Message query(Message req) {
Expression exp = parseReadExp(req);
try (ReadLock rlock = getReadlock()) {
return exp.eval(variables);
}
}
Message update(Message req) {
Expression exp = parseWriteExp(req);
try (WriteLock wlock = getWriteLock()) {
return exp.eval(variables);
}
}
}
Ratis LogService
Recipe that provides a facade of a log (append-only, immutable bytes)
Maintain little-to-no state. Storage “provided” by the Raft Log.
interface Reader {
void seek(long offset);
byte[] readMsg();
List<byte[]> readBulk(int numMsgs);
}
interface Writer {
long write(byte[] msg);
List<Long> writeBulk(
List<byte[]> msgs);
}
interface Client {
List<String> list();
Log getLog(String name);
void archive(String name);
void close(String name);
void delete(String name);
}
interface Log {
Reader createReader();
Writer createWriter();
Metadata getMetadata();
void addListener();
}
Ratis LogService Architecture
Log Name
transactions
gps_coordinates
sensors
query_durations
Client
Metadata
Workers
LogService Testing
Docker-compose simplicity: 3 metadata services, >=3 workers
$ mvn package assembly:single && ./build-docker.sh
$ docker-compose up -d
$ ./client-env.sh
Utilities: interactive shell, verification tool
$ ./bin/shell -q <...>
$ ./bin/load-test -q <...>
LogService Testing
Goal: Generate some non-trivial data sizes
Environment:
● Intel i5-5250U
● 16GB of RAM
● Samsung SSD 850 M.2
● Gentoo Linux: Kernel 4.19.27
● Docker 18.09.4
● Write ~50MB per scenario
● Single client program, one log/thread, no batching
● JDK8, 3GB LogWorker heaps (no other tuning)
LogService Testing Results
Logs/Threads Value Size Num Records Duration
1 50 1,100,000 5h+
4 50 275,000 35m
5 100 105,000 13m 30s
5 500 22,000 2m 48s
8 100 66,000 16m 20s
8 500 13,200 2m 30s
4 1000 13,200 1m 40s
Does HBase want this?
Assumption: we can more efficiently run HBase in cloud environments without
HDFS for WALs.
● Running HDFS is expensive, hard
○ Data is “heavy” (10’s mins to 1’s of hours to decommission)
○ Unexpected DataNode failure requires slow re-replication
● More things to monitor -- twice as many JVMs
Ideal Case:
● Scale up HBase by just adding a more RegionServers, then balance
● Scale down by gently (order 1’s of minutes) removing RegionServers
Asynchronous
flushing to generate
HFiles
Write Path
Store
Durability in HBase
Put
Delete
Incr
RegionServer
wal
MemStore
1
2
Region1
Store
MemStore
RegionN
3
3
Store File
Store File
Append
and sync
KVs
Life cycle of WAL
RegionServer
WAL
WALs
zookeeper
Flush
Log Roller
Roll Wal
Flush
Tracking for
Replication
Backup
Cleaner
chore
WALs
Archived
Regionserver Recovery
Identification
- Master(ServerManager) observes when
a region server is deemed dead due to
their ephemeral node being deleted
Splitting
- Reading the WAL and creating
separate files for each region
Re-assignment
- Assigning the regions from dead
server to live regionservers
Fencing
- Fencing for half dead region server
(server which undergoes long GC
pause and comes back after GC
finishes)
- Currently done through renaming
HDFS directory
Replaying
- Reading the WAL recovered edits
produced by WAL splitting and
replaying the edits that were not
flushed
Regionserver Recovery Refactoring
Identification
- No change is required
Splitting
interface WALProvider {
public Map<Region, WAL> split(WAL
wal);
}
Re-assignment
- No change is required
Fencing
interface ServerFence {
public void fence(ServerName server);
}
In case of Ratis, Implementation could be to
close the log to prevent further writes by dead
regionserver.
Replaying
interface WALProvider {
public Reader getRecoveredEditsReader(
Region region );
}
Disclaimer: These Interfaces are for reference only , may change during the actual development
Replication
- Async and Serial Replication rely on reading WALs
- Need a long-term storage for WALs
- Ratis LogService uses local disk
Proposed Solution
- Can we upload Ratis WALs to distributed, cheap storage?
- If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
Why Ratis for WAL?
Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS
● Fully embeddable(No dependency on External System)
● Low Latency
● High throughput
● Enable HBase for cloud deployment
Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
What’s next?
More testing for LogService
● Easy to cause leader-election storms
● Better insight/understanding into internals
A Ratis LogService WalProvider
● Wire up the LogService with the new WAL APIs
References
Ratis LogService
● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice
HBase WAL Refactoring
● https://issues.apache.org/jira/browse/HBASE-20951
● https://issues.apache.org/jira/browse/HBASE-20952
Authors
● ankit,elserj@apache.org

More Related Content

What's hot

Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 

What's hot (20)

Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 

Similar to Floating on a RAFT: HBase Durability with Apache Ratis

NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisAnkit Singhal
 
Cassandra Internals Overview
Cassandra Internals OverviewCassandra Internals Overview
Cassandra Internals Overviewbeobal
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sinaHui Cheng
 
Encode x NEAR: Technical Overview of NEAR 1
Encode x NEAR: Technical Overview of NEAR 1Encode x NEAR: Technical Overview of NEAR 1
Encode x NEAR: Technical Overview of NEAR 1KlaraOrban
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgwzhouyuan
 
Web Server Clustering - OSSCAMP
Web Server Clustering - OSSCAMPWeb Server Clustering - OSSCAMP
Web Server Clustering - OSSCAMPAbhishek Singh
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Shrey Agarwal
 
Bare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefBare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefMatt Ray
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)Sri Prasanna
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 

Similar to Floating on a RAFT: HBase Durability with Apache Ratis (20)

NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
 
Cassandra Internals Overview
Cassandra Internals OverviewCassandra Internals Overview
Cassandra Internals Overview
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
 
Encode x NEAR: Technical Overview of NEAR 1
Encode x NEAR: Technical Overview of NEAR 1Encode x NEAR: Technical Overview of NEAR 1
Encode x NEAR: Technical Overview of NEAR 1
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Proxy
ProxyProxy
Proxy
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
Web Server Clustering - OSSCAMP
Web Server Clustering - OSSCAMPWeb Server Clustering - OSSCAMP
Web Server Clustering - OSSCAMP
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS
 
Bare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefBare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and Chef
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 

More from DataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Floating on a RAFT: HBase Durability with Apache Ratis

  • 1. Floating on a Raft HBase Durability with Apache Ratis NoSQL Day 2019 Washington, D.C. Ankit Singhal, Josh Elser Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
  • 2. Distributed Consensus Problem: How do a collection of computers agree on state in the face of failures? A = 1 A = 2 A = 1 CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
  • 3. Distributed Consensus Goals: Low-latency, high-throughput, fault-tolerant Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped Replication Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft, Hazelcast Raft, Neo4j, WANdisco...
  • 4. Easy to understand, easy to implement. “New” (2013) -- Diego Ongaro, John Ousterhout Proven correctness via TLA+ Paxos is “old” (1989), but still hard Raft
  • 5. Apache Ratis Incubating project at the Apache Software Foundation A library-oriented, Java implementation of Raft (not a service!) Pluggable pieces: ● Transport (gRPC, Netty, Hadoop RPC) ● State Machine (your code!) ● Raft Log (In-memory, segmented files on disk)
  • 6. A StateMachine is the abstraction point for user-code. Interface to query and modify “state” Ratis Arithmetic Example: Maintain variables (e.g. a = 1) and apply mathematical operations. Read expr’s: add, subtract, multiply, divide Write expr’s: assignment Ratis State Machines class Arithmetic implements StateMachine { Map<String,Double> variables; Message query(Message req) { Expression exp = parseReadExp(req); try (ReadLock rlock = getReadlock()) { return exp.eval(variables); } } Message update(Message req) { Expression exp = parseWriteExp(req); try (WriteLock wlock = getWriteLock()) { return exp.eval(variables); } } }
  • 7. Ratis LogService Recipe that provides a facade of a log (append-only, immutable bytes) Maintain little-to-no state. Storage “provided” by the Raft Log. interface Reader { void seek(long offset); byte[] readMsg(); List<byte[]> readBulk(int numMsgs); } interface Writer { long write(byte[] msg); List<Long> writeBulk( List<byte[]> msgs); } interface Client { List<String> list(); Log getLog(String name); void archive(String name); void close(String name); void delete(String name); } interface Log { Reader createReader(); Writer createWriter(); Metadata getMetadata(); void addListener(); }
  • 8. Ratis LogService Architecture Log Name transactions gps_coordinates sensors query_durations Client Metadata Workers
  • 9. LogService Testing Docker-compose simplicity: 3 metadata services, >=3 workers $ mvn package assembly:single && ./build-docker.sh $ docker-compose up -d $ ./client-env.sh Utilities: interactive shell, verification tool $ ./bin/shell -q <...> $ ./bin/load-test -q <...>
  • 10. LogService Testing Goal: Generate some non-trivial data sizes Environment: ● Intel i5-5250U ● 16GB of RAM ● Samsung SSD 850 M.2 ● Gentoo Linux: Kernel 4.19.27 ● Docker 18.09.4 ● Write ~50MB per scenario ● Single client program, one log/thread, no batching ● JDK8, 3GB LogWorker heaps (no other tuning)
  • 11. LogService Testing Results Logs/Threads Value Size Num Records Duration 1 50 1,100,000 5h+ 4 50 275,000 35m 5 100 105,000 13m 30s 5 500 22,000 2m 48s 8 100 66,000 16m 20s 8 500 13,200 2m 30s 4 1000 13,200 1m 40s
  • 12. Does HBase want this? Assumption: we can more efficiently run HBase in cloud environments without HDFS for WALs. ● Running HDFS is expensive, hard ○ Data is “heavy” (10’s mins to 1’s of hours to decommission) ○ Unexpected DataNode failure requires slow re-replication ● More things to monitor -- twice as many JVMs Ideal Case: ● Scale up HBase by just adding a more RegionServers, then balance ● Scale down by gently (order 1’s of minutes) removing RegionServers
  • 13. Asynchronous flushing to generate HFiles Write Path Store Durability in HBase Put Delete Incr RegionServer wal MemStore 1 2 Region1 Store MemStore RegionN 3 3 Store File Store File Append and sync KVs
  • 14. Life cycle of WAL RegionServer WAL WALs zookeeper Flush Log Roller Roll Wal Flush Tracking for Replication Backup Cleaner chore WALs Archived
  • 15. Regionserver Recovery Identification - Master(ServerManager) observes when a region server is deemed dead due to their ephemeral node being deleted Splitting - Reading the WAL and creating separate files for each region Re-assignment - Assigning the regions from dead server to live regionservers Fencing - Fencing for half dead region server (server which undergoes long GC pause and comes back after GC finishes) - Currently done through renaming HDFS directory Replaying - Reading the WAL recovered edits produced by WAL splitting and replaying the edits that were not flushed
  • 16. Regionserver Recovery Refactoring Identification - No change is required Splitting interface WALProvider { public Map<Region, WAL> split(WAL wal); } Re-assignment - No change is required Fencing interface ServerFence { public void fence(ServerName server); } In case of Ratis, Implementation could be to close the log to prevent further writes by dead regionserver. Replaying interface WALProvider { public Reader getRecoveredEditsReader( Region region ); } Disclaimer: These Interfaces are for reference only , may change during the actual development
  • 17. Replication - Async and Serial Replication rely on reading WALs - Need a long-term storage for WALs - Ratis LogService uses local disk Proposed Solution - Can we upload Ratis WALs to distributed, cheap storage? - If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
  • 18. Why Ratis for WAL? Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS ● Fully embeddable(No dependency on External System) ● Low Latency ● High throughput ● Enable HBase for cloud deployment Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
  • 19. What’s next? More testing for LogService ● Easy to cause leader-election storms ● Better insight/understanding into internals A Ratis LogService WalProvider ● Wire up the LogService with the new WAL APIs
  • 20. References Ratis LogService ● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice HBase WAL Refactoring ● https://issues.apache.org/jira/browse/HBASE-20951 ● https://issues.apache.org/jira/browse/HBASE-20952 Authors ● ankit,elserj@apache.org