Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
Architecture wise: BigTable as a system is well understood and simple. A decade since the paper.
Distributed systems are complex! Easier to reason about if we consider them as smaller units.
Important to be able to grep! Know what to look for. DNS important to make sure consistent naming across all nodes.
HBase needs a distributed a resilient filesystem (see also Azure tech). Data that is written+sync’ed must be present! Relies on one-writer per file (hdfs leases)
HBase Tables: Not just Key-Values (hfiles) but also serialized table metadata.
WALs durabilty is key here
/hbase/data = All table data
/hbase/archive = Hfiles before deletion
/hbase/WALs = Write-ahead logs
/hbase/oldWALs = WALs before deletion
/hbase/corrupt = Corrupt WALs
.regioninfo = metadata about this region
.tmp = general temporary space (compactions)
recovered.edits = artifact of WAL recovery
Compactions == fewer files, more efficient lookups
“What happens when meta is unassigned?”
ZooKeeper provides authentication and authorization as well (for HBase, no auth or Kerberos auth via SASL).
ACLs are used to prevent users from changing sensitive data in ZK – only HBase nodes can change them.
Resilience is hard. How do we make sure that an operation will succeed if servers fail? How do we determine between previous failed attempts and users trying to concurrently perform the same operation
Table creation: unique name, directories in HDFS, create intial region in HDFS, update meta, enable the table, etc.
ProcV2 implementation is tricky/complicated, but provides an internal API to make operations easy to implement and reason about in the future. Easy to inspect state.
Model is proven in Accumulo’s FATE
Lots of knobs because we want to be able to optimize things like throughput, latency, and fairness, which are often mutually exclusive
Listener does Socket accept, dispatches to Readers. Readers read a number of bytes off the wire (the Selector channel). Sends the deserialized request to the Scheduler which gets it placed on a call queue, which a handler will eventually process.
Aka “you dun goofed up”
CopyTable – slow, requires src and destination to be up. Not really.. Desirable
Snapshots – Great for one off’s. Can grow DFS usage though. Requires coordination of a flush for full backup
B&R – Snapshots with ability track WALs for incremental backups since last full backup