1. HBase Storage Internals, present and future!
Matteo Bertozzi | @Cloudera
Speaker Name or Subhead Goes Here
March 2013 - Hadoop Summit Europe
1
2. What is HBase?
• Open source Storage Manager that provides random
read/write on top of HDFS
• Provides Tables with a “Key:Column/Value” interface
• Dynamic columns (qualifiers), no schema needed
• “Fixed” column groups (families)
• table[row:family:column] = value
2
3. HBase ecosystem
• Apache Hadoop HDFS for data durability and
reliability (Write-Ahead Log)
• Apache ZooKeeper for distributed coordination App MR
• Apache Hadoop MapReduce built-in support
for running MapReduce jobs
ZK HDFS
3
5. Master, Region Servers and Regions
Client • Region Server
• Server that contains a set of Regions
• Responsible to handle reads and writes
ZooKeeper
• Region
Master • The basic unit of scalability in HBase
• Subset of the table’s data
• Contiguous, sorted range of rows stored together.
Region Server Region Server Region Server • Master
Region Region Region • Coordinates the HBase Cluster
Region Region Region • Assignment/Balancing of the Regions
Region Region Region • Handles admin operations
• create/delete/modify table, …
HDFS
5
6. Autosharding and .META. table
• A Region is a Subset of the table’s data
• When there is too much data in a Region…
• a split is triggered, creating 2 regions
• The association “Region -> Server” is stored in a System Table
• The Location of .META. Is stored in ZooKeeper
Table Start Key Region ID Region Server machine01
Region 1 - testTable
testTable Key-00 1 machine01.host Region 4 - testTable
testTable Key-31 2 machine03.host
machine02
testTable Key-65 3 machine02.host
Region 3 - testTable
testTable Key-83 4 machine01.host Region 1 - users
… … … …
machine03
users Key-AB 1 machine03.host Region 2 - testTable
users Key-KG 2 machine02.host Region 2 - users
6
7. The Write Path – Create a New Table
• The client asks to the master to create a new Table
Client
• hbase> create ‘myTable’, ‘cf’
createTable()
• The Master
Master
• Store the Table information (“schema”) Store Table
“Metadata”
• Create Regions based on the key-splits provided
Assign the Regions
• no splits provided, one single region by default “enable”
• Assign the Regions to the Region Servers
Region Region Region
• The assignment Region -> Server Server Server Server
Region
is written to a system table called “.META.” Region Region
Region Region Region
7
8. The Write Path – “Inserting” data
• table.put(row-key:family:column, value)
Client
Where is
• The client asks ZooKeeper the location of .META. .META.? Scan
.META.
• The client scans .META. searching for the Region Server ZooKeeper Region Server
Region
responsible to handle the Key Insert
KeyValue Region
• The client asks the Region Server to insert/update/delete
Region Server
the specified key/value. Region
• The Region Server process the request and dispatch it to Region
Region
the Region responsible to handle the Key
• The operation is written to a Write-Ahead Log (WAL)
• …and the KeyValues added to the Store: “MemStore”
8
9. The Write Path – Append Only to Random R/W
• Files in HDFS are RS
Region Region Region
WAL
• Append-Only
• Immutable once closed MemStore + Store Files (HFiles)
• HBase provides Random Writes?
• …not really from a storage point of view
• KeyValues are stored in memory and written to disk on pressure
• Don’t worry your data is safe in the WAL!
Key0 – value 0
• (The Region Server can recover data from the WAL is case of crash) Key1 – value 1
Key2 – value 2
Key3 – value 3
• But this allow to sort data by Key before writing on disk Key4 – value 4
Key5 – value 5
• Deletes are like Inserts but with a “remove me flag” Store Files
9
10. The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
Client
Where is
• The client scans .META. searching for the Region Server .META.? Scan
.META.
responsible to handle the Key ZooKeeper Region Server
Region
• The client asks the Region Server to get the specified key/value. Get Key
Region
• The Region Server process the request and dispatch it to the
Region Server
Region responsible to handle the Key Region
• MemStore and Store Files are scanned to find the key Region
Region
10
11. The Read Path – Append Only to Random R/W
Each flush a new file is created
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key5 – value 5.0
Key3 – value 3.0 Key1 – value 1.0
Key5 – value 5.0 Key5 – [deleted]
Each file have KeyValues sorted by key
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0
• Two or more files can contains the same key
(updates/deletes)
• To find a Key you need to scan all the files
• …with some optimizations
• Filter Files Start/End Key
• Having a bloom filter on each file
11
13. HFile Format
• Only Sequential Writes, just append(key, value) Blocks
Header
• Large Sequential Reads are better
Record 0
• Why grouping records in blocks? Record 1
…
• Easy to split Record N
• Easy to read Key/Value Header
(record) Record 0
• Easy to cache Key Length : int Record 1
Value Length : int …
• Easy to index (if records are sorted) Key : byte[]
Record N
• Block Compression (snappy, lz4, gz, …) Index 0
…
Value : byte[]
Index N
Trailer
13
14. Data Block Encoding
• “Be aware of the data”
• Block Encoding allows to compress the Key based on what we know
• Keys are sorted… prefix may be similar in most cases
• One file contains keys from one Family only
“on-disk”
• Timestamps are “similar”, we can store the diff KeyValue
• Type is “put” most of the time… Row Length : short
Row : byte[]
Family Length : byte
Family : byte[]
Qualifier : byte[]
Timestamp : long
Type : byte
14
16. Compactions
Reduce the number of files to look into during a scan
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key1 – value 1.0
Key3 – value 3.0 Key4– value 4.0
Key5 – value 5.0 Key5 – [deleted]
• Removing duplicated keys (updated values)
Key8 – value 8.0 Key6 – value 6.0
Key9 – value 9.0 Key7– value 7.0
• Removing deleted keys
• Creates a new file by merging the content of two or more files Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Remove the old files Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
16
17. Pluggable Compactions
Try different algorithm
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key1 – value 1.0
Key3 – value 3.0 Key4– value 4.0
Key5 – value 5.0 Key5 – [deleted]
Be aware of the data
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0
• Time Series? I guess no updates from the 80s
• Be aware of the requests Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Compact based on statistics Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
• which files are hot and which are not
• which keys are hot and which are not
17
18. Snapshots
Zero-Copy Snapshots and Table Clones
18
19. What Is a Snapshot?
• “a Snapshot is not a copy of the table”
• a Snapshot is a set of metadata information
• The table “schema” (column families and attributes)
• The Regions information (start key, end key, …)
• The list of Store Files
ZK ZK
• The list of WALs active Master ZK
RS RS
Region Region Region Region Region Region
WAL
WAL
Store Files (HFiles) Store Files (HFiles)
19
20. How Taking a Snapshot Works?
• The master orchestrate the RSs
• the communication is done via ZooKeeper
• using a “2-phase commit like” transaction (prepare/commit)
• Each RS is responsible to take its “piece” of snapshot
• For each Region store the metadata information needed
• (list of Store Files, WALs, region start/end keys, …)
ZK ZK
Master ZK
RS RS
Region Region Region Region Region Region
WAL
WAL
Store Files (HFiles) Store Files (HFiles)
20
21. Cloning a Table from a Snapshots
• hbase> clone_snapshot ‘snapshotName’, ‘tableName’
…
• Creates a new table with the data “contained” in the snapshot
• No data copies involved
• HFiles are immutable, and shared between tables and snapshots
• You can insert/update/remove data from the new table
• No repercussions on the snapshot, original tables or other cloned tables
21
22. Compactions & Archiving
• HFiles are immutable, and shared between tables and snapshots
• On compaction or table deletion, files are removed from disk
• If one of these files are referenced by a snapshot or a cloned table
• The file is moved to an “archive” directory
• And deleted later, when there’re no references to it
22
24. 0.96 is coming up
• Moving RPC to Protobuf
• Allows rolling upgrades with no surprises
• HBase Snapshots
• Pluggable Compactions
• Remove -ROOT-
• Table Locks
24
25. 0.98 and Beyond
• Transparent Table/Column-Family Encryption
• Cell-level security
• Multiple WALs per Region Server (MTTR)
• Data Placement Awareness (MTTR)
• Data Type Awareness
• Compaction policies, based on the data needs
• Managing blocks directly (instead of files)
25
26. DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Questions?
Headline Goes Here
Matteo Name or @Cloudera
SpeakerBertozzi | Subhead Goes Here
26