SlideShare a Scribd company logo
1 of 27
Download to read offline
HBase Storage Internals, present and future!
    Matteo Bertozzi | @Cloudera
    Speaker Name or Subhead Goes Here
    March 2013 - Hadoop Summit Europe




1
What is HBase?
    • Open source Storage Manager that provides random
      read/write on top of HDFS
    • Provides Tables with a “Key:Column/Value” interface
        • Dynamic columns (qualifiers), no schema needed
        • “Fixed” column groups (families)
        • table[row:family:column] = value




2
HBase ecosystem
    • Apache Hadoop HDFS for data durability and
      reliability (Write-Ahead Log)
    • Apache ZooKeeper for distributed coordination   App   MR

    • Apache Hadoop MapReduce built-in support
      for running MapReduce jobs

                                                      ZK    HDFS




3
How HBase Works
    “View from 10000ft”




4
Master, Region Servers and Regions
                Client                                • Region Server
                                                           • Server that contains a set of Regions
                                                           • Responsible to handle reads and writes
                                          ZooKeeper
                                                      • Region

                                            Master         • The basic unit of scalability in HBase
                                                           • Subset of the table’s data
                                                           • Contiguous, sorted range of rows stored together.
    Region Server    Region Server   Region Server    • Master
       Region            Region         Region             • Coordinates the HBase Cluster
       Region            Region         Region                  • Assignment/Balancing of the Regions
       Region            Region         Region             • Handles admin operations
                                                                • create/delete/modify table, …
                         HDFS


5
Autosharding and .META. table
    •   A Region is a Subset of the table’s data
    •   When there is too much data in a Region…
           • a split is triggered, creating 2 regions
    •   The association “Region -> Server” is stored in a System Table
    •   The Location of .META. Is stored in ZooKeeper
                                             Table      Start Key   Region ID   Region Server      machine01
                                                                                                 Region 1 - testTable
                                            testTable    Key-00        1        machine01.host   Region 4 - testTable

                                            testTable    Key-31        2        machine03.host
                                                                                                   machine02
                                            testTable    Key-65        3        machine02.host
                                                                                                 Region 3 - testTable
                                            testTable    Key-83        4        machine01.host    Region 1 - users

                                               …           …           …              …
                                                                                                   machine03
                                             users       Key-AB        1        machine03.host   Region 2 - testTable

                                             users       Key-KG        2        machine02.host    Region 2 - users




6
The Write Path – Create a New Table
• The client asks to the master to create a new Table
                                                                            Client
    • hbase> create ‘myTable’, ‘cf’
                                                                                     createTable()

• The Master
                                                                          Master
    • Store the Table information (“schema”)                                      Store Table
                                                                                  “Metadata”
    • Create Regions based on the key-splits provided
                                                                        Assign the Regions
          • no splits provided, one single region by default                 “enable”


    • Assign the Regions to the Region Servers
                                                               Region       Region               Region
          • The assignment Region -> Server                    Server       Server               Server
                                                                                                 Region
            is written to a system table called “.META.”       Region        Region
                                                               Region        Region              Region




7
The Write Path – “Inserting” data
•   table.put(row-key:family:column, value)
                                                                               Client
                                                                Where is
•   The client asks ZooKeeper the location of .META.            .META.?                     Scan
                                                                                           .META.

•   The client scans .META. searching for the Region Server     ZooKeeper                           Region Server

                                                                                                      Region
    responsible to handle the Key                                                     Insert
                                                                                     KeyValue         Region
•   The client asks the Region Server to insert/update/delete
                                                                      Region Server
    the specified key/value.                                                Region
•   The Region Server process the request and dispatch it to                Region
                                                                            Region
    the Region responsible to handle the Key
       • The operation is written to a Write-Ahead Log (WAL)
       • …and the KeyValues added to the Store: “MemStore”




8
The Write Path – Append Only to Random R/W
• Files in HDFS are                                                                                     RS
                                                                                              Region       Region        Region




                                                                                        WAL
      • Append-Only
      • Immutable once closed                                                                  MemStore + Store Files (HFiles)


• HBase provides Random Writes?
      • …not really from a storage point of view
      • KeyValues are stored in memory and written to disk on pressure
           • Don’t worry your data is safe in the WAL!
                                                                                                              Key0 – value 0
               •   (The Region Server can recover data from the WAL is case of crash)                         Key1 – value 1
                                                                                                              Key2 – value 2
                                                                                                              Key3 – value 3

          • But this allow to sort data by Key before writing on disk                                         Key4 – value 4
                                                                                                              Key5 – value 5



     • Deletes are like Inserts but with a “remove me flag”                                                         Store Files




9
The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
                                                                                     Client
                                                                      Where is
• The client scans .META. searching for the Region Server             .META.?                     Scan
                                                                                                 .META.
  responsible to handle the Key                                       ZooKeeper                           Region Server

                                                                                                            Region
• The client asks the Region Server to get the specified key/value.                    Get Key
                                                                                                            Region
• The Region Server process the request and dispatch it to the
                                                                            Region Server
  Region responsible to handle the Key                                            Region
     • MemStore and Store Files are scanned to find the key                       Region
                                                                                  Region




10
The Read Path – Append Only to Random R/W
     Each flush a new file is created
                                                                       Key0 – value 0.1
•
                                                    Key0 – value 0.0
                                                    Key2 – value 2.0   Key5 – value 5.0
                                                    Key3 – value 3.0   Key1 – value 1.0
                                                    Key5 – value 5.0   Key5 – [deleted]

     Each file have KeyValues sorted by key
                                                                       Key6 – value 6.0
•
                                                    Key8 – value 8.0
                                                    Key9 – value 9.0   Key7– value 7.0




•    Two or more files can contains the same key
     (updates/deletes)
•    To find a Key you need to scan all the files
        • …with some optimizations
        • Filter Files Start/End Key
        • Having a bloom filter on each file




11
HFile
     HBase Store File Format




12
HFile Format
•    Only Sequential Writes, just append(key, value)                        Blocks
                                                                            Header
•    Large Sequential Reads are better
                                                                            Record 0
•    Why grouping records in blocks?                                        Record 1
                                                                               …
        • Easy to split                                                     Record N

        • Easy to read                                 Key/Value            Header
                                                            (record)        Record 0
        • Easy to cache                                 Key Length : int    Record 1
                                                       Value Length : int      …
        • Easy to index (if records are sorted)           Key : byte[]
                                                                            Record N

        • Block Compression (snappy, lz4, gz, …)                            Index 0
                                                                               …
                                                         Value : byte[]
                                                                            Index N

                                                                             Trailer


13
Data Block Encoding
•    “Be aware of the data”
•    Block Encoding allows to compress the Key based on what we know
        • Keys are sorted… prefix may be similar in most cases
        • One file contains keys from one Family only
                                                                    “on-disk”
        • Timestamps are “similar”, we can store the diff           KeyValue
        • Type is “put” most of the time…                           Row Length : short
                                                                               Row : byte[]
                                                                           Family Length : byte
                                                                              Family : byte[]
                                                                             Qualifier : byte[]
                                                                            Timestamp : long
                                                                               Type : byte




14
Compactions
     Optimize the read-path




15
Compactions
     Reduce the number of files to look into during a scan
                                                                                                Key0 – value 0.1
•
                                                                      Key0 – value 0.0
                                                                      Key2 – value 2.0          Key1 – value 1.0
                                                                      Key3 – value 3.0          Key4– value 4.0
                                                                      Key5 – value 5.0          Key5 – [deleted]

        • Removing duplicated keys (updated values)
                                                                      Key8 – value 8.0          Key6 – value 6.0
                                                                      Key9 – value 9.0          Key7– value 7.0




        • Removing deleted keys
•    Creates a new file by merging the content of two or more files              Key0 – value 0.1
                                                                                 Key1 – value 1.0
                                                                                 Key2 – value 2.0
                                                                                 Key4– value 4.0
        • Remove the old files                                                   Key6 – value 6.0
                                                                                 Key7– value 7.0
                                                                                 Key8– value 8.0
                                                                                 Key9– value 9.0




16
Pluggable Compactions
     Try different algorithm
                                                                                   Key0 – value 0.1
•
                                                         Key0 – value 0.0
                                                         Key2 – value 2.0          Key1 – value 1.0
                                                         Key3 – value 3.0          Key4– value 4.0
                                                         Key5 – value 5.0          Key5 – [deleted]

     Be aware of the data
                                                                                   Key6 – value 6.0
•
                                                         Key8 – value 8.0
                                                         Key9 – value 9.0          Key7– value 7.0




        • Time Series? I guess no updates from the 80s
•    Be aware of the requests                                       Key0 – value 0.1
                                                                    Key1 – value 1.0
                                                                    Key2 – value 2.0
                                                                    Key4– value 4.0
        • Compact based on statistics                               Key6 – value 6.0
                                                                    Key7– value 7.0
                                                                    Key8– value 8.0
                                                                    Key9– value 9.0
        • which files are hot and which are not
        • which keys are hot and which are not




17
Snapshots
     Zero-Copy Snapshots and Table Clones




18
What Is a Snapshot?
     • “a Snapshot is not a copy of the table”
     • a Snapshot is a set of metadata information
          • The table “schema” (column families and attributes)
          • The Regions information (start key, end key, …)
              • The list of Store Files
                                                                                                                      ZK         ZK


              • The list of WALs active                   Master                                                            ZK




                                                             RS                                         RS
                                                   Region       Region         Region         Region       Region          Region




                                             WAL




                                                                                        WAL
                                                        Store Files (HFiles)                       Store Files (HFiles)



19
How Taking a Snapshot Works?
     •   The master orchestrate the RSs
           • the communication is done via ZooKeeper
           • using a “2-phase commit like” transaction (prepare/commit)
     •   Each RS is responsible to take its “piece” of snapshot
           • For each Region store the metadata information needed
           • (list of Store Files, WALs, region start/end keys, …)
                                                                                                                             ZK         ZK

                                                                                          Master                                   ZK




                                                                    RS                                         RS
                                                          Region       Region         Region         Region       Region          Region




                                                    WAL




                                                                                               WAL
                                                               Store Files (HFiles)                       Store Files (HFiles)



20
Cloning a Table from a Snapshots
     •   hbase> clone_snapshot ‘snapshotName’, ‘tableName’
         …



     •   Creates a new table with the data “contained” in the snapshot
     •   No data copies involved
            • HFiles are immutable, and shared between tables and snapshots
     •   You can insert/update/remove data from the new table
            • No repercussions on the snapshot, original tables or other cloned tables




21
Compactions & Archiving
     •   HFiles are immutable, and shared between tables and snapshots

     •   On compaction or table deletion, files are removed from disk
     •   If one of these files are referenced by a snapshot or a cloned table
             • The file is moved to an “archive” directory
             • And deleted later, when there’re no references to it




22
Future
     What can be improved?




23
0.96 is coming up
     •   Moving RPC to Protobuf
           •   Allows rolling upgrades with no surprises
     • HBase Snapshots
     • Pluggable Compactions
     • Remove -ROOT-
     • Table Locks




24
0.98 and Beyond
     • Transparent Table/Column-Family Encryption
     • Cell-level security
     • Multiple WALs per Region Server (MTTR)
     • Data Placement Awareness (MTTR)
     • Data Type Awareness
     • Compaction policies, based on the data needs
     • Managing blocks directly (instead of files)


25
DO NOT USE PUBLICLY
                                            PRIOR TO 10/23/12
     Questions?
     Headline Goes Here
     Matteo Name or @Cloudera
     SpeakerBertozzi | Subhead Goes Here




26
27

More Related Content

What's hot

Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Exadata and the Oracle Optimizer: The Untold Story
Exadata and the Oracle Optimizer: The Untold StoryExadata and the Oracle Optimizer: The Untold Story
Exadata and the Oracle Optimizer: The Untold Story
Enkitec
 

What's hot (20)

Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 
Exadata and the Oracle Optimizer: The Untold Story
Exadata and the Oracle Optimizer: The Untold StoryExadata and the Oracle Optimizer: The Untold Story
Exadata and the Oracle Optimizer: The Untold Story
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 

Similar to HBase internals

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Bhaskar Naik
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBase
DataWorks Summit
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
James Chen
 

Similar to HBase internals (20)

Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtable
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
 
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
 
Clustering
Clustering Clustering
Clustering
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBase
 
1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day 1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
CloudStack technical overview
 
Apache Hadoop YARN State of the Union
Apache Hadoop YARN State of the UnionApache Hadoop YARN State of the Union
Apache Hadoop YARN State of the Union
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
 
CloudStack Performance Testing
CloudStack Performance TestingCloudStack Performance Testing
CloudStack Performance Testing
 

HBase internals

  • 1. HBase Storage Internals, present and future! Matteo Bertozzi | @Cloudera Speaker Name or Subhead Goes Here March 2013 - Hadoop Summit Europe 1
  • 2. What is HBase? • Open source Storage Manager that provides random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2
  • 3. HBase ecosystem • Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination App MR • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS 3
  • 4. How HBase Works “View from 10000ft” 4
  • 5. Master, Region Servers and Regions Client • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes ZooKeeper • Region Master • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. Region Server Region Server Region Server • Master Region Region Region • Coordinates the HBase Cluster Region Region Region • Assignment/Balancing of the Regions Region Region Region • Handles admin operations • create/delete/modify table, … HDFS 5
  • 6. Autosharding and .META. table • A Region is a Subset of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper Table Start Key Region ID Region Server machine01 Region 1 - testTable testTable Key-00 1 machine01.host Region 4 - testTable testTable Key-31 2 machine03.host machine02 testTable Key-65 3 machine02.host Region 3 - testTable testTable Key-83 4 machine01.host Region 1 - users … … … … machine03 users Key-AB 1 machine03.host Region 2 - testTable users Key-KG 2 machine02.host Region 2 - users 6
  • 7. The Write Path – Create a New Table • The client asks to the master to create a new Table Client • hbase> create ‘myTable’, ‘cf’ createTable() • The Master Master • Store the Table information (“schema”) Store Table “Metadata” • Create Regions based on the key-splits provided Assign the Regions • no splits provided, one single region by default “enable” • Assign the Regions to the Region Servers Region Region Region • The assignment Region -> Server Server Server Server Region is written to a system table called “.META.” Region Region Region Region Region 7
  • 8. The Write Path – “Inserting” data • table.put(row-key:family:column, value) Client Where is • The client asks ZooKeeper the location of .META. .META.? Scan .META. • The client scans .META. searching for the Region Server ZooKeeper Region Server Region responsible to handle the Key Insert KeyValue Region • The client asks the Region Server to insert/update/delete Region Server the specified key/value. Region • The Region Server process the request and dispatch it to Region Region the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” 8
  • 9. The Write Path – Append Only to Random R/W • Files in HDFS are RS Region Region Region WAL • Append-Only • Immutable once closed MemStore + Store Files (HFiles) • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! Key0 – value 0 • (The Region Server can recover data from the WAL is case of crash) Key1 – value 1 Key2 – value 2 Key3 – value 3 • But this allow to sort data by Key before writing on disk Key4 – value 4 Key5 – value 5 • Deletes are like Inserts but with a “remove me flag” Store Files 9
  • 10. The Read Path – “reading” data • The client asks ZooKeeper the location of .META. Client Where is • The client scans .META. searching for the Region Server .META.? Scan .META. responsible to handle the Key ZooKeeper Region Server Region • The client asks the Region Server to get the specified key/value. Get Key Region • The Region Server process the request and dispatch it to the Region Server Region responsible to handle the Key Region • MemStore and Store Files are scanned to find the key Region Region 10
  • 11. The Read Path – Append Only to Random R/W Each flush a new file is created Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key5 – value 5.0 Key3 – value 3.0 Key1 – value 1.0 Key5 – value 5.0 Key5 – [deleted] Each file have KeyValues sorted by key Key6 – value 6.0 • Key8 – value 8.0 Key9 – value 9.0 Key7– value 7.0 • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file 11
  • 12. HFile HBase Store File Format 12
  • 13. HFile Format • Only Sequential Writes, just append(key, value) Blocks Header • Large Sequential Reads are better Record 0 • Why grouping records in blocks? Record 1 … • Easy to split Record N • Easy to read Key/Value Header (record) Record 0 • Easy to cache Key Length : int Record 1 Value Length : int … • Easy to index (if records are sorted) Key : byte[] Record N • Block Compression (snappy, lz4, gz, …) Index 0 … Value : byte[] Index N Trailer 13
  • 14. Data Block Encoding • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only “on-disk” • Timestamps are “similar”, we can store the diff KeyValue • Type is “put” most of the time… Row Length : short Row : byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte 14
  • 15. Compactions Optimize the read-path 15
  • 16. Compactions Reduce the number of files to look into during a scan Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key1 – value 1.0 Key3 – value 3.0 Key4– value 4.0 Key5 – value 5.0 Key5 – [deleted] • Removing duplicated keys (updated values) Key8 – value 8.0 Key6 – value 6.0 Key9 – value 9.0 Key7– value 7.0 • Removing deleted keys • Creates a new file by merging the content of two or more files Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 • Remove the old files Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0 16
  • 17. Pluggable Compactions Try different algorithm Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key1 – value 1.0 Key3 – value 3.0 Key4– value 4.0 Key5 – value 5.0 Key5 – [deleted] Be aware of the data Key6 – value 6.0 • Key8 – value 8.0 Key9 – value 9.0 Key7– value 7.0 • Time Series? I guess no updates from the 80s • Be aware of the requests Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 • Compact based on statistics Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0 • which files are hot and which are not • which keys are hot and which are not 17
  • 18. Snapshots Zero-Copy Snapshots and Table Clones 18
  • 19. What Is a Snapshot? • “a Snapshot is not a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files ZK ZK • The list of WALs active Master ZK RS RS Region Region Region Region Region Region WAL WAL Store Files (HFiles) Store Files (HFiles) 19
  • 20. How Taking a Snapshot Works? • The master orchestrate the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) ZK ZK Master ZK RS RS Region Region Region Region Region Region WAL WAL Store Files (HFiles) Store Files (HFiles) 20
  • 21. Cloning a Table from a Snapshots • hbase> clone_snapshot ‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables 21
  • 22. Compactions & Archiving • HFiles are immutable, and shared between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it 22
  • 23. Future What can be improved? 23
  • 24. 0.96 is coming up • Moving RPC to Protobuf • Allows rolling upgrades with no surprises • HBase Snapshots • Pluggable Compactions • Remove -ROOT- • Table Locks 24
  • 25. 0.98 and Beyond • Transparent Table/Column-Family Encryption • Cell-level security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files) 25
  • 26. DO NOT USE PUBLICLY PRIOR TO 10/23/12 Questions? Headline Goes Here Matteo Name or @Cloudera SpeakerBertozzi | Subhead Goes Here 26
  • 27. 27