SlideShare a Scribd company logo
1 of 47
Christopher Shain
Software Development Lead
Tresata
   Hadoop for Financial Services
   First completely Hadoop-powered analytics
    application
   Widely recognized as “Big Data Startup to
    Watch”
   Winner of the 2011 NCTA Award for Emerging
    Tech Company of the Year

   Based in Charlotte NC
   We are hiring! curious@tresata.com
   Software Development Lead at Tresata
   Background in Financial Services IT
     End-User Applications
     Data Warehousing
     ETL
   Email: chris@tresata.com
   Twitter: @chrisshain
   What is HBase?
     From: http://hbase.apache.org:
      ▪ “HBase is the Hadoop database.”
     I think this is a confusing statement
   ‘Database’, to many, means:
     Transactions
     Joins
     Indexes


   HBase has none of these
     More on this later
   HBase is a data storage platform designed to
    work hand-in-hand with Hadoop
       Distributed
       Failure-tolerant
       Semi-structured
       Low latency
       Strictly consistent
       HDFS-Aware
       “NoSQL”
   Need for a low-latency, distributed datastore
    with unlimited horizontal scale

   Hadoop (MapReduce) doesn’t provide low-
    latency

   Traditional RDBMS don’t scale out
    horizontally
   November 2006: Google BigTable whitepaper
    published:
    http://research.google.com/archive/bigtable.html
 February 2007: Initial HBase Prototype
 October 2007: First ‘usable’ HBase
 January 2008: HBase becomes Apache
  subproject of Hadoop
 March 2009: HBase 0.20.0
 May 10th, 2010: HBase becomes Apache Top
  Level Project
   Web Indexing

   Social Graph

   Messaging (Email etc.)
   HBase is written almost entirely in Java
     JVM clients are first-class citizens
                                             HBase Master


                                             RegionServer
                         JVM Clients

                                             RegionServer

        Non-JVM             Proxy
                                             RegionServer
         Clients       (Thrift or REST)

                                             RegionServer
   All data is stored in Tables
   Table rows have exactly one Key, and all rows in a
    table are physically ordered by key
   Tables have a fixed number of Column Families
    (more on this later!)
   Each row can have many Columns in each column
    family
   Each column has a set of values, each with a
    timestamp
   Each row:family:column:timestamp combination
    represents coordinates for a Cell
   Defined by the Table
   A Column Family is a group of related
    columns with it’s own name
   All columns must be in a column family
   Each row can have a completely different set
    of columns for a column family
Row:         Column Family:   Columns:
Chris                         Friends:Bob
Bob              Friends      Friends:Chris   Friends:James
James                         Friends:Bob
      Not exactly the same as rows in a traditional
        RDBMS
         Key: a byte array (usually a UTF-8 String)
         Data: Cells, qualified by column family, column, and
          timestamp (not shown here)
Row Key: Column Families :   Columns:             Cells:
         (Defined by the     (Defined by the Row) (Created with Columns)
         Table)              (May vary between
                             rows)
            Attributes       Attributes:Age       30
                             Attributes:Height    68
Chris
            Friends          Friends:Bob          1 (Bob’s a cool guy)
                             Friends:Jane         0 (Jane and I don’t get along)
   All cells are created with a timestamp
   Column family defines how many versions of
    a cell to keep
   Updates always create a new cell
   Deletes create a tombstone (more on that
    later)
   Queries can include an “as-of” timestamp to
    return point-in-time values
    HBase deletes are a form of write called a
     “tombstone”
    Indicates that “beyond this point any previously
     written value is dead”
    Old values can still be read using point-in-time
     queries
    Timestamp     Write Type      Resulting Value    Point-In-Time
                                                    Value “as of” T+1
      T+0        PUT (“Foo”)          “Foo”              “Foo”
      T+1         PUT (“Bar”)         “Bar”              “Bar”
      T+2          DELETE             <none>             “Bar”
      T+3       PUT (“Foo Too”)     “Foo Too”            “Bar”
   Requirement: Store real-time stock tick data
     Ticker   Timestamp      Sequence    Bid      Ask
      IBM     09:15:03:001      1       179.16   179.18
     MSFT     09:15:04:112      2       28.25    28.27
     GOOG     09:15:04:114      3       624.94   624.99
      IBM     09:15:04:155      4       179.18   179.19

   Requirement: Accommodate many
    simultaneous readers & writers
   Requirement: Allow for reading of current
    price for any ticker at any point in time
Historical Prices:
          Keys                   Column             DataType
                     Ticker               Varchar
           PK        Timestamp            DateTime
                     Sequence_Number      Integer
                     Bid_Price            Decimal
                     Ask_Price            Decimal


Latest Prices:
          Keys                   Column             DataType
           PK        Ticker               Varchar
                     Bid_Price            Decimal
                     Ask_Price            Decimal
Row Key                             Family:Column

                                                 Prices:Bid
[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]
                                                 Prices:Ask



   HBase throughput will scale linearly with # of
    nodes

   No need to keep separate “latest price” table
     A scan starting at “ticker” will always
      return the latest price row
   HBase scales horizontally

   Needs to split data over many RegionServers

   Regions are the unit of scale
   All HBase tables are broken into 1 or more
    regions
   Regions have a start row key and an end row
    key
   Each Region lives on exactly one
    RegionServer
   RegionServers may host many Regions
   When RegionServers die, Master detects this
    and assigns Regions to other RegionServers
“Users” Table
                                         Row Keys in Region
                                         “Aaron” – “George”
-META- Table                             “Aaron”
                                Region   “Bob”
Table          Region
                                Server   “Chris”
         “Aaron” – “George”     Node01   Row Keys in Region
Users   “George” – “Matthew”    Node02   “George” – “Matthew”
        “Matthew” – “Zachary”   Node01   “George”
                                         Row Keys in Region
                                         “Matthew” – “Zachary”
                                         “Matthew”
                                         “Nancy”
                                         “Zachary”
Deceptively simple
ZooKeeper
                                           Cluster


                                                 Backup HBase
                                 HBase Master
                                                    Master

           JVM Clients
                                         RegionServer


                                         RegionServer


                                         RegionServer
Non-JVM            Proxy
 Clients      (Thrift or REST)
                                         RegionServer
   ZooKeeper
     Keeps track of which server is the current HBase
     Master

   HBase Master
     Keeps track of Region/RegionServer mapping
     Manages the -ROOT- and .META. tables
     Responsible for updating ZooKeeper when these
     change
   RegionServer
     Stores table regions
   Clients
     Need to be smarter than RDBMS clients
     First connect to ZooKeeper to get RegionServer
      for a given Table/Region
     Then connect directly to RegionServer to interact
      with the data
     All connections over Hadoop RPC – non-JVM
      clients use proxy (Thrift or REST (Stargate))
-ROOT- Table
                 info:regioninfo
.META.[region]   info:server               Points to DataNode hosting
                                           .META. region.…
                 info:serverstartcode

                            .META. Table
                                           info:regioninfo
     [table],[region start key],[region id] info:server           Points to DataNode
                                           info:serverstartcode   hosting table region.


                                     Regular User Table
                            … whatever …                  …
   HBase Master is not necessarily a single point of
    failure (SPOF)
     Multiple masters can be running
     Current ‘active’ Master controlled via ZooKeeper
     Make sure you have enough ZooKeeper nodes!

   Master is not needed for client connectivity
     Clients connect directly to ZooKeeper to find Regions
     Everything Master does can be put off until one is
      elected
ZooKeeper Quorum
                 ZooKeeper
                   Node


     ZooKeeper               ZooKeeper
       Node                    Node




 HBase             HBase           HBase
 Master            Master          Master
(Current)        (Standby)       (Standby)
   HBase tolerates RegionServer failure when
    running on HDFS
     Data is replicated by HDFS (dfs.replication setting)
     Lots of issues around fsync, failure before data is
      flushed - some probably still not fixed
     Thus, data can still be lost if node fails after a write


   HDFS NameNode is still SPOF, even for HBase
    Similar to log in many RDBMS
    All operations by default written to log before considered
     ‘committed’ (can be overridden for ‘disposable fast writes’)
    Log can be replayed when region is moved to another
     RegionServer
    One WAL per RegionServer
                                  Flushed periodically
                                  (10s by default)
                      WAL                                HFile

    Writes

                    MemStore                             HFile
                                Flushed when
                                MemStore gets too big
RegionServer
 Region
                  Store                         Store                     Store
                                   MemStore




                                                              MemStore




                                                                                        MemStore
                  StoreFile                      StoreFile                 StoreFile
         Log




                       HFile                      HFile                     HFile




HDFS Client        Block       Block            Block      Block          Block     Block          Block


 Block         Block       Block              Block       Block          Block      Block          Block

 Block         Block       Block              Block       Block          Block      Block          Block
HDFS DataNode             HDFS DataNode                 HDFS DataNode               HDFS DataNode
   A RegionServer is not guaranteed to be on
    the same physical node as it’s data

   Compaction causes RegionServer to write
    preferentially to local node
     But this is a function of HDFS Client, not HBase
   All data is in memory initially (memstore)
   HBase is a write-only system
     Modifications and deletes are just writes with
      later timestamps
     Function of HDFS being append-only
   Eventually old writes need to be discarded
   2 Types of Compactions:
     Minor
     Major
   All HBase edits are initially stored in memory
    (memstore)

   Flushes occur when memstore reaches a
    certain size
     By default 67,108,864 bytes
     Controlled by hbase.hregion.memstore.flush.size
     configuration property

   Each flush creates a new HFile
   Triggered when a certain number of HFiles are created for
    a given Region Store (+ some other conditions)
     By default 3 HFiles
     Controlled by hbase.hstore.compactionThreshold configuration
      property

   Compacts most recent HFiles into one
     By default, uses RegionServer-local HDFS node

   Does not eliminate deletes
     Only touches most recent HFiles

   NOTE: All column families are compacted at once (this
    might change in the future)
   Triggered every 24 hours (with random
    offset) or manually
     Large HBase installations usually leave this for
     manual operators

   Re-writes all HFiles into one

   Processes deletes
     Eliminates tombstones
     Erases earlier entries
   HBase does not have transactions
   However:
     Row-level modifications are atomic: All
      modifications to a row will succeed or fail as a unit
     Gets are consistent for a given point in time
      ▪ But Scans may return 2 rows from different points in
        time
     All data read has been ‘durably stored’
      ▪ Does NOT mean flushed to disk- can still be lost!
   DO: Design your schema for linear range scans on your
    most common queries.
     Scans are the most efficient way to query a lot of rows
      quickly

   DON’T: Use more than 2 or 3 column families.
     Some operations (flushing and compacting) operate
      on the whole row

   DO: Be aware of the relative cardinality of column
    families
     Wildly differing cardinality leads to sparsity and bad
      scanning results.
   DO: Be mindful of the size of your row and column
    keys
     They are used in indexes and queries, can be quite
      large!

   DON’T: Use monotonically increasing row keys
     Can lead to hotspots on writes

   DO: Store timestamp keys in reverse
     Rows in a table need to be read in order, usually
      you want most recent
   DO: Query single rows using exact-match on key
    (Gets) or Scans for multiple rows
     Scans allow efficient I/O vs. multiple gets

   DON’T: Use regex-based or non-prefix column filters
     Very inefficient

   DO: Tune the scan cache and batch size parameters
     Drastically improves performance when returning lots of
      rows
   Deceptively simple
                                    HBase Master


                   JVM Clients      RegionServer


                                    RegionServer


      Non-JVM         Proxy         RegionServer
       Clients   (Thrift or REST)

                                    RegionServer
ZooKeeper Quorum
                 ZooKeeper
                   Node


     ZooKeeper               ZooKeeper
       Node                    Node




 HBase             HBase           HBase
 Master            Master          Master
(Current)        (Standby)       (Standby)
RegionServer
 Region
                  Store                         Store                      Store
                                   MemStore




                                                               MemStore




                                                                                         MemStore
                  StoreFile                      StoreFile                  StoreFile
         Log




                       HFile                      HFile                      HFile



HDFS Client        Block       Block             Block       Block          Block     Block         Block


 Block         Block       Block              Block       Block           Block      Block          Block

 Block         Block       Block              Block       Block           Block      Block          Block
HDFS DataNode             HDFS DataNode                  HDFS DataNode               HDFS DataNode
   Requirement: Store an arbitrary set of
    preferences for all users
   Requirement: Each user may choose to store
    a different set of preferences
   Requirement: Preferences may be of
    different data types (Strings, Integers, etc)
   Requirement: Developers will add new
    preference options all the time, so we
    shouldn’t need to modify the database
    structure when adding them
   One possible RDBMS solution:
     Key/Value table
     All values as strings
     Flexible, but wastes space




Keys:                Column:           Data Type:
                     UserID            Int
         PK
                     PreferenceName    Varchar

                     PreferenceValue   Varchar
   Store all preferences in the Preferences
    column family
   Preference name as column name,
    preference value as (serialized) byte array:
     HBase client library provides methods for
     serializing many common data types
Row Key:       Family:          Column:         Value:
                                      Age                30
      Chris       Preferences
                                   Hometown       “Mineola, NY”
      Joe         Preferences       Birthdate       11/13/1987

More Related Content

Similar to TriHUG January 2012 Talk by Chris Shain

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
BOSC 2010
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
zpinter
 
Philly DB MapR M7 - March 2013
Philly DB MapR M7 - March 2013Philly DB MapR M7 - March 2013
Philly DB MapR M7 - March 2013
MapR Technologies
 

Similar to TriHUG January 2012 Talk by Chris Shain (20)

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
01 hbase
01 hbase01 hbase
01 hbase
 
HBase lon meetup
HBase lon meetupHBase lon meetup
HBase lon meetup
 
Hbase.pptx
Hbase.pptxHbase.pptx
Hbase.pptx
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
HBase ArcheTypes
HBase ArcheTypesHBase ArcheTypes
HBase ArcheTypes
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Philly DB MapR M7 - March 2013
Philly DB MapR M7 - March 2013Philly DB MapR M7 - March 2013
Philly DB MapR M7 - March 2013
 
PhillyDB Hbase and MapR M7 - March 2013
PhillyDB Hbase and MapR M7 - March 2013PhillyDB Hbase and MapR M7 - March 2013
PhillyDB Hbase and MapR M7 - March 2013
 

More from trihug

TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
trihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
trihug
 

More from trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

TriHUG January 2012 Talk by Chris Shain

  • 2. Hadoop for Financial Services  First completely Hadoop-powered analytics application  Widely recognized as “Big Data Startup to Watch”  Winner of the 2011 NCTA Award for Emerging Tech Company of the Year  Based in Charlotte NC  We are hiring! curious@tresata.com
  • 3. Software Development Lead at Tresata  Background in Financial Services IT  End-User Applications  Data Warehousing  ETL  Email: chris@tresata.com  Twitter: @chrisshain
  • 4. What is HBase?  From: http://hbase.apache.org: ▪ “HBase is the Hadoop database.”  I think this is a confusing statement
  • 5. ‘Database’, to many, means:  Transactions  Joins  Indexes  HBase has none of these  More on this later
  • 6. HBase is a data storage platform designed to work hand-in-hand with Hadoop  Distributed  Failure-tolerant  Semi-structured  Low latency  Strictly consistent  HDFS-Aware  “NoSQL”
  • 7. Need for a low-latency, distributed datastore with unlimited horizontal scale  Hadoop (MapReduce) doesn’t provide low- latency  Traditional RDBMS don’t scale out horizontally
  • 8. November 2006: Google BigTable whitepaper published: http://research.google.com/archive/bigtable.html  February 2007: Initial HBase Prototype  October 2007: First ‘usable’ HBase  January 2008: HBase becomes Apache subproject of Hadoop  March 2009: HBase 0.20.0  May 10th, 2010: HBase becomes Apache Top Level Project
  • 9. Web Indexing  Social Graph  Messaging (Email etc.)
  • 10. HBase is written almost entirely in Java  JVM clients are first-class citizens HBase Master RegionServer JVM Clients RegionServer Non-JVM Proxy RegionServer Clients (Thrift or REST) RegionServer
  • 11. All data is stored in Tables  Table rows have exactly one Key, and all rows in a table are physically ordered by key  Tables have a fixed number of Column Families (more on this later!)  Each row can have many Columns in each column family  Each column has a set of values, each with a timestamp  Each row:family:column:timestamp combination represents coordinates for a Cell
  • 12. Defined by the Table  A Column Family is a group of related columns with it’s own name  All columns must be in a column family  Each row can have a completely different set of columns for a column family Row: Column Family: Columns: Chris Friends:Bob Bob Friends Friends:Chris Friends:James James Friends:Bob
  • 13. Not exactly the same as rows in a traditional RDBMS  Key: a byte array (usually a UTF-8 String)  Data: Cells, qualified by column family, column, and timestamp (not shown here) Row Key: Column Families : Columns: Cells: (Defined by the (Defined by the Row) (Created with Columns) Table) (May vary between rows) Attributes Attributes:Age 30 Attributes:Height 68 Chris Friends Friends:Bob 1 (Bob’s a cool guy) Friends:Jane 0 (Jane and I don’t get along)
  • 14. All cells are created with a timestamp  Column family defines how many versions of a cell to keep  Updates always create a new cell  Deletes create a tombstone (more on that later)  Queries can include an “as-of” timestamp to return point-in-time values
  • 15. HBase deletes are a form of write called a “tombstone”  Indicates that “beyond this point any previously written value is dead”  Old values can still be read using point-in-time queries Timestamp Write Type Resulting Value Point-In-Time Value “as of” T+1 T+0 PUT (“Foo”) “Foo” “Foo” T+1 PUT (“Bar”) “Bar” “Bar” T+2 DELETE <none> “Bar” T+3 PUT (“Foo Too”) “Foo Too” “Bar”
  • 16. Requirement: Store real-time stock tick data Ticker Timestamp Sequence Bid Ask IBM 09:15:03:001 1 179.16 179.18 MSFT 09:15:04:112 2 28.25 28.27 GOOG 09:15:04:114 3 624.94 624.99 IBM 09:15:04:155 4 179.18 179.19  Requirement: Accommodate many simultaneous readers & writers  Requirement: Allow for reading of current price for any ticker at any point in time
  • 17. Historical Prices: Keys Column DataType Ticker Varchar PK Timestamp DateTime Sequence_Number Integer Bid_Price Decimal Ask_Price Decimal Latest Prices: Keys Column DataType PK Ticker Varchar Bid_Price Decimal Ask_Price Decimal
  • 18. Row Key Family:Column Prices:Bid [Ticker].[Rev_Timestamp].[Rev_Sequence_Number] Prices:Ask  HBase throughput will scale linearly with # of nodes  No need to keep separate “latest price” table  A scan starting at “ticker” will always return the latest price row
  • 19. HBase scales horizontally  Needs to split data over many RegionServers  Regions are the unit of scale
  • 20. All HBase tables are broken into 1 or more regions  Regions have a start row key and an end row key  Each Region lives on exactly one RegionServer  RegionServers may host many Regions  When RegionServers die, Master detects this and assigns Regions to other RegionServers
  • 21. “Users” Table Row Keys in Region “Aaron” – “George” -META- Table “Aaron” Region “Bob” Table Region Server “Chris” “Aaron” – “George” Node01 Row Keys in Region Users “George” – “Matthew” Node02 “George” – “Matthew” “Matthew” – “Zachary” Node01 “George” Row Keys in Region “Matthew” – “Zachary” “Matthew” “Nancy” “Zachary”
  • 23. ZooKeeper Cluster Backup HBase HBase Master Master JVM Clients RegionServer RegionServer RegionServer Non-JVM Proxy Clients (Thrift or REST) RegionServer
  • 24. ZooKeeper  Keeps track of which server is the current HBase Master  HBase Master  Keeps track of Region/RegionServer mapping  Manages the -ROOT- and .META. tables  Responsible for updating ZooKeeper when these change
  • 25. RegionServer  Stores table regions  Clients  Need to be smarter than RDBMS clients  First connect to ZooKeeper to get RegionServer for a given Table/Region  Then connect directly to RegionServer to interact with the data  All connections over Hadoop RPC – non-JVM clients use proxy (Thrift or REST (Stargate))
  • 26. -ROOT- Table info:regioninfo .META.[region] info:server Points to DataNode hosting .META. region.… info:serverstartcode .META. Table info:regioninfo [table],[region start key],[region id] info:server Points to DataNode info:serverstartcode hosting table region. Regular User Table … whatever … …
  • 27. HBase Master is not necessarily a single point of failure (SPOF)  Multiple masters can be running  Current ‘active’ Master controlled via ZooKeeper  Make sure you have enough ZooKeeper nodes!  Master is not needed for client connectivity  Clients connect directly to ZooKeeper to find Regions  Everything Master does can be put off until one is elected
  • 28. ZooKeeper Quorum ZooKeeper Node ZooKeeper ZooKeeper Node Node HBase HBase HBase Master Master Master (Current) (Standby) (Standby)
  • 29. HBase tolerates RegionServer failure when running on HDFS  Data is replicated by HDFS (dfs.replication setting)  Lots of issues around fsync, failure before data is flushed - some probably still not fixed  Thus, data can still be lost if node fails after a write  HDFS NameNode is still SPOF, even for HBase
  • 30. Similar to log in many RDBMS  All operations by default written to log before considered ‘committed’ (can be overridden for ‘disposable fast writes’)  Log can be replayed when region is moved to another RegionServer  One WAL per RegionServer Flushed periodically (10s by default) WAL HFile Writes MemStore HFile Flushed when MemStore gets too big
  • 31. RegionServer Region Store Store Store MemStore MemStore MemStore StoreFile StoreFile StoreFile Log HFile HFile HFile HDFS Client Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode
  • 32. A RegionServer is not guaranteed to be on the same physical node as it’s data  Compaction causes RegionServer to write preferentially to local node  But this is a function of HDFS Client, not HBase
  • 33. All data is in memory initially (memstore)  HBase is a write-only system  Modifications and deletes are just writes with later timestamps  Function of HDFS being append-only  Eventually old writes need to be discarded  2 Types of Compactions:  Minor  Major
  • 34. All HBase edits are initially stored in memory (memstore)  Flushes occur when memstore reaches a certain size  By default 67,108,864 bytes  Controlled by hbase.hregion.memstore.flush.size configuration property  Each flush creates a new HFile
  • 35. Triggered when a certain number of HFiles are created for a given Region Store (+ some other conditions)  By default 3 HFiles  Controlled by hbase.hstore.compactionThreshold configuration property  Compacts most recent HFiles into one  By default, uses RegionServer-local HDFS node  Does not eliminate deletes  Only touches most recent HFiles  NOTE: All column families are compacted at once (this might change in the future)
  • 36. Triggered every 24 hours (with random offset) or manually  Large HBase installations usually leave this for manual operators  Re-writes all HFiles into one  Processes deletes  Eliminates tombstones  Erases earlier entries
  • 37. HBase does not have transactions  However:  Row-level modifications are atomic: All modifications to a row will succeed or fail as a unit  Gets are consistent for a given point in time ▪ But Scans may return 2 rows from different points in time  All data read has been ‘durably stored’ ▪ Does NOT mean flushed to disk- can still be lost!
  • 38.
  • 39. DO: Design your schema for linear range scans on your most common queries.  Scans are the most efficient way to query a lot of rows quickly  DON’T: Use more than 2 or 3 column families.  Some operations (flushing and compacting) operate on the whole row  DO: Be aware of the relative cardinality of column families  Wildly differing cardinality leads to sparsity and bad scanning results.
  • 40. DO: Be mindful of the size of your row and column keys  They are used in indexes and queries, can be quite large!  DON’T: Use monotonically increasing row keys  Can lead to hotspots on writes  DO: Store timestamp keys in reverse  Rows in a table need to be read in order, usually you want most recent
  • 41. DO: Query single rows using exact-match on key (Gets) or Scans for multiple rows  Scans allow efficient I/O vs. multiple gets  DON’T: Use regex-based or non-prefix column filters  Very inefficient  DO: Tune the scan cache and batch size parameters  Drastically improves performance when returning lots of rows
  • 42. Deceptively simple HBase Master JVM Clients RegionServer RegionServer Non-JVM Proxy RegionServer Clients (Thrift or REST) RegionServer
  • 43. ZooKeeper Quorum ZooKeeper Node ZooKeeper ZooKeeper Node Node HBase HBase HBase Master Master Master (Current) (Standby) (Standby)
  • 44. RegionServer Region Store Store Store MemStore MemStore MemStore StoreFile StoreFile StoreFile Log HFile HFile HFile HDFS Client Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode
  • 45. Requirement: Store an arbitrary set of preferences for all users  Requirement: Each user may choose to store a different set of preferences  Requirement: Preferences may be of different data types (Strings, Integers, etc)  Requirement: Developers will add new preference options all the time, so we shouldn’t need to modify the database structure when adding them
  • 46. One possible RDBMS solution:  Key/Value table  All values as strings  Flexible, but wastes space Keys: Column: Data Type: UserID Int PK PreferenceName Varchar PreferenceValue Varchar
  • 47. Store all preferences in the Preferences column family  Preference name as column name, preference value as (serialized) byte array:  HBase client library provides methods for serializing many common data types Row Key: Family: Column: Value: Age 30 Chris Preferences Hometown “Mineola, NY” Joe Preferences Birthdate 11/13/1987