TriHUG January 2012 Talk by Chris Shain

Christopher Shain
Software Development Lead
Tresata

 Hadoop for Financial Services
 First completely Hadoop-powered analytics
application
 Widely recognized as “Big Data Startup to
Watch”
 Winner of the 2011 NCTA Award for Emerging
Tech Company of the Year

 Based in Charlotte NC
 We are hiring! curious@tresata.com

 Software Development Lead at Tresata
 Background in Financial Services IT
 End-User Applications
 Data Warehousing
 ETL
 Email: chris@tresata.com
 Twitter: @chrisshain

 What is HBase?
 From: http://hbase.apache.org:
▪ “HBase is the Hadoop database.”
 I think this is a confusing statement

 ‘Database’, to many, means:
 Transactions
 Joins
 Indexes

 HBase has none of these
 More on this later

 HBase is a data storage platform designed to
work hand-in-hand with Hadoop
 Distributed
 Failure-tolerant
 Semi-structured
 Low latency
 Strictly consistent
 HDFS-Aware
 “NoSQL”

 Need for a low-latency, distributed datastore
with unlimited horizontal scale

 Hadoop (MapReduce) doesn’t provide low-
latency

 Traditional RDBMS don’t scale out
horizontally

 November 2006: Google BigTable whitepaper
published:
http://research.google.com/archive/bigtable.html
 February 2007: Initial HBase Prototype
 October 2007: First ‘usable’ HBase
 January 2008: HBase becomes Apache
subproject of Hadoop
 March 2009: HBase 0.20.0
 May 10th, 2010: HBase becomes Apache Top
Level Project

 Web Indexing

 Social Graph

 Messaging (Email etc.)

 HBase is written almost entirely in Java
 JVM clients are first-class citizens
HBase Master

RegionServer
JVM Clients

RegionServer

Non-JVM Proxy
RegionServer
Clients (Thrift or REST)

RegionServer

 All data is stored in Tables
 Table rows have exactly one Key, and all rows in a
table are physically ordered by key
 Tables have a fixed number of Column Families
(more on this later!)
 Each row can have many Columns in each column
family
 Each column has a set of values, each with a
timestamp
 Each row:family:column:timestamp combination
represents coordinates for a Cell

 Defined by the Table
 A Column Family is a group of related
columns with it’s own name
 All columns must be in a column family
 Each row can have a completely different set
of columns for a column family
Row: Column Family: Columns:
Chris Friends:Bob
Bob Friends Friends:Chris Friends:James
James Friends:Bob

 Not exactly the same as rows in a traditional
RDBMS
 Key: a byte array (usually a UTF-8 String)
 Data: Cells, qualified by column family, column, and
timestamp (not shown here)
Row Key: Column Families : Columns: Cells:
(Defined by the (Defined by the Row) (Created with Columns)
Table) (May vary between
rows)
Attributes Attributes:Age 30
Attributes:Height 68
Chris
Friends Friends:Bob 1 (Bob’s a cool guy)
Friends:Jane 0 (Jane and I don’t get along)

 All cells are created with a timestamp
 Column family defines how many versions of
a cell to keep
 Updates always create a new cell
 Deletes create a tombstone (more on that
later)
 Queries can include an “as-of” timestamp to
return point-in-time values

 HBase deletes are a form of write called a
“tombstone”
 Indicates that “beyond this point any previously
written value is dead”
 Old values can still be read using point-in-time
queries
Timestamp Write Type Resulting Value Point-In-Time
Value “as of” T+1
T+0 PUT (“Foo”) “Foo” “Foo”
T+1 PUT (“Bar”) “Bar” “Bar”
T+2 DELETE <none> “Bar”
T+3 PUT (“Foo Too”) “Foo Too” “Bar”

 Requirement: Store real-time stock tick data
Ticker Timestamp Sequence Bid Ask
IBM 09:15:03:001 1 179.16 179.18
MSFT 09:15:04:112 2 28.25 28.27
GOOG 09:15:04:114 3 624.94 624.99
IBM 09:15:04:155 4 179.18 179.19

 Requirement: Accommodate many
simultaneous readers & writers
 Requirement: Allow for reading of current
price for any ticker at any point in time

Historical Prices:
Keys Column DataType
Ticker Varchar
PK Timestamp DateTime
Sequence_Number Integer
Bid_Price Decimal
Ask_Price Decimal

Latest Prices:
Keys Column DataType
PK Ticker Varchar
Bid_Price Decimal
Ask_Price Decimal

Row Key Family:Column

Prices:Bid
[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]
Prices:Ask

 HBase throughput will scale linearly with # of
nodes

 No need to keep separate “latest price” table
 A scan starting at “ticker” will always
return the latest price row

 HBase scales horizontally

 Needs to split data over many RegionServers

 Regions are the unit of scale

 All HBase tables are broken into 1 or more
regions
 Regions have a start row key and an end row
key
 Each Region lives on exactly one
RegionServer
 RegionServers may host many Regions
 When RegionServers die, Master detects this
and assigns Regions to other RegionServers

“Users” Table
Row Keys in Region
“Aaron” – “George”
-META- Table “Aaron”
Region “Bob”
Table Region
Server “Chris”
“Aaron” – “George” Node01 Row Keys in Region
Users “George” – “Matthew” Node02 “George” – “Matthew”
“Matthew” – “Zachary” Node01 “George”
Row Keys in Region
“Matthew” – “Zachary”
“Matthew”
“Nancy”
“Zachary”

ZooKeeper
Cluster

Backup HBase
HBase Master
Master

JVM Clients
RegionServer

RegionServer

RegionServer
Non-JVM Proxy
RegionServer

 ZooKeeper
 Keeps track of which server is the current HBase
Master

 HBase Master
 Keeps track of Region/RegionServer mapping
 Manages the -ROOT- and .META. tables
 Responsible for updating ZooKeeper when these
change

 RegionServer
 Stores table regions
 Clients
 Need to be smarter than RDBMS clients
 First connect to ZooKeeper to get RegionServer
for a given Table/Region
 Then connect directly to RegionServer to interact
with the data
 All connections over Hadoop RPC – non-JVM
clients use proxy (Thrift or REST (Stargate))

-ROOT- Table
info:regioninfo
.META.[region] info:server Points to DataNode hosting
.META. region.…
info:serverstartcode

.META. Table
info:regioninfo
[table],[region start key],[region id] info:server Points to DataNode
info:serverstartcode hosting table region.

Regular User Table
… whatever … …

 HBase Master is not necessarily a single point of
failure (SPOF)
 Multiple masters can be running
 Current ‘active’ Master controlled via ZooKeeper
 Make sure you have enough ZooKeeper nodes!

 Master is not needed for client connectivity
 Clients connect directly to ZooKeeper to find Regions
 Everything Master does can be put off until one is
elected

ZooKeeper Quorum
ZooKeeper
Node

ZooKeeper ZooKeeper
Node Node

HBase HBase HBase
Master Master Master
(Current) (Standby) (Standby)

 HBase tolerates RegionServer failure when
running on HDFS
 Data is replicated by HDFS (dfs.replication setting)
 Lots of issues around fsync, failure before data is
flushed - some probably still not fixed
 Thus, data can still be lost if node fails after a write

 HDFS NameNode is still SPOF, even for HBase

 Similar to log in many RDBMS
 All operations by default written to log before considered
‘committed’ (can be overridden for ‘disposable fast writes’)
 Log can be replayed when region is moved to another
RegionServer
 One WAL per RegionServer
Flushed periodically
(10s by default)
WAL HFile

Writes

MemStore HFile
Flushed when
MemStore gets too big

RegionServer
Region
Store Store Store
MemStore

MemStore

MemStore
StoreFile StoreFile StoreFile
Log

HFile HFile HFile

HDFS Client Block Block Block Block Block Block Block

Block Block Block Block Block Block Block Block

Block Block Block Block Block Block Block Block
HDFS DataNode HDFS DataNode HDFS DataNode HDFS DataNode

 A RegionServer is not guaranteed to be on
the same physical node as it’s data

 Compaction causes RegionServer to write
preferentially to local node
 But this is a function of HDFS Client, not HBase

 All data is in memory initially (memstore)
 HBase is a write-only system
 Modifications and deletes are just writes with
later timestamps
 Function of HDFS being append-only
 Eventually old writes need to be discarded
 2 Types of Compactions:
 Minor
 Major

 All HBase edits are initially stored in memory
(memstore)

 Flushes occur when memstore reaches a
certain size
 By default 67,108,864 bytes
 Controlled by hbase.hregion.memstore.flush.size
configuration property

 Each flush creates a new HFile

 Triggered when a certain number of HFiles are created for
a given Region Store (+ some other conditions)
 By default 3 HFiles
 Controlled by hbase.hstore.compactionThreshold configuration
property

 Compacts most recent HFiles into one
 By default, uses RegionServer-local HDFS node

 Does not eliminate deletes
 Only touches most recent HFiles

 NOTE: All column families are compacted at once (this
might change in the future)

 Triggered every 24 hours (with random
offset) or manually
 Large HBase installations usually leave this for
manual operators

 Re-writes all HFiles into one

 Processes deletes
 Eliminates tombstones
 Erases earlier entries

 HBase does not have transactions
 However:
 Row-level modifications are atomic: All
modifications to a row will succeed or fail as a unit
 Gets are consistent for a given point in time
▪ But Scans may return 2 rows from different points in
time
 All data read has been ‘durably stored’
▪ Does NOT mean flushed to disk- can still be lost!

 DO: Design your schema for linear range scans on your
most common queries.
 Scans are the most efficient way to query a lot of rows
quickly

 DON’T: Use more than 2 or 3 column families.
 Some operations (flushing and compacting) operate
on the whole row

 DO: Be aware of the relative cardinality of column
families
 Wildly differing cardinality leads to sparsity and bad
scanning results.

 DO: Be mindful of the size of your row and column
keys
 They are used in indexes and queries, can be quite
large!

 DON’T: Use monotonically increasing row keys
 Can lead to hotspots on writes

 DO: Store timestamp keys in reverse
 Rows in a table need to be read in order, usually
you want most recent

 DO: Query single rows using exact-match on key
(Gets) or Scans for multiple rows
 Scans allow efficient I/O vs. multiple gets

 DON’T: Use regex-based or non-prefix column filters
 Very inefficient

 DO: Tune the scan cache and batch size parameters
 Drastically improves performance when returning lots of
rows

 Deceptively simple
HBase Master

JVM Clients RegionServer

RegionServer

Non-JVM Proxy RegionServer

RegionServer

 Requirement: Store an arbitrary set of
preferences for all users
 Requirement: Each user may choose to store
a different set of preferences
 Requirement: Preferences may be of
different data types (Strings, Integers, etc)
 Requirement: Developers will add new
preference options all the time, so we
shouldn’t need to modify the database
structure when adding them

 One possible RDBMS solution:
 Key/Value table
 All values as strings
 Flexible, but wastes space

Keys: Column: Data Type:
UserID Int
PK
PreferenceName Varchar

PreferenceValue Varchar

 Store all preferences in the Preferences
column family
 Preference name as column name,
preference value as (serialized) byte array:
 HBase client library provides methods for
serializing many common data types
Row Key: Family: Column: Value:
Age 30
Chris Preferences
Hometown “Mineola, NY”
Joe Preferences Birthdate 11/13/1987

TriHUG January 2012 Talk by Chris Shain

Recommended

Recommended

More Related Content

Similar to TriHUG January 2012 Talk by Chris Shain

Similar to TriHUG January 2012 Talk by Chris Shain (20)

More from trihug

More from trihug (11)

Recently uploaded

Recently uploaded (20)

TriHUG January 2012 Talk by Chris Shain