HBase Storage Internals

HBase
Storage
Internals,
present
and
future!

Ma6eo
Bertozzi
|
@Cloudera

March
2013
-‐
Hadoop
Summit
Europe

1

What
is
HBase

•  Open
source
Storage
Manager
that
provides
random

read/write
on
top
of
HDFS

•  Provides
Tables
with
a
“Key:Column/Value”
interface

•  Dynamic
columns
(qualiﬁers),
no
schema
needed

•  “Fixed”
column
groups
(families)

•  table[row:family:column]
=
value

2

HBase
EcoSystem

•  Apache
Hadoop
HDFS
for
data

durability
and
reliability
(Write-‐Ahead
App
MR

Log)

•  Apache
ZooKeeper
for
distributed

coordina]on
ZK
HDFS

•  Apache
Hadoop
MapReduce
built-‐in

support
for
running
MapReduce
jobs

3

How
HBase
Works?

“View
from
10000c”

4

Master,
Region
Servers
and
Regions

•  Region
Server

Client

•  Server
that
contains
a
set
of
Regions

ZooKeeper

•  Responsible
to
handle
reads
and
writes

•  Region

Master

•  The
basic
unit
of
scalability
in
HBase

•  Subset
of
the
table’s
data

Region
Server
Region
Server
Region
Server
•  Con]guous,
sorted
range
of
rows
stored

Region
Region
Region
together.

Region
Region
Region

•  Master

Region
Region
Region

•  Coordinates
the
HBase
Cluster

HDFS
•  Assignment/Balancing
of
the
Regions

•  Handles
admin
opera]ons

•  create/delete/modify
table,
…

5

Autosharding
and
.META.
table

•  A
Region
is
a
Subset
of
the
table’s
data

•  When
there
is
too
much
data
in
a
Region…

•  a
split
is
triggered,
crea]ng
2
regions

•  The
associa]on
“Region
-‐>
Server”
is
stored
in
a
System
Table

•  The
Loca]on
of
.META.
Is
stored
in
ZooKeeper

Table
Start
Key
Region
ID
Region
Server
machine01

Region
1
-‐
testTable

testTable
Key-‐00
1
machine01.host
Region
4
-‐
testTable

testTable
Key-‐31
2
machine03.host

machine02

testTable
Key-‐65
3
machine02.host

Region
3
-‐
testTable

testTable
Key-‐83
4
machine01.host
Region
1
-‐
users

…
…
…
…

machine03

users
Key-‐AB
1
machine03.host
Region
2
-‐
testTable

users
Key-‐KG
2
machine02.host
Region
2
-‐
users

6

The
Write
Path
–
Create
a
New
Table

•  The
client
asks
to
the
master
to
create
a
new
Table

•  hbase>
create
‘myTable’,
‘cf’
Client

createTable()

•  The
Master
Master

•  Store
the
Table
informa]on
(“schema”)
Store
Table

“Metadata”

•  Create
Regions
based
on
the
key-‐splits
provided
Assign
the
Regions

“enable”

•  no
splits
provided,
one
single
region
by

Region
Region
Region

default
Server

Region

Server
Server

Region

Region

•  Assign
the
Regions
to
the
Region
Servers
Region
Region
Region

•  The
assignment
Region
-‐>
Server

is
wri6en
to
a
system
table
called
“.META.”

7

The
Write
Path
–
“Inser]ng”
data

Client

•  table.put(row-‐key:family:column,
value)
Where
is

.META.?
Scan

.META.

•  The
client
asks
ZooKeeper
the
loca]on
of
.META.
ZooKeeper
Region
Server

Insert

Region

•  The
client
scans
.META.
searching
for
the

KeyValue
Region

Region
Server
responsible
to
handle
the
Key
Region
Server

Region

•  The
client
asks
the
Region
Server
to

Region

insert/update/delete
the
speciﬁed
key/value.

Region

•  The
Region
Server
process
the
request
and
dispatch
it
to
the

Region
responsible
to
handle

the
Key

•  The
opera]on
is
wri6en
to
a
Write-‐Ahead
Log
(WAL)

•  …and
the
KeyValues
added
to
the
Store:
“MemStore”

8

The
Write
Path
–
Append
Only
to
Random
R/W

•  Files
in
HDFS
are
RS

Region

WAL

Region
Region

•  Append-‐Only

•  Immutable
once
closed
MemStore
+
Store
Files
(HFiles)

•  HBase
provides
Random
Writes?

•  …not
really
from
a
storage
point
of
view

•  KeyValues
are
stored
in
memory
and
wri6en
to
disk
on
pressure

•  Don’t
worry
your
data
is
safe
in
the
WAL!

Key0
–
value
0

•  (The
Region
Server
can
recover
data
from
the
WAL
is
case
of
crash)
Key1
–
value
1

Key2
–
value
2

Key3
–
value
3

But
this
allow
to
sort
data
by
Key
before
wri]ng
on
disk

• 
Key4
–
value
4

Key5
–
value
5

•  Deletes
are
like
Inserts
but
with
a
“remove
me
ﬂag”
Store
Files

9

The
Read
Path
–
“reading”
data

•  The
client
asks
ZooKeeper
the
loca]on
of
.META.
Client

Where
is

•  The
client
scans
.META.
searching
for
the
Region
Server
.META.?
Scan

.META.

responsible
to
handle
the
Key
ZooKeeper
Region
Server

Region

•  The
client
asks
the
Region
Server
to
get
the
speciﬁed
key/
Get
Key

Region

value.
Region
Server

•  The
Region
Server
process
the
request
and
dispatch
it
to

Region

Region

the
Region
responsible
to
handle

the
Key
Region

•  MemStore
and
Store
Files
are
scanned
to
ﬁnd
the
key

10

The
Read
Path
–
Append
Only
to
Random
R/W

•  Each
flush
a
new
file
is
created

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key5
–
value
5.0

Key3
–
value
3.0
Key1
–
value
1.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Each
file
have
KeyValues
sorted
by
key

Key9
–
value
9.0
Key7–
value
7.0

•  Two
or
more
files
can
contains
the
same
key

(updates/deletes)

•  To
find
a
Key
you
need
to
scan
all
the
files

•  …with
some
op]miza]ons

•  Filter
Files
Start/End
Key

•  Having
a
bloom
filter
on
each
file

11

HFile

HBase
Store
File
Format

12

HFile
format

Blocks

•  Only
Sequen]al
Writes,
just
append(key,
value)

Header

•  Large
Sequen]al
Reads
are
be6er
Record
0

Record
1

•  Why
grouping
records
in
blocks?
Key/Value
…

(record)
Record
N

•  Easy
to
split
Key
Length
:
int

Header

Value
Length
:
int

Record
0

•  Easy
to
read
Key
:
byte[]

Record
1

…

•  Easy
to
cache
Value
:
byte[]
Record
N

Index
0

•  Easy
to
index
(if
records
are
sorted)
…

Index
N

•  Block
Compression
(snappy,
lz4,
gz,
…)
Trailer

13

Data
Block
Encoding

•  “Be
aware
of
the
data”

•  Block
Encoding
allows
to
compress
the
Key
based
on

what
we
know

•  Keys
are
sorted…
prefix
may
be
similar
in
most
cases

•  One
file
contains
keys
from
one
Family
only

•  Timestamps
are
“similar”,
we
can
store
the
diff
“on-‐disk”

•  Type
is
“put”
most
of
the
]me…
KeyValue

Row
Length
:
short

Row
:
byte[]

Family
Length
:
byte

Family
:
byte[]

Qualifier
:
byte[]

Timestamp
:
long

Type
:
byte

14

Compac]ons

Op]mize
the
read-‐path

15

Compac]ons

•  Reduce
the
number
of
files
to
look
into
during
a
scan

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key1
–
value
1.0

Key3
–
value
3.0
Key4–
value
4.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Removing
duplicated
keys
(updated
values)

Key9
–
value
9.0
Key7–
value
7.0

•  Removing
deleted
keys
Key0
–
value
0.1

Key1
–
value
1.0

•  Creates
a
new
file
by
merging
the
content
of
2+
files

Key2
–
value
2.0

Key4–

value
4.0

Key6
–
value
6.0

Key7–
value
7.0

Key8–
value
8.0

•  Remove
the
old
files

Key9–
value
9.0

16

Pluggable
Compac]ons

•  Try
diﬀerent
algorithm

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key1
–
value
1.0

Key3
–
value
3.0
Key4–
value
4.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Be
aware
of
the
data
Key9
–
value
9.0
Key7–
value
7.0

•  Time
Series?
I
guess
no
updates
from
the
80s

Key0
–
value
0.1

•  Be
aware
of
the
requests

Key1
–
value
1.0

Key2
–
value
2.0

Key4–

value
4.0

Key6
–
value
6.0

Key7–
value
7.0

•  Compact
based
on
sta]s]cs

Key8–
value
8.0

Key9–
value
9.0

•  which
ﬁles
are
hot
and
which
are
not

•  which
keys
are
hot
and
which
are
not

17

Snapshots

Zero-‐copy
snapshots
and
table
clones

18

How
taking
a
snapshot
works?

•  The
master
orchestrate
the
RSs

•  the
communica]on
is
done
via
ZooKeeper

•  using
a
“2-‐phase
commit
like”
transac]on
(prepare/commit)

•  Each
RS
is
responsible
to
take
its
“piece”
of
snapshot

•  For
each
Region
store
the
metadata
informa]on
needed

•  (list
of
Store
Files,
WALs,
region
start/end
keys,
…)

ZK
ZK

Master
ZK

RS
RS

Region

WAL

Region
Region
Region

WAL

Region
Region

Store
Files
(HFiles)
Store
Files
(HFiles)

19

What
is
a
Snapshots?

•  “a
Snapshot
is
not
a
copy
of
the
table”

•  a
Snapshot
is
a
set
of
metadata
informa]on

•  The
table
“schema”
(column
families
and
a6ributes)

•  The
Regions
informa]on
(start
key,
end
key,
…)

•  The
list
of
Store
Files

•  The
list
of
WALs
ac]ve

ZK
ZK

Master
ZK

RS
RS

Region

WAL

Region
Region
Region

WAL

Region
Region

Store
Files
(HFiles)
Store
Files
(HFiles)

20

Cloning
a
Table
from
a
Snapshots

•  hbase>
clone_snapshot
‘snapshotName’,
‘tableName’

…

•  Creates
a
new
table
with
the
data
“contained”
in
the
snapshot

•  No
data
copies
involved

•  HFiles
are
immutable

•  And
shared
between
tables
and
snapshots

•  You
can
insert/update/remove
data
from
the
new
table

•  No
repercussions
on
the
snapshot,
original
tables
or
other

cloned
tables

21

Compac]ons
&
Archiving

•  HFiles
are
immutable,
and
shared
between
tables
and
snapshots

•  On
compac]on
or
table
dele]on,
files
are
removed
from
disk

•  If
files
are
referenced
by
a
snapshot
or
a
cloned
table

•  The
file
is
moved
to
an
“archive”
directory

•  And
deleted
later,
when
there’re
no
references
to
it

22

Compac]ons

Op]mize
the
read-‐path

23

0.96
is
coming
up

•  Moving
RPC
to
Protobuf

•  Allows
rolling
upgrades
with
no
surprises

•  HBase
Snapshots

•  Pluggable
Compac]ons

•  Remove
-‐ROOT-‐

•  Table
Locks

24

0.98
and
Beyond

•  Transparent
Table/Column-‐Family
Encryp]on

•  Cell-‐level
security

•  Mul]ple
WALs
per
Region
Server
(MTTR)

•  Data
Placement
Awareness
(MTTR)

•  Data
Type
Awareness

•  Compac]on
policies,
based
on
the
data
needs

•  Managing
blocks
directly
(instead
of
ﬁles)

25

Thank
you!

Ma6eo
Bertozz,
@cloudera

@th30z

HBase Storage Internals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HBase Storage Internals

Similar to HBase Storage Internals (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

HBase Storage Internals