Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
1. HBase
Storage
Internals,
present
and
future!
Ma6eo
Bertozzi
|
@Cloudera
March
2013
-‐
Hadoop
Summit
Europe
1
2. What
is
HBase
• Open
source
Storage
Manager
that
provides
random
read/write
on
top
of
HDFS
• Provides
Tables
with
a
“Key:Column/Value”
interface
• Dynamic
columns
(qualifiers),
no
schema
needed
• “Fixed”
column
groups
(families)
• table[row:family:column]
=
value
2
3. HBase
EcoSystem
• Apache
Hadoop
HDFS
for
data
durability
and
reliability
(Write-‐Ahead
App
MR
Log)
• Apache
ZooKeeper
for
distributed
coordina]on
ZK
HDFS
• Apache
Hadoop
MapReduce
built-‐in
support
for
running
MapReduce
jobs
3
5. Master,
Region
Servers
and
Regions
• Region
Server
Client
• Server
that
contains
a
set
of
Regions
ZooKeeper
• Responsible
to
handle
reads
and
writes
• Region
Master
• The
basic
unit
of
scalability
in
HBase
• Subset
of
the
table’s
data
Region
Server
Region
Server
Region
Server
• Con]guous,
sorted
range
of
rows
stored
Region
Region
Region
together.
Region
Region
Region
• Master
Region
Region
Region
• Coordinates
the
HBase
Cluster
HDFS
• Assignment/Balancing
of
the
Regions
• Handles
admin
opera]ons
• create/delete/modify
table,
…
5
6. Autosharding
and
.META.
table
• A
Region
is
a
Subset
of
the
table’s
data
• When
there
is
too
much
data
in
a
Region…
• a
split
is
triggered,
crea]ng
2
regions
• The
associa]on
“Region
-‐>
Server”
is
stored
in
a
System
Table
• The
Loca]on
of
.META.
Is
stored
in
ZooKeeper
Table
Start
Key
Region
ID
Region
Server
machine01
Region
1
-‐
testTable
testTable
Key-‐00
1
machine01.host
Region
4
-‐
testTable
testTable
Key-‐31
2
machine03.host
machine02
testTable
Key-‐65
3
machine02.host
Region
3
-‐
testTable
testTable
Key-‐83
4
machine01.host
Region
1
-‐
users
…
…
…
…
machine03
users
Key-‐AB
1
machine03.host
Region
2
-‐
testTable
users
Key-‐KG
2
machine02.host
Region
2
-‐
users
6
7. The
Write
Path
–
Create
a
New
Table
• The
client
asks
to
the
master
to
create
a
new
Table
• hbase>
create
‘myTable’,
‘cf’
Client
createTable()
• The
Master
Master
• Store
the
Table
informa]on
(“schema”)
Store
Table
“Metadata”
• Create
Regions
based
on
the
key-‐splits
provided
Assign
the
Regions
“enable”
• no
splits
provided,
one
single
region
by
Region
Region
Region
default
Server
Region
Server
Server
Region
Region
• Assign
the
Regions
to
the
Region
Servers
Region
Region
Region
• The
assignment
Region
-‐>
Server
is
wri6en
to
a
system
table
called
“.META.”
7
8. The
Write
Path
–
“Inser]ng”
data
Client
• table.put(row-‐key:family:column,
value)
Where
is
.META.?
Scan
.META.
• The
client
asks
ZooKeeper
the
loca]on
of
.META.
ZooKeeper
Region
Server
Insert
Region
• The
client
scans
.META.
searching
for
the
KeyValue
Region
Region
Server
responsible
to
handle
the
Key
Region
Server
Region
• The
client
asks
the
Region
Server
to
Region
insert/update/delete
the
specified
key/value.
Region
• The
Region
Server
process
the
request
and
dispatch
it
to
the
Region
responsible
to
handle
the
Key
• The
opera]on
is
wri6en
to
a
Write-‐Ahead
Log
(WAL)
• …and
the
KeyValues
added
to
the
Store:
“MemStore”
8
9. The
Write
Path
–
Append
Only
to
Random
R/W
• Files
in
HDFS
are
RS
Region
WAL
Region
Region
• Append-‐Only
• Immutable
once
closed
MemStore
+
Store
Files
(HFiles)
• HBase
provides
Random
Writes?
• …not
really
from
a
storage
point
of
view
• KeyValues
are
stored
in
memory
and
wri6en
to
disk
on
pressure
• Don’t
worry
your
data
is
safe
in
the
WAL!
Key0
–
value
0
• (The
Region
Server
can
recover
data
from
the
WAL
is
case
of
crash)
Key1
–
value
1
Key2
–
value
2
Key3
–
value
3
But
this
allow
to
sort
data
by
Key
before
wri]ng
on
disk
•
Key4
–
value
4
Key5
–
value
5
• Deletes
are
like
Inserts
but
with
a
“remove
me
flag”
Store
Files
9
10. The
Read
Path
–
“reading”
data
• The
client
asks
ZooKeeper
the
loca]on
of
.META.
Client
Where
is
• The
client
scans
.META.
searching
for
the
Region
Server
.META.?
Scan
.META.
responsible
to
handle
the
Key
ZooKeeper
Region
Server
Region
• The
client
asks
the
Region
Server
to
get
the
specified
key/
Get
Key
Region
value.
Region
Server
• The
Region
Server
process
the
request
and
dispatch
it
to
Region
Region
the
Region
responsible
to
handle
the
Key
Region
• MemStore
and
Store
Files
are
scanned
to
find
the
key
10
11. The
Read
Path
–
Append
Only
to
Random
R/W
• Each
flush
a
new
file
is
created
Key0
–
value
0.0
Key0
–
value
0.1
Key2
–
value
2.0
Key5
–
value
5.0
Key3
–
value
3.0
Key1
–
value
1.0
Key5
–
value
5.0
Key5
–
[deleted]
Key8
–
value
8.0
Key6
–
value
6.0
• Each
file
have
KeyValues
sorted
by
key
Key9
–
value
9.0
Key7–
value
7.0
• Two
or
more
files
can
contains
the
same
key
(updates/deletes)
• To
find
a
Key
you
need
to
scan
all
the
files
• …with
some
op]miza]ons
• Filter
Files
Start/End
Key
• Having
a
bloom
filter
on
each
file
11
13. HFile
format
Blocks
• Only
Sequen]al
Writes,
just
append(key,
value)
Header
• Large
Sequen]al
Reads
are
be6er
Record
0
Record
1
• Why
grouping
records
in
blocks?
Key/Value
…
(record)
Record
N
• Easy
to
split
Key
Length
:
int
Header
Value
Length
:
int
Record
0
• Easy
to
read
Key
:
byte[]
Record
1
…
• Easy
to
cache
Value
:
byte[]
Record
N
Index
0
• Easy
to
index
(if
records
are
sorted)
…
Index
N
• Block
Compression
(snappy,
lz4,
gz,
…)
Trailer
13
14. Data
Block
Encoding
• “Be
aware
of
the
data”
• Block
Encoding
allows
to
compress
the
Key
based
on
what
we
know
• Keys
are
sorted…
prefix
may
be
similar
in
most
cases
• One
file
contains
keys
from
one
Family
only
• Timestamps
are
“similar”,
we
can
store
the
diff
“on-‐disk”
• Type
is
“put”
most
of
the
]me…
KeyValue
Row
Length
:
short
Row
:
byte[]
Family
Length
:
byte
Family
:
byte[]
Qualifier
:
byte[]
Timestamp
:
long
Type
:
byte
14
16. Compac]ons
• Reduce
the
number
of
files
to
look
into
during
a
scan
Key0
–
value
0.0
Key0
–
value
0.1
Key2
–
value
2.0
Key1
–
value
1.0
Key3
–
value
3.0
Key4–
value
4.0
Key5
–
value
5.0
Key5
–
[deleted]
Key8
–
value
8.0
Key6
–
value
6.0
• Removing
duplicated
keys
(updated
values)
Key9
–
value
9.0
Key7–
value
7.0
• Removing
deleted
keys
Key0
–
value
0.1
Key1
–
value
1.0
• Creates
a
new
file
by
merging
the
content
of
2+
files
Key2
–
value
2.0
Key4–
value
4.0
Key6
–
value
6.0
Key7–
value
7.0
Key8–
value
8.0
• Remove
the
old
files
Key9–
value
9.0
16
17. Pluggable
Compac]ons
• Try
different
algorithm
Key0
–
value
0.0
Key0
–
value
0.1
Key2
–
value
2.0
Key1
–
value
1.0
Key3
–
value
3.0
Key4–
value
4.0
Key5
–
value
5.0
Key5
–
[deleted]
Key8
–
value
8.0
Key6
–
value
6.0
• Be
aware
of
the
data
Key9
–
value
9.0
Key7–
value
7.0
• Time
Series?
I
guess
no
updates
from
the
80s
Key0
–
value
0.1
• Be
aware
of
the
requests
Key1
–
value
1.0
Key2
–
value
2.0
Key4–
value
4.0
Key6
–
value
6.0
Key7–
value
7.0
• Compact
based
on
sta]s]cs
Key8–
value
8.0
Key9–
value
9.0
• which
files
are
hot
and
which
are
not
• which
keys
are
hot
and
which
are
not
17
18. Snapshots
Zero-‐copy
snapshots
and
table
clones
18
19. How
taking
a
snapshot
works?
• The
master
orchestrate
the
RSs
• the
communica]on
is
done
via
ZooKeeper
• using
a
“2-‐phase
commit
like”
transac]on
(prepare/commit)
• Each
RS
is
responsible
to
take
its
“piece”
of
snapshot
• For
each
Region
store
the
metadata
informa]on
needed
• (list
of
Store
Files,
WALs,
region
start/end
keys,
…)
ZK
ZK
Master
ZK
RS
RS
Region
WAL
Region
Region
Region
WAL
Region
Region
Store
Files
(HFiles)
Store
Files
(HFiles)
19
20. What
is
a
Snapshots?
• “a
Snapshot
is
not
a
copy
of
the
table”
• a
Snapshot
is
a
set
of
metadata
informa]on
• The
table
“schema”
(column
families
and
a6ributes)
• The
Regions
informa]on
(start
key,
end
key,
…)
• The
list
of
Store
Files
• The
list
of
WALs
ac]ve
ZK
ZK
Master
ZK
RS
RS
Region
WAL
Region
Region
Region
WAL
Region
Region
Store
Files
(HFiles)
Store
Files
(HFiles)
20
21. Cloning
a
Table
from
a
Snapshots
• hbase>
clone_snapshot
‘snapshotName’,
‘tableName’
…
• Creates
a
new
table
with
the
data
“contained”
in
the
snapshot
• No
data
copies
involved
• HFiles
are
immutable
• And
shared
between
tables
and
snapshots
• You
can
insert/update/remove
data
from
the
new
table
• No
repercussions
on
the
snapshot,
original
tables
or
other
cloned
tables
21
22. Compac]ons
&
Archiving
• HFiles
are
immutable,
and
shared
between
tables
and
snapshots
• On
compac]on
or
table
dele]on,
files
are
removed
from
disk
• If
files
are
referenced
by
a
snapshot
or
a
cloned
table
• The
file
is
moved
to
an
“archive”
directory
• And
deleted
later,
when
there’re
no
references
to
it
22
24. 0.96
is
coming
up
• Moving
RPC
to
Protobuf
• Allows
rolling
upgrades
with
no
surprises
• HBase
Snapshots
• Pluggable
Compac]ons
• Remove
-‐ROOT-‐
• Table
Locks
24
25. 0.98
and
Beyond
• Transparent
Table/Column-‐Family
Encryp]on
• Cell-‐level
security
• Mul]ple
WALs
per
Region
Server
(MTTR)
• Data
Placement
Awareness
(MTTR)
• Data
Type
Awareness
• Compac]on
policies,
based
on
the
data
needs
• Managing
blocks
directly
(instead
of
files)
25