HBase and HDFS: Understanding FileSystem Usage in HBase

© Hortonworks Inc. 2011
HBase and HDFSUnderstanding file system usage in HBase
Enis Söztutar
enis [ at ] apache [dot] org
@enissoz
Page 1

About Me
Page 2
Architecting the Future of Big Data
• In the Hadoop space since 2007
• Committer and PMC Member in Apache HBase and Hadoop
• Working at Hortonworks as member of Technical Staff
• Twitter: @enissoz

Motivation
• HBase as a database depends on FileSystem for many things
• HBase has to work over HDFS, linux & windows
• HBase is the most advanced user of HDFS
• For tuning for IO performance, you have to understand how HBase does
IO
Page 3
MapReduce
Large files
Few random seek
Batch oriented
High throughput
Failure handling at task level
Computation moves to data
HBase
Large files
A lot of random seek
Latency sensitive
Durability guarantees with sync
Computation generates local data
Large number of open files

Agenda
• Overview of file types in Hbase
• Durability semantics
• IO Fencing / Lease recovery
• Data locality
– Short circuit reads (SSR)
– Checksums
– Block Placement
• Open topics
Page 4

HBase file types
Page 5

Overview of file types
• Mainly three types of files in Hbase
– Write Ahead Logs (a.k.a. WALs, logs)
– Data files (a.k.a. store files, hfiles)
– References / symbolic or logical links (0 length files)
• Every file is 3-way replicated
Page 6

Overview of file types
/hbase/.archive
/hbase/.logs/
/hbase/.logs/server1,60020,1370043265148/
/hbase/.logs/server1,60020,1370043265148/server1%2C60020%2C1370043265148.1370050467720
/hbase/.logs/server1,60020,1370043265105/server1%2C60020%2C1370043265105.1370046867591
…
/hbase/.oldlogs
/hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/449e2fa173c14747b9d2e5..
/hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/9103f38174ab48aa898a4b..
/hbase/table1/565bfb6220ca3edf02ac1f425cf18524/f1/49b32d3ee94543fb9055..
/hbase/.hbase-snapshot/usertable_snapshot/0ae3d2a93d3cf34a7cd30../family/12f114..
…
Page 7
Write Ahead Logs
Data files
Links

Data Files (HFile)
• Immutable once written
• Generated by flush or compactions (sequential writes)
• Read randomly (preads), or sequentially
• Big in size (flushsize -> tens of GBs)
• All data is in blocks (Hfile blocks not to be confused by HDFS blocks)
• Data blocks have target size:
– BLOCKSIZE in column family descriptor
– 64K by default
– Uncompressed and un-encoded size
• Index blocks (leaf, intermediate, root) have target size:
– hfile.index.block.max.size, 128K by default
• Bloom filter blocks have target size:
– io.storefile.bloom.block.size, 128K by default
Page 8

Data Files (HFile version 2.x)
Page 9

Data Files
• IO happens at block boundaries
– Random reads => disk seek + read whole block sequentially
– Read blocks are put into the block cache
– Leaf index blocks and bloom filter blocks also go to the block cache
• Use smaller block sizes for faster random-access
– Smaller read + faster in-block search
– Block index becomes bigger, more memory consumption
• Larger block sizes for faster scans
• Think about how many key values will fit in an average block
• Try compression and Data Block Encoding (PREFIX, DIFF, FAST_DIFF,
PREFIX_TREE)
– Minimizes file sizes + on disk block sizes
Page 10
Key
length
Value
length
Row
length
Row key Family
length
Family Column
qualifier
Timesta
mp
KeyType Value
Int (4) Int (4) Short(2) Byte[] byte Byte[] Byte[] Long(8) byte Byte[]

Reference Files / Links
• When region is split, “reference files” are created referring to the top or
bottom half of the parent store file according to splitkey
• HBase does not delete data/WAL files just “archives” them
/hbase/.oldlogs
/hbase/.archive
• Logs/hfiles are kept until TTL, and replication or snapshots are not
referring to them
– (hbase.master.logcleaner.ttl, 10min)
– (hbase.master.hfilecleaner.ttl, 5min)
• HFileLink: kind of hard / soft links that is application specific
• HBase snapshots are logical links to files (with backrefs)
Page 11

Write Ahead Logs
• One logical WAL per region / one physical per regionserver
• Rolled frequently
– hbase.regionserver.logroll.multiplier (0.95)
– hbase.regionserver.hlog.blocksize (default file system block size)
• Chronologically ordered set of files, only last one is open for writing
• Exceeding hbase.regionserver.maxlogs (32) will cause force flush
• Old log files are deleted as a whole
• Every edit is appended
• Sequential writes from WAL, sync very frequently (hundreds of times
per sec)
• Only sequential reads from replication, and crash recovery
• One log file per region server limits the write throughput per Region
Server
Page 12

Durability
(as in ACID)
Page 13

Overview of Write Path
1. Client sends the operations over RPC (Put/Delete)
2. Obtain row locks
3. Obtain the next mvcc write number
4. Tag the cells with the mvcc write number
5. Add the cells to the memstores (changes not visible yet)
6. Append a WALEdit to WAL, do not sync
7. Release row locks
8. Sync WAL
9. Advance mvcc, make changes visible
Page 14

Durability
• 0.94 and before:
– HTable property “DEFERRED_LOG_FLUSH” and
– Mutation.setWriteToWAL(false)
• 0.94 and 0.96:
Page 15
Durability Semantics
USE_DEFAULT Use global hbase default, OR table default (SYNC_WAL)
SKIP_WAL Do not write updates to WAL
ASYNC_WAL Write entries to WAL asynchronously
(hbase.regionserver.optionallogflushinterval, 1 sec default)
SYNC_WAL Write entries to WAL, flush to datanodes
FSYNC_WAL Write entries to WAL, fsync in datanodes

Durability
• 0.94 Durability setting per Mutation (HBASE-7801) / per table (HBASE-
8375)
• Allows intermixing different durability settings for updates to the same
table
• Durability is chosen from the mutation, unless it is USE_DEFAULT, in
which case Table’s Durability is used
• Limit the amount of time an edit can live in the memstore (HBASE-5930)
– hbase.regionserver.optionalcacheflushinterval
– Default 1hr
– Important for SKIP_WAL
– Cause a flush if there are unflushed edits that are older than
optionalcacheflushinterval
Page 16

Durability
Page 17
public enum Durability {
USE_DEFAULT,
SKIP_WAL,
ASYNC_WAL,
SYNC_WAL,
FSYNC_WAL
}
Per Table:
HTableDescriptor htd = new HTableDescriptor("myTable");
htd.setDurability(Durability.ASYNC_WAL);
admin.createTable(htd);
Shell:
hbase(main):007:0> create 't12', 'f1', DURABILITY=>'ASYNC_WAL’
Per mutation:
Put put = new Put(rowKey);
put.setDurability(Durability.ASYNC_WAL);
table.put(put);

Durability (Hflush / Hsync)
• Hflush() : Flush the data packet down the datanode pipeline. Wait for
ack’s.
• Hsync() : Flush the data packet down the pipeline. Have datanodes
execute FSYNC equivalent. Wait for ack’s.
• hflush is currently default, hsync() usage in HBase is not implemented
(HBASE-5954). Also not optimized (2x slow) and only Hadoop 2.0.
• hflush does not lose data, unless all 3 replicas die without syncing to
disk (datacenter power failure)
• Ensure that log is replicated 3 times
hbase.regionserver.hlog.tolerable.lowreplication
defaults to FileSystem default replication count (3 for HDFS)
Page 18
public interface Syncable {
public void hflush() throws IOException;
public void hsync() throws IOException;
}

Page 19

IO Fencing
Fencing is the process of isolating a node of a computer
cluster or protecting shared resources when a node appears
to be malfunctioning
Page 20

IO Fencing
Page 21
Region1Client
Region Server A
(dying)
WAL
Region1
Region Server B
Append+sync
ack
edit
edit
WAL
Append+sync
ack
Master
Zookeeper
RegionServer A znode deleted
assign
Region1 Region Server A
Region 2 …
… …
YouAreDeadException
abort
RegionServer A session timeout
--
B
RegionServer A session timeout
Client

IO Fencing
• Split Brain
• Ensure that a region is only hosted by a single region server at any time
• If master thinks that region server no longer hosts the region, RS
should not be able to accept and sync() updates
• Master renames the region server logs directory on HDFS:
– Current WAL cannot be rolled, new log file cannot be created
– For each WAL, before replaying recoverLease() is called
– recoverLease => lease recovery + block recovery
– Ensure that WAL is closed, and all data is visible (file length)
• Guarantees for region data files:
– Compactions => Remove files + add files
– Flushed => Allowed since resulting data is idempotent
• HBASE-2231, HBASE-7878, HBASE-8449
Page 22

Data Locality
Short circuit reads, checksums, block placement
Page 23

HDFS local reads (short circuit reads)
• Bypasses the datanode layer and directly
goes to the OS files
• Hadoop 1.x implementation:
– DFSClient asks for local paths for a block to the
local datanode
– Datanode checks whether the user has
permission
– Client gets the path for the block, opens the file
with FileInputStream
hdfs-site.xml
dfs.block.local-path-access.user = hbase
dfs.datanode.data.dir.perm = 750
hbase-site.xml
dfs.client.read.shortcircuit = true
Page 24
RegionServer
Hadoop FileSystem
DFSClient
Datanode
OS Filesystem (ext3)
Disks
Disks
Disks
HBase Client
RPC
RPC
BlockReader

HDFS local reads (short circuit reads)
• Hadoop 2.0 implementation (HDFS-347)
– Keep the legacy implementation
– Use Unix Domain sockets to pass the File Descriptor (FD)
– Datanode opens the block file and passes FD to the BlockReaderLocal running in
Regionserver process
– More secure than previous implementation
– Windows also supports domain sockets, need to implement native APIs
• Local buffer size dfs.client.read.shortcircuit.buffer.size
– BlockReaderLocal will fill this whole buffer everytime HBase will try to read an
HfileBlock
– dfs.client.read.shortcircuit.buffer.size = 1MB vs 64KB Hfile block size
– SSR buffer is a direct buffer (in Hadoop 2, not in Hadoop 1)
– # regions x # stores x #avg store files x # avg blocks per file x SSR buffer size
– 10 regions x 2 x 4 x (1GB / 64MB) x 1 MB = 1.28GB
non-heap memory usage
Page 25

Checksums
• HDFS checksums are not inlined.
• Two files per block, one for data, one for
checksums (HDFS-2699)
• Random positioned read causes 2 seeks
• HBase checksums comes with 0.94 (HDP
1.2+). HBASE-5074.
Page 26
blk_123456789
.blk_123456789.meta
: Data chunk (dfs.bytes-per-checksum, 512 bytes)
: Checksum chunk (4 bytes)

Checksums
Page 27
• HFile version 2.1 writes checksums per
Hfile block
• HDFS checksum verification is bypassed
on block read, will be done by HBase
• If checksum fails, we go back to reading
checksums from HDFS for “some time”
• Due to double checksum bug(HDFS-3429)
in remote reads in Hadoop 1, not enabled
by default for now. Benchmark it yourself
hbase.regionserver.checksum.verify = true
hbase.hstore.bytes.per.checksum = 16384
hbase.hstore.checksum.algorithm = CRC32C
Never set this:
dfs.client.read.shortcircuit.skip.checksum = false
HFile
: Hfile data block chunk
: Checksum chunk
Hfile block
: Block header

Rack 1 / Server 1
DataNode
Default Block Placement Policy
Page 28
b1
RegionServer
Region A
Region B
StoreFile
StoreFile
StoreFile
StoreFile
StoreFile
b2 b2
b9 b1
b1
b2
b3
b2
b1 b2b1
Rack N / Server M
DataNode
b2
b1
b1
Rack L / Server K
DataNode
b2
b1
Rack X / Server Y
DataNode
b1b2 b2
b3
RegionServer RegionServer RegionServer

Data locality for HBase
• Poor data locality when the region is moved:
– As a result of load balancing
– Region server crash + failover
• Most of the data won’t be local unless the files are compacted
• Idea (from Facebook): Regions have affiliated nodes (primary,
secondary, tertiary), HBASE-4755
• When writing a data file, give hints to the NN that we want these
locations for block replicas (HDFS-2576)
• LB should assign the region to one of the affiliated nodes on server
crash
– Keep data locality
– SSR will still work
• Reduces data loss probability
Page 29

Rack X / Server Y
RegionServer
Rack L / Server K
RegionServer
Rack N / Server M
RegionServer
Rack 1 / Server 1
Default Block Placement Policy
Page 30
RegionServer
Region A
StoreFile
StoreFile
StoreFile
Region B
StoreFile
StoreFile
DataNode
b1
b2 b2
b9 b1
b1
b2
b3
b2
b1 b2b1
DataNode
b1
b2
b2
b9b1
b2
b1
DataNode
b1
b2
b2
b9
b2
b1
DataNode
b1
b2
b3
b2
b1

Other considerations
• HBase riding over Namenode HA
– Both Hadoop 1 (NFS based) and Hadoop 2 HA (JQM, etc)
– Heavily tested with full stack HA
• Retry HDFS operations
• Isolate FileSystem usage from HBase internals
• Hadoop 2 vs Hadoop 1 performance
– Hadoop 2 is coming!
• HDFS snapshots vs HBase snapshots
– HBase DOES NOT use HDFS snapshots
– Need hardlinks
– Super flush API
• HBase security vs HDFS security
– All files are owned by HBase principal
– No ACL’s in HDFS. Allowing a user to read Hfiles / snapshots directly is hard
Page 31

Open Topics
• HDFS hard links
– Rethink how we do snapshots, backups, etc
• Parallel writes for WAL
– Reduce latency on WAL syncs
• SSD storage, cache
– SSD storage type in Hadoop or local filesystem
– Using SSD’s as a secondary cache
– Selectively places tables / column families on SSD
• HDFS zero-copy reads (HDFS-3051, HADOOP-8148)
• HDFS inline checksums (HDFS-2699)
• HDFS Quorum reads (HBASE-7509)
Page 32

Thanks
Questions?
Page 33
Enis Söztutar
enis [ at ] apache [dot] org
@enissoz

HBase and HDFS: Understanding FileSystem Usage in HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HBase and HDFS: Understanding FileSystem Usage in HBase

Similar to HBase and HDFS: Understanding FileSystem Usage in HBase (20)

More from enissoz

More from enissoz (6)

Recently uploaded

Recently uploaded (20)

HBase and HDFS: Understanding FileSystem Usage in HBase