This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
4. A Client-Server Architecture with disks
Application Server
Network roundtrip =
50 micro sec
Database
Server
Disk access =
10 milli seconds
Locally attached Disks
Monday, December 9, 13
5. Client-Server Architecture with fast storage
Application Server
Network roundtrip =
50 micro sec
Database
Server
100
microsecs
SSD
Latency dominated by network
Monday, December 9, 13
100
nanosecs
RAM
6. Architecture of an Embedded Database
Application
Server
Monday, December 9, 13
Network roundtrip =
50 micro sec
Database
Server
7. Architecture of an Embedded Database
Application
Server
Monday, December 9, 13
Network roundtrip =
50 micro sec
Database
Server
8. Architecture of an Embedded Database
Network roundtrip =
50 micro sec
Application
Server
100
microsecs
SSD
Monday, December 9, 13
100
nanosecs
RAM
Database
Server
9. Architecture of an Embedded Database
Network roundtrip =
50 micro sec
Application
Server
100
microsecs
SSD
100
nanosecs
RAM
Storage attached directly to application servers
Monday, December 9, 13
Database
Server
15. Comparison of open source databases
Random Reads
Random Writes
Monday, December 9, 13
16. Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3
Random Writes
Monday, December 9, 13
129,000 ops/sec
151,000 ops/sec
134,000 ops/sec
17. Comparison of open source databases
Random Reads
LevelDB
Kyoto TreeDB
SQLite3
129,000 ops/sec
151,000 ops/sec
134,000 ops/sec
Random Writes
LevelDB
Kyoto TreeDB
SQLite3
Monday, December 9, 13
164,000 ops/sec
88,500 ops/sec
9,860 ops/sec
18. HBase and HDFS (in April 2012)
Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
19. HBase and HDFS (in April 2012)
Random Reads
Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
20. HBase and HDFS (in April 2012)
Random Reads
HDFS (1 node)
HBase (1 node)
93,000 ops/sec
35,000 ops/sec
Details of this experiment:
http://hadoopblog.blogspot.com/2012/05/hadoop-and-solid-state-drives.html
Monday, December 9, 13
22. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Monday, December 9, 13
23. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Monday, December 9, 13
24. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Monday, December 9, 13
25. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Transaction log
Monday, December 9, 13
26. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Transaction log
Monday, December 9, 13
27. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Transaction log
Monday, December 9, 13
28. Log Structured Merge Architecture
Write Request from Application
Read Write data
in RAM
Read Only data in RAM on
disk
Monday, December 9, 13
Transaction log
29. Log Structured Merge Architecture
Write Request from Application
Periodic
Compaction
Read Write data
in RAM
Read Only data in RAM on
disk
Monday, December 9, 13
Transaction log
30. Log Structured Merge Architecture
Scan Request from Application
Periodic
Compaction
Read Write data
in RAM
Read Only data in RAM on
disk
Monday, December 9, 13
Write Request from Application
Transaction log
31. Leveldb has low write rates
Facebook Application 1:
• Write rate 2 MB/sec only per machine
• Only one cpu was used
Monday, December 9, 13
32. Leveldb has low write rates
Facebook Application 1:
• Write rate 2 MB/sec only per machine
• Only one cpu was used
We developed multithreaded compaction
10x
improvement on
write rate
Monday, December 9, 13
+
100%
of cpus are
in use
33. Leveldb has stalls
Facebook Feed:
• P99 latencies were tens of seconds
• Single-threaded compaction
Monday, December 9, 13
34. Leveldb has stalls
Facebook Feed:
• P99 latencies were tens of seconds
• Single-threaded compaction
We implemented thread aware compaction
Dedicated thread(s)
to flush memtable
Monday, December 9, 13
Pipelined memtables
P99 reduced to less
than a second
35. Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high
Monday, December 9, 13
36. Leveldb has high write amplification
• Facebook Application 2:
• Level Style Compaction
• Write amplification of 70 very high
Level-0
5 bytes
Level-1
6 bytes
11 bytes
Level-2
10 bytes
10 bytes
10 bytes
Stage 1
Stage 2
Stage 3
Two compactions by LevelDB Style Compaction
Monday, December 9, 13
37. Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from newest file,
include next file in
candidate set if
• Candidate set size >= size
of next file
Monday, December 9, 13
38. Our solution: lower write amplification
• Facebook Application 2:
• We implemented Universal
Style Compaction
• Start from newest file,
include next file in
candidate set if
• Candidate set size >= size
of next file
Level-0
5bytes
Level-1
6 bytes
Level-2
10 bytes
10 bytes
Stage 1
Stage 2
Single compaction by Universal Style Compaction
Write amplification reduced to <10
Monday, December 9, 13
40. Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans
Monday, December 9, 13
41. Leveldb has high read amplification
• Secondary Index Service:
• Leveldb does not use blooms for scans
• We implemented prefix scans
• Range scans within same key prefix
• Blooms created for prefix
• Reduces read amplification
Monday, December 9, 13
49. The Changes we did to LevelDB
Inherited from LevelDB
• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration
Monday, December 9, 13
50. The Changes we did to LevelDB
Inherited from LevelDB
• Log Structured Merge DB
• Gets/Puts/Scans of keys
• Forward and Reverse Iteration
Monday, December 9, 13
RocksDB
• 10X higher write rate
• Fewer stalls
• 7x lower write amplification
• Blooms for range scans
• Ability to avoid read-modify-write
• Optimizations for flash or RAM
• And many more…
51. RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads
Monday, December 9, 13
52. RocksDB is born!
• Key-Value persistent store
• Embedded
• Optimized for fast storage
• Server workloads
Monday, December 9, 13
53. What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data
Monday, December 9, 13
54. What is it not?
• Not distributed
• No failover
• Not highly-available,
if machine dies you
lose your data
Monday, December 9, 13
55. RocksDB API
▪
Keys and values are arbitrary byte arrays.
▪
Data is stored sorted by key.
▪
The basic operations are Put(key,value), Get(key),
Delete(key) and Merge(key, delta)
▪
Forward and backward iteration is supported
over the data.
Monday, December 9, 13
61. Log Structured Merge Tree -- Writes
▪
Log Structured Merge Tree
▪
New Puts are written to memory and optionally
to transaction log
▪
Also can specify log write sync option for each
individual write
▪
We say RocksDB is optimized for writes, what
does this mean?
Monday, December 9, 13
63. RocksDB Write Path
Write Request
Active
MemTable
log
Switch
Switch
ReadOnly
MemTable
log
log
LSM
Flush
sst
sst
sst
sst
sst
sst
d
Monday, December 9, 13
Compaction
64. Log Structured Merge Tree -- Reads
▪
Data could be in memory or on disk
▪
Consult multiple files to find the latest
instance of the key
▪
Use bloom filters to reduce IO
Monday, December 9, 13
69. RocksDB: Open & Pluggable
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Monday, December 9, 13
70. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Monday, December 9, 13
71. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Monday, December 9, 13
72. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Monday, December 9, 13
73. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Transaction log
Monday, December 9, 13
74. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Transaction log
Monday, December 9, 13
75. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Transaction log
Monday, December 9, 13
76. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Memtable
format in RAM
Pluggable sst data format
on storage
Monday, December 9, 13
Transaction log
77. RocksDB: Open & Pluggable
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Compaction
Pluggable
Memtable
format in RAM
Pluggable sst data format
on storage
Monday, December 9, 13
Transaction log
78. RocksDB: Open & Pluggable
Get or Scan Request from Application
Write Request from Application
Customizable
WAL
Blooms
Pluggable
Compaction
Pluggable
Memtable
format in RAM
Pluggable sst data format
on storage
Monday, December 9, 13
Transaction log
79. Example: Customizable WALogging
• In-house Replication solution wants to be able to
embed arbitrary blob in the rocksdb WAL stream for
log annotation
• Use Case: Indicate where a log record came from in
multi-master replication
• Solution:
Monday, December 9, 13
A Put that only speaks to the log
82. Example: Pluggable SST format
• One Facebook use case needs extreme fast response
but could tolerate some loss of durability
• Quick hack: mount sst in tmpfs
• Still not performant:
• existing sst format is block based
• Solution:
A much simpler format that just stores
sorted key/value pairs sequentially
• no blocks, no caching, mmap the whole file
• build efficient lookup index on load
Monday, December 9, 13
83. Example: Blooms for MemTable
• Same use case, after we optimized sst access, memtable
lookup becomes a major cost in query
• Problem: Get needs to go through the memtable lookups
that eventually return no data
• Solution:
Monday, December 9, 13
Just add a bloom filter to memtable!
88. Example: Pluggable memtable format
• Another Facebook use case has a distinct load phase
where no query is issued.
• Problem: write throughput is limited by single writer
thread
• Solution:
A new memtable representation that does
not keep keys sorted
Monday, December 9, 13
89. Example: Pluggable memtable format
Unsorted
MemTable
log
Switch
Switch
ReadOnly
MemTable
log
log
LSM
Sort,
Flush
sst
sst
sst
sst
sst
sst
d
Monday, December 9, 13
Compaction
94. Possible workloads for RocksDB?
▪
Serving data to users via a website
▪
A spam detection backend that needs fast access to data
▪
A graph search query that needs to scan a dataset in realtime
▪
Distributed Configuration Management Systems
▪
Fast serve of Hive Data
▪
A Queue that needs a high rate of inserts and deletes
Monday, December 9, 13
99. Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/
Monday, December 9, 13
100. Come Hack with us
• RocksDB is Open Sourced
• http://rocksdb.org
• Developers group https://www.facebook.com/groups/rocksdb.dev/
• Help us HACK RocksDB
Monday, December 9, 13