Apache BookKeeper Distributed Store- a Salesforce use case

Apache BookKeeper
DISTRIBUTED STORE
a Salesforce Use Case
Venkateswararao Jujjuri (JV)
Cloud Storage Architect
vjujjuri@salesforce.com
jujjuri@gmail.com
@jvjujjuri | Twitter
https://www.linkedin.com/in/jvjujjuri

Agenda
Salesforce needs and requirements
Hunt and Selection
BookKeeper Introduction
Improvements and Enhancements
As Service at Scale @ Salesforce
Performance
Community
Q & A

Salesforce Application Storage Needs
Store for Persistent WAL, data, and objects
Low, constant write latencies
• Transaction Log, Smaller writes
Low, constant Random Read latencies
Highly available for immutable data
• Append Only entries
• Objects
Highly Consistent for immutable data
Long Term Storage
Distributed and linearly scalable.
On commodity hardware
Low Operating Cost

What Did we consider?
Build vs. Buy
• Time-To-Market, resources, cost etc.
Finalists
• Ceph
• A CP System
• w/Unreliable reads read path can behave like AP system.
• Lot of effort to make it AP behavior on write path
• Remember: Immutable data.
• BookKeeper
• CAP system, because of immutable/append only store.
• Came close to what we want
• Almost there but not everything.

Apache Bookkeeper
A highly consistent, available, replicated distributed log service.
Immutable , append only store.
Thick Client, Simple and Elegant placement policy
• No Central Master
• No complicated hashing/computing for placement
Low latency, both on writes and reads.
Runs on commodity hardware.
Built for WAL use case, but can be expanded to other storage needs
Uses ZooKeeper as consensuses resolver, and metadata store.
Awesome Community.

Apache BookKeeper
A system to reliably log streams of records.
Is designed to store write ahead logs for database like applications.
Inspired by and designed to solve HDFS NameNode availability deficiencies.
Opensource Chronology
• 2008 Open Sourced contribution to ZooKeeper
• 2011 Sub-Project of ZooKeeper.
• 2012 Production

Terminology
Journal: Write ahead log
Ledger: Log Stream
Entry: Each entry of log record
Client: Library, with the application.
Bookie: Server
Ensemble: Set of Bookies across which a ledger is striped.
Cluster: All bookies belong to a given instance of Bookkeeper
Write Quorum Size: Number of replicas.
Ack Quorum Size: Number of responses needed before client’s write is satisfied.
LAC: Last Add Confirmed.

Guarantees
• If an entry has been acknowledged, it must be readable.
• If an entry is read once, it must always be readable.
• If write of entry ‘n’ is successful, all entries until ‘n’ are successfully committed.
Major Components
• Thick Client; Carries heavy weight in the protocol.
• Thin Server, Bookie. Bookies never initiate any interaction with ZooKeeper or fellow Bookies.
• Zookeeper monitors Bookies.
• Metadata is stored on Zookeeper.
• Auditor to monitor bookies and identify under replicated ledgers.
• Replication workers to replicate under replicated ledger copies.
Highlights

Create Ledger
• Gets Writer Ledger Handle
Add an entry to the Ledger
• Write To the Ledger
Open Ledger
• Gives ReadOnly Ledger Handle.
• May ask for non-recovery read handle.
Get an entry from the ledger
• Read from the ledger
Close the ledger.
Basic Operations

Out-of-order write and In-Order Ack.
• Application has liberty to pre-allocate entryIDs
• Multiple application threads can write in parallel.
User defined Ledger Names
• Not restricted by BK generated ledger Names
Explicit LAC updates
• Added ReadLac, WriteLac to the protocol.
• Maintain both piggy-back LAC and explicit LAC simultaneously.
Enhancements - In the internal branch working to push upstream

Conventional Name Space.
• User defined Names
• Treat LedgerId as an i-node.
Disk scrubbers and Repairs
• Actively hunt and repair bit-rots and corruptions
Scalable Metadata Store
• Separate and dedicated metadata store
• Not restricted by ZK limitations
Enhancements - Future

Salesforce Application with BookKeeper
Application
Store Interface
With
Bookkeeper client User
Library
Bookies ZooKeeper
Server Machine

Guarantees
• If an entry has been acknowledged, it must be readable.
• If an entry is read once, it must always be readable.
• If write of entry ‘n’ is successful, all entries until ‘n’ are successfully committed.
Consistencies
• Last Add Confirmed is consistency among readers
• Fence is consistency among writers.
Consistencies

Out of order write and in order Ack
0 1 2 3 4 5
App A ( Writer )
6
App B ( Writer )
8
App C ( Writer )
7

Last Add Confirmed
0 1 2 3 4 5
App A ( Writer )
6
App B ( Writer )
8
App C ( Writer )
7
LAC LAC
App D (Reader)
X
LAC

What Can Happen?
Client side
• Client Restarts
• Client looses connection with zookeeper
• Client looses connection with bookies.
Bookie Side
• Bookie Goes down
• Disk(s) on bookie go bad, IO issues
• Bookie gets disconnected from network.
Zookeeper
• Gets disconnected from rest of the cluster

Writing Client Crash
bookie
bookie
bookie
zookeeper
What is the last entry?
• Nothing happens until a reader attempts to
read.
• Recovery process gets initiated when a
process opens the ledger for reading.
• Close the ledger on zoo keeper
• Identify Last entry of the ledger.
• Update metadata on zookeeper with Last
Add Confirmed. (LAC)

Client gets disconnected with Bookies.
Either bookie is down or network between client and bookie have issues.
Contact zoo keeper to get the list of available bookies.
Update ensemble set, register with bookkeeper.
Continue with new set.

Client gets disconnected with Zookeeper.
Tries to reestablish the connection.
Can continue to read and write to the ledger.
Until that time, no metadata operations can be performed.
• Can not create a ledger
• Can not seal a ledger.
• Can not open a ledger.

Reader Opens while writer is active.
Must be avoided by the application.
BK guarantees correctness.
Reader initiates recovery process.
• Fences bookie on the zookeeper.
• Informs all bookies in ensemble recovery started.
• After these steps writer will get write errors.(if actively writing)
• Reader contacts all bookies to learn last entry.
• Replicates last entry if it doesn’t have enough replicas.
• Updates zookeeper with LAC, and closes the ledger.

Recovery begins when the ledger is opened by the reader in recovery mode
• Check if the ledger needs recovery (not closed)
• Fence the ledger first and initiate recovery
• Step1: Flag that the ledger is in recovery by update ZooKeeper state.
• Step2 : Fence Bookies
• Step3 : Recover the Ledger
Fencing and Recovery

Ledger Fencing
BookKeeper
Distributed Store
Ledger
Write Non Recovery Read
Recovery ReadFence & Recover
Attempt to write

ZooKeeper
Cluster
B
Bookie Crashes - Auto Recovery
Bookie-1 Bookie-2 Bookie-N
BookKeeper
Cluster
Auditor (Lead)
Replicator
Worker
Auditor
(Follower)
Replicator
Worker
Auditor
(Follower)
Replicator
Worker
Machine-1 Machine-2 Machine-N

Auditor
• Starts on every Bookie machine, leader gets elected through ZooKeeper.
• One active auditor per cluster.
• Watch Bookie failures and manage under replicated ledgers list.
Replication Workers
• Responsible for performing replication to maintain quorum copies.
• Can run on any machine in the cluster, usually runs on each Bookie machine.
• Work on under replicated ledgers list published by the Auditor.
• Pick one ledger at a time, create a lock on ZooKeeper and replicate to local bookie.
• If local bookie is part of the ensemble, drop the lock and move to next one in the list.
Auto Recovery Components

Heterogeneous Stores and Tired Architecture
Log Store
Data Store
Archival Store

Clusters of storage serving App Instances
Log Store
Data Store
Archival Store
App Instance
App Instance App Instance
App Instance
App Instance
App Instance
App Instance
App Instance

Community Update
Projects built on BookKeeper
• Twitter Distributed Log : Manhattan, Pub/Sub, DeferredRPC
• Yahoo Cloud Messaging
• Salesforce Distributed Store.
• Huawei – HDFS NameNode
• HubSpot – WAL
• Majordodo – Distributed Resource Manager
Community
• 6 PMC members
• 8 Committers
• 20-25 active members
• 5 Enterprises actively using/contributing
More Info
https://cwiki.apache.org/confluence/display/BOOKKEEPER/BookKeeper+papers+and+presentations

• Journal
• A journal file contains the BookKeeper transaction logs.
• One journal per bookie at a time.
• New journal file is created once the old one reaches max file size.
• Entry Log
• Entries from different ledgers are aggregated and written sequentially
• Offsets are kept as pointers in LedgerCache for fast lookup.
• One entry log per bookie at a time.
• New Entry Log file is created once old one reaches max size.
• Old entry log files are removed by the Garbage Collector Thread once they are not associated with any active ledger.
• Index Files
• One per ledger.
• Offsets of the entries in that ledger.
Data Management in Bookies

WritePath - Add Entry
disk 2
fsync
ack
add
L2 - E3
L3 - E7
L1 - E4
L1 - E2
L2 - E1
L1 - E1
disk 1
L2 - E3
L3 - E7
L1 - E4
L1 - E2
L2 - E1
L1 - E1
async flush
cache Similar rates
Durability
Read-efficient
INDEX
Ledger device Journal device

The ledger abstraction
op op op op op op op op op op opop op op opop op
add
read
checkpoint
Ledger 1
Ledger 2
Ledger 3

Garbage collection / compaction
disk 2
L2 - E3
L3 - E7
L1 - E4
L1 - E2
L2 - E1
L1 - E1
disk 1
L2 - E3
L3 - E7
L2 - E1
L1 - E4
L1 - E2
L1 - E1
L1 - E4
L1 - E2
L2 - E1L1 - E1
L1 - E4
L1 - E2
L2 - E1
L1 - E1
Ledger 1 deleted
L2 - E1
Entry log Journal

Apache BookKeeper Distributed Store- a Salesforce use case

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache BookKeeper Distributed Store- a Salesforce use case

Similar to Apache BookKeeper Distributed Store- a Salesforce use case (20)

More from Salesforce Engineering

More from Salesforce Engineering (20)

Recently uploaded

Recently uploaded (20)

Apache BookKeeper Distributed Store- a Salesforce use case

Editor's Notes