SlideShare a Scribd company logo
1 of 47
Download to read offline
DISTRIBUTED STORAGE AND COMPUTE WITH
LIBRADOS
SAGE WEIL – VAULT - 2015.03.11
2
AGENDA
● motivation
● what is Ceph?
● what is librados?
● what can it do?
● other RADOS goodies
● a few use cases
MOTIVATION
4
MY FIRST WEB APP
● a bunch of data files
/srv/myapp/12312763.jpg
/srv/myapp/87436413.jpg
/srv/myapp/47464721.jpg
…
5
ACTUAL USERS
● scale up
– buy a bigger, more expensive file server
6
SOMEBODY TWEETED
● multiple web frontends
– NFS mount /srv/myapp
$$$
7
NAS COSTS ARE NON-LINEAR
● scale out: hash files across servers
/srv/myapp/1/1237436.jpg
/srv/myapp/2/2736228.jpg
/srv/myapp/3/3472722.jpg
...
2
1
3
8
SERVERS FILL UP
● ...and directories get too big
● hash to shards that are smaller than servers
9
LOAD IS NOT BALANCED
● migrate smaller shards
– probably some rsync hackery
– maybe some trickery to maintain consistent view
of data
10
IT'S 2014 ALREADY
● don't reinvent the wheel
– ad hoc sharding
– load balancing
● reliability? replication?
11
DISTRIBUTED OBJECT STORES
● we want transparent
– scaling, sharding, rebalancing
– replication, migration, healing
● simple, flat(ish) namespace
magic!
CEPH
13
CEPH MOTIVATING PRINCIPLES
● everything must scale horizontally
● no single point of failure
● commodity hardware
● self-manage whenever possible
● move beyond legacy approaches
– client/cluster instead of client/server
– avoid ad hoc high-availability
● open source (LGPL)
14
ARCHITECTURAL FEATURES
● smart storage daemons
– centralized coordination of dumb devices does not
scale
– peer to peer, emergent behavior
● flexible object placement
– “smart” hash-based placement (CRUSH)
– awareness of hardware infrastructure, failure domains
● no metadata server or proxy for finding objects
● strong consistency (CP instead of AP)
15
CEPH COMPONENTS
RGW
web services gateway
for object storage,
compatible with S3
and Swift
LIBRADOS
client library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
distributed file system
with POSIX semantics
and scale-out
metadata
management
APP HOST/VM CLIENT
16
CEPH COMPONENTS
LIBRADOS
client library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
ENLIGHTENED APP
LIBRADOS
18
LIBRADOS
● native library for accessing RADOS
– librados.so shared library
– C, C++, Python, Erlang, Haskell, PHP, Java (JNA)
● direct data path to storage nodes
– speaks native Ceph protocol with cluster
● exposes
– mutable objects
– rich per-object API and data model
● hides
– data distribution, migration, replication, failures
19
OBJECTS
● name
– alphanumeric
– no rename
● data
– opaque byte array
– bytes to 100s of MB
– byte-granularity access
(just like a file)
● attributes
– small
– e.g., “version=12”
● key/value data
– random access insert,
remove, list
– keys (bytes to 100s of
bytes)
– values (bytes to megabytes)
– key-granularity access
20
POOLS
● name
● many objects
– bazillions
– independent namespace
● replication and placement policy
– 3 replicas separated across racks
– 8+2 erasure coded, separated across hosts
● sharding, (cache) tiering parameters
21
DATA PLACEMENT
● there is no metadata server, only OSDMap
– pools, their ids, and sharding parameters
– OSDs (storage daemons), their IPs, and up/down state
– CRUSH hierarchy and placement rules
– 10s to 100s of KB
object “foo” 0x2d872c31
PG 2.c31
OSDs [56, 23, 131]
pool “my_objects” pool_id 2
hash
modulo pg_num
CRUSH hierarchy cluster state
22
EXPLICIT DATA PLACEMENT
● you don't choose data location
● except relative to other objects
– normally we hash the object name
– you can also explicitly specify a different string
– and remember it on read, too
object “foo” 0x2d872c31
hash
object “bar”
key “foo” 0x2d872c31
hash
23
HELLO, WORLD
connect to the cluster
p is like a file descriptor
atomically write/replace object
24
COMPOUND OBJECT OPERATIONS
● group operations on object into single request
– atomic: all operations commit or do not commit
– idempotent: request applied exactly once
25
CONDITIONAL OPERATIONS
● mix read and write ops
● overall operation aborts if any step fails
● 'guard' read operations verify condition is true
– verify xattr has specific value
– assert object is a specific version
● allows atomic compare-and-swap
26
KEY/VALUE DATA
● each object can contain key/value data
– independent of byte data or attributes
– random access insertion, deletion, range query/list
● good for structured data
– avoid read/modify/write cycles
– RGW bucket index
● enumerate objects and there size to support listing
– CephFS directories
● efficient file creation, deletion, inode updates
27
SNAPSHOTS
● object granularity
– RBD has per-image snapshots
– CephFS can snapshot any subdirectory
● librados user must cooperate
– provide “snap context” at write time
– allows for point-in-time consistency without flushing
caches
● triggers copy-on-write inside RADOS
– consume space only when snapshotted data is
overwritten
28
RADOS CLASSES
● write new RADOS “methods”
– code runs directly inside storage server I/O path
– simple plugin API; admin deploys a .so
● read-side methods
– process data, return result
● write-side methods
– process, write; read, modify, write
– generate an update transaction that is applied
atomically
29
A SIMPLE CLASS METHOD
30
INVOKING A METHOD
31
EXAMPLE: RBD
● RBD (RADOS block device)
● image data striped across 4MB data objects
● image header object
– image size, snapshot info, lock state
● image operations may be initiated by any client
– image attached to KVM virtual machine
– 'rbd' CLI may trigger snapshot or resize
● need to communicate between librados client!
32
WATCH/NOTIFY
● establish stateful 'watch' on an object
– client interest persistently registered with object
– client keeps connection to OSD open
● send 'notify' messages to all watchers
– notify message (and payload) sent to all watchers
– notification (and reply payloads) on completion
● strictly time-bounded liveness check on watch
– no notifier falsely believes we got a message
● example: distributed cache w/ cache invalidations
33
WATCH/NOTIFY
OBJECT
CLIENT A CLIENT A CLIENT A
watch
watchwatch
commit
commit
notify “please invalidate cache entry foo”
notify
notify
notify-ack
notify-ack
complete
persisted
invalidate
A FEW USE CASES
35
SIMPLE APPLICATIONS
● cls_lock – cooperative locking
● cls_refcount – simple object refcounting
● images
– rotate, resize, filter images
● log or time series data
– filter data, return only matching records
● structured metadata (e.g., for RBD and RGW)
– stable interface for metadata objects
– safe and atomic update operations
36
DYNAMIC OBJECTS IN LUA
● Noah Wakins (UCSC)
– http://ceph.com/rados/dynamic-object-interfaces-with-lua/
● write rados class methods in LUA
– code sent to OSD from the client
– provides LUA view of RADOS class runtime
● LUA client wrapper for librados
– makes it easy to send code to exec on OSD
37
VAULTAIRE
● Andrew Cowie (Anchor Systems)
● a data vault for metrics
– https://github.com/anchor/vaultaire
– http://linux.conf.au/schedule/30074/view_talk
– http://mirror.linux.org.au/pub/linux.conf.au/2015/OGGB3
/Thursday/
● preserve all data points (no MRTG)
● append-only RADOS objects
● dedup repeat writes on read
● stateless daemons for inject, analytics, etc.
38
ZLOG – CORFU ON RADOS
● Noah Watkins (UCSC)
– http://noahdesu.github.io/2014/10/26/corfu-on-
ceph.html
● high performance distributed shared log
– use RADOS for storing log shards instead of CORFU's
special-purpose storage backend for flash
– let RADOS handle replication and durability
● cls_zlog
– maintain log structure in object
– enforce epoch invariants
39
OTHERS
● radosfs
– simple POSIX-like metadata-server-less file system
– https://github.com/cern-eos/radosfs
● glados
– gluster translator on RADOS
● several dropbox-like file sharing services
● iRODS
– simple backend for an archival storage system
● Synnefo
– open source cloud stack used by GRNET
– Pithos block device layer implements virtual disks on top of
librados (similar to RBD)
OTHER RADOS GOODIES
41
ERASURE CODING
OBJECT
REPLICATED POOL
CEPH STORAGE CLUSTER
ERASURE CODED POOL
CEPH STORAGE CLUSTER
COPY COPY
OBJECT
31 2 X Y
COPY
4
Full copies of stored objects
 Very high durability
 3x (200% overhead)
 Quicker recovery
One copy plus parity
 Cost-effective durability
 1.5x (50% overhead)
 Expensive recovery
42
ERASURE CODING
● subset of operations supported for EC
– attributes and byte data
– append-only on stripe boundaries
– snapshots
– compound operations
● but not
– key/value data
– rados classes (yet)
– object overwrites
– non-stripe aligned appends
43
TIERED STORAGE
APPLICATION USING LIBRADOS
CACHE POOL (REPLICATED, SSDs)
BACKING POOL (ERASURE CODED, HDDs)
CEPH STORAGE CLUSTER
44
WHAT (LIB)RADOS DOESN'T DO
● stripe large objects for you
– see libradosstriper
● rename objects
– (although we do have a “copy” operation)
● multi-object transactions
– roll your own two-phase commit or intent log
● secondary object index
– can find objects by name only
– can't query RADOS to find objects with some attribute
● list objects by prefix
– can only enumerate in hash(object name) order
– with confusing results from cache tiers
45
PERSPECTIVE
● Swift
– AP, last writer wins
– large objects
– simpler data model (whole
object GET/PUT)
● GlusterFS
– CP (usually)
– file-based data model
● Riak
– AP, flexible conflict
resolution
– simple key/value data
model (small object)
– secondary indexes
● Cassandra
– AP
– table-based data model
– secondary indexes
46
CONCLUSIONS
● file systems are a poor match for scale-out apps
– usually require ad hoc sharding
– directory hierarchies, rename unnecessary
– opaque byte streams require ad hoc locking
● librados
– transparent scaling, replication or erasure coding
– richer object data model (bytes, attrs, key/value)
– rich API (compound operations, snapshots,
watch/notify)
– extensible via rados class plugins
THANK YOU!
Sage Weil
CEPH PRINCIPAL
ARCHITECT
sage@redhat.com
@liewegas

More Related Content

What's hot

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldSage Weil
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersSage Weil
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondSage Weil
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016John Spray
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgwzhouyuan
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)Ontico
 

What's hot (19)

BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
librados
libradoslibrados
librados
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
 

Similar to Distributed Storage and Compute with Librados

Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS Ceph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Community
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Community
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Community
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Community
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for youTaco Scargo
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopPatrick McGarry
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 

Similar to Distributed Storage and Compute with Librados (20)

Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 

Recently uploaded

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Distributed Storage and Compute with Librados

  • 1. DISTRIBUTED STORAGE AND COMPUTE WITH LIBRADOS SAGE WEIL – VAULT - 2015.03.11
  • 2. 2 AGENDA ● motivation ● what is Ceph? ● what is librados? ● what can it do? ● other RADOS goodies ● a few use cases
  • 4. 4 MY FIRST WEB APP ● a bunch of data files /srv/myapp/12312763.jpg /srv/myapp/87436413.jpg /srv/myapp/47464721.jpg …
  • 5. 5 ACTUAL USERS ● scale up – buy a bigger, more expensive file server
  • 6. 6 SOMEBODY TWEETED ● multiple web frontends – NFS mount /srv/myapp $$$
  • 7. 7 NAS COSTS ARE NON-LINEAR ● scale out: hash files across servers /srv/myapp/1/1237436.jpg /srv/myapp/2/2736228.jpg /srv/myapp/3/3472722.jpg ... 2 1 3
  • 8. 8 SERVERS FILL UP ● ...and directories get too big ● hash to shards that are smaller than servers
  • 9. 9 LOAD IS NOT BALANCED ● migrate smaller shards – probably some rsync hackery – maybe some trickery to maintain consistent view of data
  • 10. 10 IT'S 2014 ALREADY ● don't reinvent the wheel – ad hoc sharding – load balancing ● reliability? replication?
  • 11. 11 DISTRIBUTED OBJECT STORES ● we want transparent – scaling, sharding, rebalancing – replication, migration, healing ● simple, flat(ish) namespace magic!
  • 12. CEPH
  • 13. 13 CEPH MOTIVATING PRINCIPLES ● everything must scale horizontally ● no single point of failure ● commodity hardware ● self-manage whenever possible ● move beyond legacy approaches – client/cluster instead of client/server – avoid ad hoc high-availability ● open source (LGPL)
  • 14. 14 ARCHITECTURAL FEATURES ● smart storage daemons – centralized coordination of dumb devices does not scale – peer to peer, emergent behavior ● flexible object placement – “smart” hash-based placement (CRUSH) – awareness of hardware infrastructure, failure domains ● no metadata server or proxy for finding objects ● strong consistency (CP instead of AP)
  • 15. 15 CEPH COMPONENTS RGW web services gateway for object storage, compatible with S3 and Swift LIBRADOS client library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD reliable, fully- distributed block device with cloud platform integration CEPHFS distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  • 16. 16 CEPH COMPONENTS LIBRADOS client library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors ENLIGHTENED APP
  • 18. 18 LIBRADOS ● native library for accessing RADOS – librados.so shared library – C, C++, Python, Erlang, Haskell, PHP, Java (JNA) ● direct data path to storage nodes – speaks native Ceph protocol with cluster ● exposes – mutable objects – rich per-object API and data model ● hides – data distribution, migration, replication, failures
  • 19. 19 OBJECTS ● name – alphanumeric – no rename ● data – opaque byte array – bytes to 100s of MB – byte-granularity access (just like a file) ● attributes – small – e.g., “version=12” ● key/value data – random access insert, remove, list – keys (bytes to 100s of bytes) – values (bytes to megabytes) – key-granularity access
  • 20. 20 POOLS ● name ● many objects – bazillions – independent namespace ● replication and placement policy – 3 replicas separated across racks – 8+2 erasure coded, separated across hosts ● sharding, (cache) tiering parameters
  • 21. 21 DATA PLACEMENT ● there is no metadata server, only OSDMap – pools, their ids, and sharding parameters – OSDs (storage daemons), their IPs, and up/down state – CRUSH hierarchy and placement rules – 10s to 100s of KB object “foo” 0x2d872c31 PG 2.c31 OSDs [56, 23, 131] pool “my_objects” pool_id 2 hash modulo pg_num CRUSH hierarchy cluster state
  • 22. 22 EXPLICIT DATA PLACEMENT ● you don't choose data location ● except relative to other objects – normally we hash the object name – you can also explicitly specify a different string – and remember it on read, too object “foo” 0x2d872c31 hash object “bar” key “foo” 0x2d872c31 hash
  • 23. 23 HELLO, WORLD connect to the cluster p is like a file descriptor atomically write/replace object
  • 24. 24 COMPOUND OBJECT OPERATIONS ● group operations on object into single request – atomic: all operations commit or do not commit – idempotent: request applied exactly once
  • 25. 25 CONDITIONAL OPERATIONS ● mix read and write ops ● overall operation aborts if any step fails ● 'guard' read operations verify condition is true – verify xattr has specific value – assert object is a specific version ● allows atomic compare-and-swap
  • 26. 26 KEY/VALUE DATA ● each object can contain key/value data – independent of byte data or attributes – random access insertion, deletion, range query/list ● good for structured data – avoid read/modify/write cycles – RGW bucket index ● enumerate objects and there size to support listing – CephFS directories ● efficient file creation, deletion, inode updates
  • 27. 27 SNAPSHOTS ● object granularity – RBD has per-image snapshots – CephFS can snapshot any subdirectory ● librados user must cooperate – provide “snap context” at write time – allows for point-in-time consistency without flushing caches ● triggers copy-on-write inside RADOS – consume space only when snapshotted data is overwritten
  • 28. 28 RADOS CLASSES ● write new RADOS “methods” – code runs directly inside storage server I/O path – simple plugin API; admin deploys a .so ● read-side methods – process data, return result ● write-side methods – process, write; read, modify, write – generate an update transaction that is applied atomically
  • 31. 31 EXAMPLE: RBD ● RBD (RADOS block device) ● image data striped across 4MB data objects ● image header object – image size, snapshot info, lock state ● image operations may be initiated by any client – image attached to KVM virtual machine – 'rbd' CLI may trigger snapshot or resize ● need to communicate between librados client!
  • 32. 32 WATCH/NOTIFY ● establish stateful 'watch' on an object – client interest persistently registered with object – client keeps connection to OSD open ● send 'notify' messages to all watchers – notify message (and payload) sent to all watchers – notification (and reply payloads) on completion ● strictly time-bounded liveness check on watch – no notifier falsely believes we got a message ● example: distributed cache w/ cache invalidations
  • 33. 33 WATCH/NOTIFY OBJECT CLIENT A CLIENT A CLIENT A watch watchwatch commit commit notify “please invalidate cache entry foo” notify notify notify-ack notify-ack complete persisted invalidate
  • 34. A FEW USE CASES
  • 35. 35 SIMPLE APPLICATIONS ● cls_lock – cooperative locking ● cls_refcount – simple object refcounting ● images – rotate, resize, filter images ● log or time series data – filter data, return only matching records ● structured metadata (e.g., for RBD and RGW) – stable interface for metadata objects – safe and atomic update operations
  • 36. 36 DYNAMIC OBJECTS IN LUA ● Noah Wakins (UCSC) – http://ceph.com/rados/dynamic-object-interfaces-with-lua/ ● write rados class methods in LUA – code sent to OSD from the client – provides LUA view of RADOS class runtime ● LUA client wrapper for librados – makes it easy to send code to exec on OSD
  • 37. 37 VAULTAIRE ● Andrew Cowie (Anchor Systems) ● a data vault for metrics – https://github.com/anchor/vaultaire – http://linux.conf.au/schedule/30074/view_talk – http://mirror.linux.org.au/pub/linux.conf.au/2015/OGGB3 /Thursday/ ● preserve all data points (no MRTG) ● append-only RADOS objects ● dedup repeat writes on read ● stateless daemons for inject, analytics, etc.
  • 38. 38 ZLOG – CORFU ON RADOS ● Noah Watkins (UCSC) – http://noahdesu.github.io/2014/10/26/corfu-on- ceph.html ● high performance distributed shared log – use RADOS for storing log shards instead of CORFU's special-purpose storage backend for flash – let RADOS handle replication and durability ● cls_zlog – maintain log structure in object – enforce epoch invariants
  • 39. 39 OTHERS ● radosfs – simple POSIX-like metadata-server-less file system – https://github.com/cern-eos/radosfs ● glados – gluster translator on RADOS ● several dropbox-like file sharing services ● iRODS – simple backend for an archival storage system ● Synnefo – open source cloud stack used by GRNET – Pithos block device layer implements virtual disks on top of librados (similar to RBD)
  • 41. 41 ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH STORAGE CLUSTER COPY COPY OBJECT 31 2 X Y COPY 4 Full copies of stored objects  Very high durability  3x (200% overhead)  Quicker recovery One copy plus parity  Cost-effective durability  1.5x (50% overhead)  Expensive recovery
  • 42. 42 ERASURE CODING ● subset of operations supported for EC – attributes and byte data – append-only on stripe boundaries – snapshots – compound operations ● but not – key/value data – rados classes (yet) – object overwrites – non-stripe aligned appends
  • 43. 43 TIERED STORAGE APPLICATION USING LIBRADOS CACHE POOL (REPLICATED, SSDs) BACKING POOL (ERASURE CODED, HDDs) CEPH STORAGE CLUSTER
  • 44. 44 WHAT (LIB)RADOS DOESN'T DO ● stripe large objects for you – see libradosstriper ● rename objects – (although we do have a “copy” operation) ● multi-object transactions – roll your own two-phase commit or intent log ● secondary object index – can find objects by name only – can't query RADOS to find objects with some attribute ● list objects by prefix – can only enumerate in hash(object name) order – with confusing results from cache tiers
  • 45. 45 PERSPECTIVE ● Swift – AP, last writer wins – large objects – simpler data model (whole object GET/PUT) ● GlusterFS – CP (usually) – file-based data model ● Riak – AP, flexible conflict resolution – simple key/value data model (small object) – secondary indexes ● Cassandra – AP – table-based data model – secondary indexes
  • 46. 46 CONCLUSIONS ● file systems are a poor match for scale-out apps – usually require ad hoc sharding – directory hierarchies, rename unnecessary – opaque byte streams require ad hoc locking ● librados – transparent scaling, replication or erasure coding – richer object data model (bytes, attrs, key/value) – rich API (compound operations, snapshots, watch/notify) – extensible via rados class plugins
  • 47. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas