Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ERASURE CODING AND CACHE TIERING
SAGE WEIL – SCALE13X - 2015.02.22
2
AGENDA
● Ceph architectural overview
● RADOS background
● Cache tiering
● Erasure coding
● Project status, roadmap
ARCHITECTURE
4
CEPH MOTIVATING PRINCIPLES
● All components must scale horizontally
● There can be no single point of failure
● The solu...
5
CEPH COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing ...
ROBUST SERVICES BUILT ON RADOS
7
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library ...
8
THE RADOS GATEWAY
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGW
LIBRADOS
APPLICATION APPLICATION
REST
9
MULTI-SITE OBJECT STORAGE
WEB
APPLICATION
APP
SERVER
CEPH OBJECT
GATEWAY
(RGW)
CEPH STORAGE
CLUSTER
(US-EAST)
WEB
APPLIC...
11
RADOSGW MAKES RADOS
WEBBY
RADOSGW:
 REST-based object storage proxy
 Uses RADOS to store objects
●
Stripes large REST...
12
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library...
13
STORING VIRTUAL DISKS
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
14
KERNEL MODULE
M M
RADOS CLUSTER
LINUX HOST
KRBD
15
RBD FEATURES
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integrat...
16
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library...
17
SEPARATE METADATA SERVER
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 01
10
18
SCALABLE METADATA SERVERS
METADATA SERVER
 Manages metadata for a POSIX-compliant
shared filesystem
 Directory hierar...
RADOS
20
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library...
21
RADOS
● Flat object namespace within each pool
● Rich object API (librados)
– Bytes, attributes, key/value data
– Parti...
22
RADOS CLUSTER
APPLICATION
M M
M M
M
RADOS CLUSTER
23
RADOS COMPONENTS
OSDs:
 10s to 1000s in a cluster
 One per disk (or one per SSD, RAID
group…)
 Serve stored objects ...
24
OBJECT STORAGE DAEMONS
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
btrfs
ext4
M
M
M
DATA PLACEMENT
26
WHERE DO OBJECTS LIVE?
??
APPLICATION
M
M
M
OBJECT
27
A METADATA SERVER?
1
APPLICATION
M
M
M
2
28
CALCULATED PLACEMENT
FAPPLICATION
M
M
M
A-G
H-N
O-T
U-Z
29
CRUSH
CLUSTER
OBJECTS
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
PLACEMENT GROUPS
(PGs)
30
CRUSH IS A QUICK CALCULATION
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
31
CRUSH AVOIDS FAILED DEVICES
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
10
32
CRUSH: DECLUSTERED PLACEMENT
RADOS CLUSTER
10
01
01
10
10
01 11
0
1
1001
0110 10 01
11
01
32●
Each PG independently map...
33
CRUSH: DYNAMIC DATA PLACEMENT
CRUSH:
 Pseudo-random placement algorithm
 Fast calculation, no lookup
 Repeatable, de...
34
DATA IS ORGANIZED INTO POOLS
CLUSTER
OBJECTS
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
POOLS
(CONTAINING PGs)
10
01...
TIERED STORAGE
36
TWO WAYS TO CACHE
● Within each OSD
– Combine SSD and HDD under each OSD
– Make localized promote/demote decisions
– Le...
37
TIERED STORAGE
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
38
RADOS TIERING PRINCIPLES
● Each tier is a RADOS pool
– May be replicated or erasure coded
● Tiers are durable
– e.g., r...
39
READ (CACHE HIT)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ READ REPLY
40
READ (CACHE MISS)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ READ REPLY
PROXY...
42
READ (CACHE MISS)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
READ
PROMOTE
READ REP...
43
WRITE (HIT)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
WRITE ACK
44
WRITE (MISS)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
WRITE ACK
PROMOTE
45
WRITE (MISS) (COMING SOON)
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
WRITE ACK
PR...
46
ESTIMATING TEMPERATURE
● Each PG constructs in-memory bloom filters
– Insert records on both read and write
– Each filt...
47
FLUSH AND/OR EVICT
COLD DATA
CEPH CLIENT
CACHE POOL (SSD): WRITEBACK
BACKING POOL (HDD)
CEPH STORAGE CLUSTER
FLUSH ACK ...
48
TIERING AGENT
● Each PG has an internal tiering agent
– Manages PG based on administrator defined policy
● Flush dirty ...
49
CACHE TIER USAGE
● Cache tier should be faster than the base tier
● Cache tier should be replicated (not erasure coded)...
ERASURE CODING
52
ERASURE CODING
OBJECT
REPLICATED POOL
CEPH STORAGE CLUSTER
ERASURE CODED POOL
CEPH STORAGE CLUSTER
COPY COPY
OBJECT
31 ...
53
ERASURE CODING SHARDS
CEPH STORAGE CLUSTER
OBJECT
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
54
ERASURE CODING SHARDS
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
0
4
8
12
16
1
5
9
13
17
2
6
10
14
18
3
7...
55
PRIMARY COORDINATES
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
56
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ
57
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ
READS
58
EC READ
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
READ REPLY
59
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
60
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
WRITES
61
EC WRITE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE ACK
62
EC WRITE: DEGRADED
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE
WRITES
63
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
WRITE...
64
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
B B B...
65
EC RESTRICTIONS
● Overwrite in place will not work in general
● Log and 2PC would increase complexity, latency
● We cho...
66
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
B B B...
67
EC WRITE: PARTIAL FAILURE
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CEPH CLIENT
A A A...
68
EC RESTRICTIONS
● This is a small subset of allowed librados operations
– Notably cannot (over)write any extent
● Coinc...
69
WHICH ERASURE CODE?
● The EC algorithm and implementation are pluggable
– jerasure/gf-complete (free, open, and very fa...
70
COST OF RECOVERY
1 TB OSD
71
COST OF RECOVERY
1 TB OSD
72
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
73
COST OF RECOVERY (REPLICATION)
1 TB OSD
.01 TB
.01 TB
.01 TB
.01 TB
...
...
.01 TB .01 TB
74
COST OF RECOVERY (REPLICATION)
1 TB OSD
1 TB
75
COST OF RECOVERY (EC)
1 TB OSD
1 TB
1 TB
1 TB
1 TB
76
LOCAL RECOVERY CODE (LRC)
CEPH STORAGE CLUSTER
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
A
OSD
C
OSD
B
OSD...
77
BIG THANKS TO
● Ceph
– Loic Dachary (CloudWatt, FSF France, Red Hat)
– Andreas Peters (CERN)
– Sam Just (Inktank / Red ...
ROADMAP
79
WHAT'S NEXT
● Erasure coding
– Allow (optimistic) client reads directly from shards
– ARM optimizations for jerasure
● ...
80
OTHER ONGOING WORK
● Performance optimization (SanDisk, Intel, Mellanox)
● Alternative OSD backends
– New backend: hybr...
81
FOR MORE INFORMATION
● http://ceph.com
● http://github.com/ceph
● http://tracker.ceph.com
● Mailing lists
– ceph-users@...
THANK YOU!
Sage Weil
CEPH PRINCIPAL
ARCHITECT
sage@redhat.com
@liewegas
Upcoming SlideShare
Loading in …5
×

45

Share

Download to read offline

Storage tiering and erasure coding in Ceph (SCaLE13x)

Download to read offline

Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements.

Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations.

This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Storage tiering and erasure coding in Ceph (SCaLE13x)

  1. 1. ERASURE CODING AND CACHE TIERING SAGE WEIL – SCALE13X - 2015.02.22
  2. 2. 2 AGENDA ● Ceph architectural overview ● RADOS background ● Cache tiering ● Erasure coding ● Project status, roadmap
  3. 3. ARCHITECTURE
  4. 4. 4 CEPH MOTIVATING PRINCIPLES ● All components must scale horizontally ● There can be no single point of failure ● The solution must be hardware agnostic ● Should use commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● Move beyond legacy approaches – Client/cluster instead of client/server – Ad hoc HA
  5. 5. 5 CEPH COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  6. 6. ROBUST SERVICES BUILT ON RADOS
  7. 7. 7 ARCHITECTURAL COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  8. 8. 8 THE RADOS GATEWAY M M M RADOS CLUSTER RADOSGW LIBRADOS socket RADOSGW LIBRADOS APPLICATION APPLICATION REST
  9. 9. 9 MULTI-SITE OBJECT STORAGE WEB APPLICATION APP SERVER CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (US-EAST) WEB APPLICATION APP SERVER CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (EU-WEST)
  10. 10. 11 RADOSGW MAKES RADOS WEBBY RADOSGW:  REST-based object storage proxy  Uses RADOS to store objects ● Stripes large RESTful objects across many RADOS objects  API supports buckets, accounts  Usage accounting for billing  Compatible with S3 and Swift applications
  11. 11. 12 ARCHITECTURAL COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  12. 12. 13 STORING VIRTUAL DISKS M M RADOS CLUSTER HYPERVISOR LIBRBD VM
  13. 13. 14 KERNEL MODULE M M RADOS CLUSTER LINUX HOST KRBD
  14. 14. 15 RBD FEATURES ● Stripe images across entire cluster (pool) ● Read-only snapshots ● Copy-on-write clones ● Broad integration – Qemu – Linux kernel – iSCSI (STGT, LIO) – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ● Incremental backup (relative to snapshots)
  15. 15. 16 ARCHITECTURAL COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  16. 16. 17 SEPARATE METADATA SERVER LINUX HOST M M M RADOS CLUSTER KERNEL MODULE datametadata 01 10
  17. 17. 18 SCALABLE METADATA SERVERS METADATA SERVER  Manages metadata for a POSIX-compliant shared filesystem  Directory hierarchy  File metadata (owner, timestamps, mode, etc.)  Snapshots on any directory  Clients stripe file data in RADOS  MDS not in data path  MDS stores metadata in RADOS  Dynamic MDS cluster scales to 10s or 100s  Only required for shared filesystem
  18. 18. RADOS
  19. 19. 20 ARCHITECTURAL COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  20. 20. 21 RADOS ● Flat object namespace within each pool ● Rich object API (librados) – Bytes, attributes, key/value data – Partial overwrite of existing data (mutable objects) – Single-object compound operations – RADOS classes (stored procedures) ● Strong consistency (CP system) ● Infrastructure aware, dynamic topology ● Hash-based placement (CRUSH) ● Direct client to server data path
  21. 21. 22 RADOS CLUSTER APPLICATION M M M M M RADOS CLUSTER
  22. 22. 23 RADOS COMPONENTS OSDs:  10s to 1000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors:  Maintain cluster membership and state  Provide consensus for distributed decision- making  Small, odd number (e.g., 5)  Not part of data path M
  23. 23. 24 OBJECT STORAGE DAEMONS FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs btrfs ext4 M M M
  24. 24. DATA PLACEMENT
  25. 25. 26 WHERE DO OBJECTS LIVE? ?? APPLICATION M M M OBJECT
  26. 26. 27 A METADATA SERVER? 1 APPLICATION M M M 2
  27. 27. 28 CALCULATED PLACEMENT FAPPLICATION M M M A-G H-N O-T U-Z
  28. 28. 29 CRUSH CLUSTER OBJECTS 10 01 01 10 10 01 11 01 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01 PLACEMENT GROUPS (PGs)
  29. 29. 30 CRUSH IS A QUICK CALCULATION RADOS CLUSTER OBJECT 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01
  30. 30. 31 CRUSH AVOIDS FAILED DEVICES RADOS CLUSTER OBJECT 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01 10
  31. 31. 32 CRUSH: DECLUSTERED PLACEMENT RADOS CLUSTER 10 01 01 10 10 01 11 0 1 1001 0110 10 01 11 01 32● Each PG independently maps to a pseudorandom set of OSDs ● PGs that map to the same OSD generally have replicas that do not ● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD – Highly parallel recovery – Avoid single-disk recovery bottleneck 10 01
  32. 32. 33 CRUSH: DYNAMIC DATA PLACEMENT CRUSH:  Pseudo-random placement algorithm  Fast calculation, no lookup  Repeatable, deterministic  Statistically uniform distribution  Stable mapping  Limited data migration on change  Rule-based configuration  Infrastructure topology aware  Adjustable replication  Weighted devices (different sizes)
  33. 33. 34 DATA IS ORGANIZED INTO POOLS CLUSTER OBJECTS 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01 POOLS (CONTAINING PGs) 10 01 11 01 10 01 01 10 01 10 10 01 11 01 10 01 10 01 10 11 01 11 01 10 10 01 01 01 10 10 01 01 POOL A POOL B POOL C POOL DOBJECTS OBJECTS OBJECTS
  34. 34. TIERED STORAGE
  35. 35. 36 TWO WAYS TO CACHE ● Within each OSD – Combine SSD and HDD under each OSD – Make localized promote/demote decisions – Leverage existing tools ● dm-cache, bcache, FlashCache ● Variety of caching controllers – We can help with hints ● Cache on separate devices/nodes – Different hardware for different tiers ● Slow nodes for cold data ● High performance nodes for hot data – Add, remove, scale each tier independently ● Unlikely to choose right ratios at procurement time HDD OSD BLOCKDEV SSD FS
  36. 36. 37 TIERED STORAGE APPLICATION CACHE POOL (REPLICATED) BACKING POOL (ERASURE CODED) CEPH STORAGE CLUSTER
  37. 37. 38 RADOS TIERING PRINCIPLES ● Each tier is a RADOS pool – May be replicated or erasure coded ● Tiers are durable – e.g., replicate across SSDs in multiple hosts ● Each tier has its own CRUSH policy – e.g., map cache pool to SSDs devices/hosts only ● librados adapts to tiering topology – Transparently direct requests accordingly ● e.g., to cache – No changes to RBD, RGW, CephFS, etc.
  38. 38. 39 READ (CACHE HIT) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER READ READ REPLY
  39. 39. 40 READ (CACHE MISS) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER READ READ REPLY PROXY READ
  40. 40. 42 READ (CACHE MISS) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER READ PROMOTE READ REPLY
  41. 41. 43 WRITE (HIT) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER WRITE ACK
  42. 42. 44 WRITE (MISS) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER WRITE ACK PROMOTE
  43. 43. 45 WRITE (MISS) (COMING SOON) CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER WRITE ACK PROXY WRITE
  44. 44. 46 ESTIMATING TEMPERATURE ● Each PG constructs in-memory bloom filters – Insert records on both read and write – Each filter covers configurable period (e.g., 1 hour) – Tunable false positive probability (e.g., 5%) – Store most recent N periods on disk (e.g., last 24 hours) ● Estimate temperature – Has object been accessed in any of the last N periods? – ...in how many of them? – Informs the flush/evict decision ● Estimate “recency” – How many periods since the object hasn't been accessed? – Informs read miss behavior: proxy vs promote
  45. 45. 47 FLUSH AND/OR EVICT COLD DATA CEPH CLIENT CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER FLUSH ACK EVICT
  46. 46. 48 TIERING AGENT ● Each PG has an internal tiering agent – Manages PG based on administrator defined policy ● Flush dirty objects – When pool reaches target dirty ratio – Tries to select cold objects – Marks objects clean when they have been written back to the base pool ● Evict (delete) clean objects – Greater “effort” as cache pool approaches target size
  47. 47. 49 CACHE TIER USAGE ● Cache tier should be faster than the base tier ● Cache tier should be replicated (not erasure coded) ● Promote and flush are expensive – Best results when object temperature are skewed ● Most I/O goes to small number of hot objects – Cache should be big enough to capture most of the acting set ● Challenging to benchmark – Need a realistic workload (e.g., not 'dd') to determine how it will perform in practice – Takes a long time to “warm up” the cache
  48. 48. ERASURE CODING
  49. 49. 52 ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH STORAGE CLUSTER COPY COPY OBJECT 31 2 X Y COPY 4 Full copies of stored objects  Very high durability  3x (200% overhead)  Quicker recovery One copy plus parity  Cost-effective durability  1.5x (50% overhead)  Expensive recovery
  50. 50. 53 ERASURE CODING SHARDS CEPH STORAGE CLUSTER OBJECT Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL
  51. 51. 54 ERASURE CODING SHARDS CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD 0 4 8 12 16 1 5 9 13 17 2 6 10 14 18 3 7 9 15 19 A B C D E A' B' C' D' E' ● Variable stripe size (e.g., 4 KB) ● Zero-fill shards (logically) in partial tail stripe
  52. 52. 55 PRIMARY COORDINATES CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL
  53. 53. 56 EC READ CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT READ
  54. 54. 57 EC READ CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT READ READS
  55. 55. 58 EC READ CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT READ REPLY
  56. 56. 59 EC WRITE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT WRITE
  57. 57. 60 EC WRITE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT WRITE WRITES
  58. 58. 61 EC WRITE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT WRITE ACK
  59. 59. 62 EC WRITE: DEGRADED CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT WRITE WRITES
  60. 60. 63 EC WRITE: PARTIAL FAILURE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT WRITE WRITES
  61. 61. 64 EC WRITE: PARTIAL FAILURE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT B B BA A A
  62. 62. 65 EC RESTRICTIONS ● Overwrite in place will not work in general ● Log and 2PC would increase complexity, latency ● We chose to restrict allowed operations – create – append (on stripe boundary) – remove (keep previous generation of object for some time) ● These operations can all easily be rolled back locally – create → delete – append → truncate – remove → roll back to previous generation ● Object attrs preserved in existing PG logs (they are small) ● Key/value data is not allowed on EC pools
  63. 63. 66 EC WRITE: PARTIAL FAILURE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT B B BA A A
  64. 64. 67 EC WRITE: PARTIAL FAILURE CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL CEPH CLIENT A A AA A A
  65. 65. 68 EC RESTRICTIONS ● This is a small subset of allowed librados operations – Notably cannot (over)write any extent ● Coincidentally, unsupported operations are also inefficient for erasure codes – Generally require read/modify/write of affected stripe(s) ● Some can consume EC directly – RGW (no object data update in place) ● Others can combine EC with a cache tier (RBD, CephFS) – Replication for warm/hot data – Erasure coding for cold data – Tiering agent skips objects with key/value data
  66. 66. 69 WHICH ERASURE CODE? ● The EC algorithm and implementation are pluggable – jerasure/gf-complete (free, open, and very fast) – ISA-L (Intel library; optimized for modern Intel procs) – LRC (local recovery code – layers over existing plugins) – SHEC (trades extra storage for recovery efficiency – new from Fujitsu) ● Parameterized – Pick “k” and “m”, stripe size ● OSD handles data path, placement, rollback, etc. ● Erasure plugin handles – Encode and decode math – Given these available shards, which ones should I fetch to satisfy a read? – Given these available shards and these missing shards, which ones should I fetch to recover?
  67. 67. 70 COST OF RECOVERY 1 TB OSD
  68. 68. 71 COST OF RECOVERY 1 TB OSD
  69. 69. 72 COST OF RECOVERY (REPLICATION) 1 TB OSD 1 TB
  70. 70. 73 COST OF RECOVERY (REPLICATION) 1 TB OSD .01 TB .01 TB .01 TB .01 TB ... ... .01 TB .01 TB
  71. 71. 74 COST OF RECOVERY (REPLICATION) 1 TB OSD 1 TB
  72. 72. 75 COST OF RECOVERY (EC) 1 TB OSD 1 TB 1 TB 1 TB 1 TB
  73. 73. 76 LOCAL RECOVERY CODE (LRC) CEPH STORAGE CLUSTER Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL A OSD C OSD B OSD OBJECT
  74. 74. 77 BIG THANKS TO ● Ceph – Loic Dachary (CloudWatt, FSF France, Red Hat) – Andreas Peters (CERN) – Sam Just (Inktank / Red Hat) – David Zafman (Inktank / Red Hat) ● jerasure / gf-complete – Jim Plank (University of Tennessee) – Kevin Greenan (Box.com) ● Intel (ISL plugin) ● Fujitsu (SHEC plugin)
  75. 75. ROADMAP
  76. 76. 79 WHAT'S NEXT ● Erasure coding – Allow (optimistic) client reads directly from shards – ARM optimizations for jerasure ● Cache pools – Better agent decisions (when to flush or evict) – Supporting different performance profiles ● e.g., slow / “cheap” flash can read just as fast – Complex topologies ● Multiple readonly cache tiers in multiple sites ● Tiering – Support “redirects” to (very) cold tier below base pool – Enable dynamic spin-down, dedup, and other features
  77. 77. 80 OTHER ONGOING WORK ● Performance optimization (SanDisk, Intel, Mellanox) ● Alternative OSD backends – New backend: hybrid key/value and file system – leveldb, rocksdb, LMDB ● Messenger (network layer) improvements – RDMA support (libxio – Mellanox) – Event-driven TCP implementation (UnitedStack) ● CephFS – Online consistency checking and repair tools – Performance, robustness ● Multi-datacenter RBD, RADOS replication
  78. 78. 81 FOR MORE INFORMATION ● http://ceph.com ● http://github.com/ceph ● http://tracker.ceph.com ● Mailing lists – ceph-users@ceph.com – ceph-devel@vger.kernel.org ● irc.oftc.net – #ceph – #ceph-devel ● Twitter – @ceph
  79. 79. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas
  • ssuser0581b6

    Oct. 22, 2020
  • jetflower

    May. 9, 2020
  • CharsJet

    Jun. 27, 2019
  • JohnVisser11

    May. 23, 2019
  • kyet

    Apr. 12, 2019
  • YonghyukYoo

    Sep. 17, 2018
  • chweiweich

    Sep. 2, 2018
  • satorutakeuchi18

    May. 21, 2018
  • xqin74

    May. 21, 2018
  • tsaieugene

    Sep. 21, 2017
  • petrmikusek

    Aug. 16, 2017
  • GregoryPoudrel

    Aug. 15, 2017
  • jameschen251

    Jul. 2, 2017
  • se77encc

    Jun. 26, 2017
  • tantra35

    Jun. 23, 2017
  • yukaripapa

    Jun. 19, 2017
  • KamyarSouri

    Nov. 12, 2016
  • RaymondWu18

    Sep. 12, 2016
  • biztalk72

    Aug. 5, 2016
  • jackycsli

    Jul. 13, 2016

Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements. Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations. This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.

Views

Total views

18,062

On Slideshare

0

From embeds

0

Number of embeds

3,637

Actions

Downloads

1,110

Shares

0

Comments

0

Likes

45

×