In this presentation, i have explained how Ceph Object Storage Performance can be improved drastically together with some object storage best practices, recommendations tips. I have also covered Ceph Shared Data Lake which is getting very popular.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
1. Unlocking The Performance Secrets of
Ceph Object Storage
Karan Singh
Sr. Storage Architect
Storage Solution Architectures Team
a Teaser for Shared Data Lake Solution
and
2. WHO DO WE HELP?
COMMUNITY CUSTOMERS
SOLUTIONS
ARCHITECTS
3. HOW DO WE HELP?
ARCHITECTURE PERFORMANCE
OPTIMIZATIONS
5. DIGITAL MEDIA DATA WAREHOUSE
WHAT IS OBJECT STORAGE IS GOOD FOR ?
ARCHIVE
MULTI-PETABYTE!
BACKUP
6. OBJECT STORAGE PERFORMANCE & SIZING ?
● Executed ~1500 unique tests
● Evaluated Small, Medium & Large Object sizes
● Storing millions of Objects (~130 Million to be precise)
■ 400 hours of testing for one of the scenario
● Have seen performance improvements as high as ~500%
● Testing was conducted on QCT sponsored LAB
LAB PARTNER
14. 14
Small Object (Read Ops)
Higher Read ops could have been achieved by adding more RGW hosts, until reaching disk saturation.
Bucket Index on flash media was optimal for small object workload
● Client read ops scaled linearly while increasing RGW hosts (OSDs on standard density servers)
● Performance limited by number of RGW hosts available in our test environment
● Best observed perf. was 8900 Read OPS with bucket index on Flash Media on standard density servers
19. Put bucket index on Flash Media
RADOS
GATEWAY
rgw.buckets.index Pool rgw.buckets.data Pool
TAIL TAIL TAIL TAILHEAD
FlashFlash
Sends write req. to RGW
20. 20
Small Object (Write Ops)
● Client write ops scaled sub-linearly while increasing RGW hosts
● Performance limited by disk saturation on Ceph OSD hosts
● Best observed Write OPS was 5000 on High Density Servers with bucket index on flash media
Higher write ops could have been achieved by adding more OSD hosts
22. Latency Improvements
SERVER
CONFIGURATION
TOTAL OSDs IN
CLUSTER
OBJECT
SIZE
BUCKET INDEX
CONFIGURATION
AVERAGE READ
LATENCY (ms)
AVERAGE WRITE
LATENCY (ms)
Standard Density 72 64 K HDD Index 142 318
Standard Density 72 64 K NVMe Index 57 229
High Density 210 64 K HDD Index 51 176
High Density 210 64 K NVMe Index 54 104
● Configuring Bucket Index on flash helped reduce:
● P99 Read latency reduced by ~149% (Standard density servers)
● P99 Write latency reduced by ~69% (High density servers)
24. 24
● Client read throughput scaled near-linearly while increasing RGW hosts
● Performance limited by number of RGW hosts available in our test environment
● Best observed read throughput was ~4GB/s with 32M object size on standard density servers
Large Object (Read MBps)
Higher read throughput could have been achieved by adding more RGW hosts
28. 4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
29. 4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
30. 4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
32. 4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
RADOS GATEWAY
RGW Write Operation (Before Tuning)
Sends write req. to RGW
33. 4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
Tune This Stage
RADOS GATEWAY
RGW Write Operation (Where to Tune)
Sends write req. to RGW
34. 4MB 4MB 4MB 4MB RGW Object Stripes
C RGW Object Chunks (4M Tuned)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
Reduced RGW Chunks
RADOS GATEWAY
RGW Write Operation (After Tuning)
Sends write req. to RGW
35. 35
DEFAULT vs. TUNED
Tuning rgw_max_chunk_size to 4M , helped improve write throughput
~40% Higher
Throughput
After Tuning
Improved Large Object Write (MBps)
36. 36
● Client write throughput scaled near-linearly while increasing RGW hosts
● Best observed (untuned) write throughput was 2.1 GB/s with 32M object size on high density servers
Large Object Write (MBps)
Where is the Bottleneck ??
38. 38
● Performance limited by disk saturation on Ceph OSD hosts
Large Object Write (MBps)
Higher write throughput could have been achieved by adding more OSD hosts
Disk Saturation
45. ● Steady write latency with Intel CAS, as the cluster grew from 0 to 100M+ objects
● ~100% lower as compared to Default Ceph configuration
Higher Object Count Test (Latency)
47. ● Object size
● Object count
● Server density
● Client : RGW : OSD Ratio
● Bucket index placement scheme
● Caching
● Data protection scheme
● Price/Performance
#1 Architectural Consideration for object storage cluster
48. ● 12 Bay OSD Hosts
● 10GbE RGW Hosts
● Bucket Indices on Flash Media
#2 Recommendations for Small Object Workloads
49. #3 Recommendations for Large Object Workloads
● 12 Bay OSD Hosts, 10GbE - Performance
● 35 Bay OSD Hosts, 40GbE - Price / Performance
● 10GbE RGW Hosts
● Tune rgw_max_chunk_size to 4M
● Bucket Indices on Flash Media
Caution : Do not increase rgw_max_chunk_size beyond 4M , this
causes OSD slow requests and OSD flapping issues.
50. ● 35 Bay OSD Hosts, 40GbE (Price/Performance)
● Bucket Indices on Flash Media
● Intel Cache Acceleration Software (CAS)
○ Only Metadata Caching
#4 Recommendations for High Object Count Workloads
51. For Small Object (64K) 100% Write
● 1 RGW for every 50 OSDs (HDD)
For Large Object (32M) 100% Write
● 1 RGW for every 100 OSDs (HDD)
#5 Recommendations for RGW Sizing
52. #6 RGW : 10GbE or 40GbE / Dedicated or Colo ?
● Take Away : Dedicated RGW node with 1 x 10GbE connectivity
54. Discontinuity in Big Data Infrastructure - Why?
hadoop
Before
hadoop, spark, hive, tez, presto,
impala, kafka, hbase, etc.
Now
● Congestion
● Multiple teams competing
causing delay in response from analytics applications
for the same big data resources
55. Discontinuity Presents Choice
?
Compute
Storage
C C C
Shared Storage
C
S
C
S
Get a bigger cluster
● Lacks isolation - still have noisy
neighbors
● Lacks elasticity - rigid cluster size
● Can’t scale compute/storage costs
separately
Get more clusters
● Cost of duplicating big datasets
● Lacks on-demand provisioning
● Can’t scale compute/storage costs
separately
On-demand Compute and Storage pools
● Isolation of high-priority workloads
● Shared big datasets
● On-demand provisioning
● Compute/storage costs scale separately
C
S
C
S
56. COMPUTE
HDFS TMP
COMPUTE
HDFS TMP
COMPUTE
HDFS TMP
Data Analytics Emerging Patterns
Multiple analytic clusters, provisioned on-demand, sourcing from a common object store
STORAGE
ANALYTIC CLUSTER 1
COMPUTE
STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
WORKER
HADOOP CLUSTER 1
ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3
COMMON PERSISTENT STORE
2
3
57. Red Hat Elastic Infrastructure for Analytics
Kafka
compute
instances
Hive/MapReduce
compute
instances
Gold SLA
Spark SQL
compute
Bronze SLA
Elastic Compute Resource Pool
Shared Data Lake on
Presto
compute
instances
Platinum SLA
Interactive query
Spark
compute
instances
Silver SLA
Batch Query & Joins
Ingest ETL
S3A
58. Analytics vendors focus on analytics. Red Hat on infrastructure.
Analytics
Red Hat provides
infrastructure
software
Analytics vendors
provide analytics
software
OpenStack or OpenShift Provisioned Compute Pool
InfrastructureShared Data Lake on Ceph Object Storage
Red Hat Elastic Infrastructure for Analytics
59. Red Hat Solution Development Lab (at QCT)
TPC-DS
& logsynth
data gen
Hive or Spark
ORC or parquet ETL
(1/10/100TB)
n concurrent
Impala
compute
n concurrent
Spark SQL
54 deep TPC-DS
queries Spark SQL
compute
n concurrent
Spark SQL
Enrichment
dataset joins -
100 days of logs
Interactive
query
Batch
query
ETL
Transformation
Ingest
Ceph RBD block volume pool
(Intermediate shuffle files,
hadoop.tmp, yarn.log, etc)
Ceph S3A object bucket pool
(source data, transformed data)
HA Proxy tier
20x Ceph hosts
(700 OSDs)
RGW tier (12x)
60. Shared Data Lake on Ceph Object Storage
Reference Architecture
Coming
This Fall ...