SlideShare a Scribd company logo
1 of 61
Download to read offline
Unlocking The Performance Secrets of
Ceph Object Storage
Karan Singh
Sr. Storage Architect
Storage Solution Architectures Team
a Teaser for Shared Data Lake Solution
and
WHO DO WE HELP?
COMMUNITY CUSTOMERS
SOLUTIONS
ARCHITECTS
HOW DO WE HELP?
ARCHITECTURE PERFORMANCE
OPTIMIZATIONS
OBJECT STORAGE
DIGITAL MEDIA DATA WAREHOUSE
WHAT IS OBJECT STORAGE IS GOOD FOR ?
ARCHIVE
MULTI-PETABYTE!
BACKUP
OBJECT STORAGE PERFORMANCE & SIZING ?
● Executed ~1500 unique tests
● Evaluated Small, Medium & Large Object sizes
● Storing millions of Objects (~130 Million to be precise)
■ 400 hours of testing for one of the scenario
● Have seen performance improvements as high as ~500%
● Testing was conducted on QCT sponsored LAB
LAB PARTNER
PERFORMANCE & SIZING GUIDE
Benchmarking Environment
&
Methodology
Hardware Configurations Tested
HIGH DENSITY SERVERSSTANDARD DENSITY SERVERS
● 6x OSD Nodes
○ 12x HDD (7.2K rpm, 6TB SATA)
○ 1x Intel P3700 800G NVMe
○ 128G Memory
○ 1x Intel Xeon E5-2660 v3
○ 1x 40GbE
● 4x RGW Nodes
○ 1x 40GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670 v3
● 8x Client Nodes
○ 1x 10GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670
● 1x Ceph MON (*)
○ 1x 10GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670
● 6x OSD Nodes
○ 35x HDD (7.2K rpm, 6TB SATA)
○ 2x Intel P3700 800G NVMe
○ 128G Memory
○ 2x Intel Xeon E5-2660 v3
○ 1x 40GbE
● 4x RGW Nodes
○ 1x 40GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670 v3
● 8x Client Nodes
○ 1x 10GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670
● 1x Ceph MON(*)
○ 1x 10GbE
○ 96G Memory
○ 1x Intel Xeon E5-2670
LAB PARTNER
(*) For production usage use 3 MONs at least
Lab Setup
Front View Rear View
LAB PARTNER
Benchmarking Methodology & Tools
CEPH Red Hat Ceph Storage 2.0 ( Jewel )
OS Red Hat Enterprise Linux 7.2
TOOLS
● Ceph Benchmarking Tool (CBT)
● FIO 2.11-12
● iPerf3
● COSBench 0.4.2.c3
● Intel Cache Acceleration Software (CAS) 03.01.01
Payload Selection
● 64K - Small size object (Thumbnail images, small files etc.)
● 1M - Medium size object (Images, text files etc.)
● 32M - Large size object (Analytics Engines , HD images , log files, backup etc.)
● 64M - Large size object (Analytics Engines , Videos etc. )
Optimizing for
Small Object Operations (OPS)
14
Small Object (Read Ops)
Higher Read ops could have been achieved by adding more RGW hosts, until reaching disk saturation.
Bucket Index on flash media was optimal for small object workload
● Client read ops scaled linearly while increasing RGW hosts (OSDs on standard density servers)
● Performance limited by number of RGW hosts available in our test environment
● Best observed perf. was 8900 Read OPS with bucket index on Flash Media on standard density servers
Read Operations are simple !!
… Let’s Talk About Write …
RGW Object Consists of
Object Index Object Data
TAIL TAIL TAIL TAILHEAD
RADOS
GATEWAY
rgw.buckets.index Pool rgw.buckets.data Pool
TAIL TAIL TAIL TAILHEAD
RGW Write Operation
Sends write req. to RGW
Can we Improve This ?
… Yes We Can !!
Put bucket index on Flash Media
RADOS
GATEWAY
rgw.buckets.index Pool rgw.buckets.data Pool
TAIL TAIL TAIL TAILHEAD
FlashFlash
Sends write req. to RGW
20
Small Object (Write Ops)
● Client write ops scaled sub-linearly while increasing RGW hosts
● Performance limited by disk saturation on Ceph OSD hosts
● Best observed Write OPS was 5000 on High Density Servers with bucket index on flash media
Higher write ops could have been achieved by adding more OSD hosts
Small Object (Write Ops) Saturating HDDs
Latency Improvements
SERVER
CONFIGURATION
TOTAL OSDs IN
CLUSTER
OBJECT
SIZE
BUCKET INDEX
CONFIGURATION
AVERAGE READ
LATENCY (ms)
AVERAGE WRITE
LATENCY (ms)
Standard Density 72 64 K HDD Index 142 318
Standard Density 72 64 K NVMe Index 57 229
High Density 210 64 K HDD Index 51 176
High Density 210 64 K NVMe Index 54 104
● Configuring Bucket Index on flash helped reduce:
● P99 Read latency reduced by ~149% (Standard density servers)
● P99 Write latency reduced by ~69% (High density servers)
Optimizing for
Large Object Throughput (MBps)
24
● Client read throughput scaled near-linearly while increasing RGW hosts
● Performance limited by number of RGW hosts available in our test environment
● Best observed read throughput was ~4GB/s with 32M object size on standard density servers
Large Object (Read MBps)
Higher read throughput could have been achieved by adding more RGW hosts
Read Operations are simple !!
… Let’s Talk About Write …
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
RADOS GATEWAY
RGW Write Operation Sends write req. to RGW
Can we Improve This ?
… Yes We Can !!
4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
RADOS GATEWAY
RGW Write Operation (Before Tuning)
Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
C C C C C CC C RGW Object Chunks (512K Default)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
Tune This Stage
RADOS GATEWAY
RGW Write Operation (Where to Tune)
Sends write req. to RGW
4MB 4MB 4MB 4MB RGW Object Stripes
C RGW Object Chunks (4M Tuned)
K K K K M M EC ( K + M ) Chunks
OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk
Reduced RGW Chunks
RADOS GATEWAY
RGW Write Operation (After Tuning)
Sends write req. to RGW
35
DEFAULT vs. TUNED
Tuning rgw_max_chunk_size to 4M , helped improve write throughput
~40% Higher
Throughput
After Tuning
Improved Large Object Write (MBps)
36
● Client write throughput scaled near-linearly while increasing RGW hosts
● Best observed (untuned) write throughput was 2.1 GB/s with 32M object size on high density servers
Large Object Write (MBps)
Where is the Bottleneck ??
Large Object Write
Bottleneck : DISK SATURATION
38
● Performance limited by disk saturation on Ceph OSD hosts
Large Object Write (MBps)
Higher write throughput could have been achieved by adding more OSD hosts
Disk Saturation
Designing for
Higher Object Count (100M+ Objects)
Designing for Higher Object Count
100M+ OBJECTS
● Tested Configurations
○ Default OSD Filestore settings
○ Tuned OSD Filestore settings
○ Default OSD Filestore settings + Intel CAS (Metadata Caching)
○ Tuned OSD Filestore settings + Intel CAS (Metadata Caching)
● Test Details
○ High Density Servers
○ 64K object size
○ 50 Hours each test run ( 200 Hrs total testing )
○ Cluster filled up to ~130 Million Objects
41
Best Config: Default OSD Filestore settings + Intel CAS (Metadata caching)
Designing for Higher Object Count (Read)
42
Designing for Higher Object Count (Read)
COMPARING DIFFERENT CONFIGURATIONS
43
Best Config: Default OSD Filestore settings + Intel CAS (Metadata caching)
Designing for Higher Object Count (Write)
44
Designing for Higher Object Count (Write)
COMPARING DIFFERENT CONFIGURATIONS
● Steady write latency with Intel CAS, as the cluster grew from 0 to 100M+ objects
● ~100% lower as compared to Default Ceph configuration
Higher Object Count Test (Latency)
Secrets to
Object Storage Performance
(Key takeaways)
● Object size
● Object count
● Server density
● Client : RGW : OSD Ratio
● Bucket index placement scheme
● Caching
● Data protection scheme
● Price/Performance
#1 Architectural Consideration for object storage cluster
● 12 Bay OSD Hosts
● 10GbE RGW Hosts
● Bucket Indices on Flash Media
#2 Recommendations for Small Object Workloads
#3 Recommendations for Large Object Workloads
● 12 Bay OSD Hosts, 10GbE - Performance
● 35 Bay OSD Hosts, 40GbE - Price / Performance
● 10GbE RGW Hosts
● Tune rgw_max_chunk_size to 4M
● Bucket Indices on Flash Media
Caution : Do not increase rgw_max_chunk_size beyond 4M , this
causes OSD slow requests and OSD flapping issues.
● 35 Bay OSD Hosts, 40GbE (Price/Performance)
● Bucket Indices on Flash Media
● Intel Cache Acceleration Software (CAS)
○ Only Metadata Caching
#4 Recommendations for High Object Count Workloads
For Small Object (64K) 100% Write
● 1 RGW for every 50 OSDs (HDD)
For Large Object (32M) 100% Write
● 1 RGW for every 100 OSDs (HDD)
#5 Recommendations for RGW Sizing
#6 RGW : 10GbE or 40GbE / Dedicated or Colo ?
● Take Away : Dedicated RGW node with 1 x 10GbE connectivity
Shared Data Lake Solution
Using
Ceph Object Storage
(Teaser)
Discontinuity in Big Data Infrastructure - Why?
hadoop
Before
hadoop, spark, hive, tez, presto,
impala, kafka, hbase, etc.
Now
● Congestion
● Multiple teams competing
causing delay in response from analytics applications
for the same big data resources
Discontinuity Presents Choice
?
Compute
Storage
C C C
Shared Storage
C
S
C
S
Get a bigger cluster
● Lacks isolation - still have noisy
neighbors
● Lacks elasticity - rigid cluster size
● Can’t scale compute/storage costs
separately
Get more clusters
● Cost of duplicating big datasets
● Lacks on-demand provisioning
● Can’t scale compute/storage costs
separately
On-demand Compute and Storage pools
● Isolation of high-priority workloads
● Shared big datasets
● On-demand provisioning
● Compute/storage costs scale separately
C
S
C
S
COMPUTE
HDFS TMP
COMPUTE
HDFS TMP
COMPUTE
HDFS TMP
Data Analytics Emerging Patterns
Multiple analytic clusters, provisioned on-demand, sourcing from a common object store
STORAGE
ANALYTIC CLUSTER 1
COMPUTE
STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
WORKER
HADOOP CLUSTER 1
ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3
COMMON PERSISTENT STORE
2
3
Red Hat Elastic Infrastructure for Analytics
Kafka
compute
instances
Hive/MapReduce
compute
instances
Gold SLA
Spark SQL
compute
Bronze SLA
Elastic Compute Resource Pool
Shared Data Lake on
Presto
compute
instances
Platinum SLA
Interactive query
Spark
compute
instances
Silver SLA
Batch Query & Joins
Ingest ETL
S3A
Analytics vendors focus on analytics. Red Hat on infrastructure.
Analytics
Red Hat provides
infrastructure
software
Analytics vendors
provide analytics
software
OpenStack or OpenShift Provisioned Compute Pool
InfrastructureShared Data Lake on Ceph Object Storage
Red Hat Elastic Infrastructure for Analytics
Red Hat Solution Development Lab (at QCT)
TPC-DS
& logsynth
data gen
Hive or Spark
ORC or parquet ETL
(1/10/100TB)
n concurrent
Impala
compute
n concurrent
Spark SQL
54 deep TPC-DS
queries Spark SQL
compute
n concurrent
Spark SQL
Enrichment
dataset joins -
100 days of logs
Interactive
query
Batch
query
ETL
Transformation
Ingest
Ceph RBD block volume pool
(Intermediate shuffle files,
hadoop.tmp, yarn.log, etc)
Ceph S3A object bucket pool
(source data, transformed data)
HA Proxy tier
20x Ceph hosts
(700 OSDs)
RGW tier (12x)
Shared Data Lake on Ceph Object Storage
Reference Architecture
Coming
This Fall ...
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews
Get Ceph Object Storage P&S Guide : http://bit.ly/object-ra

More Related Content

What's hot

Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a NutshellKaran Singh
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Ceph Community
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentationRodrigo Missiaggia
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 

What's hot (20)

Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 

Similar to Ceph Object Storage Performance Secrets and Ceph Data Lake Solution

Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelRed_Hat_Storage
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Community
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red_Hat_Storage
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Community
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Community
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Community
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Community
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Community
 

Similar to Ceph Object Storage Performance Secrets and Ceph Data Lake Solution (20)

Ceph
CephCeph
Ceph
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to Jewel
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Ceph Object Storage Performance Secrets and Ceph Data Lake Solution

  • 1. Unlocking The Performance Secrets of Ceph Object Storage Karan Singh Sr. Storage Architect Storage Solution Architectures Team a Teaser for Shared Data Lake Solution and
  • 2. WHO DO WE HELP? COMMUNITY CUSTOMERS SOLUTIONS ARCHITECTS
  • 3. HOW DO WE HELP? ARCHITECTURE PERFORMANCE OPTIMIZATIONS
  • 5. DIGITAL MEDIA DATA WAREHOUSE WHAT IS OBJECT STORAGE IS GOOD FOR ? ARCHIVE MULTI-PETABYTE! BACKUP
  • 6. OBJECT STORAGE PERFORMANCE & SIZING ? ● Executed ~1500 unique tests ● Evaluated Small, Medium & Large Object sizes ● Storing millions of Objects (~130 Million to be precise) ■ 400 hours of testing for one of the scenario ● Have seen performance improvements as high as ~500% ● Testing was conducted on QCT sponsored LAB LAB PARTNER
  • 9. Hardware Configurations Tested HIGH DENSITY SERVERSSTANDARD DENSITY SERVERS ● 6x OSD Nodes ○ 12x HDD (7.2K rpm, 6TB SATA) ○ 1x Intel P3700 800G NVMe ○ 128G Memory ○ 1x Intel Xeon E5-2660 v3 ○ 1x 40GbE ● 4x RGW Nodes ○ 1x 40GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 v3 ● 8x Client Nodes ○ 1x 10GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 ● 1x Ceph MON (*) ○ 1x 10GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 ● 6x OSD Nodes ○ 35x HDD (7.2K rpm, 6TB SATA) ○ 2x Intel P3700 800G NVMe ○ 128G Memory ○ 2x Intel Xeon E5-2660 v3 ○ 1x 40GbE ● 4x RGW Nodes ○ 1x 40GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 v3 ● 8x Client Nodes ○ 1x 10GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 ● 1x Ceph MON(*) ○ 1x 10GbE ○ 96G Memory ○ 1x Intel Xeon E5-2670 LAB PARTNER (*) For production usage use 3 MONs at least
  • 10. Lab Setup Front View Rear View LAB PARTNER
  • 11. Benchmarking Methodology & Tools CEPH Red Hat Ceph Storage 2.0 ( Jewel ) OS Red Hat Enterprise Linux 7.2 TOOLS ● Ceph Benchmarking Tool (CBT) ● FIO 2.11-12 ● iPerf3 ● COSBench 0.4.2.c3 ● Intel Cache Acceleration Software (CAS) 03.01.01
  • 12. Payload Selection ● 64K - Small size object (Thumbnail images, small files etc.) ● 1M - Medium size object (Images, text files etc.) ● 32M - Large size object (Analytics Engines , HD images , log files, backup etc.) ● 64M - Large size object (Analytics Engines , Videos etc. )
  • 13. Optimizing for Small Object Operations (OPS)
  • 14. 14 Small Object (Read Ops) Higher Read ops could have been achieved by adding more RGW hosts, until reaching disk saturation. Bucket Index on flash media was optimal for small object workload ● Client read ops scaled linearly while increasing RGW hosts (OSDs on standard density servers) ● Performance limited by number of RGW hosts available in our test environment ● Best observed perf. was 8900 Read OPS with bucket index on Flash Media on standard density servers
  • 15. Read Operations are simple !! … Let’s Talk About Write …
  • 16. RGW Object Consists of Object Index Object Data TAIL TAIL TAIL TAILHEAD
  • 17. RADOS GATEWAY rgw.buckets.index Pool rgw.buckets.data Pool TAIL TAIL TAIL TAILHEAD RGW Write Operation Sends write req. to RGW
  • 18. Can we Improve This ? … Yes We Can !!
  • 19. Put bucket index on Flash Media RADOS GATEWAY rgw.buckets.index Pool rgw.buckets.data Pool TAIL TAIL TAIL TAILHEAD FlashFlash Sends write req. to RGW
  • 20. 20 Small Object (Write Ops) ● Client write ops scaled sub-linearly while increasing RGW hosts ● Performance limited by disk saturation on Ceph OSD hosts ● Best observed Write OPS was 5000 on High Density Servers with bucket index on flash media Higher write ops could have been achieved by adding more OSD hosts
  • 21. Small Object (Write Ops) Saturating HDDs
  • 22. Latency Improvements SERVER CONFIGURATION TOTAL OSDs IN CLUSTER OBJECT SIZE BUCKET INDEX CONFIGURATION AVERAGE READ LATENCY (ms) AVERAGE WRITE LATENCY (ms) Standard Density 72 64 K HDD Index 142 318 Standard Density 72 64 K NVMe Index 57 229 High Density 210 64 K HDD Index 51 176 High Density 210 64 K NVMe Index 54 104 ● Configuring Bucket Index on flash helped reduce: ● P99 Read latency reduced by ~149% (Standard density servers) ● P99 Write latency reduced by ~69% (High density servers)
  • 23. Optimizing for Large Object Throughput (MBps)
  • 24. 24 ● Client read throughput scaled near-linearly while increasing RGW hosts ● Performance limited by number of RGW hosts available in our test environment ● Best observed read throughput was ~4GB/s with 32M object size on standard density servers Large Object (Read MBps) Higher read throughput could have been achieved by adding more RGW hosts
  • 25. Read Operations are simple !! … Let’s Talk About Write …
  • 26. RADOS GATEWAY RGW Write Operation Sends write req. to RGW
  • 27. 4MB 4MB 4MB 4MB RGW Object Stripes RADOS GATEWAY RGW Write Operation Sends write req. to RGW
  • 28. 4MB 4MB 4MB 4MB RGW Object Stripes C C C C C CC C RGW Object Chunks (512K Default) RADOS GATEWAY RGW Write Operation Sends write req. to RGW
  • 29. 4MB 4MB 4MB 4MB RGW Object Stripes C C C C C CC C RGW Object Chunks (512K Default) K K K K M M EC ( K + M ) Chunks RADOS GATEWAY RGW Write Operation Sends write req. to RGW
  • 30. 4MB 4MB 4MB 4MB RGW Object Stripes C C C C C CC C RGW Object Chunks (512K Default) K K K K M M EC ( K + M ) Chunks OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk RADOS GATEWAY RGW Write Operation Sends write req. to RGW
  • 31. Can we Improve This ? … Yes We Can !!
  • 32. 4MB 4MB 4MB 4MB RGW Object Stripes C C C C C CC C RGW Object Chunks (512K Default) K K K K M M EC ( K + M ) Chunks OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk RADOS GATEWAY RGW Write Operation (Before Tuning) Sends write req. to RGW
  • 33. 4MB 4MB 4MB 4MB RGW Object Stripes C C C C C CC C RGW Object Chunks (512K Default) K K K K M M EC ( K + M ) Chunks OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk Tune This Stage RADOS GATEWAY RGW Write Operation (Where to Tune) Sends write req. to RGW
  • 34. 4MB 4MB 4MB 4MB RGW Object Stripes C RGW Object Chunks (4M Tuned) K K K K M M EC ( K + M ) Chunks OSD OSD OSD OSD OSD OSD EC Chunks Written to OSD/Disk Reduced RGW Chunks RADOS GATEWAY RGW Write Operation (After Tuning) Sends write req. to RGW
  • 35. 35 DEFAULT vs. TUNED Tuning rgw_max_chunk_size to 4M , helped improve write throughput ~40% Higher Throughput After Tuning Improved Large Object Write (MBps)
  • 36. 36 ● Client write throughput scaled near-linearly while increasing RGW hosts ● Best observed (untuned) write throughput was 2.1 GB/s with 32M object size on high density servers Large Object Write (MBps) Where is the Bottleneck ??
  • 37. Large Object Write Bottleneck : DISK SATURATION
  • 38. 38 ● Performance limited by disk saturation on Ceph OSD hosts Large Object Write (MBps) Higher write throughput could have been achieved by adding more OSD hosts Disk Saturation
  • 39. Designing for Higher Object Count (100M+ Objects)
  • 40. Designing for Higher Object Count 100M+ OBJECTS ● Tested Configurations ○ Default OSD Filestore settings ○ Tuned OSD Filestore settings ○ Default OSD Filestore settings + Intel CAS (Metadata Caching) ○ Tuned OSD Filestore settings + Intel CAS (Metadata Caching) ● Test Details ○ High Density Servers ○ 64K object size ○ 50 Hours each test run ( 200 Hrs total testing ) ○ Cluster filled up to ~130 Million Objects
  • 41. 41 Best Config: Default OSD Filestore settings + Intel CAS (Metadata caching) Designing for Higher Object Count (Read)
  • 42. 42 Designing for Higher Object Count (Read) COMPARING DIFFERENT CONFIGURATIONS
  • 43. 43 Best Config: Default OSD Filestore settings + Intel CAS (Metadata caching) Designing for Higher Object Count (Write)
  • 44. 44 Designing for Higher Object Count (Write) COMPARING DIFFERENT CONFIGURATIONS
  • 45. ● Steady write latency with Intel CAS, as the cluster grew from 0 to 100M+ objects ● ~100% lower as compared to Default Ceph configuration Higher Object Count Test (Latency)
  • 46. Secrets to Object Storage Performance (Key takeaways)
  • 47. ● Object size ● Object count ● Server density ● Client : RGW : OSD Ratio ● Bucket index placement scheme ● Caching ● Data protection scheme ● Price/Performance #1 Architectural Consideration for object storage cluster
  • 48. ● 12 Bay OSD Hosts ● 10GbE RGW Hosts ● Bucket Indices on Flash Media #2 Recommendations for Small Object Workloads
  • 49. #3 Recommendations for Large Object Workloads ● 12 Bay OSD Hosts, 10GbE - Performance ● 35 Bay OSD Hosts, 40GbE - Price / Performance ● 10GbE RGW Hosts ● Tune rgw_max_chunk_size to 4M ● Bucket Indices on Flash Media Caution : Do not increase rgw_max_chunk_size beyond 4M , this causes OSD slow requests and OSD flapping issues.
  • 50. ● 35 Bay OSD Hosts, 40GbE (Price/Performance) ● Bucket Indices on Flash Media ● Intel Cache Acceleration Software (CAS) ○ Only Metadata Caching #4 Recommendations for High Object Count Workloads
  • 51. For Small Object (64K) 100% Write ● 1 RGW for every 50 OSDs (HDD) For Large Object (32M) 100% Write ● 1 RGW for every 100 OSDs (HDD) #5 Recommendations for RGW Sizing
  • 52. #6 RGW : 10GbE or 40GbE / Dedicated or Colo ? ● Take Away : Dedicated RGW node with 1 x 10GbE connectivity
  • 53. Shared Data Lake Solution Using Ceph Object Storage (Teaser)
  • 54. Discontinuity in Big Data Infrastructure - Why? hadoop Before hadoop, spark, hive, tez, presto, impala, kafka, hbase, etc. Now ● Congestion ● Multiple teams competing causing delay in response from analytics applications for the same big data resources
  • 55. Discontinuity Presents Choice ? Compute Storage C C C Shared Storage C S C S Get a bigger cluster ● Lacks isolation - still have noisy neighbors ● Lacks elasticity - rigid cluster size ● Can’t scale compute/storage costs separately Get more clusters ● Cost of duplicating big datasets ● Lacks on-demand provisioning ● Can’t scale compute/storage costs separately On-demand Compute and Storage pools ● Isolation of high-priority workloads ● Shared big datasets ● On-demand provisioning ● Compute/storage costs scale separately C S C S
  • 56. COMPUTE HDFS TMP COMPUTE HDFS TMP COMPUTE HDFS TMP Data Analytics Emerging Patterns Multiple analytic clusters, provisioned on-demand, sourcing from a common object store STORAGE ANALYTIC CLUSTER 1 COMPUTE STORAGE COMPUTE STORAGE COMPUTE STORAGE WORKER HADOOP CLUSTER 1 ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3 COMMON PERSISTENT STORE 2 3
  • 57. Red Hat Elastic Infrastructure for Analytics Kafka compute instances Hive/MapReduce compute instances Gold SLA Spark SQL compute Bronze SLA Elastic Compute Resource Pool Shared Data Lake on Presto compute instances Platinum SLA Interactive query Spark compute instances Silver SLA Batch Query & Joins Ingest ETL S3A
  • 58. Analytics vendors focus on analytics. Red Hat on infrastructure. Analytics Red Hat provides infrastructure software Analytics vendors provide analytics software OpenStack or OpenShift Provisioned Compute Pool InfrastructureShared Data Lake on Ceph Object Storage Red Hat Elastic Infrastructure for Analytics
  • 59. Red Hat Solution Development Lab (at QCT) TPC-DS & logsynth data gen Hive or Spark ORC or parquet ETL (1/10/100TB) n concurrent Impala compute n concurrent Spark SQL 54 deep TPC-DS queries Spark SQL compute n concurrent Spark SQL Enrichment dataset joins - 100 days of logs Interactive query Batch query ETL Transformation Ingest Ceph RBD block volume pool (Intermediate shuffle files, hadoop.tmp, yarn.log, etc) Ceph S3A object bucket pool (source data, transformed data) HA Proxy tier 20x Ceph hosts (700 OSDs) RGW tier (12x)
  • 60. Shared Data Lake on Ceph Object Storage Reference Architecture Coming This Fall ...