SlideShare a Scribd company logo
1 of 40
Download to read offline
Solving the Linux storage
scalability bottlenecks
Jens Axboe
Software Engineer Kernel Recipes 2015, Oct 2nd
2015
• Devices went from “hundreds of IOPS” to “hundreds of
thousands of IOPS”
• Increases in core count, and NUMA
• Existing IO stack has a lot of data sharing
●
For applications
●
And between submission and completion
• Existing heuristics and optimizations centered around
slower storage
What are the issues?
• The old stack had severe scaling issues
●
Even negative scaling
●
Wasting lots of CPU cycles
• This also lead to much higher latencies
• But where are the real scaling bottlenecks hidden?
Observed problems
SCSI driver request_fn driver Bypass driver
File system
BIO layer (struct bio)
Block layer (struct request)
SCSI stack
IO stack
App
CPU A
App
CPU B
App
CPU C
App
CPU D
BIO layer
Block layer
Driver
File system
Seen from the application
App
CPU A
App
CPU B
App
CPU C
App
CPU D
BIO layer
Block layer
Driver
File system
Seen from the application
Hmmmm!
• At this point we may have a suspicion of where the
bottleneck might be. Let's run a test and see if it backs up
the theory.
• We use null_blk
●
queue_mode=1 completion_nsec=0 irqmode=0
• Fio
●
Each thread does pread(2), 4k, randomly, O_DIRECT
• Each added thread alternates between the two available
NUMA nodes (2 socket system, 32 threads)
Testing the theory
That looks like a lot of lock
contention… Fio reports spending
95% of the time in the kernel, looks
like ~75% of that time is spinning
on locks.
Looking at call graphs, it's a good
mix of queue vs completion, and
queue vs queue (and queue-to-block
vs queue-to-driver).
Driver
App
CPU A
App
CPU B
App
CPU C
App
CPU D
- Requests placed for processing
- Requests retrieved by driver
- Requests completion signaled
== Lots of shared state!
Block layer
• We have good scalability until we reach the block layer
●
The shared state is a massive issue
• A bypass mode driver could work around the problem
• We need a real and future proof solution!
Problem areas
• Shares basic name with similar networking functionality,
but was built from scratch
• Basic idea is to separate shared state
●
Between applications
●
Between completion and submission
• Improve scaling on non-mq hardware was a criteria
• Provide a full pool of helper functionality
●
Implement and debug once
• Become THE queuing model, not “the 3rd
one”
Enter block multiqueue
• Started in 2011
• Original design reworked, finalized around 2012
• Merged in 3.13
History
Per-cpu
software
queues
(blk_mq_ctx)
App
CPU A
App
CPU B
App
CPU C
App
CPU D
App
CPU E
App
CPU F
Hardware mapping
queues
(blk_mq_hw_ctx)
Hardware queue Hardware queue Hardware queueHardware and driver
Per-cpu
software
queues
(blk_mq_ctx)
App
CPU A
App
CPU B
Hardware mapping
queues
(blk_mq_hw_ctx)
Hardware queue
Hardware and driver
Completions
Submissions
• Application touches private per-cpu queue
●
Software queues
●
Submission is now almost fully privatized
Per-cpu
software
queues
(blk_mq_ctx)
App
CPU A
App
CPU B
Hardware mapping
queues
(blk_mq_hw_ctx)
Hardware queue
Hardware and driver
Completions
Submissions
• Software queues map M:N to hardware
queues
●
There are always as many software queues
as CPUs
●
With enough hardware queues, it's a 1:1
mapping
●
Fewer, and we map based on topology of
the system
Per-cpu
software
queues
(blk_mq_ctx)
App
CPU A
App
CPU B
Hardware mapping
queues
(blk_mq_hw_ctx)
Hardware queue
Hardware and driver
Completions
Submissions
• Hardware queues handle dispatch to
hardware and completions
• Efficient and fast versions of:
●
Tagging
●
Timeout handling
●
Allocation eliminations
●
Local completions
• Provides intelligent queue ↔ CPU mappings
●
Can be used for IRQ mappings as well
• Clean API
●
Driver conversions generally remove more code than they
add
Features
Allocate bio Find free request Map bio to request
Insert into software
queue
Signal hardware queue
run?
Hardware queue runs
Submit to hardware
Sleep on free request
Free resources
(bio, mark rq as free)
Complete IO Hardware IRQ event
blk-mq IO flow
Allocate bio Allocate request Map bio to request Insert into queue Signal driver (?)
Driver runsAllocate tag
Submit to hardware
Sleep on resources
Free resources
(request, bio, tag,
Hardware, etc)
Complete IO Hardware IRQ event
Allocate driver
command and SG
list
Block layer IO flow
Pull request off block
layer queue
• Want completions as local as possible
●
Even without queue shared state, there's still the request
• Particularly for fewer/single hardware queue design, care
must be taken to minimize sharing
• If completion queue can place event, we use that
●
If not, IPI
Completions
Per-cpu
software
queues
(blk_mq_ctx)
App
CPU A
App
CPU B
Hardware mapping
queues
(blk_mq_hw_ctx)
Hardware queue
Hardware and driver
Completions
Submissions
IRQ in right location?
Yes
No IPI to right CPU
Complete IO
IRQ
• Almost all hardware uses tags to identify IO requests
●
Must get a free tag on request issue
●
Must return tag to pool on completion
Tagging
Driver Hardware
“This is a request identified by tag=0x13”
“This is the completion event for the
request identified by tag=0x13”
• Must have features:
●
Efficient at or near tag exhaustion
●
Efficient for shared tag maps
• Blk-mq implements a novel bitmap tag approach
●
Software queue hinting (sticky)
●
Sparse layout
●
Rolling wakeups
Tag support
Sparse tag maps $ cat /sys/block/sda/mq/0/tags
nr_tags=31, reserved_tags=0, bits_per_word=2
nr_free=31, nr_reserved=0
|- Cacheline (generally 64b) -|
Tag values 0-1 Tag values 2-3 Tag values 4-5 Tag values 6-7
App A App B App C App D
• Applications tend to stick to software queues
●
Utilize that concept to make them stick to tag cachelines
●
Cache last tag in software queue
• We use null_blk
• Fio
●
Each thread does pread(2), 4k, randomly, O_DIRECT
• queue_mode=2 completion_nsec=0 irqmode=0
submit_queues=32
• Each added thread alternates between the two available
NUMA nodes (2 socket system)
Rerunning the test case
Single queue mode, basically all
system time is spent banging on
the device queue lock. Fio reports
95% of the time spent in the
Kernel. Max completion time is
10x higher than blk-mq mode,
50th
percentile is 24usec.
In blk-mq mode, locking time is
drastically reduced and the profile
Is much cleaner. Fio reports 74%
of the time spent in the kernel.
50th
percentile is 3 usec.
“But Jens, isn't most storage hardware still
single queue? What about single queue
performance on blk-mq?”
— Astute audience member
• SCSI had severe scaling issues
●
Per LUN performance limited to ~150K IOPS
• SCSI queuing layered on top of blk-mq
• Initially by Nic Bellinger (Datera), later continued by
Christoph Hellwig
• Merged in 3.17
●
CONFIG_SCSI_MQ_DEFAULT=y
●
scsi_mod.use_blk_mq=1
• Helped drive some blk-mq features
Scsi-mq
Graph from Christoph Hellwig
• Backport
• Ran a pilot last half, results were so good it was
immediately put in production.
• Running in production at Facebook
●
TAO, cache
• Biggest win was in latency reductions
●
FB workloads not that IOPS intensive
●
But still saw sys % wins too
At Facebook
• As of 4.3-rc2
●
mtip32xx (micron SSD)
●
NVMe
●
virtio_blk, xen block driver
●
rbd (ceph block)
●
loop
●
ubi
●
SCSI
• All over the map (which is good)
Conversion progress
• An IO scheduler
• Better helpers for IRQ affinity mappings
• IO accounting
• IO polling
• More conversions
●
Long term goal remains killing off request_fn
Future work
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks

More Related Content

What's hot

Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisAnne Nicolas
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
Understand and optimize Linux I/O
Understand and optimize Linux I/OUnderstand and optimize Linux I/O
Understand and optimize Linux I/OAndrea Righi
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architectureKernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architectureAnne Nicolas
 
Kernel Recipes 2015: Speed up your kernel development cycle with QEMU
Kernel Recipes 2015: Speed up your kernel development cycle with QEMUKernel Recipes 2015: Speed up your kernel development cycle with QEMU
Kernel Recipes 2015: Speed up your kernel development cycle with QEMUAnne Nicolas
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerunidsecconf
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitAndrea Righi
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the CanariesKernel TLV
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisPaul V. Novarese
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Performance: Observe and Tune
Performance: Observe and TunePerformance: Observe and Tune
Performance: Observe and TunePaul V. Novarese
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 

What's hot (20)

Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysis
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Understand and optimize Linux I/O
Understand and optimize Linux I/OUnderstand and optimize Linux I/O
Understand and optimize Linux I/O
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architectureKernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architecture
 
Kernel Recipes 2015: Speed up your kernel development cycle with QEMU
Kernel Recipes 2015: Speed up your kernel development cycle with QEMUKernel Recipes 2015: Speed up your kernel development cycle with QEMU
Kernel Recipes 2015: Speed up your kernel development cycle with QEMU
 
Linux Kernel I/O Schedulers
Linux Kernel I/O SchedulersLinux Kernel I/O Schedulers
Linux Kernel I/O Schedulers
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Performance: Observe and Tune
Performance: Observe and TunePerformance: Observe and Tune
Performance: Observe and Tune
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 

Similar to Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Nikolay Savvinov
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status ReportLiu Yuan
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsStefano Salsano
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-reviewabinaya m
 
Scaling application servers for efficiency
Scaling application servers for efficiencyScaling application servers for efficiency
Scaling application servers for efficiencyTomas Doran
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentationAbhishek Jain
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on EmulatorsDVClub
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 

Similar to Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks (20)

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
Os lectures
Os lecturesOs lectures
Os lectures
 
Lec 3
Lec 3 Lec 3
Lec 3
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
 
Scaling application servers for efficiency
Scaling application servers for efficiencyScaling application servers for efficiency
Scaling application servers for efficiency
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentation
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 

More from Anne Nicolas

Kernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstKernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstAnne Nicolas
 
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIKernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIAnne Nicolas
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas
 
Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyAnne Nicolas
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureAnne Nicolas
 
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Anne Nicolas
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Anne Nicolas
 
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxEmbedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxAnne Nicolas
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialAnne Nicolas
 
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconEmbedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconAnne Nicolas
 
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureEmbedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureAnne Nicolas
 
Embedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayEmbedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayAnne Nicolas
 
Embedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerEmbedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerAnne Nicolas
 
Embedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationEmbedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationAnne Nicolas
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaEmbedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaAnne Nicolas
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPAnne Nicolas
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Anne Nicolas
 

More from Anne Nicolas (20)

Kernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream firstKernel Recipes 2019 - Driving the industry toward upstream first
Kernel Recipes 2019 - Driving the industry toward upstream first
 
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMIKernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
Kernel Recipes 2019 - No NMI? No Problem! – Implementing Arm64 Pseudo-NMI
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and future
 
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
Embedded Recipes 2019 - Knowing your ARM from your ARSE: wading through the t...
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
Kernel Recipes 2019 - Analyzing changes to the binary interface exposed by th...
 
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and BareboxEmbedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
Embedded Recipes 2019 - Remote update adventures with RAUC, Yocto and Barebox
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less special
 
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre SiliconEmbedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
Embedded Recipes 2019 - Linux on Open Source Hardware and Libre Silicon
 
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) pictureEmbedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
Embedded Recipes 2019 - From maintaining I2C to the big (embedded) picture
 
Embedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops wayEmbedded Recipes 2019 - Testing firmware the devops way
Embedded Recipes 2019 - Testing firmware the devops way
 
Embedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmakerEmbedded Recipes 2019 - Herd your socs become a matchmaker
Embedded Recipes 2019 - Herd your socs become a matchmaker
 
Embedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integrationEmbedded Recipes 2019 - LLVM / Clang integration
Embedded Recipes 2019 - LLVM / Clang integration
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaEmbedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimedia
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Kernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDPKernel Recipes 2019 - Suricata and XDP
Kernel Recipes 2019 - Suricata and XDP
 
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
Kernel Recipes 2019 - Marvels of Memory Auto-configuration (SPD)
 

Recently uploaded

Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 

Recently uploaded (20)

Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 

Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks

  • 1.
  • 2. Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Kernel Recipes 2015, Oct 2nd 2015
  • 3. • Devices went from “hundreds of IOPS” to “hundreds of thousands of IOPS” • Increases in core count, and NUMA • Existing IO stack has a lot of data sharing ● For applications ● And between submission and completion • Existing heuristics and optimizations centered around slower storage What are the issues?
  • 4. • The old stack had severe scaling issues ● Even negative scaling ● Wasting lots of CPU cycles • This also lead to much higher latencies • But where are the real scaling bottlenecks hidden? Observed problems
  • 5. SCSI driver request_fn driver Bypass driver File system BIO layer (struct bio) Block layer (struct request) SCSI stack IO stack
  • 6. App CPU A App CPU B App CPU C App CPU D BIO layer Block layer Driver File system Seen from the application
  • 7. App CPU A App CPU B App CPU C App CPU D BIO layer Block layer Driver File system Seen from the application Hmmmm!
  • 8. • At this point we may have a suspicion of where the bottleneck might be. Let's run a test and see if it backs up the theory. • We use null_blk ● queue_mode=1 completion_nsec=0 irqmode=0 • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • Each added thread alternates between the two available NUMA nodes (2 socket system, 32 threads) Testing the theory
  • 9.
  • 10. That looks like a lot of lock contention… Fio reports spending 95% of the time in the kernel, looks like ~75% of that time is spinning on locks. Looking at call graphs, it's a good mix of queue vs completion, and queue vs queue (and queue-to-block vs queue-to-driver).
  • 11. Driver App CPU A App CPU B App CPU C App CPU D - Requests placed for processing - Requests retrieved by driver - Requests completion signaled == Lots of shared state! Block layer
  • 12. • We have good scalability until we reach the block layer ● The shared state is a massive issue • A bypass mode driver could work around the problem • We need a real and future proof solution! Problem areas
  • 13. • Shares basic name with similar networking functionality, but was built from scratch • Basic idea is to separate shared state ● Between applications ● Between completion and submission • Improve scaling on non-mq hardware was a criteria • Provide a full pool of helper functionality ● Implement and debug once • Become THE queuing model, not “the 3rd one” Enter block multiqueue
  • 14. • Started in 2011 • Original design reworked, finalized around 2012 • Merged in 3.13 History
  • 15. Per-cpu software queues (blk_mq_ctx) App CPU A App CPU B App CPU C App CPU D App CPU E App CPU F Hardware mapping queues (blk_mq_hw_ctx) Hardware queue Hardware queue Hardware queueHardware and driver
  • 16. Per-cpu software queues (blk_mq_ctx) App CPU A App CPU B Hardware mapping queues (blk_mq_hw_ctx) Hardware queue Hardware and driver Completions Submissions • Application touches private per-cpu queue ● Software queues ● Submission is now almost fully privatized
  • 17. Per-cpu software queues (blk_mq_ctx) App CPU A App CPU B Hardware mapping queues (blk_mq_hw_ctx) Hardware queue Hardware and driver Completions Submissions • Software queues map M:N to hardware queues ● There are always as many software queues as CPUs ● With enough hardware queues, it's a 1:1 mapping ● Fewer, and we map based on topology of the system
  • 18. Per-cpu software queues (blk_mq_ctx) App CPU A App CPU B Hardware mapping queues (blk_mq_hw_ctx) Hardware queue Hardware and driver Completions Submissions • Hardware queues handle dispatch to hardware and completions
  • 19. • Efficient and fast versions of: ● Tagging ● Timeout handling ● Allocation eliminations ● Local completions • Provides intelligent queue ↔ CPU mappings ● Can be used for IRQ mappings as well • Clean API ● Driver conversions generally remove more code than they add Features
  • 20. Allocate bio Find free request Map bio to request Insert into software queue Signal hardware queue run? Hardware queue runs Submit to hardware Sleep on free request Free resources (bio, mark rq as free) Complete IO Hardware IRQ event blk-mq IO flow
  • 21. Allocate bio Allocate request Map bio to request Insert into queue Signal driver (?) Driver runsAllocate tag Submit to hardware Sleep on resources Free resources (request, bio, tag, Hardware, etc) Complete IO Hardware IRQ event Allocate driver command and SG list Block layer IO flow Pull request off block layer queue
  • 22. • Want completions as local as possible ● Even without queue shared state, there's still the request • Particularly for fewer/single hardware queue design, care must be taken to minimize sharing • If completion queue can place event, we use that ● If not, IPI Completions
  • 23. Per-cpu software queues (blk_mq_ctx) App CPU A App CPU B Hardware mapping queues (blk_mq_hw_ctx) Hardware queue Hardware and driver Completions Submissions IRQ in right location? Yes No IPI to right CPU Complete IO IRQ
  • 24. • Almost all hardware uses tags to identify IO requests ● Must get a free tag on request issue ● Must return tag to pool on completion Tagging Driver Hardware “This is a request identified by tag=0x13” “This is the completion event for the request identified by tag=0x13”
  • 25. • Must have features: ● Efficient at or near tag exhaustion ● Efficient for shared tag maps • Blk-mq implements a novel bitmap tag approach ● Software queue hinting (sticky) ● Sparse layout ● Rolling wakeups Tag support
  • 26. Sparse tag maps $ cat /sys/block/sda/mq/0/tags nr_tags=31, reserved_tags=0, bits_per_word=2 nr_free=31, nr_reserved=0 |- Cacheline (generally 64b) -| Tag values 0-1 Tag values 2-3 Tag values 4-5 Tag values 6-7 App A App B App C App D • Applications tend to stick to software queues ● Utilize that concept to make them stick to tag cachelines ● Cache last tag in software queue
  • 27. • We use null_blk • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • queue_mode=2 completion_nsec=0 irqmode=0 submit_queues=32 • Each added thread alternates between the two available NUMA nodes (2 socket system) Rerunning the test case
  • 28.
  • 29.
  • 30. Single queue mode, basically all system time is spent banging on the device queue lock. Fio reports 95% of the time spent in the Kernel. Max completion time is 10x higher than blk-mq mode, 50th percentile is 24usec. In blk-mq mode, locking time is drastically reduced and the profile Is much cleaner. Fio reports 74% of the time spent in the kernel. 50th percentile is 3 usec.
  • 31. “But Jens, isn't most storage hardware still single queue? What about single queue performance on blk-mq?” — Astute audience member
  • 32.
  • 33. • SCSI had severe scaling issues ● Per LUN performance limited to ~150K IOPS • SCSI queuing layered on top of blk-mq • Initially by Nic Bellinger (Datera), later continued by Christoph Hellwig • Merged in 3.17 ● CONFIG_SCSI_MQ_DEFAULT=y ● scsi_mod.use_blk_mq=1 • Helped drive some blk-mq features Scsi-mq
  • 35. • Backport • Ran a pilot last half, results were so good it was immediately put in production. • Running in production at Facebook ● TAO, cache • Biggest win was in latency reductions ● FB workloads not that IOPS intensive ● But still saw sys % wins too At Facebook
  • 36.
  • 37.
  • 38. • As of 4.3-rc2 ● mtip32xx (micron SSD) ● NVMe ● virtio_blk, xen block driver ● rbd (ceph block) ● loop ● ubi ● SCSI • All over the map (which is good) Conversion progress
  • 39. • An IO scheduler • Better helpers for IRQ affinity mappings • IO accounting • IO polling • More conversions ● Long term goal remains killing off request_fn Future work