SlideShare a Scribd company logo
1 of 42
Download to read offline
The Linux Block Layer
Built for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV
27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs
• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.
• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at
http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack
• I am the maintainer of the iSCSI over RDMA (iSER) drivers
• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and
pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.
• Devices were limited to hundreds of IOPs
• Devices access latency was in the milliseconds ballpark
• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage
access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter
• IOPs went from Hundreds to Hundreds of thousands to Millions
• Latency went from Milliseconds to Microseconds
• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!
• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing
• Between different applications (running on different cores)
• Between submission and completion
• Locking for synchronization
• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow
storage
• The result is very bad (even negative) scaling, spending lots of CPU
cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized
- Placed for staging
- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:
● Give up flow control
● Give up error handling
● Give up statistics
● Give up tagging and indexing
● Give up I/O deadlines
● Give up I/O scheduling
● Crazy code duplication -
mistakes are copied because
people get stuff wrong...
Most importantly, this is not the
Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”
• The stack needed a complete re-write from ground up
• What do we do:
• Go look at the networking stack which solved the exact same issue 10+
years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores
• Split shared state between applications and
submission/completion
• Careful locality awareness: Cachelines, NUMA
• Pre-allocate resources as much as possible
• Provide full helper functionality - ease of implementation
• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging
• Locality of submissions and completions
• Extremely aware to minimize cache pollutions
• Smart error handling - minimum intrusion to the hot path
• Smart cpu <-> queue mappings
• Clean API
• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the
“correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing
• Tags are used to identify individual I/Os in the presence of
out-of-order completions
• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps
• Efficient at near-exhaustion
• Rolling wake-ups
• Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations
• Allocate all the requests memory at initialization time
• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation
• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:
- null_blk driver
- fio
- 4K sync random read
- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced
• FIO reports much less “system time”
• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq
• mtip32xx (Micron)
• virtio_blk, xen
• rbd (ceph)
• loop
• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”
• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists
• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion
• For SCSI complex error handling
• Removed the “big scsi host_busy lock”
• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support
• Needed BIDI support (yukk..)
• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity
mappings to the block layer
● Map the HW context mappings via
the underlying device IRQ mappings
● Offer MSI(X) allocation and correct
affinity spreading via the PCI
subsystem
● Take advantage in pci based drivers
(nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a
proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling
• To optimize I/O sequentiality - Elevator algorithm
• Prevent write vs. read starvation (i.e. deadline scheduler)
• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media
• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O
• Writes are naturally buffered and committed to disk in the
background
• Needs to have little or no impact on foreground activity
• What was needed:
• Plumb I/O stats for submitted reads and writes
• Track average latency in window granularity and what is currently enqueued
• Scale queue depth accordingly
• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top
• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward
• Just delay writes FIFO until deadline hits
• Reads FIFO are pass-through
• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on
real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices
• Lightweight
• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)
• Reads are typically preferred
• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency
targets
• Kyber tracks I/O latency stats and adjust queue size accordingly
• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler
• A lot heavier
• Maintain Per-Process I/O budget
• Maintain bunch of Per-Process heuristics
• Yields the “best” I/O to queue at any given time
• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)
• 3D-Xpoint
• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these
latencies
• It starts with IRQ (interrupt handling)
• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion
handling
So we should support polling!
37
• Add selective polling syscall interface:
• Use preadv2/pwritev2 with flag IOCB_HIGHPRI
• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!
• We have all the statistics framework in place, let’s use it for hybrid polling!
• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.
• Some are 4k, some are 256K and some or 1-2MB
• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...
• 0-4k
• 4-16k
• 16k-64k
• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!
• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!

More Related Content

What's hot

malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page FoliosAdrian Huang
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device driversHoucheng Lin
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfAdrian Huang
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 

What's hot (20)

Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page Folios
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdf
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
A practical guide to buildroot
A practical guide to buildrootA practical guide to buildroot
A practical guide to buildroot
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 

Similar to The Linux Block Layer - Built for Fast Storage

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect APImicrokerneldude
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 

Similar to The Linux Block Layer - Built for Fast Storage (20)

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Micro controller & Micro processor
Micro controller & Micro processorMicro controller & Micro processor
Micro controller & Micro processor
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect API
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

More from Kernel TLV

SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution EnvironmentKernel TLV
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityKernel TLV
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to BottomKernel TLV
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernel TLV
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentKernel TLV
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the BeastKernel TLV
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 

More from Kernel TLV (20)

DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution Environment
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem Security
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker Guidelines
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future Development
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the Beast
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 

Recently uploaded

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Recently uploaded (20)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

The Linux Block Layer - Built for Fast Storage

  • 1. The Linux Block Layer Built for Fast Storage Light up your cloud! Sagi Grimberg KernelTLV 27/6/18 1
  • 2. First off, Happy 1’st birthday Roni! 2
  • 3. Who am I? • Co-founder and Principal Architect @ Lightbits Labs • LightBits Labs is a stealth-mode startup pushing the software and hardware technology boundaries in cloud-scale storage. • We are looking for excellent people who enjoy a challenge for a variety of positions, including both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk. • Active contributor to Linux I/O and RDMA stack • I am the maintainer of the iSCSI over RDMA (iSER) drivers • I co-maintain the Linux NVMe subsystem • Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between… 3
  • 4. Where were we 10 years ago • Only rotating storage devices exist. • Devices were limited to hundreds of IOPs • Devices access latency was in the milliseconds ballpark • The Linux block layer was sufficient to handle these devices • High performance applications found clever ways to avoid storage access as much as possible 4
  • 5. What happened? (hint: HW) • Flash SSDs started appearing in the DataCenter • IOPs went from Hundreds to Hundreds of thousands to Millions • Latency went from Milliseconds to Microseconds • Fast Interfaces evolved: PCIe (NVMe) • Processors core count increased a lot! • And NUMA... 5
  • 7. What are the issues? • Existing I/O stack had a lot of data sharing • Between different applications (running on different cores) • Between submission and completion • Locking for synchronization • Zero NUMA awareness • All stack heuristics and optimizations centered around slow storage • The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies. 7
  • 8. I/O Stack - Little deeper 8
  • 9. I/O Stack - Little deeper 9 Hmmm... - Request are serialized - Placed for staging - Retrieved by the drivers ⇒ Lots of shared state!
  • 10. I/O Stack - Performance 10
  • 11. I/O Stack - Performance 11
  • 12. Workaround: Bypass the the request layer 12 Problems with bypass: ● Give up flow control ● Give up error handling ● Give up statistics ● Give up tagging and indexing ● Give up I/O deadlines ● Give up I/O scheduling ● Crazy code duplication - mistakes are copied because people get stuff wrong... Most importantly, this is not the Linux design approach!
  • 13. Enter Block Multiqueue • The old stack does not consist of “one serialization point” • The stack needed a complete re-write from ground up • What do we do: • Go look at the networking stack which solved the exact same issue 10+ years ago. • But build from scratch for storage devices 13
  • 14. Block Multiqueue - Goals • Linear Scaling with CPU cores • Split shared state between applications and submission/completion • Careful locality awareness: Cachelines, NUMA • Pre-allocate resources as much as possible • Provide full helper functionality - ease of implementation • Support all existing HW • Become THE queueing mode, not a “3’rd one” 14
  • 15. Block Multiqueue - Architecture 15
  • 16. Block Multiqueue - Features • Efficient tagging • Locality of submissions and completions • Extremely aware to minimize cache pollutions • Smart error handling - minimum intrusion to the hot path • Smart cpu <-> queue mappings • Clean API • Easy conversion (usually just cleanup old cruft) 16
  • 17. Block Multiqueue - I/O Flow 17
  • 18. Block Multiqueue - Completions 18 ● Applications are usually “cpu-sticky” ● If I/O completion comes on the “correct” cpu, complete it ● Else, IPI to the “correct” cpu
  • 19. Block Multiqueue - Tagging 19 • Almost every modern HW supports queueing • Tags are used to identify individual I/Os in the presence of out-of-order completions • Tags are limited by capabilities of the HW, driver needs to flow control
  • 20. Block Multiqueue - Tagging 20 • PerCPU Cacheline aware scalable bitmaps • Efficient at near-exhaustion • Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
  • 21. Block Multiqueue - Pre-allocations 21 • Eliminate hot path allocations • Allocate all the requests memory at initialization time • Tag and request allocations are combined (no two step allocation) • No driver per-request allocation • Driver context and SG lists are placed in “slack space” behind the request
  • 22. Block Multiqueue - Performance 22 Test-Case: - null_blk driver - fio - 4K sync random read - Dual socket system
  • 23. Block Multiqueue - perf profiling 23 • Locking time is drastically reduced • FIO reports much less “system time” • Average and tail latencies are much lower and consistent
  • 24. Next on the Agenda: SCSI, NVMe and friends • NVMe started as a bypass driver - converted to blk-mq • mtip32xx (Micron) • virtio_blk, xen • rbd (ceph) • loop • more... • SCSI midlayer was a bigger project.. 24
  • 25. SCSI multiqueue • Needed the concept of “shared tag sets” • Tags are now a property of the HBA and not the storage device • Needed a chunking of scatter-gather lists • SCSI HBAs support huge sg lists, two much to allocate up front • Needed “Head of queue” insertion • For SCSI complex error handling • Removed the “big scsi host_busy lock” • reduced the huge contention on the scsi target “busy” atomic • Needed Partial completion support • Needed BIDI support (yukk..) • Hardened the stack a lot with lots of user bug reports. 25
  • 26. Block multiqueue - MSI(X) based queue mapping 26 ● Motivation: Eliminate the IPI case ● Expose MSI(X) vector affinity mappings to the block layer ● Map the HW context mappings via the underlying device IRQ mappings ● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem ● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
  • 27. But wait, what about I/O schedulers? • What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years! • A fundamental part of the I/O stack functionality is scheduling • To optimize I/O sequentiality - Elevator algorithm • Prevent write vs. read starvation (i.e. deadline scheduler) • Fairness enforcement (i.e. CFQ) • One can argue that I/O scheduling was designed for rotating media • Optimized for reducing actuator seek time NOT NECESSARILY TRUE - Flash can benefit scheduling! 27
  • 28. Start from ground up: WriteBack Throttling • Linux since the dawn of times sucked at buffered I/O • Writes are naturally buffered and committed to disk in the background • Needs to have little or no impact on foreground activity • What was needed: • Plumb I/O stats for submitted reads and writes • Track average latency in window granularity and what is currently enqueued • Scale queue depth accordingly • Prefer reads over non-directIO writes 28
  • 29. WriteBack Throttling - Performance 29 Before... After...
  • 30. Now we are ready for I/O schedules - MQ-Deadline • Added I/O interception of requests for building schedulers on top • First MQ conversion was for deadline scheduler • Pretty easy and straightforward • Just delay writes FIFO until deadline hits • Reads FIFO are pass-through • All percpu context - tradeoff? • Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads. 30
  • 31. Next: Kyber I/O Scheduler • Targeted for fast multi-queue devices • Lightweight • Prefers reads over writes • All I/Os are split into two queues (reads and writes) • Reads are typically preferred • Writes are throttled but not to a point of starvation • The key is to keep submission queues short to guarantee latency targets • Kyber tracks I/O latency stats and adjust queue size accordingly • Aware of flash background operations. 31
  • 32. Next: BFQ I/O Scheduler • Budget fair queueing scheduler • A lot heavier • Maintain Per-Process I/O budget • Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time • A better fit for slower storage, especially rotating media and cheap & deep SSDs. 32
  • 33. But wait #2: What about Ultra-low latency devices • New media is emerging with Ultra low latency (1-2 us) • 3D-Xpoint • Z-NAND • Even with block MQ, the Linux I/O stack still has issues providing these latencies • It starts with IRQ (interrupt handling) • If I/O is so fast, we might want to poll for completion and avoid paying the cost of MSI(X) interrupt 33
  • 34. Interrupt based I/O completion model 34
  • 35. Polling based I/O completion model 35
  • 36. IRQ vs. Polling 36 • Polling can remove the extra context switch from the completion handling
  • 37. So we should support polling! 37 • Add selective polling syscall interface: • Use preadv2/pwritev2 with flag IOCB_HIGHPRI • Saves roughly 25% of added latency
  • 38. But what about CPU% - can we go hybrid? 38 • Yes! • We have all the statistics framework in place, let’s use it for hybrid polling! • Wake up poller after ½ of the mean latency.
  • 39. Hybrid polling - Performance 39
  • 40. Hybrid polling - Adjust to I/O size 40 • Block layer sees I/Os of different sizes. • Some are 4k, some are 256K and some or 1-2MB • We need to consider that when tracking stats for Polling considerations • Simple solution: Bucketize stats... • 0-4k • 4-16k • 16k-64k • >64k • Now Hybrid polling has good QoS!
  • 41. To Conclude 41 • Lots of interesting stuff happening in Linux • Linux belongs to everyone, Get involved! • We always welcome patches and bug reports :)