SlideShare a Scribd company logo
1 of 33
Download to read offline
Andrea Righi - andrea@betterlinux.com
Conoscere e ottimizzare l'I/O su Linux
Andrea Righi - andrea@betterlinux.com
Agenda
● Overview
● I/O Monitoring
● I/O Tuning
● Reliability
● Q/A
Andrea Righi - andrea@betterlinux.com
Overview
Andrea Righi - andrea@betterlinux.com
File I/O in Linux
Andrea Righi - andrea@betterlinux.com
READ vs WRITE
● READ
● synchronous: CPU needs to wait the completion of
the READ to continue
● cached pages are easy to reclaim
● WRITE
● asynchronous: CPU doesn't need to wait the
completion of the WRITE to continue
● cached pages are hard to reclaim (require I/O)
Andrea Righi - andrea@betterlinux.com
SYNC vs ASYNC
● SYNC I/O READ: kernel queues a read operation for the data and returns
only after the entire block of data is read back, process is in waiting for
I/O state (D)
● SYNC I/O WRITE: kernel queues a write operation for the data and
returns only after the entire block of data is written, process is in waiting
for I/O
● ASYNC I/O READ: process repeatedly call read() with the size of the data
remaning, until the entire block is read (use select()/poll() to determine
when some data is available)
● ASYNC I/O WRITE: kernel updates the corresponding pages in page-
cache and marks them dirty; then the control quickly returns to the
process which can continue to run; the data is flushed later from a
different context in more optimal ways (i.e., sequential vs seeky blocks)
Andrea Righi - andrea@betterlinux.com
Block I/O subsystem
(simplified view)
● Processes submit I/O
requests to request queues
● The block I/O layer saves
the context of the process
that submits the request
● Requests can be merged
and reordered by the I/O
scheduler
● Minimize disk seeks,
optimize performance,
provide fairness among
processes
Andrea Righi - andrea@betterlinux.com
Plug / unplug
● When I/O is queued to a device that device enters a
plugged state
● I/O isn't immediately dispatched to the low-level device
driver
● When a process is going to wait on the I/O to finish, the
device is unplugged
● Allow merging of sequenial requests (writing and
reading bigger chunks of data allows to save re-writes of
the same hardware blocks and improves I/O throughput)
Andrea Righi - andrea@betterlinux.com
Flash memory
● Limited amount of erase cycles
● Flash memory blocks have to be explicitly
erased before they can be written to
● Writes decrease flash memory lifetime
● Wear leveling: logical mapping to distribute
writes evenly among the available physical
blocks
Andrea Righi - andrea@betterlinux.com
I/O Monitoring
Andrea Righi - andrea@betterlinux.com
iostat
● Informations about request queues associated
with specific block devices
● Number of blocks read/written, average I/O wait
time, disk utilization %, ...
● It does not provide detailed informations per-I/O
based (pid? uid? ...)
Andrea Righi - andrea@betterlinux.com
iotop
● top-like I/O monitoring tool
● Disk read, write, I/O wait time percentage
● Still does not provide enough informations on a
per-I/O basis:
● per block device statistics are missing
● no statistics about the nature of each request
Andrea Righi - andrea@betterlinux.com
blktrace
● Low-overhead monitoring tool
● detailed per user / cgroup / thread and block
device statistics
● allow to trace events for specific operations
performed on each I/O entering the block I/O
layer
Andrea Righi - andrea@betterlinux.com
blktrace events
● Request queue entry allocated
● Sleep during request queue allocation
● Request queue insertion
● Front/back merge
● Re-queue of a request
● Request issued to underlying block device
● Request queue plug/unplug
● I/O remap (DM / MD)
●
I/O split/bounce operation
●
Request completed
● ...
Andrea Righi - andrea@betterlinux.com
blktrace operations
● RWBS
● 'R' - read
● 'W' - write
● 'D' - discard
● 'B' - barrier
● 'A' - ahead
● 'S' - synchronous
● 'M' - meta-data
● 'N' - No data
static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
{
int i = 0;
if (t->action & BLK_TC_DISCARD) rwbs[i++] = 'D';
else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
else if (t->bytes) rwbs[i++] = 'R';
else rwbs[i++] = 'N';
if (t->action & BLK_TC_AHEAD) rwbs[i++] = 'A';
if (t->action & BLK_TC_BARRIER) rwbs[i++] = 'B';
if (t->action & BLK_TC_SYNC) rwbs[i++] = 'S';
if (t->action & BLK_TC_META) rwbs[i++] = 'M';
rwbs[i] = '0';
}
Andrea Righi - andrea@betterlinux.com
blktrace actions
● Actions
● C -- complete
● D -- issued
● I – inserted
● Q -- queued
● B -- bounced
● M – back merge
● F -- front merge
●
G -- get request
● S -- sleep
● P -- plug
● U -- unplug
●
T -- unplug due to timer
● X -- split
● A -- remap
● m -- message
Andrea Righi - andrea@betterlinux.com
blktrace output
# btrace /dev/sda
...
8,0 1 26 0.054596889 228 Q WS 237891152 + 8 [jbd2/sda3-8]
8,0 1 27 0.054597204 228 M WS 237891152 + 8 [jbd2/sda3-8]
8,0 1 28 0.054597816 228 A WS 237891160 + 8 <- (8,3) 230983256
8,0 1 29 0.054598137 228 Q WS 237891160 + 8 [jbd2/sda3-8]
8,0 1 30 0.054598457 228 M WS 237891160 + 8 [jbd2/sda3-8]
8,0 1 31 0.054599094 228 A WS 237891168 + 8 <- (8,3) 230983264
8,0 1 32 0.054599399 228 Q WS 237891168 + 8 [jbd2/sda3-8]
8,0 1 33 0.054599725 228 M WS 237891168 + 8 [jbd2/sda3-8]
Device, CPU, seq.num., timestamp, PID, Action, RWBS, Start block + # of blocks, PID
Andrea Righi - andrea@betterlinux.com
I/O Tuning
Andrea Righi - andrea@betterlinux.com
Dirty pages writeback
● Writeback is the process of writing pages back to
persistent storage
● Dirty pages (grep Dirty /proc/meminfo)
● Slow down tasks that are creating more dirty pages
than the system can handle balance_dirty_pages()
● direct reclaim (bad I/O pattern)
● pause
● IO-less dirty throttling (>= 3.2)
● pdflush vs per backing device writeback (>= 2.6.32)
Andrea Righi - andrea@betterlinux.com
Background vs direct cleaning
● From Documentation/sysctl/vm.txt:
● Background cleaning (kernel flusher threads):
– /proc/sys/vm/dirty_background_ratio
– /proc/sys/vm/dirty_background_bytes
● Direct cleaning (normal tasks generating disk
writes):
– /proc/sys/vm/dirty_ratio
– /proc/sys/vm/dirty_bytes
Andrea Righi - andrea@betterlinux.com
Flusher thread tuning
● /proc/sys/vm/dirty_writeback_centisecs
● Wake up kernel flusher threads every
dirty_writeback_centisecs
● /proc/sys/vm/dirty_expire_centisecs
● Define when dirty data is old enough to be eligible
for writeout by kernel flusher threads
Andrea Righi - andrea@betterlinux.com
Swap I/O
● /proc/sys/vm/swappiness
● anonymous vs file LRU scanning ratio
– high value: aggressive swap
– low value: aggressive file pages reclaim
Andrea Righi - andrea@betterlinux.com
Filesystem I/O
● ext3: data=journal | ordered | writeback
● journal: meta-data + data committed in the journal
● ordered: data committed before its meta-data
● writeback: meta-data and data committed out-of-order
● ext4: delayed allocation
● block allocation deferred until background writeback
● improve chances of using contiguous blocks (threads writing at
different offsets simultaneously)
● xfs, ext4, zfs, …
● zero-length file problem:
– open-write-close-rename
Andrea Righi - andrea@betterlinux.com
Filesystem I/O tuning
● noatime, nodiratime:
● do not update inode access times
● relatime:
● access time is updated if the previous access time was
earlier than the current modify or change time (doesn't
break applications like mutt that needs to know if a file
has been read since the last time it was modified)
● commit=N
● sync data and meta-data every N seconds (default = 5s)
Andrea Righi - andrea@betterlinux.com
I/O tuning at different layers
● Applications
● LD_PRELOAD
● VM
● caching
● Filesystem
● mount options / filesystem tuning
● Block device
● caching
Andrea Righi - andrea@betterlinux.com
Reliability
Andrea Righi - andrea@betterlinux.com
I/O data flow
● Application to library buffer
● fwrite(), fprintf(), etc.
● Library to OS buffer
● write()
● OS buffer to disk
● paged out, periodic flush (5 sec usually)
● fsync(), fdatasync(), sync(), sync_file_range()
Andrea Righi - andrea@betterlinux.com
Simple use case
● User hits “Save” in Word Processor
● Expects that data to be on disk when saved
● If power goes out
● The last saved version of my data is there
● If there isn't an explicit save, some recent version of
my data should be okay
Andrea Righi - andrea@betterlinux.com
Buggy implementation
struct wp_doc {
char *document;
size_t len;
};
struct wp_doc d;
...
FILE *f;
f = fopen(“document.txt”, ”w”);
fwrite(d.document, d.len, 1, f);
fclose(f);
Andrea Righi - andrea@betterlinux.com
Bugs
● No error checking
● fopen (did we open the file?)
● fwrite (did we write the entire file?)
● Crash in the middle of fwrite()
● document corrupted
● No sync
● close does not imply sync()!
Andrea Righi - andrea@betterlinux.com
Reliable implementation
struct wp_doc {
char *document;
size_t len;
};
struct wp_doc d;
...
FILE *f;
size_t len;
f = fopen(“.document.txt”, ”w”);
if (!f) return errno;
size_t len = fwrite(d.document, d.len, 1, f);
if (len != 1) { fclose(f); return errno; }
if (fflush(f) != 0) { fclose(f); return errno };
if (fsync(fileno(f)) == -1) { fclose(f); return errno };
fclose(f);
rename(“.document.txt”, ”document.txt”);
error checking
temp file
flush libc buffer
sync to disk
before rename
Andrea Righi - andrea@betterlinux.com
References
● Block I/O layer tracing - blktrace:
http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materi
aly/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf
● Eat my data:
http://www.flamingspork.com/talks/2007/06/eat_my_data.odp
● fsync() problems with Firefox:
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/
● Linux documentation
● Documentation/sysctl/vm.txt
● Documentation/laptops/laptop-mode.txt
Andrea Righi - andrea@betterlinux.com
Q/A
● You're very welcome!
● Twitter
● @arighi
● #bem2014

More Related Content

What's hot

Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Embedded_Linux_Booting
Embedded_Linux_BootingEmbedded_Linux_Booting
Embedded_Linux_BootingRashila Rr
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingScyllaDB
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack monad bobo
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM ShapeBlue
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernelguest547d74
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
Launch the First Process in Linux System
Launch the First Process in Linux SystemLaunch the First Process in Linux System
Launch the First Process in Linux SystemJian-Hong Pan
 

What's hot (20)

Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Embedded_Linux_Booting
Embedded_Linux_BootingEmbedded_Linux_Booting
Embedded_Linux_Booting
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Network Drivers
Network DriversNetwork Drivers
Network Drivers
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
systemd
systemdsystemd
systemd
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Linux Kernel Overview
Linux Kernel OverviewLinux Kernel Overview
Linux Kernel Overview
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernel
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
Linux Memory
Linux MemoryLinux Memory
Linux Memory
 
Launch the First Process in Linux System
Launch the First Process in Linux SystemLaunch the First Process in Linux System
Launch the First Process in Linux System
 
Hands-on ethernet driver
Hands-on ethernet driverHands-on ethernet driver
Hands-on ethernet driver
 

Viewers also liked

Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
Kernel I/O Subsystem
Kernel I/O SubsystemKernel I/O Subsystem
Kernel I/O SubsystemSushil Ale
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Kernel I/O subsystem
Kernel I/O subsystemKernel I/O subsystem
Kernel I/O subsystemAtiKa Bhatti
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelKernel TLV
 
Chapter 13 - I/O Systems
Chapter 13 - I/O SystemsChapter 13 - I/O Systems
Chapter 13 - I/O SystemsWayne Jones Jnr
 
Eat my data
Eat my dataEat my data
Eat my dataPeng Zuo
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBabak Farrokhi
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disksiammutex
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems updateLukáš Czerner
 
Linux System-R.D.Sivakumar
Linux System-R.D.SivakumarLinux System-R.D.Sivakumar
Linux System-R.D.SivakumarSivakumar R D .
 
VM and IO Topics in Linux
VM and IO Topics in LinuxVM and IO Topics in Linux
VM and IO Topics in Linuxcucufrog
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker containerVinay Jindal
 
Recent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource managementRecent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource managementOpenVZ
 
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...OpenVZ
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 

Viewers also liked (20)

Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
Kernel I/O Subsystem
Kernel I/O SubsystemKernel I/O Subsystem
Kernel I/O Subsystem
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Kernel I/O subsystem
Kernel I/O subsystemKernel I/O subsystem
Kernel I/O subsystem
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Chapter 13 - I/O Systems
Chapter 13 - I/O SystemsChapter 13 - I/O Systems
Chapter 13 - I/O Systems
 
Eat my data
Eat my dataEat my data
Eat my data
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems update
 
Linux System-R.D.Sivakumar
Linux System-R.D.SivakumarLinux System-R.D.Sivakumar
Linux System-R.D.Sivakumar
 
VM and IO Topics in Linux
VM and IO Topics in LinuxVM and IO Topics in Linux
VM and IO Topics in Linux
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
Recent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource managementRecent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource management
 
Ext4 filesystem(1)
Ext4 filesystem(1)Ext4 filesystem(1)
Ext4 filesystem(1)
 
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
 
Tuning Linux for MongoDB
Tuning Linux for MongoDBTuning Linux for MongoDB
Tuning Linux for MongoDB
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 

Similar to Understand and optimize Linux I/O

Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)Valeriy Kravchuk
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsDave Stokes
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...Dave Stokes
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: RevisionDamian T. Gordon
 
Pen Testing Development
Pen Testing DevelopmentPen Testing Development
Pen Testing DevelopmentCTruncer
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)RichardWarburton
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveMarcelo Altmann
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimizationManish Rawat
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxVinicius M Grippa
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyLinaro
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerSeveralnines
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereRodrique Heron
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Jérôme Petazzoni
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Valeriy Kravchuk
 

Similar to Understand and optimize Linux I/O (20)

Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
 
Pen Testing Development
Pen Testing DevelopmentPen Testing Development
Pen Testing Development
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
 
Backups
BackupsBackups
Backups
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case study
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
 

More from Andrea Righi

Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel developmentAndrea Righi
 
Linux kernel bug hunting
Linux kernel bug huntingLinux kernel bug hunting
Linux kernel bug huntingAndrea Righi
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug huntingAndrea Righi
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitAndrea Righi
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 

More from Andrea Righi (7)

Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel development
 
Linux kernel bug hunting
Linux kernel bug huntingLinux kernel bug hunting
Linux kernel bug hunting
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug hunting
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Debugging linux
Debugging linuxDebugging linux
Debugging linux
 
Linux boot-time
Linux boot-timeLinux boot-time
Linux boot-time
 

Recently uploaded

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 

Recently uploaded (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 

Understand and optimize Linux I/O

  • 1. Andrea Righi - andrea@betterlinux.com Conoscere e ottimizzare l'I/O su Linux
  • 2. Andrea Righi - andrea@betterlinux.com Agenda ● Overview ● I/O Monitoring ● I/O Tuning ● Reliability ● Q/A
  • 3. Andrea Righi - andrea@betterlinux.com Overview
  • 4. Andrea Righi - andrea@betterlinux.com File I/O in Linux
  • 5. Andrea Righi - andrea@betterlinux.com READ vs WRITE ● READ ● synchronous: CPU needs to wait the completion of the READ to continue ● cached pages are easy to reclaim ● WRITE ● asynchronous: CPU doesn't need to wait the completion of the WRITE to continue ● cached pages are hard to reclaim (require I/O)
  • 6. Andrea Righi - andrea@betterlinux.com SYNC vs ASYNC ● SYNC I/O READ: kernel queues a read operation for the data and returns only after the entire block of data is read back, process is in waiting for I/O state (D) ● SYNC I/O WRITE: kernel queues a write operation for the data and returns only after the entire block of data is written, process is in waiting for I/O ● ASYNC I/O READ: process repeatedly call read() with the size of the data remaning, until the entire block is read (use select()/poll() to determine when some data is available) ● ASYNC I/O WRITE: kernel updates the corresponding pages in page- cache and marks them dirty; then the control quickly returns to the process which can continue to run; the data is flushed later from a different context in more optimal ways (i.e., sequential vs seeky blocks)
  • 7. Andrea Righi - andrea@betterlinux.com Block I/O subsystem (simplified view) ● Processes submit I/O requests to request queues ● The block I/O layer saves the context of the process that submits the request ● Requests can be merged and reordered by the I/O scheduler ● Minimize disk seeks, optimize performance, provide fairness among processes
  • 8. Andrea Righi - andrea@betterlinux.com Plug / unplug ● When I/O is queued to a device that device enters a plugged state ● I/O isn't immediately dispatched to the low-level device driver ● When a process is going to wait on the I/O to finish, the device is unplugged ● Allow merging of sequenial requests (writing and reading bigger chunks of data allows to save re-writes of the same hardware blocks and improves I/O throughput)
  • 9. Andrea Righi - andrea@betterlinux.com Flash memory ● Limited amount of erase cycles ● Flash memory blocks have to be explicitly erased before they can be written to ● Writes decrease flash memory lifetime ● Wear leveling: logical mapping to distribute writes evenly among the available physical blocks
  • 10. Andrea Righi - andrea@betterlinux.com I/O Monitoring
  • 11. Andrea Righi - andrea@betterlinux.com iostat ● Informations about request queues associated with specific block devices ● Number of blocks read/written, average I/O wait time, disk utilization %, ... ● It does not provide detailed informations per-I/O based (pid? uid? ...)
  • 12. Andrea Righi - andrea@betterlinux.com iotop ● top-like I/O monitoring tool ● Disk read, write, I/O wait time percentage ● Still does not provide enough informations on a per-I/O basis: ● per block device statistics are missing ● no statistics about the nature of each request
  • 13. Andrea Righi - andrea@betterlinux.com blktrace ● Low-overhead monitoring tool ● detailed per user / cgroup / thread and block device statistics ● allow to trace events for specific operations performed on each I/O entering the block I/O layer
  • 14. Andrea Righi - andrea@betterlinux.com blktrace events ● Request queue entry allocated ● Sleep during request queue allocation ● Request queue insertion ● Front/back merge ● Re-queue of a request ● Request issued to underlying block device ● Request queue plug/unplug ● I/O remap (DM / MD) ● I/O split/bounce operation ● Request completed ● ...
  • 15. Andrea Righi - andrea@betterlinux.com blktrace operations ● RWBS ● 'R' - read ● 'W' - write ● 'D' - discard ● 'B' - barrier ● 'A' - ahead ● 'S' - synchronous ● 'M' - meta-data ● 'N' - No data static void fill_rwbs(char *rwbs, struct blk_io_trace *t) { int i = 0; if (t->action & BLK_TC_DISCARD) rwbs[i++] = 'D'; else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W'; else if (t->bytes) rwbs[i++] = 'R'; else rwbs[i++] = 'N'; if (t->action & BLK_TC_AHEAD) rwbs[i++] = 'A'; if (t->action & BLK_TC_BARRIER) rwbs[i++] = 'B'; if (t->action & BLK_TC_SYNC) rwbs[i++] = 'S'; if (t->action & BLK_TC_META) rwbs[i++] = 'M'; rwbs[i] = '0'; }
  • 16. Andrea Righi - andrea@betterlinux.com blktrace actions ● Actions ● C -- complete ● D -- issued ● I – inserted ● Q -- queued ● B -- bounced ● M – back merge ● F -- front merge ● G -- get request ● S -- sleep ● P -- plug ● U -- unplug ● T -- unplug due to timer ● X -- split ● A -- remap ● m -- message
  • 17. Andrea Righi - andrea@betterlinux.com blktrace output # btrace /dev/sda ... 8,0 1 26 0.054596889 228 Q WS 237891152 + 8 [jbd2/sda3-8] 8,0 1 27 0.054597204 228 M WS 237891152 + 8 [jbd2/sda3-8] 8,0 1 28 0.054597816 228 A WS 237891160 + 8 <- (8,3) 230983256 8,0 1 29 0.054598137 228 Q WS 237891160 + 8 [jbd2/sda3-8] 8,0 1 30 0.054598457 228 M WS 237891160 + 8 [jbd2/sda3-8] 8,0 1 31 0.054599094 228 A WS 237891168 + 8 <- (8,3) 230983264 8,0 1 32 0.054599399 228 Q WS 237891168 + 8 [jbd2/sda3-8] 8,0 1 33 0.054599725 228 M WS 237891168 + 8 [jbd2/sda3-8] Device, CPU, seq.num., timestamp, PID, Action, RWBS, Start block + # of blocks, PID
  • 18. Andrea Righi - andrea@betterlinux.com I/O Tuning
  • 19. Andrea Righi - andrea@betterlinux.com Dirty pages writeback ● Writeback is the process of writing pages back to persistent storage ● Dirty pages (grep Dirty /proc/meminfo) ● Slow down tasks that are creating more dirty pages than the system can handle balance_dirty_pages() ● direct reclaim (bad I/O pattern) ● pause ● IO-less dirty throttling (>= 3.2) ● pdflush vs per backing device writeback (>= 2.6.32)
  • 20. Andrea Righi - andrea@betterlinux.com Background vs direct cleaning ● From Documentation/sysctl/vm.txt: ● Background cleaning (kernel flusher threads): – /proc/sys/vm/dirty_background_ratio – /proc/sys/vm/dirty_background_bytes ● Direct cleaning (normal tasks generating disk writes): – /proc/sys/vm/dirty_ratio – /proc/sys/vm/dirty_bytes
  • 21. Andrea Righi - andrea@betterlinux.com Flusher thread tuning ● /proc/sys/vm/dirty_writeback_centisecs ● Wake up kernel flusher threads every dirty_writeback_centisecs ● /proc/sys/vm/dirty_expire_centisecs ● Define when dirty data is old enough to be eligible for writeout by kernel flusher threads
  • 22. Andrea Righi - andrea@betterlinux.com Swap I/O ● /proc/sys/vm/swappiness ● anonymous vs file LRU scanning ratio – high value: aggressive swap – low value: aggressive file pages reclaim
  • 23. Andrea Righi - andrea@betterlinux.com Filesystem I/O ● ext3: data=journal | ordered | writeback ● journal: meta-data + data committed in the journal ● ordered: data committed before its meta-data ● writeback: meta-data and data committed out-of-order ● ext4: delayed allocation ● block allocation deferred until background writeback ● improve chances of using contiguous blocks (threads writing at different offsets simultaneously) ● xfs, ext4, zfs, … ● zero-length file problem: – open-write-close-rename
  • 24. Andrea Righi - andrea@betterlinux.com Filesystem I/O tuning ● noatime, nodiratime: ● do not update inode access times ● relatime: ● access time is updated if the previous access time was earlier than the current modify or change time (doesn't break applications like mutt that needs to know if a file has been read since the last time it was modified) ● commit=N ● sync data and meta-data every N seconds (default = 5s)
  • 25. Andrea Righi - andrea@betterlinux.com I/O tuning at different layers ● Applications ● LD_PRELOAD ● VM ● caching ● Filesystem ● mount options / filesystem tuning ● Block device ● caching
  • 26. Andrea Righi - andrea@betterlinux.com Reliability
  • 27. Andrea Righi - andrea@betterlinux.com I/O data flow ● Application to library buffer ● fwrite(), fprintf(), etc. ● Library to OS buffer ● write() ● OS buffer to disk ● paged out, periodic flush (5 sec usually) ● fsync(), fdatasync(), sync(), sync_file_range()
  • 28. Andrea Righi - andrea@betterlinux.com Simple use case ● User hits “Save” in Word Processor ● Expects that data to be on disk when saved ● If power goes out ● The last saved version of my data is there ● If there isn't an explicit save, some recent version of my data should be okay
  • 29. Andrea Righi - andrea@betterlinux.com Buggy implementation struct wp_doc { char *document; size_t len; }; struct wp_doc d; ... FILE *f; f = fopen(“document.txt”, ”w”); fwrite(d.document, d.len, 1, f); fclose(f);
  • 30. Andrea Righi - andrea@betterlinux.com Bugs ● No error checking ● fopen (did we open the file?) ● fwrite (did we write the entire file?) ● Crash in the middle of fwrite() ● document corrupted ● No sync ● close does not imply sync()!
  • 31. Andrea Righi - andrea@betterlinux.com Reliable implementation struct wp_doc { char *document; size_t len; }; struct wp_doc d; ... FILE *f; size_t len; f = fopen(“.document.txt”, ”w”); if (!f) return errno; size_t len = fwrite(d.document, d.len, 1, f); if (len != 1) { fclose(f); return errno; } if (fflush(f) != 0) { fclose(f); return errno }; if (fsync(fileno(f)) == -1) { fclose(f); return errno }; fclose(f); rename(“.document.txt”, ”document.txt”); error checking temp file flush libc buffer sync to disk before rename
  • 32. Andrea Righi - andrea@betterlinux.com References ● Block I/O layer tracing - blktrace: http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materi aly/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf ● Eat my data: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp ● fsync() problems with Firefox: http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ ● Linux documentation ● Documentation/sysctl/vm.txt ● Documentation/laptops/laptop-mode.txt
  • 33. Andrea Righi - andrea@betterlinux.com Q/A ● You're very welcome! ● Twitter ● @arighi ● #bem2014