SlideShare a Scribd company logo
1 of 27
Cluster 2010 Presentation

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo

                                                      Contact: kazuki.ohta@gmail.com
                                                                                   1
Background: Compute and Storage Imbalance

• Leadership-class computational scale:
  • 100,000+ processes
  • Advanced Multi-core architectures, Compute node OSs
• Leadership-class storage scale:
  • 100+ servers
  • Commercial storage hardware, Cluster file system


• Current leadership-class machines supply only 1GB/s of storage
  throughput for every 10TF of compute performance. This gap grew
  factor of 10 in recent years.
• Bridging this imbalance between compute and storage is a critical
  problem for the large-scale computation.

                                                                      2
Previous Studies: Current I/O Software Stack


   Storage Abstraction,
     Data Portability
     (HDF5, NetCDF,                     Parallel/Serial Applications
         ADIOS)
                                         High-Level I/O Libraries
    Organizing Accesses
    from Many Clients
        (ROMIO)                                                     POSIX I/O
                               MPI-IO
                                                                    VFS, FUSE

                                          Parallel File Systems
    Logical File System
over Many Storage Devices
(PVFS2, Lustre, GPFS, PanFS,                 Storage Devices
         Ceph, etc)




                                                                                3
Challenge: Millions of Concurrent Clients

• 1,000,000+ concurrent clients present a challenge to current I/O stack
   • e,g. metadata performance, locking, network incast problem, etc.
• I/O Forwarding Layer is introduced.
   • All I/O requests are delegated to dedicated I/O forwarder process.
   • I/O forwarder reduces the number of clients seen by the file system for all
     applications, without collective I/O.

                                 I/O Path
              Compute                       Compute     Compute   Compute   Compute    Compute   ....    Compute
              Processes


            I/O Forwarder
                                                                       I/O Forwarder




          Parallel File System                        PVFS2       Parallel File System
                                                                   PVFS2         PVFS2           PVFS2




                                                                                                                   4
I/O Software Stack with I/O Forwarding


                                       Parallel/Serial Applications


                                        High-Level I/O Libraries

                                                                   POSIX I/O
                              MPI-IO
                                                                   VFS, FUSE
  Bridge Between Compute
Process and Storage System                   I/O Forwarding
(IBM ciod, Cray DVS, IOFSL)

                                         Parallel File Systems


                                            Storage Devices




                                                                               5
Example I/O System: Blue Gene/P Architecture




                                               6
I/O Forwarding Challenges

• Large Requests
  • Latency of the forwarding
  • Memory limit of the I/O
  • Variety of backend file system node performance
• Small Requests
  • Current I/O forwarding mechanism reduces the number of clients, but
    does not reduces the number of requests.
  • Request processing overheads at the file systems


• We proposed two optimization techniques for the I/O forwarding layer.
  • Out-Of-Order I/O Pipelining, for large requests.
  • I/O Request Scheduler, for small requests.
                                                                          7
Out-Of-Order I/O Pipelining
• Split large I/O requests into
  small fixed-size chunks
• These chunks are forwarded in                     File                        FileSystem
  an out-of-order way.              Client   IOFSL System   Client   IOFSL       Threads

• Good points
   • Reduce forwarding latency,
     by overlapping the I/O
     requests and the network
     transfer.
   • I/O sizes are not limited by
     the memory size at the
     forwarding node.                    No-Pipelining          Out-Of-Order Pipelining

   • Little effect by the slowest
     file system node.
                                                                                             8
I/O Request Scheduler

• Scheduling and Merging the small requests at the forwarder
  • Reduce number of seeks
  • Reduce number of requests, the file systems actually sees
• Scheduling overhead must be minimum
  • Handle-Based Round-Robin algorithm for the fairness between files
  • Ranges are managed by Interval Tree
     • The contiguous requests are merged

                                                        Pick N Requests
                              Q                          and Issue I/O

                H       H          H       H      H



               Read    Read       Write   Read   Read


                                                                          9
I/O Forwarding and Scalability Layer (IOFSL)

• IOFSL Project [Nawab 2009]
  • Open-Source I/O Forwarding Implementation
  • http://www.iofsl.org/
• Portable on most HPC environment
  • Network Independent
    • All network communication is done by BMI [Carns 2005]
       • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree,
         Portals, etc.
  • File System Independent
  • MPI-IO (ROMIO) / FUSE Client
                                                              10
IOFSL Software Stack

               Application

    POSIXInterface     MPI-IO Interface

        FUSE           ROMIO-ZOIDFS                       IOFSL Server

           ZOIDFS Client API                                FileSystem Dispatcher

                     BMI                            BMI     PVFS2        libsysio


                             ZOIDFS Protocol                        I/O Request
                      (TCP, Infiniband, Myrinet, etc. )


• Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented
  in the IOFSL, and evaluated on two environments.
   • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P)
                                                                                    11
Evaluation on T2K: Spec

• T2K Open Super Computer, Tokyo Sites
  • http://www.open-supercomputer.org/
  • 32 node Research Cluster
  • 16 cores: 2.3 GHz Quad-Core Opteron*4
  • 32GB Memory
  • 10Gbps Myrinet Network
  • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec)
• One IOFSL, Four PVFS2, 128 MPI Processes
• Software
  • MPICH2 1.1.1p1
  • PVFS2 CVS (almost 2.8.2)
     • both are configured to use Myrinet
                                                         12
Evaluation on T2K: IOR Benchmark

• Each process issues the same amount of I/O
• Gradually increasing the message size, and see the
  bandwidth change
  • Note: modified to do fsync() for MPI-IO




              Message Size

                                                       13
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60


                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                          14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40

                                                          I/O Scheduler Improvement ( 40.0%)
                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on Blue Gene/P: Spec

• Argonne National Laboratory BG/P “Surveyor”
  • Blue Gene/P platform for research and development
  • 1024 nodes, 4096-core
  • Four PVFS2 servers
  • DataDirect Networks S2A9550 SAN
• 256 compute nodes, with 4 I/O nodes were used.




         Node Card: 4 core     Node Board: 128 core   Rack: 4096 core   15
Evaluation on BG/P: BMI PingPong
                    900
                                        BMI TCP/IP
                    800                  BMI ZOID
                              CNK BG/P Tree Network
                    700

                    600
 Bandwidth (MB/s)




                    500

                    400

                    300

                    200

                    100

                      0
                          1      4       16       64        256   1024   4096
                                              Buffer Size (KB)                  16
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                               CIOD
                                IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4      16     64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300

                      200

                      100

                        0
                            1    4      16     64    256      1024   4096   16384   65536
                                               Message Size (KB)
                                                                                            17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300                                                   Performance Drop
                                                                               ( -38.5%)
                      200

                      100

                        0
                            1    4      16     64    256      1024   4096     16384   65536
                                               Message Size (KB)
                                                                                               17
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           18
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300
                                                                      16 threads > 32threads
                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                               18
Related Work

• Computational Plant Project @ Sandia National Laboratory
  • First introduced I/O Forwarding Layer
• IBM Blue Gene/L, Blue Gene/P
  • All I/O requests are forwarded to I/O nodes
     • Compute OS can be stripped down to minimum
      functionality, and reduces the OS noise
  • ZOID: I/O Forwarding Project [Kamil 2008]
     • Only on Blue Gene
• Lustre Network Request Scheduler (NRS) [Qian 2009]
  • Request scheduler at the parallel file system nodes
  • Only simulation results
                                                             19
Future Work

• Event-driven server architecture
  • reduced thread contension
• Collaborative Caching at the I/O forwarding layer
  • multiple I/O forwarder works collaboratively for caching data
   and also metadata
• Hints from MPI-IO
  • Better cooperation with collective I/O
• Evaluation on other leadership scale machines
  • ORNL Jaguar, Cray XT4, XT5 systems




                                                                    20
Conclusions
• Implementation and evaluation of two optimization techniques at the I/O
  Forwarding Layer
  • I/O pipelining that overlaps the file system requests and the network
    communication.
  • I/O scheduler that reduces the number of independent, non-contiguous
    file systems accesses.
• Demonstrating portable I/O forwarding layer, and performance comparison
  with existing HPC I/O software stack.
  • Two Environments
     • T2K Tokyo Linux cluster
     • ANL Blue Gene/P Surveyor
  • First I/O forwarding evaluations on linux cluster and Blue Gene/P
  • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack

                                                                            21
Thanks!

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo
                                                      Contact: kazuki.ohta@gmail.com
                                                                                  22

More Related Content

What's hot

Ipv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentationIpv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentationIDEA4PRO
 
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hintonparallellabs
 
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011Stephen Foskett
 
Technology Updates in IPv6
Technology Updates in IPv6Technology Updates in IPv6
Technology Updates in IPv6Shinsuke SUZUKI
 
Cisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by HinwotoCisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by HinwotoFebrian ‎
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksOPNFV
 
Introduction to ipv6 v1.3
Introduction to ipv6 v1.3Introduction to ipv6 v1.3
Introduction to ipv6 v1.3Karunakant Rai
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadbandIPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadbandAPNIC
 
Cisco presentation2
Cisco presentation2Cisco presentation2
Cisco presentation2ehsan nazer
 
IPV6 Hands on Lab
IPV6 Hands on Lab IPV6 Hands on Lab
IPV6 Hands on Lab Cisco Canada
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofiTracy Johnson
 
Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit Open-NFP
 
Named Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseNamed Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseRute C. Sofia
 

What's hot (20)

Ipv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentationIpv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentation
 
i pv6
i pv6i pv6
i pv6
 
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hinton
 
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
 
Technology Updates in IPv6
Technology Updates in IPv6Technology Updates in IPv6
Technology Updates in IPv6
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Cisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by HinwotoCisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by Hinwoto
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
 
Introduction to ipv6 v1.3
Introduction to ipv6 v1.3Introduction to ipv6 v1.3
Introduction to ipv6 v1.3
 
Introduction to IPv6
Introduction to IPv6Introduction to IPv6
Introduction to IPv6
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadbandIPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
 
Cisco presentation2
Cisco presentation2Cisco presentation2
Cisco presentation2
 
IPV6 Hands on Lab
IPV6 Hands on Lab IPV6 Hands on Lab
IPV6 Hands on Lab
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
 
Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit
 
IPv6
IPv6IPv6
IPv6
 
Named Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseNamed Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-case
 
Nehalem
NehalemNehalem
Nehalem
 
Khan and morrison_dq207
Khan and morrison_dq207Khan and morrison_dq207
Khan and morrison_dq207
 

Similar to Optimization Techniques at the I/O Forwarding Layer

PVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersPVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersTawose Olamide Timothy
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
Linux one vs x86 18 july
Linux one vs x86 18 julyLinux one vs x86 18 july
Linux one vs x86 18 julyDiego Rodriguez
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Efficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O AtomicityEfficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O AtomicityViet-Trung TRAN
 
Three years of OFELIA - taking stock
Three years of OFELIA - taking stockThree years of OFELIA - taking stock
Three years of OFELIA - taking stockFIBRE Testbed
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudMarco Rodrigues
 
Openflow overview
Openflow overviewOpenflow overview
Openflow overviewopenflowhub
 
Intel omni path architecture
Intel omni path architectureIntel omni path architecture
Intel omni path architectureAshay Shirwadkar
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task ComputingEric Van Hensbergen
 
Plank
PlankPlank
PlankFNian
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
Web Services for the Internet of Things
Web Services for the Internet of ThingsWeb Services for the Internet of Things
Web Services for the Internet of ThingsMarkku Laine
 

Similar to Optimization Techniques at the I/O Forwarding Layer (20)

PVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersPVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux Clusters
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
Linux one vs x86
Linux one vs x86 Linux one vs x86
Linux one vs x86
 
Linux one vs x86 18 july
Linux one vs x86 18 julyLinux one vs x86 18 july
Linux one vs x86 18 july
 
NIO.pdf
NIO.pdfNIO.pdf
NIO.pdf
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 
Efficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O AtomicityEfficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O Atomicity
 
Three years of OFELIA - taking stock
Three years of OFELIA - taking stockThree years of OFELIA - taking stock
Three years of OFELIA - taking stock
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco Cloud
 
Openflow overview
Openflow overviewOpenflow overview
Openflow overview
 
Intel omni path architecture
Intel omni path architectureIntel omni path architecture
Intel omni path architecture
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Plank
PlankPlank
Plank
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
Web Services for the Internet of Things
Web Services for the Internet of ThingsWeb Services for the Internet of Things
Web Services for the Internet of Things
 

More from Kazuki Ohta

修士中間発表
修士中間発表修士中間発表
修士中間発表Kazuki Ohta
 
Hadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 SpringHadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 SpringKazuki Ohta
 
Googleの基盤クローン Hadoopについて
Googleの基盤クローン HadoopについてGoogleの基盤クローン Hadoopについて
Googleの基盤クローン HadoopについてKazuki Ohta
 
Sedue at Hatena::Bookmark
Sedue at Hatena::BookmarkSedue at Hatena::Bookmark
Sedue at Hatena::BookmarkKazuki Ohta
 
Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方Kazuki Ohta
 

More from Kazuki Ohta (6)

Using MPI
Using MPIUsing MPI
Using MPI
 
修士中間発表
修士中間発表修士中間発表
修士中間発表
 
Hadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 SpringHadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 Spring
 
Googleの基盤クローン Hadoopについて
Googleの基盤クローン HadoopについてGoogleの基盤クローン Hadoopについて
Googleの基盤クローン Hadoopについて
 
Sedue at Hatena::Bookmark
Sedue at Hatena::BookmarkSedue at Hatena::Bookmark
Sedue at Hatena::Bookmark
 
Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Optimization Techniques at the I/O Forwarding Layer

  • 1. Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 1
  • 2. Background: Compute and Storage Imbalance • Leadership-class computational scale: • 100,000+ processes • Advanced Multi-core architectures, Compute node OSs • Leadership-class storage scale: • 100+ servers • Commercial storage hardware, Cluster file system • Current leadership-class machines supply only 1GB/s of storage throughput for every 10TF of compute performance. This gap grew factor of 10 in recent years. • Bridging this imbalance between compute and storage is a critical problem for the large-scale computation. 2
  • 3. Previous Studies: Current I/O Software Stack Storage Abstraction, Data Portability (HDF5, NetCDF, Parallel/Serial Applications ADIOS) High-Level I/O Libraries Organizing Accesses from Many Clients (ROMIO) POSIX I/O MPI-IO VFS, FUSE Parallel File Systems Logical File System over Many Storage Devices (PVFS2, Lustre, GPFS, PanFS, Storage Devices Ceph, etc) 3
  • 4. Challenge: Millions of Concurrent Clients • 1,000,000+ concurrent clients present a challenge to current I/O stack • e,g. metadata performance, locking, network incast problem, etc. • I/O Forwarding Layer is introduced. • All I/O requests are delegated to dedicated I/O forwarder process. • I/O forwarder reduces the number of clients seen by the file system for all applications, without collective I/O. I/O Path Compute Compute Compute Compute Compute Compute .... Compute Processes I/O Forwarder I/O Forwarder Parallel File System PVFS2 Parallel File System PVFS2 PVFS2 PVFS2 4
  • 5. I/O Software Stack with I/O Forwarding Parallel/Serial Applications High-Level I/O Libraries POSIX I/O MPI-IO VFS, FUSE Bridge Between Compute Process and Storage System I/O Forwarding (IBM ciod, Cray DVS, IOFSL) Parallel File Systems Storage Devices 5
  • 6. Example I/O System: Blue Gene/P Architecture 6
  • 7. I/O Forwarding Challenges • Large Requests • Latency of the forwarding • Memory limit of the I/O • Variety of backend file system node performance • Small Requests • Current I/O forwarding mechanism reduces the number of clients, but does not reduces the number of requests. • Request processing overheads at the file systems • We proposed two optimization techniques for the I/O forwarding layer. • Out-Of-Order I/O Pipelining, for large requests. • I/O Request Scheduler, for small requests. 7
  • 8. Out-Of-Order I/O Pipelining • Split large I/O requests into small fixed-size chunks • These chunks are forwarded in File FileSystem an out-of-order way. Client IOFSL System Client IOFSL Threads • Good points • Reduce forwarding latency, by overlapping the I/O requests and the network transfer. • I/O sizes are not limited by the memory size at the forwarding node. No-Pipelining Out-Of-Order Pipelining • Little effect by the slowest file system node. 8
  • 9. I/O Request Scheduler • Scheduling and Merging the small requests at the forwarder • Reduce number of seeks • Reduce number of requests, the file systems actually sees • Scheduling overhead must be minimum • Handle-Based Round-Robin algorithm for the fairness between files • Ranges are managed by Interval Tree • The contiguous requests are merged Pick N Requests Q and Issue I/O H H H H H Read Read Write Read Read 9
  • 10. I/O Forwarding and Scalability Layer (IOFSL) • IOFSL Project [Nawab 2009] • Open-Source I/O Forwarding Implementation • http://www.iofsl.org/ • Portable on most HPC environment • Network Independent • All network communication is done by BMI [Carns 2005] • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc. • File System Independent • MPI-IO (ROMIO) / FUSE Client 10
  • 11. IOFSL Software Stack Application POSIXInterface MPI-IO Interface FUSE ROMIO-ZOIDFS IOFSL Server ZOIDFS Client API FileSystem Dispatcher BMI BMI PVFS2 libsysio ZOIDFS Protocol I/O Request (TCP, Infiniband, Myrinet, etc. ) • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented in the IOFSL, and evaluated on two environments. • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P) 11
  • 12. Evaluation on T2K: Spec • T2K Open Super Computer, Tokyo Sites • http://www.open-supercomputer.org/ • 32 node Research Cluster • 16 cores: 2.3 GHz Quad-Core Opteron*4 • 32GB Memory • 10Gbps Myrinet Network • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec) • One IOFSL, Four PVFS2, 128 MPI Processes • Software • MPICH2 1.1.1p1 • PVFS2 CVS (almost 2.8.2) • both are configured to use Myrinet 12
  • 13. Evaluation on T2K: IOR Benchmark • Each process issues the same amount of I/O • Gradually increasing the message size, and see the bandwidth change • Note: modified to do fsync() for MPI-IO Message Size 13
  • 14. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 15. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 16. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 I/O Scheduler Improvement ( 40.0%) 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 17. Evaluation on Blue Gene/P: Spec • Argonne National Laboratory BG/P “Surveyor” • Blue Gene/P platform for research and development • 1024 nodes, 4096-core • Four PVFS2 servers • DataDirect Networks S2A9550 SAN • 256 compute nodes, with 4 I/O nodes were used. Node Card: 4 core Node Board: 128 core Rack: 4096 core 15
  • 18. Evaluation on BG/P: BMI PingPong 900 BMI TCP/IP 800 BMI ZOID CNK BG/P Tree Network 700 600 Bandwidth (MB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 Buffer Size (KB) 16
  • 19. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 20. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 21. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 Performance Drop ( -38.5%) 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 22. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 23. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 16 threads > 32threads 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 24. Related Work • Computational Plant Project @ Sandia National Laboratory • First introduced I/O Forwarding Layer • IBM Blue Gene/L, Blue Gene/P • All I/O requests are forwarded to I/O nodes • Compute OS can be stripped down to minimum functionality, and reduces the OS noise • ZOID: I/O Forwarding Project [Kamil 2008] • Only on Blue Gene • Lustre Network Request Scheduler (NRS) [Qian 2009] • Request scheduler at the parallel file system nodes • Only simulation results 19
  • 25. Future Work • Event-driven server architecture • reduced thread contension • Collaborative Caching at the I/O forwarding layer • multiple I/O forwarder works collaboratively for caching data and also metadata • Hints from MPI-IO • Better cooperation with collective I/O • Evaluation on other leadership scale machines • ORNL Jaguar, Cray XT4, XT5 systems 20
  • 26. Conclusions • Implementation and evaluation of two optimization techniques at the I/O Forwarding Layer • I/O pipelining that overlaps the file system requests and the network communication. • I/O scheduler that reduces the number of independent, non-contiguous file systems accesses. • Demonstrating portable I/O forwarding layer, and performance comparison with existing HPC I/O software stack. • Two Environments • T2K Tokyo Linux cluster • ANL Blue Gene/P Surveyor • First I/O forwarding evaluations on linux cluster and Blue Gene/P • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack 21
  • 27. Thanks! Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 22

Editor's Notes