SlideShare a Scribd company logo
1 of 130
Running with the Devil:
Mechanical Sympathetic Networking

Todd L. Montgomery
@toddlmontgomery
Informatica Ultra Messaging Architecture
Tail of a Networking Stack



                              Beastie
                             Direct Descendants
                                   FreeBSD
                                   NetBSD
                                  OpenBSD
                                      ...
                              Darwin (Mac OS X)
                                         also
                             Windows, Solaris, even Linux,
                                     Android, ...
Domain: TCP & UDP
It’s a Trap!
               TCP MSS
              1448 bytes

   Big
Request
(256 KB)
                ...
Overly
Long
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes?
   Big
Request
(256 KB)
                ...
Overly
Long
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)
                ...
Overly
Long
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF

Overly
Long
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF Pops up again!

Overly
Long
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF Pops up again!

                                      Solution 2
Overly
Long                                   Set TCP_NODELAY
 RTT




                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF Pops up again!

                                      Solution 2
Overly
Long                                   Set TCP_NODELAY
 RTT
                                            YES! but CPU Skyrockets!



                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF Pops up again!

                                      Solution 2
Overly
Long                                   Set TCP_NODELAY
 RTT
                            ?               YES! but CPU Skyrockets!



                           Response




           Symptom: Overly Long Round-Trip Time for a Request + Response
It’s a Trap!
               TCP MSS
              1448 bytes        Only
                                 Specific Request Sizes? OSs?
   Big
Request
(256 KB)                              Solution 1
                ...                    Set SO_RCVBUF Pops up again!

                                      Solution 2
Overly
Long                                   Set TCP_NODELAY
 RTT
                            ?               YES! but CPU Skyrockets!



                           Response              Well understood bad
                                                      interaction!



           Symptom: Overly Long Round-Trip Time for a Request + Response
Challenges with TCP
Challenges with TCP


    Nagle

“Don’t send ‘small’
  segments if un-
  acknowledged
 segments exist”
Challenges with TCP


    Nagle                 Delayed ACKs

“Don’t send ‘small’
  segments if un-
                      +    Don’t acknowledge data
                          immediately. Wait a small
  acknowledged             period of time (200 ms)
 segments exist”             for responses to be
                          generated and piggyback
                            the response with the
                              acknowledgement
Challenges with TCP


    Nagle                 Delayed ACKs                      Temporary Deadlock

                                                                Waiting on an
“Don’t send ‘small’
  segments if un-
                      +    Don’t acknowledge data
                          immediately. Wait a small
                                                      =   acknowledgement to send
  acknowledged             period of time (200 ms)        any waiting small segment.
 segments exist”             for responses to be           But acknowledgement is
                          generated and piggyback          delayed waiting on more
                            the response with the              data or a timeout
                              acknowledgement
Challenges with TCP


    Nagle                 Delayed ACKs                      Temporary Deadlock

                                                                Waiting on an
“Don’t send ‘small’
  segments if un-
                      +    Don’t acknowledge data
                          immediately. Wait a small
                                                      =   acknowledgement to send
  acknowledged             period of time (200 ms)        any waiting small segment.
 segments exist”             for responses to be           But acknowledgement is
                          generated and piggyback          delayed waiting on more
                            the response with the              data or a timeout
                              acknowledgement



                                 Solutions?
Little Experiment
            TCP MSS
           16344 bytes
            (loopback)
   BIG
 Request
 (32 MB)
              ...

                         Response
                          (1 KB)




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16
              ...                          8192            12



                         Response
                          (1 KB)




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU




                         Response
                          (1 KB)




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                              Take Away(s)
                         Response
                          (1 KB)




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                              Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                              Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?
                                    Chunks smaller than MSS are evil?




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                                Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?
                                    Chunks smaller than MSS are evil?
                                            ... no, or not quite ...




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                                Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?
                                    Chunks smaller than MSS are evil?
                                            ... no, or not quite ...
                                    OS pagesize (4096 bytes) matters!




               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                                Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?
                                    Chunks smaller than MSS are evil?
                                            ... no, or not quite ...
                                    OS pagesize (4096 bytes) matters!
                                                     Why?



               Question: Does the size of a send matter that much?
Little Experiment
            TCP MSS
           16344 bytes                  Chunk Size     RTT (msec)
            (loopback)
   BIG                                    1500             32
 Request
 (32 MB)                                   4096            16        Dramatically
              ...                          8192            12
                                                                     Higher CPU



                                                Take Away(s)
                         Response
                          (1 KB)    “Small” messages are evil?
                                    Chunks smaller than MSS are evil?
                                            ... no, or not quite ...
                                    OS pagesize (4096 bytes) matters!
                                                     Why?
                                    Kernel boundary crossings matter!

               Question: Does the size of a send matter that much?
Little Experiment
             TCP MSS
            16344 bytes                   Chunk Size      RTT (msec)
             (loopback)
   BIG                                      1500              32
 Request
 (32 MB)                                     4096             16       Dramatically
               ...                           8192             12
                                                                       Higher CPU



                                                   Take Away(s)
                            Response
                             (1 KB)    “Small” messages are evil?
                                       Chunks smaller than MSS are evil?
                                               ... no, or not quite ...
                                       OS pagesize (4096 bytes) matters!
     What about sendfile(2) and                          Why?
     FileChannel.transferTo()?         Kernel boundary crossings matter!

                 Question: Does the size of a send matter that much?
Challenges with UDP
Challenges with UDP

               Not a Stream

               Message boundaries
                     matter!
                (kernel boundary
                   crossings)
Challenges with UDP

                     Not a Stream

                     Message boundaries
                           matter!
                      (kernel boundary
                         crossings)



  No Nagle
Small messages not
      batched
Challenges with UDP

 Not Reliable           Not a Stream

Loss recovery is apps   Message boundaries
    responsibility            matter!
                         (kernel boundary
                            crossings)



  No Nagle
Small messages not
      batched
Challenges with UDP

 Not Reliable             Not a Stream

Loss recovery is apps    Message boundaries
    responsibility             matter!
                          (kernel boundary
                             crossings)



  No Nagle               Causes of Loss
Small messages not      ‣Receiver buffer overrun
      batched           ‣Network congestion
                        (neither are strictly the apps fault)
Challenges with UDP

 Not Reliable             Not a Stream                          No Flow Control

Loss recovery is apps    Message boundaries                     Potential to overrun a
    responsibility             matter!                                receiver
                          (kernel boundary
                             crossings)



  No Nagle               Causes of Loss
Small messages not      ‣Receiver buffer overrun
      batched           ‣Network congestion
                        (neither are strictly the apps fault)
Challenges with UDP

 Not Reliable             Not a Stream                          No Flow Control

Loss recovery is apps    Message boundaries                     Potential to overrun a
    responsibility             matter!                                receiver
                          (kernel boundary
                             crossings)


                                                                No Congestion
  No Nagle               Causes of Loss
                                                                   Control
Small messages not      ‣Receiver buffer overrun
      batched           ‣Network congestion                     Potential impact to all
                        (neither are strictly the apps fault)    competing traffic!!
                                                                (unconstrained flow)
Network Utilization & Datagrams
Network Utilization & Datagrams


                  Data
              Data + Control
Network Utilization & Datagrams

                                        “The

                  Data            percentage of
                                   traffic that is
                                       data”

              Data + Control
Network Utilization & Datagrams

                                                                       “The

                       Data                                      percentage of
                                                                  traffic that is
                                                                      data”

                   Data + Control

               No. of 200 Byte
                                       Utilization (%)
               App Messages
                      1                   87.7
                        5                 97.3
                       20                 99.3
         * IP Header = 20 bytes, UDP Header = 8 bytes, no response
Network Utilization & Datagrams

                                                                       “The

                       Data                                      percentage of
                                                                  traffic that is
                                                                      data”

                   Data + Control

               No. of 200 Byte
                                       Utilization (%)
               App Messages
                      1                   87.7                          Plus
                                                                       Fewer
                        5                 97.3                       interrupts!
                       20                 99.3
         * IP Header = 20 bytes, UDP Header = 8 bytes, no response
Network Utilization & Datagrams

                                                                             “The

                             Data                                      percentage of
                                                                        traffic that is
                                                                            data”

                         Data + Control

   Batching?
                     No. of 200 Byte
                                             Utilization (%)
                     App Messages
                            1                   87.7                          Plus
                                                                             Fewer
                              5                 97.3                       interrupts!
                             20                 99.3
               * IP Header = 20 bytes, UDP Header = 8 bytes, no response
Application-Level Batching?
Application-Level Batching?

   Application
    Specific
   Knowledge

    Applications
  sometimes know
 when to send small
 and when to batch

* HTTP (headers + body), etc.
Application-Level Batching?

   Application                  +
    Specific
   Knowledge

    Applications
  sometimes know
 when to send small
 and when to batch

* HTTP (headers + body), etc.
Application-Level Batching?

   Application                       Performance
                                +    Limitations &
    Specific
   Knowledge                           Tradeoffs

    Applications                    Nagle, Delayed ACKs,
  sometimes know                     Chunk Sizes, UDP
 when to send small                   Network Util, etc.
 and when to batch

* HTTP (headers + body), etc.
Application-Level Batching?

   Application                       Performance
                                +    Limitations &
                                                           =
    Specific
   Knowledge                           Tradeoffs

    Applications                    Nagle, Delayed ACKs,
  sometimes know                     Chunk Sizes, UDP
 when to send small                   Network Util, etc.
 and when to batch

* HTTP (headers + body), etc.
Application-Level Batching?

   Application                       Performance               Batching by
                                +    Limitations &
                                                           =
    Specific                                                  the Application
   Knowledge                           Tradeoffs
                                                                Applications can
                                    Nagle, Delayed ACKs,       optimize and make
    Applications                                                  the tradeoffs
  sometimes know                     Chunk Sizes, UDP
                                      Network Util, etc.      necessary at the time
 when to send small                                             they are needed
 and when to batch

* HTTP (headers + body), etc.
Application-Level Batching?

   Application                       Performance               Batching by
                                +    Limitations &
                                                           =
    Specific                                                  the Application
   Knowledge                           Tradeoffs
                                                                Applications can
                                    Nagle, Delayed ACKs,       optimize and make
    Applications                                                  the tradeoffs
  sometimes know                     Chunk Sizes, UDP
                                      Network Util, etc.      necessary at the time
 when to send small                                             they are needed
 and when to batch

* HTTP (headers + body), etc.



                                       Addressing
Application-Level Batching?

   Application                        Performance              Batching by
                                +     Limitations &
                                                           =
    Specific                                                  the Application
   Knowledge                            Tradeoffs
                                                                     Applications can
                                    Nagle, Delayed ACKs,            optimize and make
    Applications                                                       the tradeoffs
  sometimes know                     Chunk Sizes, UDP
                                      Network Util, etc.           necessary at the time
 when to send small                                                  they are needed
 and when to batch

* HTTP (headers + body), etc.



                                        Addressing
                                ‣Request/Response idiosyncrasies
Application-Level Batching?

   Application                        Performance              Batching by
                                +     Limitations &
                                                           =
    Specific                                                  the Application
   Knowledge                            Tradeoffs
                                                                     Applications can
                                    Nagle, Delayed ACKs,            optimize and make
    Applications                                                       the tradeoffs
  sometimes know                     Chunk Sizes, UDP
                                      Network Util, etc.           necessary at the time
 when to send small                                                  they are needed
 and when to batch

* HTTP (headers + body), etc.



                                        Addressing
                                ‣Request/Response idiosyncrasies
                                ‣Send-side optimizations
Application-Level Batching?

   Application                        Performance              Batching by
                                +     Limitations &
                                                           =
    Specific                                                  the Application
   Knowledge                            Tradeoffs
                                                                     Applications can
                                    Nagle, Delayed ACKs,            optimize and make
    Applications                                                       the tradeoffs
  sometimes know                     Chunk Sizes, UDP
                                      Network Util, etc.           necessary at the time
 when to send small                                                  they are needed
 and when to batch

* HTTP (headers + body), etc.



                                        Addressing
                                ‣Request/Response idiosyncrasies
                                ‣Send-side optimizations
Batching setsockopt()s
Batching setsockopt()s

         TCP_CORK
Batching setsockopt()s

              TCP_CORK

‣Linux only
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY


                                                  TCP_NOPUSH
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY


                                                  TCP_NOPUSH

                                         ‣BSD (some) only
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY


                                                  TCP_NOPUSH

                                         ‣BSD (some) only
                                         ‣Only send when SO_SNDBUF full
Batching setsockopt()s

               TCP_CORK

‣Linux only
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY


                                                  TCP_NOPUSH

                                         ‣BSD (some) only
                                         ‣Only send when SO_SNDBUF full
                                         ‣Mostly broken on Darwin
Batching setsockopt()s

               TCP_CORK

‣Linux only                                         When to Batch?
‣Only send when MSS full, when unCORKed, or ...
‣... after 200 msec
‣unCORKing requires kernel boundary crossing
‣Intended to work with TCP_NODELAY


                                                  TCP_NOPUSH
    When to Flush?                       ‣BSD (some) only
                                         ‣Only send when SO_SNDBUF full
                                         ‣Mostly broken on Darwin
Flush? Batch?
Flush? Batch?

     Batch when...
Flush? Batch?

           Batch when...

1. Application logic
Flush? Batch?

           Batch when...

1. Application logic
2. More data is likely to follow
Flush? Batch?

           Batch when...

1. Application logic
2. More data is likely to follow
3. Unlikely to get data out before next one
Flush? Batch?

           Batch when...                      Flush when...

1. Application logic
2. More data is likely to follow
3. Unlikely to get data out before next one
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow
3. Unlikely to get data out before next one
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one



             An Automatic Transmission for Batching
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
      2. Batch when Mean time between sends < Mean time to send (EWMA?)
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
      2. Batch when Mean time between sends < Mean time to send (EWMA?)
      3. Flush on timeout as safety measure
Flush? Batch?

           Batch when...                                 Flush when...

1. Application logic                          1. Application logic
2. More data is likely to follow              2. More data is unlikely to follow
3. Unlikely to get data out before next one   3. Timeout (200 msec?)
                                              4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
      2. Batch when Mean time between sends < Mean time to send (EWMA?)
      3. Flush on timeout as safety measure



 Question: Can you batch too much?
Flush? Batch?

           Batch when...                                  Flush when...

1. Application logic                           1. Application logic
2. More data is likely to follow               2. More data is unlikely to follow
3. Unlikely to get data out before next one    3. Timeout (200 msec?)
                                               4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
      2. Batch when Mean time between sends < Mean time to send (EWMA?)
      3. Flush on timeout as safety measure



 Question: Can you batch too much?            YES!
Flush? Batch?

           Batch when...                                  Flush when...

1. Application logic                           1. Application logic
2. More data is likely to follow               2. More data is unlikely to follow
3. Unlikely to get data out before next one    3. Timeout (200 msec?)
                                               4. Likely to get data out before next one



             An Automatic Transmission for Batching

      1. Always default to flushing
      2. Batch when Mean time between sends < Mean time to send (EWMA?)
      3. Flush on timeout as safety measure


                                                                Large UDP (fragmentation) +
 Question: Can you batch too much?            YES!               non-trivial loss probability
A Batching Architecture
A Batching Architecture



                      Socket(s)
A Batching Architecture

               Blocking Sends


                          Socket(s)
A Batching Architecture

                    Blocking Sends

                                                 MTBS: Mean Time Between Sends
                                     Socket(s)
                                                 MTTS: Mean Time To Send (on socket)


               “Smart” Batching
            Pulls off all waiting data when
           possible (automatically batches
                when MTBS < MTTS)
A Batching Architecture

                    Blocking Sends

                                                 MTBS: Mean Time Between Sends
                                     Socket(s)
                                                 MTTS: Mean Time To Send (on socket)


               “Smart” Batching
            Pulls off all waiting data when
           possible (automatically batches
                when MTBS < MTTS)



       Advantages
A Batching Architecture

                           Blocking Sends

                                                        MTBS: Mean Time Between Sends
                                            Socket(s)
                                                        MTTS: Mean Time To Send (on socket)


                      “Smart” Batching
                   Pulls off all waiting data when
                  possible (automatically batches
                       when MTBS < MTTS)



           Advantages
‣Non-contended send threads
A Batching Architecture

                           Blocking Sends

                                                        MTBS: Mean Time Between Sends
                                            Socket(s)
                                                        MTTS: Mean Time To Send (on socket)


                      “Smart” Batching
                   Pulls off all waiting data when
                  possible (automatically batches
                       when MTBS < MTTS)



           Advantages
‣Non-contended send threads
‣Decoupled API and socket sends
A Batching Architecture

                              Blocking Sends

                                                           MTBS: Mean Time Between Sends
                                               Socket(s)
                                                           MTTS: Mean Time To Send (on socket)


                         “Smart” Batching
                      Pulls off all waiting data when
                     possible (automatically batches
                          when MTBS < MTTS)



             Advantages
‣Non-contended send threads
‣Decoupled API and socket sends
‣Single writer principle for sockets
A Batching Architecture

                              Blocking Sends

                                                           MTBS: Mean Time Between Sends
                                               Socket(s)
                                                           MTTS: Mean Time To Send (on socket)


                         “Smart” Batching
                      Pulls off all waiting data when
                     possible (automatically batches
                          when MTBS < MTTS)



             Advantages
‣Non-contended send threads
‣Decoupled API and socket sends
‣Single writer principle for sockets
‣Built-in back pressure (bounded ring buffer)
A Batching Architecture

                              Blocking Sends

                                                           MTBS: Mean Time Between Sends
                                               Socket(s)
                                                           MTTS: Mean Time To Send (on socket)


                         “Smart” Batching
                      Pulls off all waiting data when
                     possible (automatically batches
                          when MTBS < MTTS)



             Advantages
‣Non-contended send threads
‣Decoupled API and socket sends
‣Single writer principle for sockets
‣Built-in back pressure (bounded ring buffer)
‣Easy to add (async) send notification
A Batching Architecture

                              Blocking Sends

                                                           MTBS: Mean Time Between Sends
                                               Socket(s)
                                                           MTTS: Mean Time To Send (on socket)


                         “Smart” Batching
                      Pulls off all waiting data when
                     possible (automatically batches
                          when MTBS < MTTS)



             Advantages
‣Non-contended send threads
‣Decoupled API and socket sends
‣Single writer principle for sockets
‣Built-in back pressure (bounded ring buffer)
‣Easy to add (async) send notification
‣Easy to add rate limiting
A Batching Architecture

                              Blocking Sends

                                                              MTBS: Mean Time Between Sends
                                               Socket(s)
                                                              MTTS: Mean Time To Send (on socket)


                         “Smart” Batching
                      Pulls off all waiting data when
                     possible (automatically batches
                          when MTBS < MTTS)



             Advantages
‣Non-contended send threads                                  Can be re-used for other
‣Decoupled API and socket sends                            batching tasks (like file I/O, DB
‣Single writer principle for sockets                       writes, and pipeline requests)!
‣Built-in back pressure (bounded ring buffer)
‣Easy to add (async) send notification
‣Easy to add rate limiting
Multi-Message Send/Receive
Multi-Message Send/Receive

    sendmmsg(2)
Multi-Message Send/Receive

         sendmmsg(2)

‣Linux 3.x only
Multi-Message Send/Receive

         sendmmsg(2)

‣Linux 3.x only
‣Send multiple datagrams with single call
Multi-Message Send/Receive

         sendmmsg(2)

‣Linux 3.x only
‣Send multiple datagrams with single call
‣Fits nicely with batching architecture
Multi-Message Send/Receive

         sendmmsg(2)

‣Linux 3.x only
‣Send multiple datagrams with single call
‣Fits nicely with batching architecture

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!
Multi-Message Send/Receive

         sendmmsg(2)                        recvmmsg(2)

‣Linux 3.x only
‣Send multiple datagrams with single call
‣Fits nicely with batching architecture

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!
Multi-Message Send/Receive

         sendmmsg(2)                                   recvmmsg(2)

‣Linux 3.x only                             ‣Linux 3.x only
‣Send multiple datagrams with single call
‣Fits nicely with batching architecture

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!
Multi-Message Send/Receive

         sendmmsg(2)                                   recvmmsg(2)

‣Linux 3.x only                             ‣Linux 3.x only
‣Send multiple datagrams with single call   ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!
Multi-Message Send/Receive

         sendmmsg(2)                                   recvmmsg(2)

‣Linux 3.x only                             ‣Linux 3.x only
‣Send multiple datagrams with single call   ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture     ‣So, so, sooo SMOKIN’ HOT!

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!
Multi-Message Send/Receive

         sendmmsg(2)                                    recvmmsg(2)

‣Linux 3.x only                              ‣Linux 3.x only
‣Send multiple datagrams with single call    ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture      ‣So, so, sooo SMOKIN’ HOT!

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!


                                       Advantages
Multi-Message Send/Receive

         sendmmsg(2)                                     recvmmsg(2)

‣Linux 3.x only                               ‣Linux 3.x only
‣Send multiple datagrams with single call     ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture       ‣So, so, sooo SMOKIN’ HOT!

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!


                                       Advantages
                           ‣Reduced kernel boundary crossings
Multi-Message Send/Receive

         sendmmsg(2)                                     recvmmsg(2)

‣Linux 3.x only                               ‣Linux 3.x only
‣Send multiple datagrams with single call     ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture       ‣So, so, sooo SMOKIN’ HOT!

          Compliments gather send
       (sendmsg, writev) - which you
           can do in the same call!


                                       Advantages
                           ‣Reduced kernel boundary crossings
Multi-Message Send/Receive

         sendmmsg(2)                                     recvmmsg(2)

‣Linux 3.x only                               ‣Linux 3.x only
‣Send multiple datagrams with single call     ‣Receive multiple datagrams with single call
‣Fits nicely with batching architecture       ‣So, so, sooo SMOKIN’ HOT!

          Compliments gather send                      Scatter recv (recvmsg, readv) is
       (sendmsg, writev) - which you                     usually not worth the trouble
           can do in the same call!


                                       Advantages
                           ‣Reduced kernel boundary crossings
Domain: Protocol Design
Application-Level Framing
Application-Level Framing




     Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
     0                                                                        X bytes

                                        File to Transfer




     Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                    X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9




       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                    X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8




       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9




       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9


                                   Advantages


       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9


                                   Advantages
                  ‣Optimize recovery until end (or checkpoints)


       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9


                                   Advantages
                  ‣Optimize recovery until end (or checkpoints)
                  ‣Works well with multicast and unicast

       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9


                                   Advantages
                  ‣Optimize recovery until end (or checkpoints)
                  ‣Works well with multicast and unicast
                  ‣Works best over UDP (message boundaries)
       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
Application-Level Framing
       0                                                                                      X bytes

                                            File to Transfer
                       Split into Application Data Unit (constant size except maybe last)

  S:       ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9



 R:        ADU 1 ADU 2                 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8



  Recover                   ADU 3                                                           ADU 9


                                   Advantages
                  ‣Optimize recovery until end (or checkpoints)
                  ‣Works well with multicast and unicast
                  ‣Works best over UDP (message boundaries)
       Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
PGM Router Assist




          Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                             S




                                   ...

                                       R             R




          Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                             S




                                   ...                       Loss Here! Effects Both
                                                                downstream links




                                       R             R




          Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                S
        No loss here on these
               subtrees




                                      ...                       Loss Here! Effects Both
                                                                   downstream links




                                          R             R




             Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                  S
        No loss here on these
               subtrees




                                      ...                       Loss Here! Effects Both
                                                                   downstream links




                                          R               R
                                                                               NAK Path
                                      NAKs traverse hop-by-hop
                                     back up the forwarding tree,
                                    saving state in each router for
                                         retransmits to follow


             Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                  S
        No loss here on these
               subtrees                                         Retransmit only needs to
                                                               traverse link once for both
                                                                   downstream links




                                      ...                       Loss Here! Effects Both
                                                                   downstream links




                                          R               R
                                                                                NAK Path
                                      NAKs traverse hop-by-hop
                                     back up the forwarding tree,
                                    saving state in each router for
                                                                                Retransmit Path
                                         retransmits to follow


             Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                  S
        No loss here on these
               subtrees                                         Retransmit only needs to
                                                               traverse link once for both
                                                                   downstream links




                                      ...                       Loss Here! Effects Both
                                                                   downstream links

   Advantages
                                          R               R
                                                                                NAK Path
                                      NAKs traverse hop-by-hop
                                     back up the forwarding tree,
                                    saving state in each router for
                                                                                Retransmit Path
                                         retransmits to follow


             Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                       S
             No loss here on these
                    subtrees                                         Retransmit only needs to
                                                                    traverse link once for both
                                                                        downstream links




                                           ...                       Loss Here! Effects Both
                                                                        downstream links

     Advantages
‣NAKs follow reverse path                      R               R
                                                                                     NAK Path
                                           NAKs traverse hop-by-hop
                                          back up the forwarding tree,
                                         saving state in each router for
                                                                                     Retransmit Path
                                              retransmits to follow


                  Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                        S
              No loss here on these
                     subtrees                                         Retransmit only needs to
                                                                     traverse link once for both
                                                                         downstream links




                                            ...                       Loss Here! Effects Both
                                                                         downstream links

     Advantages
‣NAKs follow reverse path                       R               R
‣Retransmissions localized
                                                                                      NAK Path
                                            NAKs traverse hop-by-hop
                                           back up the forwarding tree,
                                          saving state in each router for
                                                                                      Retransmit Path
                                               retransmits to follow


                   Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                        S
              No loss here on these
                     subtrees                                         Retransmit only needs to
                                                                     traverse link once for both
                                                                         downstream links




                                            ...                       Loss Here! Effects Both
                                                                         downstream links

     Advantages
‣NAKs follow reverse path                       R               R
‣Retransmissions localized
‣Optimization of bandwidth!                                                           NAK Path
                                            NAKs traverse hop-by-hop
                                           back up the forwarding tree,
                                          saving state in each router for
                                                                                      Retransmit Path
                                               retransmits to follow


                   Pragmatic General Multicast (PGM), IETF RFC 3208
PGM Router Assist

                                                        S
              No loss here on these
                     subtrees                                         Retransmit only needs to
                                                                     traverse link once for both
                                                                         downstream links




                                            ...                       Loss Here! Effects Both
                                                                         downstream links

     Advantages
‣NAKs follow reverse path                       R               R
‣Retransmissions localized
‣Optimization of bandwidth!                                                           NAK Path
                                            NAKs traverse hop-by-hop
                                           back up the forwarding tree,
                                          saving state in each router for
                                                                                      Retransmit Path
                                               retransmits to follow


                   Pragmatic General Multicast (PGM), IETF RFC 3208
Questions?

More Related Content

What's hot

AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningAbdulrazak Zakieh
 
IETF 100: Surviving IPv6 fragmentation
IETF 100: Surviving IPv6 fragmentationIETF 100: Surviving IPv6 fragmentation
IETF 100: Surviving IPv6 fragmentationAPNIC
 
Collaborate nfs kyle_final
Collaborate nfs kyle_finalCollaborate nfs kyle_final
Collaborate nfs kyle_finalKyle Hailey
 
A performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorA performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorNECST Lab @ Politecnico di Milano
 
Collaborate vdb performance
Collaborate vdb performanceCollaborate vdb performance
Collaborate vdb performanceKyle Hailey
 
Training course lect1
Training course lect1Training course lect1
Training course lect1Noor Dhiya
 
Are we really ready to turn off IPv4?
Are we really ready to turn off IPv4?Are we really ready to turn off IPv4?
Are we really ready to turn off IPv4?APNIC
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency TradingViktor Sovietov
 
Building your own open-source voice assistant
Building your own open-source voice assistantBuilding your own open-source voice assistant
Building your own open-source voice assistantAll Things Open
 
DIY Jarvis All Things Open 2019
DIY Jarvis All Things Open 2019DIY Jarvis All Things Open 2019
DIY Jarvis All Things Open 2019Wes Widner
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtosNAVER D2
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveDipti Borkar
 

What's hot (16)

AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: Pulsar
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
IETF 100: Surviving IPv6 fragmentation
IETF 100: Surviving IPv6 fragmentationIETF 100: Surviving IPv6 fragmentation
IETF 100: Surviving IPv6 fragmentation
 
Collaborate nfs kyle_final
Collaborate nfs kyle_finalCollaborate nfs kyle_final
Collaborate nfs kyle_final
 
A performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisorA performance-aware power capping orchestrator for the Xen hypervisor
A performance-aware power capping orchestrator for the Xen hypervisor
 
NFS and Oracle
NFS and OracleNFS and Oracle
NFS and Oracle
 
Collaborate vdb performance
Collaborate vdb performanceCollaborate vdb performance
Collaborate vdb performance
 
Cldap threat-advisory
Cldap threat-advisoryCldap threat-advisory
Cldap threat-advisory
 
Training course lect1
Training course lect1Training course lect1
Training course lect1
 
Are we really ready to turn off IPv4?
Are we really ready to turn off IPv4?Are we really ready to turn off IPv4?
Are we really ready to turn off IPv4?
 
Cryptol experience
Cryptol experienceCryptol experience
Cryptol experience
 
Java in High Frequency Trading
Java in High Frequency TradingJava in High Frequency Trading
Java in High Frequency Trading
 
Building your own open-source voice assistant
Building your own open-source voice assistantBuilding your own open-source voice assistant
Building your own open-source voice assistant
 
DIY Jarvis All Things Open 2019
DIY Jarvis All Things Open 2019DIY Jarvis All Things Open 2019
DIY Jarvis All Things Open 2019
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep dive
 

Viewers also liked

Microservice Protocols of Interaction
Microservice Protocols of InteractionMicroservice Protocols of Interaction
Microservice Protocols of InteractionTodd Montgomery
 
Reactsf 2014-message-driven
Reactsf 2014-message-drivenReactsf 2014-message-driven
Reactsf 2014-message-drivenTodd Montgomery
 
QCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of InteractionQCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of InteractionTodd Montgomery
 
IoTaConf 2014 - IoT Connectivity, Standards, and Architecture
IoTaConf 2014 - IoT Connectivity, Standards, and ArchitectureIoTaConf 2014 - IoT Connectivity, Standards, and Architecture
IoTaConf 2014 - IoT Connectivity, Standards, and ArchitectureTodd Montgomery
 
QCon NY 2014 - Evolving REST for an IoT World
QCon NY 2014 - Evolving REST for an IoT WorldQCon NY 2014 - Evolving REST for an IoT World
QCon NY 2014 - Evolving REST for an IoT WorldTodd Montgomery
 
Reactive Programming Models for IoT
Reactive Programming Models for IoTReactive Programming Models for IoT
Reactive Programming Models for IoTTodd Montgomery
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Go Reactive: Event-Driven, Scalable, Resilient & Responsive SystemsGo Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Go Reactive: Event-Driven, Scalable, Resilient & Responsive SystemsJonas Bonér
 
Chris Kypriotis various recommendations
Chris Kypriotis various recommendationsChris Kypriotis various recommendations
Chris Kypriotis various recommendationsCK CK CK
 

Viewers also liked (9)

Microservice Protocols of Interaction
Microservice Protocols of InteractionMicroservice Protocols of Interaction
Microservice Protocols of Interaction
 
Reactsf 2014-message-driven
Reactsf 2014-message-drivenReactsf 2014-message-driven
Reactsf 2014-message-driven
 
QCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of InteractionQCon London 2015 Protocols of Interaction
QCon London 2015 Protocols of Interaction
 
IoTaConf 2014 - IoT Connectivity, Standards, and Architecture
IoTaConf 2014 - IoT Connectivity, Standards, and ArchitectureIoTaConf 2014 - IoT Connectivity, Standards, and Architecture
IoTaConf 2014 - IoT Connectivity, Standards, and Architecture
 
Tema5
Tema5Tema5
Tema5
 
QCon NY 2014 - Evolving REST for an IoT World
QCon NY 2014 - Evolving REST for an IoT WorldQCon NY 2014 - Evolving REST for an IoT World
QCon NY 2014 - Evolving REST for an IoT World
 
Reactive Programming Models for IoT
Reactive Programming Models for IoTReactive Programming Models for IoT
Reactive Programming Models for IoT
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Go Reactive: Event-Driven, Scalable, Resilient & Responsive SystemsGo Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems
 
Chris Kypriotis various recommendations
Chris Kypriotis various recommendationsChris Kypriotis various recommendations
Chris Kypriotis various recommendations
 

Similar to Running with the Devil: Mechanical Sympathetic Networking

What every Java developer should know about network?
What every Java developer should know about network?What every Java developer should know about network?
What every Java developer should know about network?aragozin
 
New flaws in WPA-TKIP
New flaws in WPA-TKIPNew flaws in WPA-TKIP
New flaws in WPA-TKIPvanhoefm
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CachePer Buer
 
Informal Presentation on WPA-TKIP
Informal Presentation on WPA-TKIPInformal Presentation on WPA-TKIP
Informal Presentation on WPA-TKIPvanhoefm
 
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096APNIC
 
Computer network (13)
Computer network (13)Computer network (13)
Computer network (13)NYversity
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)Art Schanz
 
Disque: a detailed overview of the distributed implementation - Salvatore San...
Disque: a detailed overview of the distributed implementation - Salvatore San...Disque: a detailed overview of the distributed implementation - Salvatore San...
Disque: a detailed overview of the distributed implementation - Salvatore San...distributed matters
 
CS4344 09/10 Lecture 10: Transport Protocol for Networked Games
CS4344 09/10 Lecture 10: Transport Protocol for Networked GamesCS4344 09/10 Lecture 10: Transport Protocol for Networked Games
CS4344 09/10 Lecture 10: Transport Protocol for Networked GamesWei Tsang Ooi
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungchengMichael Zhang
 
RIPE 76: Measuring ATR
RIPE 76: Measuring ATRRIPE 76: Measuring ATR
RIPE 76: Measuring ATRAPNIC
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)Amazon Web Services
 
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...The Linux Foundation
 

Similar to Running with the Devil: Mechanical Sympathetic Networking (20)

What every Java developer should know about network?
What every Java developer should know about network?What every Java developer should know about network?
What every Java developer should know about network?
 
T.Pollak y C.Yaconi - Prey
T.Pollak y C.Yaconi - PreyT.Pollak y C.Yaconi - Prey
T.Pollak y C.Yaconi - Prey
 
New flaws in WPA-TKIP
New flaws in WPA-TKIPNew flaws in WPA-TKIP
New flaws in WPA-TKIP
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
 
Informal Presentation on WPA-TKIP
Informal Presentation on WPA-TKIPInformal Presentation on WPA-TKIP
Informal Presentation on WPA-TKIP
 
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096
DNS-OARC-36: Measurement of DNSSEC Validation with RSA-4096
 
TCPIP Networks for DBAs
TCPIP Networks for DBAsTCPIP Networks for DBAs
TCPIP Networks for DBAs
 
Computer network (13)
Computer network (13)Computer network (13)
Computer network (13)
 
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
 
Aerospike & GCE (LSPE Talk)
Aerospike & GCE (LSPE Talk)Aerospike & GCE (LSPE Talk)
Aerospike & GCE (LSPE Talk)
 
Disque: a detailed overview of the distributed implementation - Salvatore San...
Disque: a detailed overview of the distributed implementation - Salvatore San...Disque: a detailed overview of the distributed implementation - Salvatore San...
Disque: a detailed overview of the distributed implementation - Salvatore San...
 
CS4344 09/10 Lecture 10: Transport Protocol for Networked Games
CS4344 09/10 Lecture 10: Transport Protocol for Networked GamesCS4344 09/10 Lecture 10: Transport Protocol for Networked Games
CS4344 09/10 Lecture 10: Transport Protocol for Networked Games
 
A Baker's dozen of TCP
A Baker's dozen of TCPA Baker's dozen of TCP
A Baker's dozen of TCP
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng
 
RIPE 76: Measuring ATR
RIPE 76: Measuring ATRRIPE 76: Measuring ATR
RIPE 76: Measuring ATR
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)
 
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...
XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng,...
 
Networking.ppt
Networking.pptNetworking.ppt
Networking.ppt
 
Lect9 (1)
Lect9 (1)Lect9 (1)
Lect9 (1)
 

Running with the Devil: Mechanical Sympathetic Networking

  • 1. Running with the Devil: Mechanical Sympathetic Networking Todd L. Montgomery @toddlmontgomery Informatica Ultra Messaging Architecture
  • 2. Tail of a Networking Stack Beastie Direct Descendants FreeBSD NetBSD OpenBSD ... Darwin (Mac OS X) also Windows, Solaris, even Linux, Android, ...
  • 4. It’s a Trap! TCP MSS 1448 bytes Big Request (256 KB) ... Overly Long RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 5. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? Big Request (256 KB) ... Overly Long RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 6. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) ... Overly Long RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 7. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Overly Long RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 8. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Pops up again! Overly Long RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 9. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Pops up again! Solution 2 Overly Long Set TCP_NODELAY RTT Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 10. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Pops up again! Solution 2 Overly Long Set TCP_NODELAY RTT YES! but CPU Skyrockets! Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 11. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Pops up again! Solution 2 Overly Long Set TCP_NODELAY RTT ? YES! but CPU Skyrockets! Response Symptom: Overly Long Round-Trip Time for a Request + Response
  • 12. It’s a Trap! TCP MSS 1448 bytes Only Specific Request Sizes? OSs? Big Request (256 KB) Solution 1 ... Set SO_RCVBUF Pops up again! Solution 2 Overly Long Set TCP_NODELAY RTT ? YES! but CPU Skyrockets! Response Well understood bad interaction! Symptom: Overly Long Round-Trip Time for a Request + Response
  • 14. Challenges with TCP Nagle “Don’t send ‘small’ segments if un- acknowledged segments exist”
  • 15. Challenges with TCP Nagle Delayed ACKs “Don’t send ‘small’ segments if un- + Don’t acknowledge data immediately. Wait a small acknowledged period of time (200 ms) segments exist” for responses to be generated and piggyback the response with the acknowledgement
  • 16. Challenges with TCP Nagle Delayed ACKs Temporary Deadlock Waiting on an “Don’t send ‘small’ segments if un- + Don’t acknowledge data immediately. Wait a small = acknowledgement to send acknowledged period of time (200 ms) any waiting small segment. segments exist” for responses to be But acknowledgement is generated and piggyback delayed waiting on more the response with the data or a timeout acknowledgement
  • 17. Challenges with TCP Nagle Delayed ACKs Temporary Deadlock Waiting on an “Don’t send ‘small’ segments if un- + Don’t acknowledge data immediately. Wait a small = acknowledgement to send acknowledged period of time (200 ms) any waiting small segment. segments exist” for responses to be But acknowledgement is generated and piggyback delayed waiting on more the response with the data or a timeout acknowledgement Solutions?
  • 18. Little Experiment TCP MSS 16344 bytes (loopback) BIG Request (32 MB) ... Response (1 KB) Question: Does the size of a send matter that much?
  • 19. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 ... 8192 12 Response (1 KB) Question: Does the size of a send matter that much?
  • 20. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Response (1 KB) Question: Does the size of a send matter that much?
  • 21. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) Question: Does the size of a send matter that much?
  • 22. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Question: Does the size of a send matter that much?
  • 23. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? Question: Does the size of a send matter that much?
  • 24. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? ... no, or not quite ... Question: Does the size of a send matter that much?
  • 25. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? ... no, or not quite ... OS pagesize (4096 bytes) matters! Question: Does the size of a send matter that much?
  • 26. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? ... no, or not quite ... OS pagesize (4096 bytes) matters! Why? Question: Does the size of a send matter that much?
  • 27. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? ... no, or not quite ... OS pagesize (4096 bytes) matters! Why? Kernel boundary crossings matter! Question: Does the size of a send matter that much?
  • 28. Little Experiment TCP MSS 16344 bytes Chunk Size RTT (msec) (loopback) BIG 1500 32 Request (32 MB) 4096 16 Dramatically ... 8192 12 Higher CPU Take Away(s) Response (1 KB) “Small” messages are evil? Chunks smaller than MSS are evil? ... no, or not quite ... OS pagesize (4096 bytes) matters! What about sendfile(2) and Why? FileChannel.transferTo()? Kernel boundary crossings matter! Question: Does the size of a send matter that much?
  • 30. Challenges with UDP Not a Stream Message boundaries matter! (kernel boundary crossings)
  • 31. Challenges with UDP Not a Stream Message boundaries matter! (kernel boundary crossings) No Nagle Small messages not batched
  • 32. Challenges with UDP Not Reliable Not a Stream Loss recovery is apps Message boundaries responsibility matter! (kernel boundary crossings) No Nagle Small messages not batched
  • 33. Challenges with UDP Not Reliable Not a Stream Loss recovery is apps Message boundaries responsibility matter! (kernel boundary crossings) No Nagle Causes of Loss Small messages not ‣Receiver buffer overrun batched ‣Network congestion (neither are strictly the apps fault)
  • 34. Challenges with UDP Not Reliable Not a Stream No Flow Control Loss recovery is apps Message boundaries Potential to overrun a responsibility matter! receiver (kernel boundary crossings) No Nagle Causes of Loss Small messages not ‣Receiver buffer overrun batched ‣Network congestion (neither are strictly the apps fault)
  • 35. Challenges with UDP Not Reliable Not a Stream No Flow Control Loss recovery is apps Message boundaries Potential to overrun a responsibility matter! receiver (kernel boundary crossings) No Congestion No Nagle Causes of Loss Control Small messages not ‣Receiver buffer overrun batched ‣Network congestion Potential impact to all (neither are strictly the apps fault) competing traffic!! (unconstrained flow)
  • 37. Network Utilization & Datagrams Data Data + Control
  • 38. Network Utilization & Datagrams “The Data percentage of traffic that is data” Data + Control
  • 39. Network Utilization & Datagrams “The Data percentage of traffic that is data” Data + Control No. of 200 Byte Utilization (%) App Messages 1 87.7 5 97.3 20 99.3 * IP Header = 20 bytes, UDP Header = 8 bytes, no response
  • 40. Network Utilization & Datagrams “The Data percentage of traffic that is data” Data + Control No. of 200 Byte Utilization (%) App Messages 1 87.7 Plus Fewer 5 97.3 interrupts! 20 99.3 * IP Header = 20 bytes, UDP Header = 8 bytes, no response
  • 41. Network Utilization & Datagrams “The Data percentage of traffic that is data” Data + Control Batching? No. of 200 Byte Utilization (%) App Messages 1 87.7 Plus Fewer 5 97.3 interrupts! 20 99.3 * IP Header = 20 bytes, UDP Header = 8 bytes, no response
  • 43. Application-Level Batching? Application Specific Knowledge Applications sometimes know when to send small and when to batch * HTTP (headers + body), etc.
  • 44. Application-Level Batching? Application + Specific Knowledge Applications sometimes know when to send small and when to batch * HTTP (headers + body), etc.
  • 45. Application-Level Batching? Application Performance + Limitations & Specific Knowledge Tradeoffs Applications Nagle, Delayed ACKs, sometimes know Chunk Sizes, UDP when to send small Network Util, etc. and when to batch * HTTP (headers + body), etc.
  • 46. Application-Level Batching? Application Performance + Limitations & = Specific Knowledge Tradeoffs Applications Nagle, Delayed ACKs, sometimes know Chunk Sizes, UDP when to send small Network Util, etc. and when to batch * HTTP (headers + body), etc.
  • 47. Application-Level Batching? Application Performance Batching by + Limitations & = Specific the Application Knowledge Tradeoffs Applications can Nagle, Delayed ACKs, optimize and make Applications the tradeoffs sometimes know Chunk Sizes, UDP Network Util, etc. necessary at the time when to send small they are needed and when to batch * HTTP (headers + body), etc.
  • 48. Application-Level Batching? Application Performance Batching by + Limitations & = Specific the Application Knowledge Tradeoffs Applications can Nagle, Delayed ACKs, optimize and make Applications the tradeoffs sometimes know Chunk Sizes, UDP Network Util, etc. necessary at the time when to send small they are needed and when to batch * HTTP (headers + body), etc. Addressing
  • 49. Application-Level Batching? Application Performance Batching by + Limitations & = Specific the Application Knowledge Tradeoffs Applications can Nagle, Delayed ACKs, optimize and make Applications the tradeoffs sometimes know Chunk Sizes, UDP Network Util, etc. necessary at the time when to send small they are needed and when to batch * HTTP (headers + body), etc. Addressing ‣Request/Response idiosyncrasies
  • 50. Application-Level Batching? Application Performance Batching by + Limitations & = Specific the Application Knowledge Tradeoffs Applications can Nagle, Delayed ACKs, optimize and make Applications the tradeoffs sometimes know Chunk Sizes, UDP Network Util, etc. necessary at the time when to send small they are needed and when to batch * HTTP (headers + body), etc. Addressing ‣Request/Response idiosyncrasies ‣Send-side optimizations
  • 51. Application-Level Batching? Application Performance Batching by + Limitations & = Specific the Application Knowledge Tradeoffs Applications can Nagle, Delayed ACKs, optimize and make Applications the tradeoffs sometimes know Chunk Sizes, UDP Network Util, etc. necessary at the time when to send small they are needed and when to batch * HTTP (headers + body), etc. Addressing ‣Request/Response idiosyncrasies ‣Send-side optimizations
  • 54. Batching setsockopt()s TCP_CORK ‣Linux only
  • 55. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ...
  • 56. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec
  • 57. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing
  • 58. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY
  • 59. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY TCP_NOPUSH
  • 60. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY TCP_NOPUSH ‣BSD (some) only
  • 61. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY TCP_NOPUSH ‣BSD (some) only ‣Only send when SO_SNDBUF full
  • 62. Batching setsockopt()s TCP_CORK ‣Linux only ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY TCP_NOPUSH ‣BSD (some) only ‣Only send when SO_SNDBUF full ‣Mostly broken on Darwin
  • 63. Batching setsockopt()s TCP_CORK ‣Linux only When to Batch? ‣Only send when MSS full, when unCORKed, or ... ‣... after 200 msec ‣unCORKing requires kernel boundary crossing ‣Intended to work with TCP_NODELAY TCP_NOPUSH When to Flush? ‣BSD (some) only ‣Only send when SO_SNDBUF full ‣Mostly broken on Darwin
  • 65. Flush? Batch? Batch when...
  • 66. Flush? Batch? Batch when... 1. Application logic
  • 67. Flush? Batch? Batch when... 1. Application logic 2. More data is likely to follow
  • 68. Flush? Batch? Batch when... 1. Application logic 2. More data is likely to follow 3. Unlikely to get data out before next one
  • 69. Flush? Batch? Batch when... Flush when... 1. Application logic 2. More data is likely to follow 3. Unlikely to get data out before next one
  • 70. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 3. Unlikely to get data out before next one
  • 71. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one
  • 72. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?)
  • 73. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one
  • 74. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching
  • 75. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing
  • 76. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing 2. Batch when Mean time between sends < Mean time to send (EWMA?)
  • 77. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing 2. Batch when Mean time between sends < Mean time to send (EWMA?) 3. Flush on timeout as safety measure
  • 78. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing 2. Batch when Mean time between sends < Mean time to send (EWMA?) 3. Flush on timeout as safety measure Question: Can you batch too much?
  • 79. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing 2. Batch when Mean time between sends < Mean time to send (EWMA?) 3. Flush on timeout as safety measure Question: Can you batch too much? YES!
  • 80. Flush? Batch? Batch when... Flush when... 1. Application logic 1. Application logic 2. More data is likely to follow 2. More data is unlikely to follow 3. Unlikely to get data out before next one 3. Timeout (200 msec?) 4. Likely to get data out before next one An Automatic Transmission for Batching 1. Always default to flushing 2. Batch when Mean time between sends < Mean time to send (EWMA?) 3. Flush on timeout as safety measure Large UDP (fragmentation) + Question: Can you batch too much? YES! non-trivial loss probability
  • 83. A Batching Architecture Blocking Sends Socket(s)
  • 84. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS)
  • 85. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages
  • 86. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads
  • 87. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads ‣Decoupled API and socket sends
  • 88. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads ‣Decoupled API and socket sends ‣Single writer principle for sockets
  • 89. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads ‣Decoupled API and socket sends ‣Single writer principle for sockets ‣Built-in back pressure (bounded ring buffer)
  • 90. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads ‣Decoupled API and socket sends ‣Single writer principle for sockets ‣Built-in back pressure (bounded ring buffer) ‣Easy to add (async) send notification
  • 91. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads ‣Decoupled API and socket sends ‣Single writer principle for sockets ‣Built-in back pressure (bounded ring buffer) ‣Easy to add (async) send notification ‣Easy to add rate limiting
  • 92. A Batching Architecture Blocking Sends MTBS: Mean Time Between Sends Socket(s) MTTS: Mean Time To Send (on socket) “Smart” Batching Pulls off all waiting data when possible (automatically batches when MTBS < MTTS) Advantages ‣Non-contended send threads Can be re-used for other ‣Decoupled API and socket sends batching tasks (like file I/O, DB ‣Single writer principle for sockets writes, and pipeline requests)! ‣Built-in back pressure (bounded ring buffer) ‣Easy to add (async) send notification ‣Easy to add rate limiting
  • 95. Multi-Message Send/Receive sendmmsg(2) ‣Linux 3.x only
  • 96. Multi-Message Send/Receive sendmmsg(2) ‣Linux 3.x only ‣Send multiple datagrams with single call
  • 97. Multi-Message Send/Receive sendmmsg(2) ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Fits nicely with batching architecture
  • 98. Multi-Message Send/Receive sendmmsg(2) ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Fits nicely with batching architecture Compliments gather send (sendmsg, writev) - which you can do in the same call!
  • 99. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Fits nicely with batching architecture Compliments gather send (sendmsg, writev) - which you can do in the same call!
  • 100. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Fits nicely with batching architecture Compliments gather send (sendmsg, writev) - which you can do in the same call!
  • 101. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture Compliments gather send (sendmsg, writev) - which you can do in the same call!
  • 102. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture ‣So, so, sooo SMOKIN’ HOT! Compliments gather send (sendmsg, writev) - which you can do in the same call!
  • 103. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture ‣So, so, sooo SMOKIN’ HOT! Compliments gather send (sendmsg, writev) - which you can do in the same call! Advantages
  • 104. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture ‣So, so, sooo SMOKIN’ HOT! Compliments gather send (sendmsg, writev) - which you can do in the same call! Advantages ‣Reduced kernel boundary crossings
  • 105. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture ‣So, so, sooo SMOKIN’ HOT! Compliments gather send (sendmsg, writev) - which you can do in the same call! Advantages ‣Reduced kernel boundary crossings
  • 106. Multi-Message Send/Receive sendmmsg(2) recvmmsg(2) ‣Linux 3.x only ‣Linux 3.x only ‣Send multiple datagrams with single call ‣Receive multiple datagrams with single call ‣Fits nicely with batching architecture ‣So, so, sooo SMOKIN’ HOT! Compliments gather send Scatter recv (recvmsg, readv) is (sendmsg, writev) - which you usually not worth the trouble can do in the same call! Advantages ‣Reduced kernel boundary crossings
  • 109. Application-Level Framing Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 110. Application-Level Framing 0 X bytes File to Transfer Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 111. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 112. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 113. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 114. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Advantages Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 115. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Advantages ‣Optimize recovery until end (or checkpoints) Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 116. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Advantages ‣Optimize recovery until end (or checkpoints) ‣Works well with multicast and unicast Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 117. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Advantages ‣Optimize recovery until end (or checkpoints) ‣Works well with multicast and unicast ‣Works best over UDP (message boundaries) Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 118. Application-Level Framing 0 X bytes File to Transfer Split into Application Data Unit (constant size except maybe last) S: ADU 1 ADU 2 ADU 3 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 ADU 9 R: ADU 1 ADU 2 ADU 4 ADU 5 ADU 6 ADU 7 ADU 8 Recover ADU 3 ADU 9 Advantages ‣Optimize recovery until end (or checkpoints) ‣Works well with multicast and unicast ‣Works best over UDP (message boundaries) Clark and Tennenhouse, ACM SIGCOMM CCR, VOlume 20, Issue 4, Sept. 1990
  • 119. PGM Router Assist Pragmatic General Multicast (PGM), IETF RFC 3208
  • 120. PGM Router Assist S ... R R Pragmatic General Multicast (PGM), IETF RFC 3208
  • 121. PGM Router Assist S ... Loss Here! Effects Both downstream links R R Pragmatic General Multicast (PGM), IETF RFC 3208
  • 122. PGM Router Assist S No loss here on these subtrees ... Loss Here! Effects Both downstream links R R Pragmatic General Multicast (PGM), IETF RFC 3208
  • 123. PGM Router Assist S No loss here on these subtrees ... Loss Here! Effects Both downstream links R R NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 124. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links R R NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 125. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links Advantages R R NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 126. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links Advantages ‣NAKs follow reverse path R R NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 127. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links Advantages ‣NAKs follow reverse path R R ‣Retransmissions localized NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 128. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links Advantages ‣NAKs follow reverse path R R ‣Retransmissions localized ‣Optimization of bandwidth! NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208
  • 129. PGM Router Assist S No loss here on these subtrees Retransmit only needs to traverse link once for both downstream links ... Loss Here! Effects Both downstream links Advantages ‣NAKs follow reverse path R R ‣Retransmissions localized ‣Optimization of bandwidth! NAK Path NAKs traverse hop-by-hop back up the forwarding tree, saving state in each router for Retransmit Path retransmits to follow Pragmatic General Multicast (PGM), IETF RFC 3208

Editor's Notes

  1. \n
  2. To get the most out of the hardware at our disposal today takes a deep understanding of the entire software and hardware stack. At no place is this more evident than the current CPU architectures. However, great gains are ripe for the taking when dealing with networks and the TCP/IP stack. In this talk, we will discuss some techniques that may be well known or unknown about how to get the most out of the TCP/IP stack of any modern OS. We&apos;ll discuss:\n(1) how application level batching can be leveraged to remarkably avoid common TCP pitfalls,\n(2) how the impact of UDP datagram size influences CPU and network efficiency,\n(3) why the new OS system calls like sendmmsg/recvmmsg are so hot, and\n(4) how easy it is to leverage asynchronous calls for fun and profit.\n\n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. \n
  173. \n
  174. \n
  175. \n
  176. \n
  177. \n