This document discusses the benefits of 10 Gigabit Ethernet (10GE) for reducing latency. It states that 10GE allows for lower CPU utilization and reduced latency within servers compared to Gigabit Ethernet. It also discusses that while Infiniband promised low latency, it required application rewrites and added complexity due to needing translation to Ethernet outside local networks. The document explores 10GE cabling options like SFP+ which provide the lowest latency, and network interface cards that support technologies like RDMA for further reducing latency within servers. With the right hardware and software, organizations can see over an 80% reduction in overall end-to-end latency by moving from Gigabit Ethernet to 10GE.
1. Page 1
Why 10 Gigabit Ethernet? Why Now?
With the advent of Ultra Low-Latency Switches, such as the Nexus 5000 which provides a consistent sub
3 uSec latency, regardless of load or packet size, End-to-End latency is becoming more important. From
an End-to-End Latency perspective, 90% of latency is In-Host, as opposed to In-Network. In addition to
faster switches and decreased serialization delay, 10GE NIC technology allows for lower CPU Utilization
and reduced In-Host latency.
Figure 1: Cisco Nexus 5000 Series 10 Gigabit Ethernet Switches
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/data_sheet_c78-461802.html
Nexus 5000 Data Sheet
What About Infiniband?
Infiniband (IB) came about with the promise of Ultra Low Latency and Low CPU Utilization. With this
came a new set of problems. Ethernet has become the ubiquitous standard in the industry. We are
even starting to see conventional High Performance Computing vendors such as Myricom and Voltaire
and SAN vendors such as Brocade develop Ethernet products as Network, Storage, and HPC
environments are being converged into a single Unified 10GE Fabric. When communicating outside of
your LAN, to an exchange, for example, traffic must pass through an Infiniband Gateway to translate IB
to Ethernet, making any theoretical latency gain negligible in real-world scenarios. To take advantage of
the latencies Infiniband promised, applications had to be re-written to use RDMA, a sacrifice many were
not willing to make. IB also does not provide features that are standard in the Ethernet world such as
ACLs or QoS. In addition, conventional Network Monitoring Tools and Sniffers do not work with IB.
2. Page 2
What is RDMA?
RDMA, or Remote Data Memory Access, is a technology that allows a sender to write directly to a
receiver’s memory, bypassing the kernel. With conventional NICs, packets entering a NIC are processed
by the server’s CPU using the Operating Systems UDP/IP Stack. This process requires multiple
interrupts, context switches, and copies of the data before it ends up in application memory, available
for use.
Figure 2: Conventional Server I/O
In an RDMA environment, this process is much simpler. It is called Kernel Bypass/Zero Copy. With
RDMA, the packet is processed by the NIC and is then copied directly to application memory without
requiring processing by the CPU. This ultimately produces reduced In-Host Latency and lower CPU
Utilization.
Figure 3: Kernel Bypass/Zero Copy Server I/O
3. Page 3
What Cables Can I Use?
10GbaseT
10GbaseT will allow for 10GE speeds over Cat6e cables. This is the eventual low cost solution for
10G/1G/100Mbps communication; however the technology is in its infancy. Today’s 10GbaseT Phys
consume ~8W and induce 2.5 uSecs of latency per port. This will eventually be reduced and will be
incorporated into future Cisco products; however it is not supported today.
TwinAx (CX1)
The current low-cost, low-power solution for 10GE is TwinAx cabling. This consists of a copper (CX1)
cable with an SFP+ Transceiver directly attached each end.
Note: Each Transceiver induces an additional 50 nSec of latency, 100 nSec total per cable.
Figure 4: TwinAx Cable
SFP+
SFP+ provides the lowest latency solution today. With a variety of SFP+ transceivers available for multi
mode and single mode fiber, there are plenty of options for 10GE cabling without the added latency of
10GbaseT or TwinAx. With its smaller form factor, lower cost, and lower power consumption compared
to previous X2 and XENPAK transceivers, SFP+ allows for much higher port densities than previously
possible. The Nexus 5010 currently supports up to 26 Line Rate 10GE Ports in a compact 1 RU form-
factor with the 5020 providing 52 Line Rate 10GE ports in 2 RUs. SFP+ transceivers look, smell, and feel
like SFP transceivers, however they operate at 10 Gbps speeds. A limited number of SFP+ ports will also
accept GE SFP Transceivers for backwards compatibility.
4. Page 4
What NICs Should I Use?
iWARP
iWARP utilizes RDMA over Ethernet instead of Infiniband. This provides the same Kernel Bypass/Zero
Copy functionality, without the need for a secondary infrastructure. However with iWARP, just as with
IB, applications must be written to the lib.ib.verbs library to take advantage of this functionality.
Key Players: NetEffect (Intel), Chelsio, Mellanox, ServerEngines
Supported Operating Systems: Linux
User Space APIs
Numerous NIC vendors are now developing User Space APIs which give you all the benefits of iWARP,
without having to re-write your application. This middleware translates between sockets programming
and the hardware.
Key Players: Myricom and Solarflare
Figure 5: User Space Library Software Block Diagram
MX (Myrinet Express)
Myricom has their roots in High Performance Computing. They originally developed a HPC protocol
called Myrinet, but have since ported their development toward 10GE.
Key Players: Myricom
Supported Operating Systems: Linux and Windows
TCP/UDP Acceleration
6. Page 6
SR-IOV (Single Root Input/Output Virtualization)
SR-IOV was originally designed for a Virtual Machine environment. SR-IOV allows for a single 10GE NIC
to be divided into multiple Virtual NICs (vNIC), which are then mapped to Virtual Machines. This same
concept can be applied in a non-virtualized environment, mapping each vNIC to Application Memory
once again providing Kernel Bypass/Zero Copy functionality.
Key Players: Server Engines (Chelsio, NetEffect, Mellanox, Broadcom, and Neterion in Future)
Supported Operating Systems:
TCP/UDP Acceleration
Figure 7: SR-IOV in a Virtualized Server Environment
7. Page 7
How Does This Affect My Applications?
Cisco has teamed with NetEffect (Intel), to provide a solution which provides the theoretical advantages
of Infiniband, without the drawbacks. Cisco and NetEffect combined forces to write a middleware called
RAB, or RDMA Accelerated Buffers, which is optimized for use with Wombat Data Fabric. Cisco is also
exploring another middleware called DAL, or Datagram Acceleration Layer which could be used with
TIBCO RV, or any other application using UDP Multicast. This middleware allows for the decreased CPU
Utilization and reduced In-Host Latency with no modifications to your application.
Figure 8: RAB and DAL Middleware Software Block Diagram
Conventional Server I/O requires packets to be processed by the server’s CPU using the Operating
System’s UDP/IP Stack. This involves multiple interrupts, context switches, and copies of the data. This
ultimately leads to high CPU Utilization and unnecessary In-Host Latency.
8. Page 8
Figure 9: Conventional UDP/IP Communication
DAL Middleware intercepts conventional sockets calls and writes them directly to the NetEffect NIC,
providing the Kernel Bypass/Zero Copy functionality without the headaches of re-writing your
application, as was required with Infiniband.
Figure 10: Kernel Bypass, Zero Copy Communication with DAL
9. Page 9
How Does This Impact Latency?
As mentioned earlier, with today’s low latency networks, the source of 90% of latency is actually within
the server itself, rather than in the network. We performed a baseline test and found ping pong latency
to be on the order of 35-40 uSec with 30 uSec residing within the server and only 7 uSec from the core
switching infrastructure.
Figure 11: Sources of End to End Latency
We are seeing Market Data and High Performance Computing environments move to 10GE not only for
added throughput, but for reduced latency. This results in Serialization Delay being reduced by an order
of magnitude. For Jumbo Size Frames, this will result in latency being decreased from 72 to 7.2 uSec as
seen below.
Table 1: Serialization Delay Comparison
10. Page
10
Furthermore, by utilizing the User Space APIs and the Kernel Bypass/Zero Copy functionality they
provide, we have seen Application Layer to Application Layer Latency reduced to less than 6 uSec.
Table 2: Latency Comparison
Overall, the end result of moving from GE to 10GE is an overall End-to-End latency decrease of over
80%.
Figure 12: End to End Latency Comparison