More Related Content Similar to SoC Solutions Enabling Server-Based Networking (20) SoC Solutions Enabling Server-Based Networking1. © 2016 NETRONOME SYSTEMS, INC.
Ron Swartzentruber
Senior Principal Engineer, Silicon Development
9/8/2016
SoC Solutions Enabling Server-
Based Networking
2. © 2016 NETRONOME SYSTEMS, INC. 2
The Challenge
Demands on silicon have dramatically increased due to the rapid pace of innovation in the fields
of software-defined networks and network functions virtualization
Current server-based solutions do not efficiently handle the applications they need to run
▶ Low throughput of server-based networking datapath limits application performance
▶ High CPU load of server-based networking limits compute available to applications
Economics of scale require that applications run on commercial-off-the-shelf hardware; instead
of traditional and expensive datacenter networking equipment
The continued need for higher overall network bandwidth challenges the pace of Moore’s law
Higher packet processing performance is now required to meet the classification, filtering and
forwarding demands of the latest technologies
▶ Brought on by Open vSwitch, Contrail vRouter, OpenStack and P4 applications
3. © 2016 NETRONOME SYSTEMS, INC. 3
The Solution
1. Develop the silicon and software together to form an efficient and cohesive solution
2. Lower cost of ownership by off-loading datapath processing to efficient network flow
processors connected to standard server platforms
▶ Improve efficiency of server-based networking
3. Design a modular, chip multithreaded, 200Gb/s Network Flow Processor
▶ Distribute datapath packet processing to large pools of processor engines
▶ Meet bandwidth needs with multiple high speed I/O and large internal memories
▶ Programmable to allow new features to be deployed rapidly
4. Develop software that transparently offloads and accelerates networking data plane
functions
5. Enable the open source community to easily and rapidly test and deploy next
generation network technologies
4. © 2016 NETRONOME SYSTEMS, INC. 4
Background: About Netronome
Inventor of the Network Flow Processor and pioneer of hardware-accelerated
server-based networking
Provider of commercial-off-the-shelf intelligent server adapters for the data
center
▶ Delivering significantly higher performance for x86 environments
▶ Production-ready software
▶ Programmable silicon
Solutions for software-defined networks that optimize security, load balancing
and virtualization
Supporter of the academic and research community towards open source
projects using Open-NFP
5. © 2016 NETRONOME SYSTEMS, INC. 5
What is Server-Based Networking?
Leverages open source networking software used in servers
Transparently offloads and accelerates networking data plane functions such as virtual
switching, virtual routing, connection tracking and virtual network functions
6. © 2016 NETRONOME SYSTEMS, INC. 6
Open Virtual Switch Example
Compute Node
. . .
Linux Kernel
Agilio CX
OVS Datapath
Tunnels
Deliver to Host
Update Statistics
Transparent
Offload
SR-IOV
Connectivity
to VMs
OVS Datapath
Actions
Match
Tables
Tunnels
ActionsMatch Tables
VM
VM
VM
VM
7. © 2016 NETRONOME SYSTEMS, INC. 7
Per Server CPU Core Efficiency
Throughput with single server CPU core
MillionPacketPerSecond
• 50X Efficiency Gain vs. Kernel OVS
• 20X Efficiency Gain vs. User OVS
https://www.netronome.com/media/redactor_files/WP_OVS_Benchmarking.pdf
8. © 2016 NETRONOME SYSTEMS, INC. 8
NFV Use Case: 2,000Kpps per VNF or Application
Rack Throughput: 168Mpps
VNFs Per Rack: 80
Server
TOR
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
20Serverswith2x40GbE
Server
TOR
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
20Serverswith2x40GbE
Rack Throughput: 440Mpps
VNFs Per Rack: 220
Racks Needed to Support 220 VNFs Racks Needed to Support 220 VNFs
2.8
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
OVS
16 Cores
9.6 Mpps of
VXLAN
Processing
4 Apps or
VNFs at
2,000Kpps
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C COVS
VMs
23 Cores
22 Mpps of
VXLAN
Processing
11 Apps or
VNFs at
2,000Kpps
VMs
8 Cores
Server Core AllocationServer Core Allocation
3X
Lower TCO
OVS on Server with Traditional NIC OVS on Server with Netronome Agilio Platform
10. © 2016 NETRONOME SYSTEMS, INC. 10
The Network Flow Processor Architecture
• Hardware accelerators
perform compute
intensive functions such
as hashing, crypto,
CAM, atomic and other
functions
• Delivers multi-terabit
bidirectional bandwidth
between processing elements
• Avoids bus contention and
saturation issues
• Packets autonomously pushed
to processing cores
• Pool of highly multi-threaded
parallel processing cores
• Production-ready OVS and
vRouter datapath code
• Datapath extensibility using
P4 and C programming tools
• Multi-threaded memory
engines and banks of
SRAM tightly coupled with
atomic and other hardware
accelerator functions
Latency Tolerant
Multi-threading between Processing Cores,
H/W Accelerators and Memory Banks
Delivers Highest Scale & Best Price-Performance
11. © 2016 NETRONOME SYSTEMS, INC. 11
The Flow Processing Core
Flow Processing Core
▶ The principal data processing element inside the NFP
▶ 8K Instruction control store, with capability to share
▶ 40-bit address space
▶ Eight processing threads with unique wake up control, state and PC
▶ Two-cycle switch between contexts
▶ 6 stage main pipeline
▶ 32-bit ALU with shift, multiply, CAM
▶ Easily Programmable using Assembly, C or P4
12. © 2016 NETRONOME SYSTEMS, INC. 12
Latency Tolerant Processing
Multiple Parallel Processing
Threads
Delays incurred to/from
Hardware Accelerators and
Memory
Threads can be de-scheduled or
yielded
Result: Allows Latency to be
hidden from the Software
Application
Flow Processing Core with 8 Threads
CRC
Ext
Mem
Int
Mem
Hash
LUT
XOR
Prefix
Match
FPC Thread
FPC Thread
Latency to Accelerators and Memory
FPC
13. © 2016 NETRONOME SYSTEMS, INC. 13
Memory-Centric Processing
Switch fabric interface
▶ 2 billion commands
per second
▶ 500Gb/s data bandwidth
Multi-bank SRAM
▶ Eight crossbar inputs
▶ Eight transactions per cycle
▶ 1 Tb/s bandwidth
Multiple processing engines
▶ No locking between engines
▶ Different engines in different processing memories in the device
▶ Different engines support different processing operations
▶ Highly threaded to maintain 100% throughput when required
14. © 2016 NETRONOME SYSTEMS, INC. 14
Memory Hierarchy
Philosophy: Processing in the optimal location
Process data where the data resides
• External DDR memory units
(EMU)
• Locks, hash tables,
microqueues
• Linked lists, rings
• Recursive lookups
• >300 different processing
operations
• >200 threads per unit
• Cluster Target Memory (CTM M)
• Locks, hash tables, microqueues
• Packet buffering, delivery, transmit
offload
• Rings
• >250 different processing operations
• >100 threads per unit
• Cluster Local Scratch
• Locks, hash tables,
microqueues
• Rings, stacks
• Regular expression NFA
• >100 different processing
operations
• Internal memory units (IMU)
• Locks, hash tables,
microqueues
• Recursive lookups
• Statistics, load balancing
• >300 different processing
operations
• >200 threads per unit
15. © 2016 NETRONOME SYSTEMS, INC. 15
Fabric Interconnect
Distributed switch fabric
▶ 6-way crossbar routing
▶ 768Gb/s bandwidth across each island
Island based design methodology
Island interconnect at fixed pin locations
connected by abutment
▶ Fabric ports
▶ Register interface
▶ Interrupts and events
▶ Test logic
16. © 2016 NETRONOME SYSTEMS, INC. 16
Island APR Block Topology
Modular
▶ Allows software to scale as
processing requirements
increase
Re-usable
▶ Blocks can be replaced and
interchanged across the
floorplan
M C C M
C C C
P C B A
M
A
E C C C
B C P AA
F
M
17. © 2016 NETRONOME SYSTEMS, INC. 17
Technology
Intel 22nm
▶ Intel 3D Tri-Gate transistor manufactured at 22nm process
▶ 37% performance increase at low voltage (0.7V)
▶ 50% power reduction at typical performance v.s. 32nm
Specifics
▶ Low leakage SoC process
▶ Foundry support for industry-standard SoC development tools
18. © 2016 NETRONOME SYSTEMS, INC. 18
SoC Verification
Today’s SoC requires co-verification of silicon and software
Simulation and Emulation required to fully verify design
Enable server-based networking software applications to run pre-silicon in order
prove out the design
Scalable test environment
▶ Python used to create Verilog module and test bench
▶ Instantiated UVCs based on the I/Os of interest
M C
PA
I/O
E C C C
B C P AA I/O
19. © 2016 NETRONOME SYSTEMS, INC. 19
Software Emulation
Run Tests 500 to 2,000X the speed using emulation as compared to simulation
Run real world software applications to validate performance and find potential
bottlenecks
Test many thousands of packets in a fraction of the time
Make/run environment that allowing any SW engineer to test NFP application code pre-
silicon
Treat the DUT as a ”smartNIC” connected to a VM via PCIe Speedbridge Interface and
loaded via external PCIe interface
Host PCIe to Network
with External Memory
EthernetNetwork
M C C
C C
P CA
M
D
D
R
External
Memoy
BFM
PCIe
I/O
20. ©2016 Open-NFP 20
Open-NFP www.open-nfp.org
Support and grow reusable research in accelerating dataplane network functions processing
Reduce/eliminate the cost and technology barriers to research in this space
• Technologies: P4, SDN, OpenFlow, Open vSwitch (OVS) offload
• Tools: Discounted hardware, development tools, software, cloud access
• Community: Website (www.open-nfp.org): learning & training materials, active Google group
https://groups.google.com/d/forum/open-nfp, open project descriptions, code repository
• Learning/Education/Research support: Summer seminar series, Developer conferences, Tutorials,
research proposal support for proposals to the NSF, state agencies
23. ©2016 Open-NFP 23
Universities Companies
Conference Attendees/Open-NFP Projects*
*This does not imply that these organizations endorse Open-NFP or Netronome