2. About me
Nat Morris
• Based in Haverfordwest (beyond the M4)
• Team lead, Cumulus Networks
• Director & Board Member, UK Network Operators Forum
(UKNOF)
• Feeder of dogs
• Attended first SWLUG meeting in 2001
Twitter
• @natmorris
cumulusnetworks.com 2
3. About Cumulus Networks
Team
JR Rivers, co-founder and CEO
Nolan Leake, co-founder and CTO
Shrijeet Mukherjee, VP Engineering
Reza Malekzadeh, VP Business
Jason Martin, VP Customer Experience
Investors
Andreessen Horowitz
Battery Ventures
Sequoia Capital
Wing. VC (Peter Wagner)
Ed Bugnion, Diane Greene and Mendel
Rosenblum (VMware founders)
cumulusnetworks.com 3
8. Understanding Characteristics of a Leaf Switch
8cumulusnetworks.com
10/40 Gigabit
spine uplink ports
Serial
console port
Ethernet Out-of-
Band
Management Port
* SFP+ ports can be grouped together into a single QSFP 40G port via reverse connecting breakout cable options
* QSFP ports can be broken out into four SFP+ ports via copper or optical transceiver options
9. Understanding Characteristics of a Spine Switch
9cumulusnetworks.com
Serial
console port
Ethernet Out-of-
Band
Management Port
* QSFP ports can be broken out into four SFP+ ports via copper or optical breakout cable options
11. Anatomy of a Network Switch
cumulusnetworks.com 11
( Management Interfaces ) ( Data Plane )
CPU
SoC
DRAM
Boot
Flash
Mass
Storage
Switchin
g
ASIC
Serial
Console
Ethernet
Mgmt Port
10Gb
Port
40Gb
Port…
10Gb
Port
40Gb
Port
…
PCIe
12. Bare Metal Switch Provisioning
Similar approach to installing OS on server
BIOS + PXE = U-Boot + ONIE (Open Network Install
Environment)
Supported hardware (HCL) preloaded with ONIE
ONIE available on GitHub
• http://onie.github.io/onie/
cumulusnetworks.com 12
bare metal server
operating
system
app app app
BIOS and PXE
bare metal switch
operating
system
app app app
U-Boot and ONIE
15. Hardware Compatibility List (HCL)
cumulusnetworks.com 15
Switch Model Number Description
Merchant
Silicon
Cumulus Linux Release
Dell S6000-ON 32 x 40G-QSFP+ Trident II 2.1 or later
Edge-Core AS6700-32X with ONIE 32 x 40G-QSFP+ Trident II 2.0.1 or later
Penguin Computing Arctica 3200XL 32 x 40G-QSFP+ Trident II 2.0 or later
Quanta QCT QuantaMesh T5032-LY6 32 x 40G-QSFP+ Trident II 2.0.1 or later
Agema AG-7448CU 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident 1.5.0 or later
Dell S4810-ON 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident 2.0.2 or later
Edge-Core AS5600-52X with ONIE 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later
Edge-Core AS5610-52X with ONIE 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 2.0.1 or later
Edge-Core AS5710-54X with ONIE 48 x 10G-SFP+ and 6 x 40G-QSFP+ Trident II 2.1.x or later
Penguin Computing Arctica 4804X 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.1 or later
Quanta QCT QuantaMesh T-3048-LY2 48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later
Quanta QCT
QuantaMesh T-3048-
LY2R
48 x 10G-SFP+ and 4 x 40G-QSFP+ Trident+ 1.5.0 or later
Quanta QCT QuantaMesh T5048-LY8 48 x 10G-SFP+ and 6 x 40G-QSFP+ Trident II 2.1.x or later*
Edge-Core AS4600-54T with ONIE 48 x 1G-T and 4 x 10G-SFP+ Apollo2 2.0 or later
Penguin Computing Arctica 4804i 48 x 1G-T and 4 x 10G-SFP+ Triumph2 1.5.1 or later
Quanta QCT QuantaMesh T1048-LB9 48 x 1G-T and 4 x 10G-SFP+ FireBolt3 1.5.0 or later
40G10G1G
18. ONIE: Bare Metal Install – First Time Boot Up
cumulusnetworks.com 18
Boot Loader
(HW Vendor Supplied)
ONIE
(HW Vendor Supplied)
Installer
(OS Vendor)
Boot Loader
• Low Level boot loader, configures CPU complex
• Loads and boots ONIE
ONIE
• Linux Kernel with Busybox
• Configures management Ethernet interface
• Locates and executes an OS installer
• Provides tools and environment for installer
OS Installer
• Available from network or USB
• Linux executable
• Installs vendor OS into mass storage
Network OS
(OS Vendor Supplied)
Fetches
Installs
19. ONIE: Network OS Installer Discovery and Install Behavior
cumulusnetworks.com 19
Configure Network
Interface
Locate Installer
Run Installer
• Uses DHCPv4, DHCPv6
• Configures Ethernet interface for IPv4 / IPv6
• Configures DNS and hostname
• Determines the location of an installer executable
• Examines local file systems, e.g. USB flash drives
• Uses DHCP options, DNS Service Discovery, Multicast
DNS and IPv6 Neighbors
• Downloads installer via URL
• Passes various environment variables to installer
• Launches installer
20. Networking Interfaces in Linux
cumulusnetworks.com 20
Interface Description
eth0 Physical interface for out-of-band management
lo Loopback (logical interface redirecting to switch)
127.0.0.1 in /etc/hosts
Debian lists secondary 127.0.1.1
swpN Physical interface for data plane traffic
N corresponds to port number
bridge Logical interface creating a single Layer 2 broadcast domain
Traffic on sub-interfaces can be untagged or tagged
Commonly called “VLAN”
bond Logical interface aggregating two or more interfaces
Commonly called “LAG” or “port channel”
21. Pushing Changes Down
cumulusnetworks.com 21
CPU, RAM, Flash, etc. Switch Silicon
Front Panel Ports
lldpd
Routing Tables
ARP
Table
Devices
Bridge FDB Filter Tables
Bonds VLANs
LinuxKernel
Virtual Kernel Ports
Bridging
mstpd
ACLRouting Suite
Quagga
snmpd
vconfig
iptable
ebtable
ip6tableiproute2
VXLAN
Bridges
Switch HAL
brctl
Switch
Driver
UserSpace
Quagga daemon,
Quagga.conf, and vtysh
CLI and
/etc/network/interfaces
switchd
22. Show Interface Statistics
cumulusnetworks.com 22
High level statistics for an interface
cumulus@switch:~$ ip -s link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
21780 242 0 0 0 242
TX: bytes packets errors dropped carrier collsns
1145554 11325 0 0 0 0
Low level statistics for an interface
cumulus@switch:~$ sudo ethtool -S swp1
23. Deconstructing /etc/network/interfaces
auto swp1
iface swp1 inet static
address 192.168.0.11/30
gateway 192.168.0.1
up ip link set $IFACE up
down ip link set $IFACE down
cumulusnetworks.com 23
Bring up interface during boot
up or service network reload Interface name
Method:
manual, static, dhcp
ifup verse to bring up
interface
ifdown verse to bring
down interface
IP address settings for interface,
only if using static
Metho
d
Action
manual No IP address configured by default
static IP address configured using address and gateway
options
dhcp Obtain IP address using DHCP server
24. Bridging
Bridge = single isolated Layer 2 broadcast
domain
Allows hosts connected to bridge ports (members)
to discover each other without having to define
routes
Traffic on ports is tagged (802.1q VLAN ID) or
untagged (native)
• Tagging involves using sub-interfaces, e.g. swpN.ID
Commonly called “VLAN” in traditional networking
cumulusnetworks.com 24
25. Defining a Bridge
auto br-vlan100
iface br-vlan100 inet manual
bridge_ports swp4.100 swp5.100
up ip link set $IFACE up
down ip link set $IFACE down
cumulusnetworks.com 25
Bring up interface during boot
up or service network reload Interface name
Method: manual, static,
dhcp
ifup verse to bring up
interface
ifdown verse to bring
down interface
Bridge members.
swp4, swp4.100, swp5, and
swp5.100 must be defined
first
.100 creates sub-interface
(turning swp into trunk port)
26. Show Bridge
cumulusnetworks.com 26
Show bridges
Show bridge MAC addresses
cumulus@switch:~$ brctl showmacs br-red
port name mac addr is local? ageing timer
swp4 06:90:70:22:a6:2e no 19.47
swp1 12:12:36:43:6f:9d no 40.50
swp1 44:38:39:00:12:9b yes 0.00
swp2 44:38:39:00:12:9c yes 0.00
cumulus@switch:~$ brctl show
bridge name bridge id STP enabled interfaces
br-vlan100 8000.089e01f89511 no swp5
swp6
27. Cumulus Linux Packaging and Support
cumulusnetworks.com 27
main
updates
security-updates
addons
testing
250 packages
~ 20 Cumulus Linux packages
Examples:
Ruby, Perl, Python, Bash, IPtables, LLDP
Updates: packages revised
Security: known concerns, CVEs
User-identified utilities + libraries
Puppet, Factor, Chef, collectd
Early access utilities and libraries
Bird (CL 1.5)
40K+ packagesDebian.org
Fully
Supported
Fully
Supported*
Best Effort
Best Effort*
*packages not controlled by
Cumulus
28. Traditional Hierarchical Network Topology
L3
L2
Access
Aggregation
Core
Legacy and limitations
Not designed for today’s data
center running modern
workloads
• Server density
• Increased server-to-server traffic
Numerous proprietary protocols
• STP/RSTP/PVSTP, VTP, HSRP, MLAG,
LACP
“This is what we’ve been taught”
28
29. L3 Is the Future
L3
L2
ECMP
Clos network
(“spine/leaf”)
1. Simpler network
1. Fewer protocols
2. Standards-based
1. Fewer proprietary
features
3. Predictable latency
1. Every leaf is 1 hop away
4. Horizontally scalable
Leaf
Spine
Core
29
30. Basic Clos Architecture (2-Tier Spine/Leaf)
30cumulusnetworks.com
Optimized for high bandwidth East to West traffic patterns
compute and storage
network services
Core or WAN
Spine Layer
Leaf Layer
31. Basic Clos Architecture (3-Tier or 5-Stage)
31cumulusnetworks.com
Leaf
Spin
e
InterPod Spine
Network Services
Leaf
The logos on the right by the team represent the companies where the team served.
The logos on the right by the team represent the companies where the team served.
Cumulus Network’s HCL focused on fixed boxes (Leaf/Spine)Same Broadcom silicon as Arista switches, same hardware performance at lower price point.Arista has additional hardware platforms for special purposes Choice – Cumulus focuses on breadth of platforms/vendors for best of breed.Arista supports black boxesArista and supports many different configurations – Cumulus doesn’t need differentiated price points for low end configurations, they are already cheaperCumulus Linux is a Linux OS, and network services apps run on top of it are very rich.Arista in contrast is a Linux-based OS, EoS integrates all apps in one image and control is limited to some Linux containerCloud Networking designs – includes L2/Host Multi-homing*, L3/ECMP, L2 over L3 VXLAN.Customers are moving to L3 CLOS fabrics so L2/Host multi-homing is all that’s needed, not MLAGOrchestration – Comprehensive set of tools today on par with Arista and rapid innovationOur model offers the same Orchestration tool and more due to rapid pace of innovation (ex. Midokura)OpenFlow is supported with other OS such as Big SwitchAutomation.Cumulus Linux has Zero Touch Provisioning, automated install, better DevOps integration (due to unmodified Linux/scripting languages)Application visibility – Leverage server style tools & hardware counters/functionalityArista may have stronger networking tracers, advanced mirroring (DANZ), advanced congestion management (LANZ) tools today. Congestion management/counters will be enabled with switchd file system, more can be done for simplification, but similar capability can be enabled through scriptingProgrammable foundation – drivers abstractions, eAPI, Unmodified Linux Cumulus Linux drivers abstractions are unchanged (in contrast Arista uses sysDB to provide visibility to their own driver), Cumulus Linux networking data structures are unchanged (Arista uses its own so user is limited to management plane/control plane box changes)
Bare metal switches have been around for a while, but each had a proprietary OS and were not robust. Now we have the best OS to ride along the best switches.
Just like BIOS and PXE allows you to install an OS on a server using a remote image, the combination of U-Boot and ONIE allows that for bare metal switches.We require ONIE preloaded on HCL because U-Boot is different across vendor devices, and U-Boot itself is not very user friendly.We created ONIE and gave it to the Open Compute Project (OCP); it facilitates easy network OS installation of not just Cumulus Linux (Pica8 is a competitive example). Now you have your choice of installing whatever OS you want, not just what comes with the switch (e.g. Cisco IOS– OEM example, or FASTPATH– Broadcom’s OS).Think of ONIE as PXE on steroids. ONIE is a small BusyBox Linux distribution, with a bunch of fetch and execution Bash scripts. It leverages modern ways of discovering networks using what was built into Linux—e.g., IPv6 neighbor discovery, DHCPv6, DHCPv4.U-Boot is very good at probing the bus. U-Boot takes about 1MB. It has boot flash that’s dedicated to booting the hardware, separate from the Operating System flash. ONIE is a way to build on top of this. Takes about 3.5 MB.ONIE is extremely well documented and flexible, and embraced by the open source community. (Source is on GitHubsince summer 2013).
2.1 releases are subject to hardware availability. Make sure to explicitly order the ONIE part, as some models ship without ONIE.
swp numbering starts at 1 instead of 0 to match the numbering typically found on the front panel silk screen.
Within Linux is a construct called netlink,the communication channel between user space and Linux kernel. Everything we see in the User Space box talks to the Kernel through netlink (not shown on diagram). switchd snoops the netlink traffic and can react (e.g. whenever you add or remove a route)Color decode:Green with orange border pushes things down to the kernel
Statistics show that as packets flow across the merchant silicon, switchd updates the counters in the kernel real-time. The advantage of this approach is that you can deploy any interface monitoring tool that you normally would use in Linux.How do you access counters? Use netstat. It’s the same as on a server—even on a switch port.
Technically the up and down verses are needed only if method is manual. But we add them here for consistency. In the forthcoming CL 2.1, the commands for ifup and ifdown will be simplified.
The basic use of bridging is to connect all of the physical interfaces in the system into a single layer 2 domain. This results in a switch that behaves like a Cisco Catalyst device.VLANs are called bridging because under Linux the way you’d create a VLAN is by creating a bridge. We try to stick to Linux terms because CL is Linux.VLAN tags are implemented as VLAN sub-interfaces.The traffic from multiple bridges or VLAN segments can be multiplexed on the same data link. Cumulus Linux supports the 802.1Q VLAN trunk interface, which carries traffic from multiple VLANs, with each packet encapsulated with an 802.1Q VLAN tag. The VLAN ID carried in the VLAN tag associates the packet with the corresponding VLAN segment. Each VLAN sub-interface of the VLAN trunk can be added as a member interface of the corresponding bridge.
Add bridge_waitport 0 if you don’t want Cumulus Linux to wait while trying to connect to switch port– it waits 30 seconds by default. Setting to 0 is handy if ports are not connected or CL is not licensed.
Cumulus Linux-specific packages are organized into 5 repositories (compiled for proper architecture—PowerPC, MIPS, ARM, x86) we manage and support, compared to Debian.org (which we don’t control).main –all packages in CL image to support base functionalityupdates – updates to any packages in main,not security related.security-updates – security-related (address known exploit)updates to any packages in main, thus you should prioritizeaddons – additional packages not in image, e.g. Puppet from Puppet Labs.testing – packages undergoing development. Experimental and not QA’ed.Repository is publicly accessible from internet. We use same apt-get infrastructure as Debian. See KB article if you need to set up a local apt-get repository.switchd is the main item you can’t see source; most of our commands are written in Python and you can see the source.We’ve vetted and touched in some respect all the daemons we ship with. (If some customers are concerned about apparent reliability/support problems with Quagga in the community, we have a large customer running Quagga for over a year to manage over 6,000 switches for OSPF and IPv4 without problems.)Knowing what customers use and/or need will help us decide and prioritize what to test and include (e.g. vi, Emacs, Ruby, Perl, Python; but some customers want Puppet, Chef, Ansible)* on the slide means we do not have control of source, but we’ll do as much as we can accordingly.
The traditional hierarchical network was designed for server traffic that mostly went out and in, not servers to servers within the datacenterComplexity increases with more protocols, particularly proprietary ones (vs. standard, open protocols)With virtualization, today’s datacenters have a lot more nodes– each server is running dozens of VMs that need to talk to each other, or VMs need to be able to move between server hosts through vMotion.Just because this is the way Cisco has taught us doesn’t mean it’s the most efficient way.
We’ve taken a page out of the play book of the largest datacentersSimpler: single protocol (BGP or OSPF) provides ECMPPredictable latency– everything is a single hop awayHorizontally scalable– scale beyond two aggregation switches, higher bandwidth through ECMP and no blocked portsBetter failure– if one spine fails, less impact than in traditional hierarchical network topology where an aggregation switch failsNot 50% of bandwidth, as in traditional networks. Or MLAG-based designs.The two leaves connecting to the core are known as datacenter leaves. Why don’t we connect core switches to spine switches? We want to isolate external traffic. Core Ports are expensive, so as you try to scale and hook to all the spines, that becomes cost prohibitive.