Network failures continue to plague datacenter
operators as their symptoms may not have
direct correlation with where or why they occur. We
introduce 007, a lightweight, always-on diagnosis application
that can find problematic links and also
pinpoint problems for each TCP connection. 007 is
completely contained within the end host. During
its two month deployment in a tier-1 datacenter, it
detected every problem found by previously deployed
monitoring tools while also finding the sources of
other problems previously undetected.
Risk Assessment For Installation of Drainage Pipes.pdf
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
1. Papers We Love Sept. 2018
007: Democratically Finding The
Cause of Packet Drops
Michael Kehoe
Staff Site Reliability Engineer
NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani
2. 007: Democratically Finding The Cause of Packet Drops
Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu
Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred
5. Introduction & Motivation
“Even a small network outage
or a few lossy links can cause
the VM to “panic” and reboot.
In fact, 17% of our VM
reboots are due to network
issues and in over 70% of
these none of our monitoring
tools were able to find the
links that caused the
problem.”
9. “In a network of ≥ 106 links it’s a reasonable
assumption that there is a non-zero chance that
a number (> 10) of these links are bad (due to
device, port, or cable, etc.)…However, currently
we do not have a direct way to correlate
customer impact with bad links".
Introduction & Motivation
10. “007 records the path of TCP connections
(flows) suffering from one or more
retransmissions and assigns proportional
“blame” to each link on the path. It then
provides a ranking of links that represents their
relative drop rates.”
Introduction & Motivation
11. Introduction & Motivation
1. Does not require any changes to
network infrastructure
2. Does not require any changes to client
software
3. Detects in-band failures
4. Resilient to noise
5. Negligible overhead
12. Assumptions
DISCUSSION
1. L2 networks are not viable unless;
1. Support path discovery methods
2. Supports EverFlow
2. No use of Source NATs (SNATs)
3. Assumes ECMP (L3) Clos network
design
4. Don’t try to reverse-engineer ECMP
15. Design Overview
• TCP monitoring agent: detects
retransmissions at each end-host.
• Path discovery agent: which
identifies the flow’s path to the
Destination IP (DIP)
• At the end-hosts, a voting scheme
is used based on the paths of
flows that had retransmissions. At
regular intervals of 30s the votes
are tallied by a centralized
analysis agent to find the top-
voted links.
16. Design Overview
• 6000 lines of C++ code
• 600KB memory usage
• 1-3% CPU Usage
• 200 KBs bandwidth
utilization
18. TCP Monitoring Agent
• TCP Monitoring agent
notifies Path Discovery
Agent immediately after any
retransmit
• Use of ‘Event Tracing for
Windows’ (ETW)
• Could use BPF in Linux
20. Path Discovery Agent
“The path discovery agent
uses traceroute packets to
find the path of flows that
suffer retransmissions. These
packets are used solely to
identify the path of a flow.
They do not need to be
dropped for 007 to operate”
21. Path Discovery Agent
“Once the TCP monitoring
agent notifies the path
discovery agent that a flow
has suffered a
retransmission, the path
discovery agent checks its
cache of discovered path for
that epoch…It then sends 15
appropriately crafted TCP
packets with TTL values
ranging from 1–15.”
22. Path Discovery Agent
ENGINEERING CHALLENGES – ECMP
• ECMP algorithms are
unknown
• All packets of a given flow,
defined by the five-tuple,
follow the same path
23. Path Discovery Agent
ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS
• Traceroute itself may fail
• A lossy link may cause one
or more BGP sessions to
fail, triggering rerouting
24. Path Discovery Agent
ENGINEERING CHALLENGES – ROUTER ALIASING
• Have a pre-mapped
topology of:
• Switch/Router names
• Router/ Interface IP
addresses
26. Analysis Agent
VOTING BASED SCHEME
• Good votes are 0
• Bad votes are
1
ℎ
where h is
the number of hops on the
path
• Each link on the path is
given a vote
28. Analysis Agent
VOTING BASED SCHEME
• Congestion & single drops
are akin to noise
• Single flow is unlikely to go
through more than one
failed link
• Probability of errors in
results diminishes
exponentially with the
number of flows
30. Simulations
PERFORMANCE
• Accuracy: Proportion if
correctly identified drop
causes
• Recall: How many of the
failures are detected (false
negatives)
• Precision: How trusted are
the results (false positives)
31. Evaluation: Simulations
PERFORMANCE: OPTIMAL CASE
• 0.05 -1% drop rate
• Accuracy is > 96%
• Recall/ Precision is
almost always 100%
https://github.com/behnazak/Vigil-007SourceCode
39. Evaluation: Production
• 007 located bad link
correctly in 281 cases of
VM reboot in Microsoft
DCN
• Identifies average 0.45 ±
0.12 as bad per epoch
• Of links dropping
packets:
• 48%: Server to TOR
• 24%: T1 – TOR
41. Discussion
• Congestion detection
• Ranking with bias
• Finding the cause of other
problems
• 007 can also be used for:
• Detection of switch
failures
List citations (count)
Particular background on authors
So from the beginning of the paper, they very precisely state the problem they are trying to solve:
Find the link that dropped the packet and do so with negligible overhead and no changes to the network infrastructure
“Even a small network outage or a few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over 70% of these none of our monitoring tools were able to find the links that caused the problem.”
This problem of course isn’t new at all. There has been a reasonable amount of research into link & forwarding failure detection. The paper does at the beginning highlight some very similar research and in section 10, there’s a deep-dive on related work and how this solution meets the stated goals.
Firstly they mention Pingmesh, which is a Microsoft project.
Pingmesh does have gaps, it doesn’t guarantee that it will cover all links
There is a reasonable overhead on both CPU & Network
It also does out-of-band link testing, so you don’t necessarily create the same conditions the application sees
Roy et al
Monitors all paths
Requires modification to routers
Special features in the switch
Everflow
Requires all traffic to be captured
Not scalable
They also dive a little deeper into the motivation and the reason why you need a reliable faulty link detection:
So if you have a network of 10^6, or more than 1 million links, there is a non-zero chance you’ll have more than 10 of these links bad for various reasons
Generally it’s hard determine customer impact and then prioritize the remediation of these links
So how does 007 do this?
“007 records the path of TCP connections (flows) suffering from one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates.”
Some direct benefits are:
Does not require any changes to network infrastructure
Does not require any changes to client software
Detects in-band failures
Resilient to noise (ability to filter out congestion vs faulty link)
Negligible overhead
So the assumptions part of the paper is actually at the back, but I think these are worth mentioning now:
L2 networks are not supported unless there are path discovery methods or support everflow
It’s assumed that there is no source-natting
007 assumes a Clos topology (which is a L3 network)
Do not try to reverse-engineer ECMP
So lets look at the design of 007
TCP Monitoring agent that detects retransmissions on each end-host
The Path Discovery Agent identifies the flow to the destination IP
Finally an off-site analysis agent gathers votes every 30 seconds
(a) we only track the path of those flows that have retransmissions, (b) assign each link on the path of such a flow a vote of 1/h, where h is the path length, and (c) sum up the votes during a given period, then the top-voted links are almost always the ones dropping packets
If you recall, one of the key objectives of the project is to have low resource utilization.
As you can see, this is all pretty lightweight
So first is the TCP Monitoring Agent
This code simply notifies the path discovery agent that there is a TCP retransmission. It will pass on a 5-tuple address
In Microsoft’s case, they use the Event Tracing for Windows (or ETW) to handle this process.
In Linux, it’s possible to use BPF to do this and you can find similar sample code from Brendan Gregg
So the Path Discovery Agent gets the five-tuple address and then runs a traceroute between the destination end-host to the source.
These packets are ONLY used to identify the path of the flow, they do not need to be dropped to be of use.
There is a cache of discovered paths that help lower the amount of traceroutes we need to do
If the cache doesn’t have a hit, 15 TCP packets are sent, with increasing TTL’s 1-15 to discover the path.
The paper assumes that it’s possible to translate link IP’s to router and switch names and understand the network topology
007 has been designed for a specific use case, namely finding the cause of packet drops on individual connections in order to provide application context. This resulted in a number of design choices:
Congestion usually comes in the form of very small packet-loss. In this paper’s case, 92% of congestion was between 10^-5 and 10^-8. These are treated as noise
The ranking approach will become biased for highly traffic’ed links. This is generally ok as these links are the most impacted by faults.
Can be used to find issues with switches, not just links