Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Papers We Love Sept. 2018
007: Democratically Finding The
Cause of Packet Drops
Michael Kehoe
Staff Site Reliability Engineer
NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani

007: Democratically Finding The Cause of Packet Drops
Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu
Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred

Today’s
agenda
1 Introduction & Motivation
2 TCP Monitoring Agent
3 Path Discovery Agent
4 Analysis Agent
5 Evaluations: Simulations
6 Evaluations: Production
7 Discussion

Introduction & Motivation
“Even a small network outage
or a few lossy links can cause
the VM to “panic” and reboot.
In fact, 17% of our VM
reboots are due to network
issues and in over 70% of
these none of our monitoring
tools were able to find the
links that caused the
problem.”

• Pingmesh [1]
• Leaves gaps
• Overhead
• Out-of-band

• Roy et al [2]
• Requires modifications
to routers
• Requires additional
features on switches

• Everflow [3]
• Requires all traffic to be
captured

“In a network of ≥ 106 links it’s a reasonable
assumption that there is a non-zero chance that
a number (> 10) of these links are bad (due to
device, port, or cable, etc.)…However, currently
we do not have a direct way to correlate
customer impact with bad links".

“007 records the path of TCP connections
(flows) suffering from one or more
retransmissions and assigns proportional
“blame” to each link on the path. It then
provides a ranking of links that represents their
relative drop rates.”

1. Does not require any changes to
network infrastructure
2. Does not require any changes to client
software
3. Detects in-band failures
4. Resilient to noise
5. Negligible overhead

Assumptions
DISCUSSION
1. L2 networks are not viable unless;
1. Support path discovery methods
2. Supports EverFlow
2. No use of Source NATs (SNATs)
3. Assumes ECMP (L3) Clos network
design
4. Don’t try to reverse-engineer ECMP

Design Overview
• TCP monitoring agent: detects
retransmissions at each end-host.
• Path discovery agent: which
identifies the flow’s path to the
Destination IP (DIP)
• At the end-hosts, a voting scheme
is used based on the paths of
flows that had retransmissions. At
regular intervals of 30s the votes
are tallied by a centralized
analysis agent to find the top-
voted links.

Design Overview
• 6000 lines of C++ code
• 600KB memory usage
• 1-3% CPU Usage
• 200 KBs bandwidth
utilization

TCP Monitoring Agent
• TCP Monitoring agent
notifies Path Discovery
Agent immediately after any
retransmit
• Use of ‘Event Tracing for
Windows’ (ETW)
• Could use BPF in Linux

Path Discovery Agent
“The path discovery agent
uses traceroute packets to
find the path of flows that
suffer retransmissions. These
packets are used solely to
identify the path of a flow.
They do not need to be
dropped for 007 to operate”

“Once the TCP monitoring
agent notifies the path
discovery agent that a flow
has suffered a
retransmission, the path
discovery agent checks its
cache of discovered path for
that epoch…It then sends 15
appropriately crafted TCP
packets with TTL values
ranging from 1–15.”

ENGINEERING CHALLENGES – ECMP
• ECMP algorithms are
unknown
• All packets of a given flow,
defined by the five-tuple,
follow the same path

ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS
• Traceroute itself may fail
• A lossy link may cause one
or more BGP sessions to
fail, triggering rerouting

ENGINEERING CHALLENGES – ROUTER ALIASING
• Have a pre-mapped
topology of:
• Switch/Router names
• Router/ Interface IP
addresses

Analysis Agent
VOTING BASED SCHEME
• Good votes are 0
• Bad votes are
1
ℎ
where h is
the number of hops on the
path
• Each link on the path is
given a vote

Analysis Agent
4
2
1 3
0 0+ 1/2
1/2
+ 1/2
+ 1/2

Analysis Agent
VOTING BASED SCHEME
• Congestion & single drops
are akin to noise
• Single flow is unlikely to go
through more than one
failed link
• Probability of errors in
results diminishes
exponentially with the
number of flows

Simulations
PERFORMANCE
• Accuracy: Proportion if
correctly identified drop
causes
• Recall: How many of the
failures are detected (false
negatives)
• Precision: How trusted are
the results (false positives)

Evaluation: Simulations
PERFORMANCE: OPTIMAL CASE
• 0.05 -1% drop rate
• Accuracy is > 96%
• Recall/ Precision is
almost always 100%
https://github.com/behnazak/Vigil-007SourceCode

PERFORMANCE: VARYING DROP RATES
• Maintains accuracy for
both single and multiple
failures

PERFORMANCE: IMPACT OF NOISE
• Almost no impact

PERFORMANCE: NUMBER OF CONNECTIONS
• Almost no impact

PERFORMANCE: TRAFFIC SKEWS
• Can tolerate 50% skew
• When TOR traffic >50%
& >10 failures, accuracy
suffers

PERFORMANCE: BAD LINKS
• 007 can detect up to 7
failures with accuracy >
90%

PERFORMANCE: NETWORK SIZE
• Single failure:
• Accuracy >98% for up to
6 pods
• Multiple failures:
• Accuracy >98.01% for
30 failed links

Evaluation: Production
• 007 located bad link
correctly in 281 cases of
VM reboot in Microsoft
DCN
• Identifies average 0.45 ±
0.12 as bad per epoch
• Of links dropping
packets:
• 48%: Server to TOR
• 24%: T1 – TOR

Discussion
• Congestion detection
• Ranking with bias
• Finding the cause of other
problems
• 007 can also be used for:
• Detection of switch
failures

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Recommended

Recommended

More Related Content

Similar to Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Similar to Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops (20)

More from Michael Kehoe

More from Michael Kehoe (20)

Recently uploaded

Recently uploaded (20)

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

Editor's Notes