SlideShare a Scribd company logo
1 of 43
Papers We Love Sept. 2018
007: Democratically Finding The
Cause of Packet Drops
Michael Kehoe
Staff Site Reliability Engineer
NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani
007: Democratically Finding The Cause of Packet Drops
Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu
Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred
Today’s
agenda
1 Introduction & Motivation
2 TCP Monitoring Agent
3 Path Discovery Agent
4 Analysis Agent
5 Evaluations: Simulations
6 Evaluations: Production
7 Discussion
Introduction & Motivation
Introduction & Motivation
“Even a small network outage
or a few lossy links can cause
the VM to “panic” and reboot.
In fact, 17% of our VM
reboots are due to network
issues and in over 70% of
these none of our monitoring
tools were able to find the
links that caused the
problem.”
Introduction & Motivation
• Pingmesh [1]
• Leaves gaps
• Overhead
• Out-of-band
Introduction & Motivation
• Roy et al [2]
• Requires modifications
to routers
• Requires additional
features on switches
Introduction & Motivation
• Everflow [3]
• Requires all traffic to be
captured
“In a network of ≥ 106 links it’s a reasonable
assumption that there is a non-zero chance that
a number (> 10) of these links are bad (due to
device, port, or cable, etc.)…However, currently
we do not have a direct way to correlate
customer impact with bad links".
Introduction & Motivation
“007 records the path of TCP connections
(flows) suffering from one or more
retransmissions and assigns proportional
“blame” to each link on the path. It then
provides a ranking of links that represents their
relative drop rates.”
Introduction & Motivation
Introduction & Motivation
1. Does not require any changes to
network infrastructure
2. Does not require any changes to client
software
3. Detects in-band failures
4. Resilient to noise
5. Negligible overhead
Assumptions
DISCUSSION
1. L2 networks are not viable unless;
1. Support path discovery methods
2. Supports EverFlow
2. No use of Source NATs (SNATs)
3. Assumes ECMP (L3) Clos network
design
4. Don’t try to reverse-engineer ECMP
Assumptions
DISCUSSION
Design Overview
Design Overview
• TCP monitoring agent: detects
retransmissions at each end-host.
• Path discovery agent: which
identifies the flow’s path to the
Destination IP (DIP)
• At the end-hosts, a voting scheme
is used based on the paths of
flows that had retransmissions. At
regular intervals of 30s the votes
are tallied by a centralized
analysis agent to find the top-
voted links.
Design Overview
• 6000 lines of C++ code
• 600KB memory usage
• 1-3% CPU Usage
• 200 KBs bandwidth
utilization
TCP Monitoring Agent
TCP Monitoring Agent
• TCP Monitoring agent
notifies Path Discovery
Agent immediately after any
retransmit
• Use of ‘Event Tracing for
Windows’ (ETW)
• Could use BPF in Linux
Path Discovery Agent
Path Discovery Agent
“The path discovery agent
uses traceroute packets to
find the path of flows that
suffer retransmissions. These
packets are used solely to
identify the path of a flow.
They do not need to be
dropped for 007 to operate”
Path Discovery Agent
“Once the TCP monitoring
agent notifies the path
discovery agent that a flow
has suffered a
retransmission, the path
discovery agent checks its
cache of discovered path for
that epoch…It then sends 15
appropriately crafted TCP
packets with TTL values
ranging from 1–15.”
Path Discovery Agent
ENGINEERING CHALLENGES – ECMP
• ECMP algorithms are
unknown
• All packets of a given flow,
defined by the five-tuple,
follow the same path
Path Discovery Agent
ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS
• Traceroute itself may fail
• A lossy link may cause one
or more BGP sessions to
fail, triggering rerouting
Path Discovery Agent
ENGINEERING CHALLENGES – ROUTER ALIASING
• Have a pre-mapped
topology of:
• Switch/Router names
• Router/ Interface IP
addresses
Analysis Agent
Analysis Agent
VOTING BASED SCHEME
• Good votes are 0
• Bad votes are
1
ℎ
where h is
the number of hops on the
path
• Each link on the path is
given a vote
Analysis Agent
4
2
1 3
0 0+ 1/2
1/2
+ 1/2
+ 1/2
Analysis Agent
VOTING BASED SCHEME
• Congestion & single drops
are akin to noise
• Single flow is unlikely to go
through more than one
failed link
• Probability of errors in
results diminishes
exponentially with the
number of flows
Simulations
Simulations
PERFORMANCE
• Accuracy: Proportion if
correctly identified drop
causes
• Recall: How many of the
failures are detected (false
negatives)
• Precision: How trusted are
the results (false positives)
Evaluation: Simulations
PERFORMANCE: OPTIMAL CASE
• 0.05 -1% drop rate
• Accuracy is > 96%
• Recall/ Precision is
almost always 100%
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: VARYING DROP RATES
• Maintains accuracy for
both single and multiple
failures
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: IMPACT OF NOISE
• Almost no impact
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: NUMBER OF CONNECTIONS
• Almost no impact
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: TRAFFIC SKEWS
• Can tolerate 50% skew
• When TOR traffic >50%
& >10 failures, accuracy
suffers
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: BAD LINKS
• 007 can detect up to 7
failures with accuracy >
90%
https://github.com/behnazak/Vigil-007SourceCode
Evaluation: Simulations
PERFORMANCE: NETWORK SIZE
• Single failure:
• Accuracy >98% for up to
6 pods
• Multiple failures:
• Accuracy >98.01% for
30 failed links
https://github.com/behnazak/Vigil-007SourceCode
Evaluations: Production
Evaluation: Production
• 007 located bad link
correctly in 281 cases of
VM reboot in Microsoft
DCN
• Identifies average 0.45 ±
0.12 as bad per epoch
• Of links dropping
packets:
• 48%: Server to TOR
• 24%: T1 – TOR
Discussion
Discussion
• Congestion detection
• Ranking with bias
• Finding the cause of other
problems
• 007 can also be used for:
• Detection of switch
failures
Questions?
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

More Related Content

Similar to Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

NZNOG 2022: Routing Security
NZNOG 2022: Routing SecurityNZNOG 2022: Routing Security
NZNOG 2022: Routing SecurityAPNIC
 
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...Tristan Penman
 
The art of system and solution testing
The art of system and solution testingThe art of system and solution testing
The art of system and solution testinggaoliang641
 
Core-Stateless Fair Queueing
Core-Stateless Fair QueueingCore-Stateless Fair Queueing
Core-Stateless Fair QueueingYuanxuan Wang
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Networking
NetworkingNetworking
Networkingra na
 
hajer
hajerhajer
hajerra na
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysJoff Thyer
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy ProxyMark McBride
 
ClueCon 2018: Real-time Communications Monitoring 101 by Varun Singh
ClueCon 2018: Real-time Communications Monitoring 101 by Varun SinghClueCon 2018: Real-time Communications Monitoring 101 by Varun Singh
ClueCon 2018: Real-time Communications Monitoring 101 by Varun Singhcallstats.io
 
Building Awesome APIs with Lumen
Building Awesome APIs with LumenBuilding Awesome APIs with Lumen
Building Awesome APIs with LumenKit Brennan
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)Yunong Xiao
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Co se skrývá v datovém provozu? - Pavel Minařík
Co se skrývá v datovém provozu? - Pavel MinaříkCo se skrývá v datovém provozu? - Pavel Minařík
Co se skrývá v datovém provozu? - Pavel MinaříkSecurity Session
 
Introduction to cyber forensics
Introduction to cyber forensicsIntroduction to cyber forensics
Introduction to cyber forensicsAnpumathews
 
Group Apres
Group ApresGroup Apres
Group Apresramya5a
 

Similar to Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops (20)

NZNOG 2022: Routing Security
NZNOG 2022: Routing SecurityNZNOG 2022: Routing Security
NZNOG 2022: Routing Security
 
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
 
The art of system and solution testing
The art of system and solution testingThe art of system and solution testing
The art of system and solution testing
 
Core-Stateless Fair Queueing
Core-Stateless Fair QueueingCore-Stateless Fair Queueing
Core-Stateless Fair Queueing
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Networking
NetworkingNetworking
Networking
 
hajer
hajerhajer
hajer
 
Raptor codes
Raptor codesRaptor codes
Raptor codes
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
 
Решения WANDL и NorthStar для операторов
Решения WANDL и NorthStar для операторовРешения WANDL и NorthStar для операторов
Решения WANDL и NorthStar для операторов
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 
ClueCon 2018: Real-time Communications Monitoring 101 by Varun Singh
ClueCon 2018: Real-time Communications Monitoring 101 by Varun SinghClueCon 2018: Real-time Communications Monitoring 101 by Varun Singh
ClueCon 2018: Real-time Communications Monitoring 101 by Varun Singh
 
Building Awesome APIs with Lumen
Building Awesome APIs with LumenBuilding Awesome APIs with Lumen
Building Awesome APIs with Lumen
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Co se skrývá v datovém provozu? - Pavel Minařík
Co se skrývá v datovém provozu? - Pavel MinaříkCo se skrývá v datovém provozu? - Pavel Minařík
Co se skrývá v datovém provozu? - Pavel Minařík
 
Introduction to cyber forensics
Introduction to cyber forensicsIntroduction to cyber forensics
Introduction to cyber forensics
 
Group Apres
Group ApresGroup Apres
Group Apres
 
C Cpres
C CpresC Cpres
C Cpres
 

More from Michael Kehoe

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsMichael Kehoe
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
 

More from Michael Kehoe (20)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 

Recently uploaded

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 

Recently uploaded (20)

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

  • 1. Papers We Love Sept. 2018 007: Democratically Finding The Cause of Packet Drops Michael Kehoe Staff Site Reliability Engineer NDSI - https://www.usenix.org/conference/nsdi18/presentation/arzani
  • 2. 007: Democratically Finding The Cause of Packet Drops Behnaz Arzani Selim Ciraci Luiz Chamon Yibo Zhu Hongqiang Liu Jitu Padhye Boon Thau Loo Geoff Outhred
  • 3. Today’s agenda 1 Introduction & Motivation 2 TCP Monitoring Agent 3 Path Discovery Agent 4 Analysis Agent 5 Evaluations: Simulations 6 Evaluations: Production 7 Discussion
  • 5. Introduction & Motivation “Even a small network outage or a few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over 70% of these none of our monitoring tools were able to find the links that caused the problem.”
  • 6. Introduction & Motivation • Pingmesh [1] • Leaves gaps • Overhead • Out-of-band
  • 7. Introduction & Motivation • Roy et al [2] • Requires modifications to routers • Requires additional features on switches
  • 8. Introduction & Motivation • Everflow [3] • Requires all traffic to be captured
  • 9. “In a network of ≥ 106 links it’s a reasonable assumption that there is a non-zero chance that a number (> 10) of these links are bad (due to device, port, or cable, etc.)…However, currently we do not have a direct way to correlate customer impact with bad links". Introduction & Motivation
  • 10. “007 records the path of TCP connections (flows) suffering from one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates.” Introduction & Motivation
  • 11. Introduction & Motivation 1. Does not require any changes to network infrastructure 2. Does not require any changes to client software 3. Detects in-band failures 4. Resilient to noise 5. Negligible overhead
  • 12. Assumptions DISCUSSION 1. L2 networks are not viable unless; 1. Support path discovery methods 2. Supports EverFlow 2. No use of Source NATs (SNATs) 3. Assumes ECMP (L3) Clos network design 4. Don’t try to reverse-engineer ECMP
  • 15. Design Overview • TCP monitoring agent: detects retransmissions at each end-host. • Path discovery agent: which identifies the flow’s path to the Destination IP (DIP) • At the end-hosts, a voting scheme is used based on the paths of flows that had retransmissions. At regular intervals of 30s the votes are tallied by a centralized analysis agent to find the top- voted links.
  • 16. Design Overview • 6000 lines of C++ code • 600KB memory usage • 1-3% CPU Usage • 200 KBs bandwidth utilization
  • 18. TCP Monitoring Agent • TCP Monitoring agent notifies Path Discovery Agent immediately after any retransmit • Use of ‘Event Tracing for Windows’ (ETW) • Could use BPF in Linux
  • 20. Path Discovery Agent “The path discovery agent uses traceroute packets to find the path of flows that suffer retransmissions. These packets are used solely to identify the path of a flow. They do not need to be dropped for 007 to operate”
  • 21. Path Discovery Agent “Once the TCP monitoring agent notifies the path discovery agent that a flow has suffered a retransmission, the path discovery agent checks its cache of discovered path for that epoch…It then sends 15 appropriately crafted TCP packets with TTL values ranging from 1–15.”
  • 22. Path Discovery Agent ENGINEERING CHALLENGES – ECMP • ECMP algorithms are unknown • All packets of a given flow, defined by the five-tuple, follow the same path
  • 23. Path Discovery Agent ENGINEERING CHALLENGES – RE-ROUTING & PACKET DROPS • Traceroute itself may fail • A lossy link may cause one or more BGP sessions to fail, triggering rerouting
  • 24. Path Discovery Agent ENGINEERING CHALLENGES – ROUTER ALIASING • Have a pre-mapped topology of: • Switch/Router names • Router/ Interface IP addresses
  • 26. Analysis Agent VOTING BASED SCHEME • Good votes are 0 • Bad votes are 1 ℎ where h is the number of hops on the path • Each link on the path is given a vote
  • 27. Analysis Agent 4 2 1 3 0 0+ 1/2 1/2 + 1/2 + 1/2
  • 28. Analysis Agent VOTING BASED SCHEME • Congestion & single drops are akin to noise • Single flow is unlikely to go through more than one failed link • Probability of errors in results diminishes exponentially with the number of flows
  • 30. Simulations PERFORMANCE • Accuracy: Proportion if correctly identified drop causes • Recall: How many of the failures are detected (false negatives) • Precision: How trusted are the results (false positives)
  • 31. Evaluation: Simulations PERFORMANCE: OPTIMAL CASE • 0.05 -1% drop rate • Accuracy is > 96% • Recall/ Precision is almost always 100% https://github.com/behnazak/Vigil-007SourceCode
  • 32. Evaluation: Simulations PERFORMANCE: VARYING DROP RATES • Maintains accuracy for both single and multiple failures https://github.com/behnazak/Vigil-007SourceCode
  • 33. Evaluation: Simulations PERFORMANCE: IMPACT OF NOISE • Almost no impact https://github.com/behnazak/Vigil-007SourceCode
  • 34. Evaluation: Simulations PERFORMANCE: NUMBER OF CONNECTIONS • Almost no impact https://github.com/behnazak/Vigil-007SourceCode
  • 35. Evaluation: Simulations PERFORMANCE: TRAFFIC SKEWS • Can tolerate 50% skew • When TOR traffic >50% & >10 failures, accuracy suffers https://github.com/behnazak/Vigil-007SourceCode
  • 36. Evaluation: Simulations PERFORMANCE: BAD LINKS • 007 can detect up to 7 failures with accuracy > 90% https://github.com/behnazak/Vigil-007SourceCode
  • 37. Evaluation: Simulations PERFORMANCE: NETWORK SIZE • Single failure: • Accuracy >98% for up to 6 pods • Multiple failures: • Accuracy >98.01% for 30 failed links https://github.com/behnazak/Vigil-007SourceCode
  • 39. Evaluation: Production • 007 located bad link correctly in 281 cases of VM reboot in Microsoft DCN • Identifies average 0.45 ± 0.12 as bad per epoch • Of links dropping packets: • 48%: Server to TOR • 24%: T1 – TOR
  • 41. Discussion • Congestion detection • Ranking with bias • Finding the cause of other problems • 007 can also be used for: • Detection of switch failures

Editor's Notes

  1. List citations (count) Particular background on authors
  2. So from the beginning of the paper, they very precisely state the problem they are trying to solve: Find the link that dropped the packet and do so with negligible overhead and no changes to the network infrastructure “Even a small network outage or a few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over 70% of these none of our monitoring tools were able to find the links that caused the problem.”
  3. This problem of course isn’t new at all. There has been a reasonable amount of research into link & forwarding failure detection. The paper does at the beginning highlight some very similar research and in section 10, there’s a deep-dive on related work and how this solution meets the stated goals. Firstly they mention Pingmesh, which is a Microsoft project. Pingmesh does have gaps, it doesn’t guarantee that it will cover all links There is a reasonable overhead on both CPU & Network It also does out-of-band link testing, so you don’t necessarily create the same conditions the application sees
  4. Roy et al Monitors all paths Requires modification to routers Special features in the switch
  5. Everflow Requires all traffic to be captured Not scalable
  6. They also dive a little deeper into the motivation and the reason why you need a reliable faulty link detection: So if you have a network of 10^6, or more than 1 million links, there is a non-zero chance you’ll have more than 10 of these links bad for various reasons Generally it’s hard determine customer impact and then prioritize the remediation of these links
  7. So how does 007 do this? “007 records the path of TCP connections (flows) suffering from one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates.”
  8. Some direct benefits are: Does not require any changes to network infrastructure Does not require any changes to client software Detects in-band failures Resilient to noise (ability to filter out congestion vs faulty link) Negligible overhead
  9. So the assumptions part of the paper is actually at the back, but I think these are worth mentioning now: L2 networks are not supported unless there are path discovery methods or support everflow It’s assumed that there is no source-natting 007 assumes a Clos topology (which is a L3 network) Do not try to reverse-engineer ECMP
  10. So lets look at the design of 007
  11. TCP Monitoring agent that detects retransmissions on each end-host The Path Discovery Agent identifies the flow to the destination IP Finally an off-site analysis agent gathers votes every 30 seconds (a) we only track the path of those flows that have retransmissions, (b) assign each link on the path of such a flow a vote of 1/h, where h is the path length, and (c) sum up the votes during a given period, then the top-voted links are almost always the ones dropping packets
  12. If you recall, one of the key objectives of the project is to have low resource utilization. As you can see, this is all pretty lightweight
  13. So first is the TCP Monitoring Agent This code simply notifies the path discovery agent that there is a TCP retransmission. It will pass on a 5-tuple address In Microsoft’s case, they use the Event Tracing for Windows (or ETW) to handle this process. In Linux, it’s possible to use BPF to do this and you can find similar sample code from Brendan Gregg
  14. So the Path Discovery Agent gets the five-tuple address and then runs a traceroute between the destination end-host to the source. These packets are ONLY used to identify the path of the flow, they do not need to be dropped to be of use.
  15. There is a cache of discovered paths that help lower the amount of traceroutes we need to do If the cache doesn’t have a hit, 15 TCP packets are sent, with increasing TTL’s 1-15 to discover the path.
  16. The paper assumes that it’s possible to translate link IP’s to router and switch names and understand the network topology
  17. 007 has been designed for a specific use case, namely finding the cause of packet drops on individual connections in order to provide application context. This resulted in a number of design choices: Congestion usually comes in the form of very small packet-loss. In this paper’s case, 92% of congestion was between 10^-5 and 10^-8. These are treated as noise The ranking approach will become biased for highly traffic’ed links. This is generally ok as these links are the most impacted by faults. Can be used to find issues with switches, not just links