SlideShare a Scribd company logo
1 of 130
Download to read offline
All of Your Network Monitoring
is (probably) Wrong
joe damato
packagecloud.io
greetings
i’m joe
i like computers
i once had a blog
called timetobleed.com
@joedamato
packagecloud.io
@packagecloudio
follow along
blog.packagecloud.io
cognitive load
too much stuff
cognitive load
copy & paste configs
BTW
This is actually part of another talk I’m working on called
Programmers should
get paid more & work
less
anw
cognitive load
ever copy/paste !
conf or tuning settings
you didn’t understand?
(probably)
there’s too much
damn code
similarly...
cognitive load
do you really understand every
single graph you are generating?
(probably not)
there’s too much
damn code
If there’s too much damn code to
configure and tune
what makes you think
you can actually
monitor it?
spoiler: you can’t
(prob. doesn't matter,
more on this later)
claim: the more complex the
system is, the harder it is to monitor
NOTE
complexity != bad
from: https://www.flickr.com/photos/49128298@N04/24290342544/
NOTE
• you want to:
• download and play the cat game on your an Phone (cause
small portable electronic devices)
• while messaging your buds (cause ur lonely)
• with in app purchases (cause you need more gold fish)
• over a TLS encrypted connection (cause payments)
• over a VPN (cause china)
• while flying (cause boredom)
and that’s OK
it doesn’t mean complexity is bad
2 complicated things that aren’t
necessarily bad
from: https://flic.kr/p/aWXpWZ
from: https://flic.kr/p/56XWHr
so, like, you know one thing that’s
p. complicated?
the Linux networking stack
all kiiiiiiiiiiiiiiiiiiiiiiiinda features
• different NICs have different rx and tx queue size limits and defaults
• ethernet bonding
• IRQ modulation, ntuple filtering, ….
• RSS, RPS, RFS, aRFS
• GRO, GSO, hw accelerated VLAN IDs, timestamping, ….
• you are probably using at least 2 protocol stacks (IP and TCP/UDP)
• all kiiiiiiiiiiiiiiinda tuning levers and knobs for everything from top to bottom
all kiiiiiinda bugs
and
basically
literally
actually
really
no docs
https://www.redhat.com/archives/rhl-list/2007-September/msg03735.html
> How can I find out the /proc/net info
>
> eg: softnet_stat is for what purpose
!
Much of this is only well-documented in the
code. Here's an attempt
at interpreting softnet_stat [no guarantee
that it is correct; read the code!]:
total
# of packets (not including netpoll)
received by the interrupt handler.
There might be some double
counting going on [ … ]
I think the intention was that these
were originally on separate receive paths
Full networking writeup
literally 90 pages
literally everything about linux networking
literally available here:
http://bit.ly/linux-networking
it’s fine, as long as we are
honest that it’s just reality
[random os] has a better/faster/
leaner/whatever networking stack
than linux
anw
complex
• not necessarily inefficient
• not necessarily bad
• people expect a lot of complicated features
• so theres a lot of code needed to support all
this random stuff you want to do
• see also: cat game example
complex
• the bad news…..
• you are supposed to monitor this complicated code
• and then you are supposed to look at some graphs
• and then you are supposed to Know The Answer™
sounds difficult… but it gets better ;)
what if i told you…
a driver bug caused stats
to output incorrectly in
/proc/net/dev?
igb
• driver stats updated via a timer every 2 secs
• reading stats via /proc/net/dev produced stale stats
• but not via ethtool (different code path)
• fixed by forcing stat update whenever stats are read
• i saw this in production —— did you????
only matters if you are
monitoring your network stats
more often than every 2 sec.
maybe you aren’t
because you dont care
(that’s fine and you are prob. right)
but if you do care…
your future status
• you’d need to:
• notice the problem in your graph
• start reading your stats collecting code/plugin
• realize the bug is not there
• read your driver code
• realize the bug is in the code path that /proc/net/dev hits
• write a patch to fix it
• rebuild the driver and deploy it everywhere
that’s a lot of work
just to monitor bytes tx/rx
greetings!
things that dont exist
• descending order of probability it doesn't exist:
• free open source
• the an singularity
• calorie free chocolate covered bacon
• etc
but joe, my devops are the literal
strongest and they doesn't afraid of
anything
i’ll just use the ethtool
ethtool
• a command line tool
• uses ioctl system call to talk to network drivers
• not all drivers actually implement the interface
• and the ones which do, generally, do it differently
what if i told you…
ethtool
• no standardized way of outputting driver stats
• some drivers don’t even implement the interface
• the ones which do use diff field names
¡dale, comparemos!
• ec2 vif driver
• ixgbe driver
• igb driver
ec2 vif driver
$ sudo ethtool -S eth0
NIC statistics:
rx_gso_checksum_fixup: 0
ethtool outputs 1 statistic
what even is
rx_gso_checksum_fixup?
joe’s ixgbe driver on an Real Computer
ethtool outputs 377 statistics
NIC statistics:
rx_packets: 9665600259
tx_packets: 12198470686
rx_bytes: 6790400019470
tx_bytes: 2169046666156
rx_pkts_nic: 11107310349
tx_pkts_nic: 12198470686
rx_bytes_nic: 6929982126806
tx_bytes_nic: 2217848697965
lsc_int: 1
tx_busy: 0
non_eop_descs: 1044042523
rx_errors: 0
tx_errors: 0
rx_dropped: 1
tx_dropped: 0
multicast: 7876979
broadcast: 2633
rx_no_buffer_count: 0
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
hw_rsc_aggregated: 6600573569
hw_rsc_flushed: 5158863479
fdir_match: 175127
fdir_miss: 11098854004
fdir_overflow: 1
rx_fifo_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
tx_flow_control_xon: 0
rx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_flow_control_xoff: 0
rx_csum_offload_errors: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
rx_no_dma_resources: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
fcoe_bad_fccrc: 0
rx_fcoe_dropped: 0
rx_fcoe_packets: 0
rx_fcoe_dwords: 0
fcoe_noddp: 0
fcoe_noddp_ext_buff: 0
tx_fcoe_packets: 0
tx_fcoe_dwords: 0
tx_queue_0_packets: 650250933
tx_queue_0_bytes: 109734794973
tx_queue_1_packets: 734133738
tx_queue_1_bytes: 123318917069
tx_queue_2_packets: 772808083
tx_queue_2_bytes: 131183014063
tx_queue_3_packets: 741428236
tx_queue_3_bytes: 125821603228
tx_queue_4_packets: 692281561
tx_queue_4_bytes: 118278086880
tx_queue_5_packets: 783438226
tx_queue_5_bytes: 133234307795
tx_queue_6_packets: 719335931
tx_queue_6_bytes: 123662184314
tx_queue_7_packets: 668577198
tx_queue_7_bytes: 114915688397
tx_queue_8_packets: 711699909
tx_queue_8_bytes: 122460443627
tx_queue_9_packets: 681741781
tx_queue_9_bytes: 118032356999
tx_queue_10_packets: 585639061
tx_queue_10_bytes: 98009207733
tx_queue_11_packets: 640487443
tx_queue_11_bytes: 107781535416
tx_queue_12_packets: 706304786
tx_queue_12_bytes: 118963058912
tx_queue_13_packets: 716825472
tx_queue_13_bytes: 121032769231
tx_queue_14_packets: 699280537
tx_queue_14_bytes: 118119557225
tx_queue_15_packets: 675274048
tx_queue_15_bytes: 114916452394
tx_queue_16_packets: 123509474
tx_queue_16_bytes: 25473914817
tx_queue_17_packets: 101309066
tx_queue_17_bytes: 23513562050
tx_queue_18_packets: 92291301
tx_queue_18_bytes: 21830243983
tx_queue_19_packets: 87287348
tx_queue_19_bytes: 20887753665
tx_queue_20_packets: 34518707
tx_queue_20_bytes: 9837323388
tx_queue_21_packets: 24009284
tx_queue_21_bytes: 6760172375
tx_queue_22_packets: 23628875
tx_queue_22_bytes: 6707751077
tx_queue_23_packets: 25969617
tx_queue_23_bytes: 7343742932
tx_queue_24_packets: 30112206
tx_queue_24_bytes: 8614816667
tx_queue_25_packets: 28812367
tx_queue_25_bytes: 8186825345
tx_queue_26_packets: 31710307
tx_queue_26_bytes: 9139202059
tx_queue_27_packets: 40835241
tx_queue_27_bytes: 11499713701
tx_queue_28_packets: 39265877
tx_queue_28_bytes: 11045989548
tx_queue_29_packets: 41775414
tx_queue_29_bytes: 11804871879
tx_queue_30_packets: 12497615
tx_queue_30_bytes: 3405490173
tx_queue_31_packets: 11021513
tx_queue_31_bytes: 2659215149
tx_queue_32_packets: 10464342
tx_queue_32_bytes: 2632864135
tx_queue_33_packets: 11341007
tx_queue_33_bytes: 2818638887
tx_queue_34_packets: 12782059
tx_queue_34_bytes: 3307226594
tx_queue_35_packets: 12795212
tx_queue_35_bytes: 3400547658
tx_queue_36_packets: 59272452
tx_queue_36_bytes: 17286517363
tx_queue_37_packets: 85631445
tx_queue_37_bytes: 25126772743
tx_queue_38_packets: 84708817
tx_queue_38_bytes: 24920451495
tx_queue_39_packets: 83763431
tx_queue_39_bytes: 24662854523
tx_queue_40_packets: 0
tx_queue_40_bytes: 0
tx_queue_41_packets: 0
tx_queue_41_bytes: 0
tx_queue_42_packets: 0
tx_queue_42_bytes: 0
tx_queue_43_packets: 0
tx_queue_43_bytes: 0
tx_queue_44_packets: 0
tx_queue_44_bytes: 0
tx_queue_45_packets: 0
tx_queue_45_bytes: 0
tx_queue_46_packets: 0
tx_queue_46_bytes: 0
tx_queue_47_packets: 0
tx_queue_47_bytes: 0
tx_queue_48_packets: 0
tx_queue_48_bytes: 0
tx_queue_49_packets: 0
tx_queue_49_bytes: 0
tx_queue_50_packets: 0
tx_queue_50_bytes: 0
tx_queue_51_packets: 0
tx_queue_51_bytes: 0
tx_queue_52_packets: 0
tx_queue_52_bytes: 0
tx_queue_53_packets: 0
tx_queue_53_bytes: 0
tx_queue_54_packets: 0
tx_queue_54_bytes: 0
tx_queue_55_packets: 0
tx_queue_55_bytes: 0
tx_queue_56_packets: 0
tx_queue_56_bytes: 0
tx_queue_57_packets: 0
tx_queue_57_bytes: 0
tx_queue_58_packets: 0
tx_queue_58_bytes: 0
tx_queue_59_packets: 0
tx_queue_59_bytes: 0
tx_queue_60_packets: 0
tx_queue_60_bytes: 0
tx_queue_61_packets: 0
tx_queue_61_bytes: 0
tx_queue_62_packets: 0
tx_queue_62_bytes: 0
tx_queue_63_packets: 0
tx_queue_63_bytes: 0
tx_queue_64_packets: 0
tx_queue_64_bytes: 0
tx_queue_65_packets: 0
tx_queue_65_bytes: 0
tx_queue_66_packets: 0
tx_queue_66_bytes: 0
tx_queue_67_packets: 0
tx_queue_67_bytes: 0
tx_queue_68_packets: 0
tx_queue_68_bytes: 0
tx_queue_69_packets: 0
tx_queue_69_bytes: 0
tx_queue_70_packets: 0
tx_queue_70_bytes: 0
tx_queue_71_packets: 0
tx_queue_71_bytes: 0
rx_queue_0_packets: 67
rx_queue_0_bytes: 4688
rx_queue_1_packets: 75
rx_queue_1_bytes: 5523
rx_queue_2_packets: 79
rx_queue_2_bytes: 5987
rx_queue_3_packets: 75
rx_queue_3_bytes: 5631
rx_queue_4_packets: 71
rx_queue_4_bytes: 5273
rx_queue_5_packets: 81
of those 377….
none of them are:
rx_gso_checksum_fixup
joe’s igb driver on an Real Computer
ethtool outputs 112 statistics
similarly, of those 112….
none of them are:
rx_gso_checksum_fixup
surely 2 intel drivers
will have similar stats
ixgbe diff igb =>
316 diff stats
and it gets better!
some measured in driver
some measured in hw
monitor all the things!!11!!1!
non_eop_descs ???
os2bmc_rx_by_bmc ???
rx_no_dma_resources ???
this is fine
i’ll read the driver source
i’m really good at the kernels
this is fine
case ixgbe_mac_82599EB:
for (i = 0; i < 16; i++)
adapter->hw_rx_no_dma_resources +=
IXGBE_READ_REG(hw, IXGBE_QPRDC(i));
IXGBE_QPRDC
uh, wat?
similar?
register read
• this driver, like many others, gets some stats from the NIC
• it does this by reading register values
documentation?
so we should be able to find this in
the NIC data sheet………. right?
page 689
success
• so just repeat this process:
• get the driver code
• read it for every stat
• figure out if stat is in software or hardware
• if its in software read the driver and figure out what it means
• if its in hardware find the data sheet and figure out what it means
• then graph it
• and then figure out what the graph means
greetings
what if i told you…
some of these stats aren’t
documented in the data
sheet?
so, like, theres nothing you
can do except literally
guess.
you could email the device
manufacturer….
no one cares, joe
• no one cares about NIC level stats
• too low level
• /proc/net/dev works on my computer for tx/rx
• and it has high level summaries
• errors! drops! fifo! frame! compressed!
but what do
errors! drops! fifo! frame! compressed!
mean?
/proc/net/dev
OK, so
• all these fields come from the driver
• some are from software, others from the NIC
• some fields are sums of the other fields
• this reduces your data sheet search space
• just search for the fields you care about
but what if…
the drivers don’t agree with
each other on what the
individual statistics represent?
in other words..
what if:
driver_stats->rx_missed_errors
means something different for
each driver you ask?
greetings
meaning of driver stats are
not standardized
BTW
stat meanings for a
driver/device can change
over time.
so:
• you need to figure out which NICs are in prod for all boxes
• which firmware versions used on each NIC
• which versions of drivers used for each NIC
• read the all driver sources for the fields you care about
• read the data sheet to figure out what the fields mean
• build An Collectd plugin (or w/e) to encapsulate this
knowledge
maybe you don’t care
too low level
you care about protocol level
stats
odd b/c ethtool settings
can eliminate protocol
stack problems
you dont care?
but, w/e
let’s just read protocol
stats from /proc/net/snmp!
/proc/net/snmp
• there’s an RFC !!!!!!!! (rfc 2013)
• the fields are standardized!!!!
• it’s higher level, so i can figure out where the
protocol layers are breaking down!!!
• they are gathered mostly in software
• much easier than reading a 1200 pg data sheet
what if i told you…
BUGS
• several cases where counters are incremented in
the wrong place
• several cases where counters double count
BUGS
several cases where counters aren’t
incremented where you might think they
should be
BUGS
/*!
* ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.!
* Reporting ENOBUFS might not be good!
* (it's not tunable per se), but otherwise!
* we don't have a good statistic (IpOutDiscards but it can be too many!
* things). We could add another new stat but at least for now that!
* seems like overkill.!
*/!
from linux 3.13.0 net/core/udp.c:
If this is an important statistic for you,
your monitoring might be wrong
so, what does this mean?
so, what does this mean?
• monitoring something requires very
deep understanding
• otherwise your graphs, alerts, etc
might not actually be measuring what
you think they are measuring
so, what does it mean?
• this is why people build entire businesses around
monitoring networks (or other stuff).
• resist the urge to think you can solve every problem
with a “quick” bash script
so, what does it mean?
• properly monitoring, setting alerts, etc requires
significant investment
• i.e. not a bash script over the weekend
• again, not necessarily bad, just important to think
about
• nothing is free
• this doesn’t mean that the software is bad
• so plz don't jump to that conclusion
• this is just reality
so, what does it mean?
and now an aside
time, money, and business
• engineers always think they can solve everything by
writing enough software
• the problem is that: sometimes spending your time
doing that makes no business sense.
• other times doing that is actively detrimental
time, money, and business
• how long would it take you to:
• figure out if all your networking metrics are right
• figure out what they all mean
• set alerts that are sensible
• remember: you need to read a lot of code, data
sheets, and potentially several versions of different
drivers.
time, money, and business
https://baremetrics.com/calculator
(add at least 35% overhead to salaries)
time, money, and business
and this is why monitoring all the things
makes no business sense
for most businesses below a certain revenue
level
time, money, and business
and this is also why it doesn’t really
matter if these stats are wrong
if these stats actually matter to
your business, your business will
invest the $$$$ to figure this out
in conclusion:
• complexity is not necessarily bad
• even simple software is buggy and hard to monitor
correctly
• it all comes down to value and time
• your network monitoring is probably wrong
• but it probably doesn't matter because if it did, your
company would invest $$$$ in figuring it out
?packagecloud.io
@packagecloudio

More Related Content

What's hot

Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014Brendan Gregg
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitAndrea Righi
 
Power of linked list
Power of linked listPower of linked list
Power of linked listPeter Hlavaty
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: IntroductionBrendan Gregg
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at ScaleAntony Messerl
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsAll Things Open
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" Peter Hlavaty
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 
Run Run Trema Test
Run Run Trema TestRun Run Trema Test
Run Run Trema TestHiroshi Ota
 
Rapid Application Design in Financial Services
Rapid Application Design in Financial ServicesRapid Application Design in Financial Services
Rapid Application Design in Financial ServicesAerospike
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizationsBrendan Gregg
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerunidsecconf
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 

What's hot (20)

Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
Racing with Droids
Racing with DroidsRacing with Droids
Racing with Droids
 
Power of linked list
Power of linked listPower of linked list
Power of linked list
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS Tools
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Run Run Trema Test
Run Run Trema TestRun Run Trema Test
Run Run Trema Test
 
Rapid Application Design in Financial Services
Rapid Application Design in Financial ServicesRapid Application Design in Financial Services
Rapid Application Design in Financial Services
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Back to the CORE
Back to the COREBack to the CORE
Back to the CORE
 

Similar to Your Network Monitoring is (probably) Wrong

Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyAnne Nicolas
 
Malware Analysis For The Enterprise
Malware Analysis For The EnterpriseMalware Analysis For The Enterprise
Malware Analysis For The EnterpriseJason Ross
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodBrendan Gregg
 
Cfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreCfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreJulien Pivotto
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015lpgauth
 
BSides London 2015 - Proprietary network protocols - risky business on the wire.
BSides London 2015 - Proprietary network protocols - risky business on the wire.BSides London 2015 - Proprietary network protocols - risky business on the wire.
BSides London 2015 - Proprietary network protocols - risky business on the wire.Jakub Kałużny
 
Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!Perforce
 
Final ProjectFinal Project Details Description Given a spec.docx
Final ProjectFinal Project Details Description  Given a spec.docxFinal ProjectFinal Project Details Description  Given a spec.docx
Final ProjectFinal Project Details Description Given a spec.docxAKHIL969626
 
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocols
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocolsCONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocols
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocolsPROIDEA
 
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014Jakub Kałużny
 
The internet of $h1t
The internet of $h1tThe internet of $h1t
The internet of $h1tAmit Serper
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioDevOps4Networks
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversSatpal Parmar
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Jagadisha Maiya
 

Similar to Your Network Monitoring is (probably) Wrong (20)

Kernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
 
Malware Analysis For The Enterprise
Malware Analysis For The EnterpriseMalware Analysis For The Enterprise
Malware Analysis For The Enterprise
 
Audit
AuditAudit
Audit
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
Cfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymoreCfgmgmt Challenges aren't technical anymore
Cfgmgmt Challenges aren't technical anymore
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
 
BSides London 2015 - Proprietary network protocols - risky business on the wire.
BSides London 2015 - Proprietary network protocols - risky business on the wire.BSides London 2015 - Proprietary network protocols - risky business on the wire.
BSides London 2015 - Proprietary network protocols - risky business on the wire.
 
Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!
 
Final ProjectFinal Project Details Description Given a spec.docx
Final ProjectFinal Project Details Description  Given a spec.docxFinal ProjectFinal Project Details Description  Given a spec.docx
Final ProjectFinal Project Details Description Given a spec.docx
 
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocols
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocolsCONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocols
CONFidence 2014: Jakub Kałużny: Shameful secrets of proprietary protocols
 
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014
Shameful Secrets of Proprietary Network Protocols - OWASP AppSec EU 2014
 
Surge2012
Surge2012Surge2012
Surge2012
 
HPC Examples
HPC ExamplesHPC Examples
HPC Examples
 
The internet of $h1t
The internet of $h1tThe internet of $h1t
The internet of $h1t
 
New204
New204New204
New204
 
Lxbrand
LxbrandLxbrand
Lxbrand
 
Monitoring your API
Monitoring your APIMonitoring your API
Monitoring your API
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Your Network Monitoring is (probably) Wrong

  • 1. All of Your Network Monitoring is (probably) Wrong joe damato packagecloud.io
  • 3. i’m joe i like computers i once had a blog called timetobleed.com @joedamato
  • 8. cognitive load copy & paste configs
  • 9. BTW This is actually part of another talk I’m working on called Programmers should get paid more & work less
  • 10. anw
  • 11. cognitive load ever copy/paste ! conf or tuning settings you didn’t understand?
  • 15. cognitive load do you really understand every single graph you are generating?
  • 18. If there’s too much damn code to configure and tune
  • 19. what makes you think you can actually monitor it?
  • 21. (prob. doesn't matter, more on this later)
  • 22. claim: the more complex the system is, the harder it is to monitor
  • 25. NOTE • you want to: • download and play the cat game on your an Phone (cause small portable electronic devices) • while messaging your buds (cause ur lonely) • with in app purchases (cause you need more gold fish) • over a TLS encrypted connection (cause payments) • over a VPN (cause china) • while flying (cause boredom)
  • 26. and that’s OK it doesn’t mean complexity is bad
  • 27. 2 complicated things that aren’t necessarily bad
  • 30. so, like, you know one thing that’s p. complicated?
  • 32. all kiiiiiiiiiiiiiiiiiiiiiiiinda features • different NICs have different rx and tx queue size limits and defaults • ethernet bonding • IRQ modulation, ntuple filtering, …. • RSS, RPS, RFS, aRFS • GRO, GSO, hw accelerated VLAN IDs, timestamping, …. • you are probably using at least 2 protocol stacks (IP and TCP/UDP) • all kiiiiiiiiiiiiiiinda tuning levers and knobs for everything from top to bottom
  • 34. and
  • 40. https://www.redhat.com/archives/rhl-list/2007-September/msg03735.html > How can I find out the /proc/net info > > eg: softnet_stat is for what purpose ! Much of this is only well-documented in the code. Here's an attempt at interpreting softnet_stat [no guarantee that it is correct; read the code!]:
  • 41. total # of packets (not including netpoll) received by the interrupt handler. There might be some double counting going on [ … ] I think the intention was that these were originally on separate receive paths
  • 42. Full networking writeup literally 90 pages literally everything about linux networking literally available here: http://bit.ly/linux-networking
  • 43. it’s fine, as long as we are honest that it’s just reality
  • 44.
  • 45. [random os] has a better/faster/ leaner/whatever networking stack than linux
  • 46.
  • 47. anw
  • 48. complex • not necessarily inefficient • not necessarily bad • people expect a lot of complicated features • so theres a lot of code needed to support all this random stuff you want to do • see also: cat game example
  • 49. complex • the bad news….. • you are supposed to monitor this complicated code • and then you are supposed to look at some graphs • and then you are supposed to Know The Answer™
  • 50. sounds difficult… but it gets better ;)
  • 51. what if i told you…
  • 52. a driver bug caused stats to output incorrectly in /proc/net/dev?
  • 53. igb • driver stats updated via a timer every 2 secs • reading stats via /proc/net/dev produced stale stats • but not via ethtool (different code path) • fixed by forcing stat update whenever stats are read • i saw this in production —— did you????
  • 54. only matters if you are monitoring your network stats more often than every 2 sec.
  • 55. maybe you aren’t because you dont care (that’s fine and you are prob. right)
  • 56. but if you do care…
  • 57. your future status • you’d need to: • notice the problem in your graph • start reading your stats collecting code/plugin • realize the bug is not there • read your driver code • realize the bug is in the code path that /proc/net/dev hits • write a patch to fix it • rebuild the driver and deploy it everywhere
  • 58. that’s a lot of work just to monitor bytes tx/rx greetings!
  • 59. things that dont exist • descending order of probability it doesn't exist: • free open source • the an singularity • calorie free chocolate covered bacon • etc
  • 60.
  • 61. but joe, my devops are the literal strongest and they doesn't afraid of anything
  • 62. i’ll just use the ethtool
  • 63. ethtool • a command line tool • uses ioctl system call to talk to network drivers • not all drivers actually implement the interface • and the ones which do, generally, do it differently
  • 64. what if i told you…
  • 65. ethtool • no standardized way of outputting driver stats • some drivers don’t even implement the interface • the ones which do use diff field names
  • 66. ¡dale, comparemos! • ec2 vif driver • ixgbe driver • igb driver
  • 67. ec2 vif driver $ sudo ethtool -S eth0 NIC statistics: rx_gso_checksum_fixup: 0 ethtool outputs 1 statistic
  • 69. joe’s ixgbe driver on an Real Computer ethtool outputs 377 statistics
  • 70. NIC statistics: rx_packets: 9665600259 tx_packets: 12198470686 rx_bytes: 6790400019470 tx_bytes: 2169046666156 rx_pkts_nic: 11107310349 tx_pkts_nic: 12198470686 rx_bytes_nic: 6929982126806 tx_bytes_nic: 2217848697965 lsc_int: 1 tx_busy: 0 non_eop_descs: 1044042523 rx_errors: 0 tx_errors: 0 rx_dropped: 1 tx_dropped: 0 multicast: 7876979 broadcast: 2633 rx_no_buffer_count: 0 collisions: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 hw_rsc_aggregated: 6600573569 hw_rsc_flushed: 5158863479 fdir_match: 175127 fdir_miss: 11098854004 fdir_overflow: 1 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_timeout_count: 0 tx_restart_queue: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 tx_flow_control_xon: 0 rx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_flow_control_xoff: 0 rx_csum_offload_errors: 0 alloc_rx_page_failed: 0 alloc_rx_buff_failed: 0 rx_no_dma_resources: 0 os2bmc_rx_by_bmc: 0 os2bmc_tx_by_bmc: 0 os2bmc_tx_by_host: 0 os2bmc_rx_by_host: 0 fcoe_bad_fccrc: 0 rx_fcoe_dropped: 0 rx_fcoe_packets: 0 rx_fcoe_dwords: 0 fcoe_noddp: 0 fcoe_noddp_ext_buff: 0 tx_fcoe_packets: 0 tx_fcoe_dwords: 0 tx_queue_0_packets: 650250933 tx_queue_0_bytes: 109734794973 tx_queue_1_packets: 734133738 tx_queue_1_bytes: 123318917069 tx_queue_2_packets: 772808083 tx_queue_2_bytes: 131183014063 tx_queue_3_packets: 741428236 tx_queue_3_bytes: 125821603228 tx_queue_4_packets: 692281561 tx_queue_4_bytes: 118278086880 tx_queue_5_packets: 783438226 tx_queue_5_bytes: 133234307795 tx_queue_6_packets: 719335931 tx_queue_6_bytes: 123662184314 tx_queue_7_packets: 668577198 tx_queue_7_bytes: 114915688397 tx_queue_8_packets: 711699909 tx_queue_8_bytes: 122460443627 tx_queue_9_packets: 681741781 tx_queue_9_bytes: 118032356999 tx_queue_10_packets: 585639061 tx_queue_10_bytes: 98009207733 tx_queue_11_packets: 640487443 tx_queue_11_bytes: 107781535416 tx_queue_12_packets: 706304786 tx_queue_12_bytes: 118963058912 tx_queue_13_packets: 716825472 tx_queue_13_bytes: 121032769231 tx_queue_14_packets: 699280537 tx_queue_14_bytes: 118119557225 tx_queue_15_packets: 675274048 tx_queue_15_bytes: 114916452394 tx_queue_16_packets: 123509474 tx_queue_16_bytes: 25473914817 tx_queue_17_packets: 101309066 tx_queue_17_bytes: 23513562050 tx_queue_18_packets: 92291301 tx_queue_18_bytes: 21830243983 tx_queue_19_packets: 87287348 tx_queue_19_bytes: 20887753665 tx_queue_20_packets: 34518707 tx_queue_20_bytes: 9837323388 tx_queue_21_packets: 24009284 tx_queue_21_bytes: 6760172375 tx_queue_22_packets: 23628875 tx_queue_22_bytes: 6707751077 tx_queue_23_packets: 25969617 tx_queue_23_bytes: 7343742932 tx_queue_24_packets: 30112206 tx_queue_24_bytes: 8614816667 tx_queue_25_packets: 28812367 tx_queue_25_bytes: 8186825345 tx_queue_26_packets: 31710307 tx_queue_26_bytes: 9139202059 tx_queue_27_packets: 40835241 tx_queue_27_bytes: 11499713701 tx_queue_28_packets: 39265877 tx_queue_28_bytes: 11045989548 tx_queue_29_packets: 41775414 tx_queue_29_bytes: 11804871879 tx_queue_30_packets: 12497615 tx_queue_30_bytes: 3405490173 tx_queue_31_packets: 11021513 tx_queue_31_bytes: 2659215149 tx_queue_32_packets: 10464342 tx_queue_32_bytes: 2632864135 tx_queue_33_packets: 11341007 tx_queue_33_bytes: 2818638887 tx_queue_34_packets: 12782059 tx_queue_34_bytes: 3307226594 tx_queue_35_packets: 12795212 tx_queue_35_bytes: 3400547658 tx_queue_36_packets: 59272452 tx_queue_36_bytes: 17286517363 tx_queue_37_packets: 85631445 tx_queue_37_bytes: 25126772743 tx_queue_38_packets: 84708817 tx_queue_38_bytes: 24920451495 tx_queue_39_packets: 83763431 tx_queue_39_bytes: 24662854523 tx_queue_40_packets: 0 tx_queue_40_bytes: 0 tx_queue_41_packets: 0 tx_queue_41_bytes: 0 tx_queue_42_packets: 0 tx_queue_42_bytes: 0 tx_queue_43_packets: 0 tx_queue_43_bytes: 0 tx_queue_44_packets: 0 tx_queue_44_bytes: 0 tx_queue_45_packets: 0 tx_queue_45_bytes: 0 tx_queue_46_packets: 0 tx_queue_46_bytes: 0 tx_queue_47_packets: 0 tx_queue_47_bytes: 0 tx_queue_48_packets: 0 tx_queue_48_bytes: 0 tx_queue_49_packets: 0 tx_queue_49_bytes: 0 tx_queue_50_packets: 0 tx_queue_50_bytes: 0 tx_queue_51_packets: 0 tx_queue_51_bytes: 0 tx_queue_52_packets: 0 tx_queue_52_bytes: 0 tx_queue_53_packets: 0 tx_queue_53_bytes: 0 tx_queue_54_packets: 0 tx_queue_54_bytes: 0 tx_queue_55_packets: 0 tx_queue_55_bytes: 0 tx_queue_56_packets: 0 tx_queue_56_bytes: 0 tx_queue_57_packets: 0 tx_queue_57_bytes: 0 tx_queue_58_packets: 0 tx_queue_58_bytes: 0 tx_queue_59_packets: 0 tx_queue_59_bytes: 0 tx_queue_60_packets: 0 tx_queue_60_bytes: 0 tx_queue_61_packets: 0 tx_queue_61_bytes: 0 tx_queue_62_packets: 0 tx_queue_62_bytes: 0 tx_queue_63_packets: 0 tx_queue_63_bytes: 0 tx_queue_64_packets: 0 tx_queue_64_bytes: 0 tx_queue_65_packets: 0 tx_queue_65_bytes: 0 tx_queue_66_packets: 0 tx_queue_66_bytes: 0 tx_queue_67_packets: 0 tx_queue_67_bytes: 0 tx_queue_68_packets: 0 tx_queue_68_bytes: 0 tx_queue_69_packets: 0 tx_queue_69_bytes: 0 tx_queue_70_packets: 0 tx_queue_70_bytes: 0 tx_queue_71_packets: 0 tx_queue_71_bytes: 0 rx_queue_0_packets: 67 rx_queue_0_bytes: 4688 rx_queue_1_packets: 75 rx_queue_1_bytes: 5523 rx_queue_2_packets: 79 rx_queue_2_bytes: 5987 rx_queue_3_packets: 75 rx_queue_3_bytes: 5631 rx_queue_4_packets: 71 rx_queue_4_bytes: 5273 rx_queue_5_packets: 81
  • 71. of those 377…. none of them are: rx_gso_checksum_fixup
  • 72. joe’s igb driver on an Real Computer ethtool outputs 112 statistics
  • 73. similarly, of those 112…. none of them are: rx_gso_checksum_fixup
  • 74. surely 2 intel drivers will have similar stats
  • 75. ixgbe diff igb => 316 diff stats
  • 76. and it gets better!
  • 77. some measured in driver some measured in hw
  • 78. monitor all the things!!11!!1!
  • 80.
  • 81. this is fine i’ll read the driver source i’m really good at the kernels
  • 82. this is fine case ixgbe_mac_82599EB: for (i = 0; i < 16; i++) adapter->hw_rx_no_dma_resources += IXGBE_READ_REG(hw, IXGBE_QPRDC(i));
  • 85. register read • this driver, like many others, gets some stats from the NIC • it does this by reading register values
  • 86. documentation? so we should be able to find this in the NIC data sheet………. right?
  • 87.
  • 89. success • so just repeat this process: • get the driver code • read it for every stat • figure out if stat is in software or hardware • if its in software read the driver and figure out what it means • if its in hardware find the data sheet and figure out what it means • then graph it • and then figure out what the graph means
  • 91. what if i told you…
  • 92. some of these stats aren’t documented in the data sheet?
  • 93. so, like, theres nothing you can do except literally guess.
  • 94. you could email the device manufacturer….
  • 95.
  • 96.
  • 97. no one cares, joe • no one cares about NIC level stats • too low level • /proc/net/dev works on my computer for tx/rx • and it has high level summaries • errors! drops! fifo! frame! compressed!
  • 98. but what do errors! drops! fifo! frame! compressed! mean?
  • 100. OK, so • all these fields come from the driver • some are from software, others from the NIC • some fields are sums of the other fields • this reduces your data sheet search space • just search for the fields you care about
  • 101. but what if… the drivers don’t agree with each other on what the individual statistics represent?
  • 102. in other words.. what if: driver_stats->rx_missed_errors means something different for each driver you ask?
  • 104. meaning of driver stats are not standardized
  • 105. BTW
  • 106. stat meanings for a driver/device can change over time.
  • 107. so: • you need to figure out which NICs are in prod for all boxes • which firmware versions used on each NIC • which versions of drivers used for each NIC • read the all driver sources for the fields you care about • read the data sheet to figure out what the fields mean • build An Collectd plugin (or w/e) to encapsulate this knowledge
  • 108. maybe you don’t care too low level you care about protocol level stats
  • 109. odd b/c ethtool settings can eliminate protocol stack problems you dont care?
  • 111. let’s just read protocol stats from /proc/net/snmp!
  • 112. /proc/net/snmp • there’s an RFC !!!!!!!! (rfc 2013) • the fields are standardized!!!! • it’s higher level, so i can figure out where the protocol layers are breaking down!!! • they are gathered mostly in software • much easier than reading a 1200 pg data sheet
  • 113. what if i told you…
  • 114. BUGS • several cases where counters are incremented in the wrong place • several cases where counters double count
  • 115. BUGS several cases where counters aren’t incremented where you might think they should be
  • 116. BUGS /*! * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.! * Reporting ENOBUFS might not be good! * (it's not tunable per se), but otherwise! * we don't have a good statistic (IpOutDiscards but it can be too many! * things). We could add another new stat but at least for now that! * seems like overkill.! */! from linux 3.13.0 net/core/udp.c:
  • 117. If this is an important statistic for you, your monitoring might be wrong
  • 118. so, what does this mean?
  • 119. so, what does this mean? • monitoring something requires very deep understanding • otherwise your graphs, alerts, etc might not actually be measuring what you think they are measuring
  • 120. so, what does it mean? • this is why people build entire businesses around monitoring networks (or other stuff). • resist the urge to think you can solve every problem with a “quick” bash script
  • 121. so, what does it mean? • properly monitoring, setting alerts, etc requires significant investment • i.e. not a bash script over the weekend • again, not necessarily bad, just important to think about
  • 122. • nothing is free • this doesn’t mean that the software is bad • so plz don't jump to that conclusion • this is just reality so, what does it mean?
  • 123. and now an aside
  • 124. time, money, and business • engineers always think they can solve everything by writing enough software • the problem is that: sometimes spending your time doing that makes no business sense. • other times doing that is actively detrimental
  • 125. time, money, and business • how long would it take you to: • figure out if all your networking metrics are right • figure out what they all mean • set alerts that are sensible • remember: you need to read a lot of code, data sheets, and potentially several versions of different drivers.
  • 126. time, money, and business https://baremetrics.com/calculator (add at least 35% overhead to salaries)
  • 127. time, money, and business and this is why monitoring all the things makes no business sense for most businesses below a certain revenue level
  • 128. time, money, and business and this is also why it doesn’t really matter if these stats are wrong if these stats actually matter to your business, your business will invest the $$$$ to figure this out
  • 129. in conclusion: • complexity is not necessarily bad • even simple software is buggy and hard to monitor correctly • it all comes down to value and time • your network monitoring is probably wrong • but it probably doesn't matter because if it did, your company would invest $$$$ in figuring it out