Anomalies are interesting because they tell a different story from the norm. Anomaly detection is used in many applications including detecting fraudulent credit card transactions and attacks in computer networks. But we do not want anomaly detection algorithms to be “alarm factories”, because if too many anomalies are detected on a regular basis, they tend to be ignored by the decision makers. Also, many anomaly detection methods have parameters that can only be set by experts, making them difficult to be used by lay people. Therefore, it is important to have “parameter-free” anomaly detection methods that minimize false positives.
In this talk, we introduce lookout, an anomaly detection method that uses extreme value theory and topological data analysis. Lookout is essentially parameter-free and has low false positive rates. We also delve into the world of computer networks and show how lookout can be used to detect suspicious nodes in computer network traffic.
2. Why anomalies?
• They tell a different story
• Fraudulent credit card transactions amongst billions of
legitimate transactions
• Computer network intrusions
• Astronomical anomalies – solar flares
• Weather anomalies – tsunamis
• Stock market anomalies – heralding a crash?
• Important to detect anomalies in a timely manner
3. Current
challenges
AD methods rank observations in terms of
anomalousness
• They don’t identify anomalies
• So, the user needs to define a threshold and
identify anomalies
High false positives
• Do not want an “alarm factory” – confidence in the
system goes down
Parameters need to be defined by the user
• But expert knowledge is needed
5. Sevvandi Kandanaarachchi, Rob Hyndman
Preprint - https://bit.ly/lookoutliers
Lookout – leave one
out kde for outlier
detection
6. Kernel density estimation(KDE)
• A density estimation technique using kernels
• A set of points on the real line
• Placing the kernel at every point
• Kernel function𝑓 𝑥, ℎ =
1
𝑛ℎ 𝑖 𝐾(
𝑥−𝑋𝑖
ℎ
)
• ℎ - the bandwidth parameter
• https://mathisonian.github.io/kde/
7. KDE for anomaly detection
• What do we want?
• Anomalies to have much lower kde values than other points.
• Why?
• Because anomalies are in low density regions.
• The literature on bandwidth selection focusses on representing the
data
• Minimize MISE (Mean Integrated Square Error)
• But, this doesn’t work for us.
15. We are interested in . . .
• The end-point diameter (death
diameters) sequences
• We want the maximum gap
• Diameter that starts the
maximum gap = 𝑑
• ℎ = 5 𝑑 for Epanechnikov
kernel
16. • Compute the kde values
• Anomalies will have the very low kde values
• We can rank the anomalies using the low kde values
• Low kde – anomalous
• High kde – not anomalous
Using this bandwidth
17. But, we want to identify anomalies!
Just because the kde is low, is it an
anomaly?
18. We want to have a cut off!
For that we use Extreme Value
Theory!
19. EVT – Peak Over Threshold method (POT)
• Pick a threshold – 90%
• Model the exceedences
• Generalized Pareto distribution
20. Method lookout
• Fit a GPD using the kde values
• Then use the leave one out kde values to determine the probability of
points according to the GPD
• We have a set of probabilities
• Low probabilities are more likely to be anomalies
• Have a pre-defined cut off 𝛼, this is your threshold
• If 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑥𝑖) < 𝛼, then 𝑥𝑖 is an anomaly.
• So you can identify anomalies.
22. Practical advantages of lookout
The user does not need
to specify a bandwidth
parameter
•The user can be
anyone – not
necessarily a
mathematician
EVT based methods
have low false positive
rates
•Attractive for many
applications
•Not an alarm factory
23. For the mathematician/statistician in me
• Coming together of
• Topological data analysis
• Extreme Value Theory
• Kernel density estimates
• To find anomalies
25. Anomaly Persistence
• What if a data-point is identified
as an anomaly for different
bandwidth values?
• Visual representation of
anomaly persistence
• Big picture
26. Application: Computer
Networks Security
Honeyboost: Boosting honeypot performance with data fusion and anomaly
detection – Sevvandi Kandanaarachchi, Hideya Ochiai, Asha Rao
Preprint - https://arxiv.org/abs/2105.02526
27. LAN Security Monitoring
• ‘LAN-Security Monitoring Device’ to capture suspicious/malicious
activities that happen inside a LAN.
LAN: Local Area Network
LAN-Security Monitoring Device
Though it is not a real camera, it works
like a ‘cyber-space surveillance camera’.
Smartphones
Printer
Smart Appliances
Data Server
it captures all the broadcast packets,
and direct packets to the
monitoring device.
28. LAN Security Monitoring
• ‘LAN-Security Monitoring Device’ to capture suspicious/ malicious
activities that happen inside a LAN.
LAN: Local Area Network
LAN-Security Monitoring Device
Honeypot - a trap for attackers
Smartphones
Printer
Smart Appliances
Data Server
29. Honeypot data
• ARP data – a big shout out to everyone (broadcast to the network)
• These nodes do not access the honeypot
• Who has got this address – I need to communicate to you
• Generally not a suspicious activity
• But malicious nodes can also make ARP calls
• TCP and UDP data – targeted at the honeypot
• These nodes have accessed the honeypot using TCP/UDP protocols
• Oooh suspicious!
30. A bit more on honeypots
• An intruder can be there without accessing the honeypot
• Limited vision of honeypots
• Honeypots are never stand alone security devices
• Identifying anomalous nodes is important - Honeyboost
31. Generally . . .
• Anomalies detected based on individual packets – packet-based
• Packet features separately for each packet
• Of all the traffic, which packets are anomalous
• Our contribution: we find anomalous nodes – node-based
• Features of nodes using the traffic – using multivariate time series
• Of all the nodes, which nodes are anomalous
32. Varying-dimensional time series
• Different protocols have different header features
• Finding anomalies from varying dimensional time series
• 200 computers/nodes = 200 varying-dimensional time series
• Which one is anomalous, if at all?
time
33. Varying-
dimensional time
series for each node
multivariate time
series
Compute features
Window model and process
Feature space for
all nodes
Lookout
time
34. Varying-
dimensional time
series for each node
multivariate time
series
Timestamp Protocol ARP count ARP
degree
TCP PC1 TCP PC2 UDP PC1 UDP PC2
30 ARP 10 12 0 0 0 0
55 TCP 0 0 -2.15 1.75 0 0
85 UDP 0 0 0 0 3.56 0.45
Node A
36. Features
• The total length of line segments in 𝑅6
• The maximum time difference
• Number of protocols used
• Number of TCP calls/UDP calls
• Total length of line segments in each protocol space
• Line of best fit in in each protocol space
• Sum of errors squared for the line of best fit
TCP PC1
TCP PC2
37. Findings
• Suspicious nodes that do not
access the honeypot
Feature space for
all nodes
Lookout
This node
does not
access the
honeypot
This node
does not
access the
honeypot
38. Insights
• Identify some nodes before
they access the honeypot
• Gain insights – find anomalies
and look back at the original
data
• Anomaly has set
suspicious flags – PSH flag
and URG flag
• PSH flag – PUSH flag –
push packet to the
application layer
• URG flag – URGENT flag –
treat packet as urgent?
Why when accessing the
honeypot
• Can be used to derive new
rules
39. Summary
• Lookout - a EVT based method to find anomalies (using TDA)
• An application in computer network security
• R package lookout is on CRAN
• Both preprints available
• https://bit.ly/lookoutliers
• https://arxiv.org/abs/2105.02526