Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015

Making & Breaking
Machine Learning
Anomaly Detectors
in Real Life

My goal
• give an overview of Machine Learning Anomaly Detectors
• spark discussions on when/where/how to create these
• explore how “safe” these systems are
• discuss where we go from here

Anomaly Detection Machine Learningvs.
Taxonomy

Taxonomy
Anomaly DetectionMachine Learning
Heuristics/Rule-based
Predictive ML

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-
sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/
1.0" 304 0
199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-
small.gif HTTP/1.0" 200 4179
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/
livevideo.gif HTTP/1.0" 200 0
205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/
1.0" 200 3985
how to ﬁnd anomalies in these???

Why are ML-based techniques
attractive compared to
threshold/rule-based heuristics?
• adaptive
• dynamic
• minimal human intervention (theoretically)

Why are threshold/rule-based
heuristics good?
• easy to reason
• simple & understandable
• can also be dynamic/adaptive

The big ML + Anomaly Detection Problem
a lot of machine learning + anomaly
detection research, but not a lot of
successful systems in the real world.
WHY?

The big ML + Anomaly Detection Problem
Anomaly Detection:
Traditional Machine Learning:
ﬁnd novel attacks,
identify never seen before things
learn patterns,
identify similar things

What makes Anomaly Detection so
different?
fundamentally different from other ML problems
• very high cost of errors
• lack of training data
• “semantic gap”
• difﬁculties in evaluation
• adversarial setting

Really bad if the system is wrong…
•compared to other
learning applications,
very intolerant to errors
•what happens if we have a
high false positive rate?
•high false negative rate?

Lack of training data…
•what data to train model on?
•so hard to clean input data!

Hard to interpret the results/alerts…
the “semantic gap”
ok… I got the alert…
why did I get the alert…?

The evaluation problem
• devising a sound evaluation scheme is even more
difﬁcult than building the system itself
• problems with relying on ML Anomaly Detection
evaluations in academic research papers

Adversarial impact
advanced actors can
(and will) spend the
time and effort to
bypass the system

How have real world AD systems failed?
• many false positives
• hard to ﬁnd attack-free training data
• used without deep understanding
• model-poisoning

Doing it!
• generate time-series
• select representative features
• train/tune model of ‘normality’
• alert if incoming points deviate from model

Example infrastructure
Sensitivity of PCA for Trafﬁc Anomaly Detection, Ringberg et. al.

density-based
subspace/correlation-based
support vector
machines
clustering
neural networks
Common Techniques

“Model”?
clusters
• centroid clusters
• good for “online learning”

How to select features?
• often ends up being the most challenging
piece of the problem
• isn’t it just a parameter optimization problem?

How to select features?
Difﬁculties:
• too many possible combinations to iterate!
• hard to evaluate
• frequently changing “optimal”
• performance accuracy not the only criteria
• improved model interpretability
• shorter training times
• enhanced generalization / reduced overﬁtting

Principal Component Analysis
• common statistical method to automatically
select features
How?
• transform data into different dimensions
• returns an ordered-list of dimensions
(principal components) that can best
represent data’s variance

http://setosa.io/ev/principal-component-analysis/
projection

true PCA result,
maximize variance capture

choose principal components
that cover 80-90% of the dataset's variance

“Scree” Plot
PCA more effective
Number of Principal Components Used
Cumulative%VarianceCapture
100804020060
10^0 10^1 10^2 10^3
PCA less effective

How to avoid common pitfalls?
• Understand your threat model well
• Keep the detection scope narrow
• Reduce the cost of false negatives/positives

How good is my anomaly detector?
how easily can you filter out
false positives?

How good is my anomaly detector?
evaluating true positives?

how do we attack this?
the most important question…

How do we attack this?
manipulate learning system to
permit a specific attack
degrade performance of the
system to compromise it’s reliability

Attacking PCA-based systems
center
before
attack center
after
attack
“chaff”
attack direction
decision
boundary

before
attack
after
attack
“chaff”
no clear
attack direction

chaff volume vs. injection period
to avoid detection, go slow!

How do you defend against this?
maintain a calibration test set

afterbefore

decision boundary
ratio-detection

before
attack
after
attack
decision boundary
region

Can Machine Learning be secure?
not easy to achieve for unsupervised, online learning
slow adversaries down
gives you time to
detect when you’re
being targeted

Improved PCA
• Antidote
• Principal component pursuit
• Robust PCA

Robust statistics
use median instead of mean
PCA’s ‘variance’ maximization vs. Antidote’s ‘median absolute deviation’
ﬁnd an appropriate distribution that models your dataset
normal/gaussian vs. laplacian
distributions
use robust PCA

My own tests.
I ran my own simulations with
some real data…
why did I do this?

Projection on 1st Principal Component
Projectiononto“TargetFlow”

Naive PCA
Robust PCA

by the way,
generating this
chaff is hard

Robust PCA
Naive PCA

Training
Periods
2
4
6
10
8
#

Random
Detector
RPCA - No Poisoning
RPCA - Boiling Frog,
50%
Chaff over 10
training periods
RPCA - 30% chaff
RPCA - 50% chaff
False Alarm Rate (False Positive Rate)
PoisoningDetectionRate(TruePositiveRate)
1.00.80.40.200.6
0 0.2 0.4 0.6 0.8 1.0
true positive rate vs.
false positive rate

Attack Duration (# of
training periods)
EvasionSuccessRate(FalseNegativeRate)
1.00.80.40.200.6
0 2 4 6 8 10
Chaff Injected 0% 10% 20% 30% 40% 50%
For Boiling Frog
For Naive Injection
RPCA
- Boiling Frog,
50%
Chaff spread over
x periods
RPCA - Naive Injection
Evasion success rates

Naive PCA Robust(er) PCA
Naive Chaff
Injection
(50% injection,
single training period)
~ 76% evasion success ~ 14% evasion success
Boiling Frog
Injection
(10 training periods)
~ 87% evasion success ~ 38% evasion success

• not so good, but improving…
• pure ML-based anomaly detectors are still vulnerable
to compromise
• use ML to ﬁnd features and thresholds, then run
streaming anomaly detection using static rules
Anomaly detection systems today

What next?
• do more tests on AD systems that others have created
• other defenses against poisoning techniques
• experiment on mode resilient ML models

cchio@shapesecurity.com
@cchio

Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015

Similar to Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015 (20)

More from CODE BLUE

More from CODE BLUE (20)

Recently uploaded

Recently uploaded (20)

Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015