Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015

Machine learning-based (ML) techniques for network intrusion detection have gained notable traction in the web security industry over the past decade. Some Intrusion Detection Systems (IDS) successfully used these techniques to detect and deflect network intrusions before they could cause significant harm to network services. Simply put, IDS systems construct a signature model of how normal traffic looks, using data retrieved from web access logs as input. Then, an online processing system is put in place to maintain a model of how expected network traffic looks like, and/or how malicious traffic looks like. When traffic that is deviant from the expected model exceeds the defined threshold, the IDS flags it as malicious. The theory behind it was that the more data the system sees, the more accurate the model would become. This provides a flexible system for traffic analysis, seemingly perfect for the constantly evolving and growing web traffic patterns.

However, this fairytale did not last for long. It was soon found that the attackers had been avoiding detection by ‘poisoning’ the classifier models used by these PCA systems. The adversaries slowly train the detection model by sending large volumes of seemingly benign web traffic to make the classification model more tolerant to outliers and actual malicious attempts. They succeeded.

In this talk, we will do a live demo of this 'model-poisoning' attack and analyze methods that have been proposed to decrease the susceptibility of ML-based network anomaly detection systems from being manipulated by attackers. Instead of diving into the ML theory behind this, we will emphasize on examples of these systems working in the real world, the attacks that render them impotent, and how it affects developers looking to protect themselves from network intrusion. Most importantly, we will look towards the future of ML-based network intrusion detection.

  • Login to see the comments

Making & Breaking Machine Learning Anomaly Detectors in Real Life by Clarence Chio - CODE BLUE 2015

  1. 1. Making & Breaking Machine Learning Anomaly Detectors in Real Life
  2. 2. Machine Learning
  3. 3. My goal • give an overview of Machine Learning Anomaly Detectors • spark discussions on when/where/how to create these • explore how “safe” these systems are • discuss where we go from here
  4. 4. Anomaly Detection Machine Learningvs. Taxonomy
  5. 5. Taxonomy Anomaly DetectionMachine Learning Heuristics/Rule-based Predictive ML
  6. 6. Intrusion detection?
  7. 7. - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission- sts-73.html HTTP/1.0" 200 4085 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/ 1.0" 304 0 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch- small.gif HTTP/1.0" 200 4179 - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/ livevideo.gif HTTP/1.0" 200 0 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/ 1.0" 200 3985 how to find anomalies in these???
  8. 8. Why are ML-based techniques attractive compared to threshold/rule-based heuristics? • adaptive • dynamic • minimal human intervention (theoretically)
  9. 9. but…
  10. 10. Why are threshold/rule-based heuristics good? • easy to reason • simple & understandable • can also be dynamic/adaptive
  11. 11. Successful MLApplications
  12. 12. Successful MLApplications
  13. 13. Successful MLApplications
  14. 14. Setting Expectations
  15. 15. The big ML + Anomaly Detection Problem a lot of machine learning + anomaly detection research, but not a lot of successful systems in the real world. WHY?
  16. 16. The big ML + Anomaly Detection Problem Anomaly Detection: Traditional Machine Learning: find novel attacks, identify never seen before things learn patterns, identify similar things
  17. 17. What makes Anomaly Detection so different? fundamentally different from other ML problems • very high cost of errors • lack of training data • “semantic gap” • difficulties in evaluation • adversarial setting
  18. 18. Really bad if the system is wrong… •compared to other learning applications, very intolerant to errors •what happens if we have a high false positive rate? •high false negative rate?
  19. 19. Lack of training data… •what data to train model on? •so hard to clean input data!
  20. 20. Hard to interpret the results/alerts… the “semantic gap” ok… I got the alert… why did I get the alert…?
  21. 21. The evaluation problem • devising a sound evaluation scheme is even more difficult than building the system itself • problems with relying on ML Anomaly Detection evaluations in academic research papers
  22. 22. Adversarial impact advanced actors can (and will) spend the time and effort to bypass the system
  23. 23. How have real world AD systems failed? • many false positives • hard to find attack-free training data • used without deep understanding • model-poisoning
  24. 24. So…is it hopeless?
  25. 25. Doing it! • generate time-series • select representative features • train/tune model of ‘normality’ • alert if incoming points deviate from model
  26. 26. Example infrastructure Sensitivity of PCA for Traffic Anomaly Detection, Ringberg et. al.
  27. 27. density-based subspace/correlation-based support vector machines clustering neural networks Common Techniques
  28. 28. “Model”? clusters • centroid clusters • good for “online learning”
  29. 29. How to select features? • often ends up being the most challenging piece of the problem • isn’t it just a parameter optimization problem?
  30. 30. How to select features? Difficulties: • too many possible combinations to iterate! • hard to evaluate • frequently changing “optimal” • performance accuracy not the only criteria • improved model interpretability • shorter training times • enhanced generalization / reduced overfitting
  31. 31. Principal Component Analysis • common statistical method to automatically select features How? • transform data into different dimensions • returns an ordered-list of dimensions (principal components) that can best represent data’s variance
  32. 32. Principal Component Analysis projection
  33. 33. Principal Component Analysis
  34. 34. Principal Component Analysis true PCA result, maximize variance capture
  35. 35. Principal Component Analysis choose principal components that cover 80-90% of the dataset's variance
  36. 36. “Scree” Plot PCA more effective Number of Principal Components Used Cumulative%VarianceCapture 100804020060 10^0 10^1 10^2 10^3 PCA less effective
  37. 37. How to avoid common pitfalls? • Understand your threat model well • Keep the detection scope narrow • Reduce the cost of false negatives/positives
  39. 39. How good is my anomaly detector? how easily can you filter out false positives?
  40. 40. How good is my anomaly detector? evaluating true positives?
  41. 41. how do we attack this? the most important question…
  42. 42. How do we attack this? manipulate learning system to permit a specific attack degrade performance of the system to compromise it’s reliability
  43. 43. Chaff
  44. 44. Attacking PCA-based systems center before attack center after attack “chaff” attack direction decision boundary
  45. 45. Attacking PCA-based systems before attack after attack “chaff” no clear attack direction
  47. 47. Attacking PCA-based systems chaff volume vs. injection period to avoid detection, go slow!
  48. 48. How do you defend against this? maintain a calibration test set
  49. 49. How do you defend against this? afterbefore
  50. 50. How do you defend against this? decision boundary ratio-detection
  51. 51. Attacking PCA-based systems before attack after attack decision boundary region
  52. 52. Can Machine Learning be secure? not easy to achieve for unsupervised, online learning slow adversaries down gives you time to detect when you’re being targeted
  53. 53. How do you defend against this? Improved PCA • Antidote • Principal component pursuit • Robust PCA
  54. 54. Robust statistics use median instead of mean PCA’s ‘variance’ maximization vs. Antidote’s ‘median absolute deviation’ find an appropriate distribution that models your dataset normal/gaussian vs. laplacian distributions use robust PCA
  55. 55. My own tests. I ran my own simulations with some real data… why did I do this?
  56. 56. Projection on 1st Principal Component Projectiononto“TargetFlow”
  57. 57. Projection on 1st Principal Component Naive PCA Robust PCA Projectiononto“TargetFlow”
  58. 58. Projection on 1st Principal Component Projectiononto“TargetFlow”
  59. 59. Projection on 1st Principal Component Projectiononto“TargetFlow” by the way, generating this chaff is hard
  60. 60. Projection on 1st Principal Component Robust PCA Naive PCA Projectiononto“TargetFlow”
  61. 61. Projection on 1st Principal Component Robust PCA Naive PCA Projectiononto“TargetFlow”
  62. 62. Projection on 1st Principal Component Training Periods 2 4 6 10 8 # Projectiononto“TargetFlow”
  63. 63. Projection on 1st Principal Component Projectiononto“TargetFlow”
  64. 64. Projection on 1st Principal Component Robust PCA Naive PCA Projectiononto“TargetFlow”
  65. 65. Projection on 1st Principal Component Robust PCA Naive PCA Projectiononto“TargetFlow”
  66. 66. Random Detector RPCA - No Poisoning RPCA - Boiling Frog, 50% Chaff over 10 training periods RPCA - 30% chaff RPCA - 50% chaff False Alarm Rate (False Positive Rate) PoisoningDetectionRate(TruePositiveRate) 0 0.2 0.4 0.6 0.8 1.0 true positive rate vs. false positive rate
  67. 67. Attack Duration (# of training periods) EvasionSuccessRate(FalseNegativeRate) 0 2 4 6 8 10 Chaff Injected 0% 10% 20% 30% 40% 50% For Boiling Frog For Naive Injection RPCA - Boiling Frog, 50% Chaff spread over x periods RPCA - Naive Injection Evasion success rates
  68. 68. Naive PCA Robust(er) PCA Naive Chaff Injection (50% injection, single training period) ~ 76% evasion success ~ 14% evasion success Boiling Frog Injection (10 training periods) ~ 87% evasion success ~ 38% evasion success
  69. 69. • not so good, but improving… • pure ML-based anomaly detectors are still vulnerable to compromise • use ML to find features and thresholds, then run streaming anomaly detection using static rules Anomaly detection systems today
  70. 70. What next? • do more tests on AD systems that others have created • other defenses against poisoning techniques • experiment on mode resilient ML models
  71. 71. @cchio