SlideShare a Scribd company logo
1 of 43
ember
an open source
malware classifier and
dataset
whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame
whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum
Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time
One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.
Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk
Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intellectual property
Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s
DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!
Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification
Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet
Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?
Endgame Malware BEnchmark for Research
“MNIST for malware”
ember
“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember
The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data
Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data
7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2
[proth@proth-mbp data]$ cd ember
[proth@proth-mbp ember]$ ls -lh
total 9.2G
-rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl
-rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,
The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]
features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF
features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features
features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each
features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names
features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys
features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature
features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…
vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%
disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)
code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make predictions on
new PE files
notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset
suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training
suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification
demo time!
ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637
ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/endgameinc/ember
THANKYOU!
Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

More Related Content

What's hot

Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
2.17Mb ppt
2.17Mb ppt2.17Mb ppt
2.17Mb pptbutest
 
Python Machine Learning - Getting Started
Python Machine Learning - Getting StartedPython Machine Learning - Getting Started
Python Machine Learning - Getting StartedRafey Iqbal Rahman
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning AlgorithmsWalaa Hamdy Assy
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke PredictionMohammadRakib8
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsHichem Felouat
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment AnalysisDinesh V
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 

What's hot (20)

Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
2.17Mb ppt
2.17Mb ppt2.17Mb ppt
2.17Mb ppt
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Python Machine Learning - Getting Started
Python Machine Learning - Getting StartedPython Machine Learning - Getting Started
Python Machine Learning - Getting Started
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine learning
Machine learningMachine learning
Machine learning
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
Machine learning
Machine learning Machine learning
Machine learning
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 

Similar to Ember

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...Amazon Web Services Korea
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAxel de Romblay
 

Similar to Ember (20)

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
MLBox
MLBoxMLBox
MLBox
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Ember

  • 1. ember an open source malware classifier and dataset
  • 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  • 4. Open datasets push ML research forward source: https://twitter.com/benhamner/status/938123380074610688 Datasets cited in NIPS papers over time
  • 5. One example: MNIST MNIST: http://yann.lecun.com/exdb/mnist/ Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  • 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  • 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  • 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/ DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
  • 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: https://www.ll.mit.edu//ideval/data/1998data.html https://www.ll.mit.edu//ideval/data/1999data.html KDD Cup 1999: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html OLD!!!!
  • 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/ VirusShare [Malicious Only]: https://virusshare.com/ Microsoft Malware Challenge [Malicious Only. Headers Stripped]: https://www.kaggle.com/c/malware-classification
  • 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  • 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  • 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  • 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  • 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  • 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  • 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  • 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  • 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  • 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  • 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  • 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format https://lief.quarkslab.com/ https://lief.quarkslab.com/doc/Intro.html https://github.com/lief-project/LIEF
  • 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  • 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
  • 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  • 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  • 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  • 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  • 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  • 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  • 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  • 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  • 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  • 36. code https://github.com/endgameinc/ember The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  • 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  • 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  • 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  • 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper: https://arxiv.org/abs/1804.04637
  • 42.
  • 43. ember Download the data: https://pubdata.endgame.com/ember/ember_dataset.tar.bz2 Download the code: https://github.com/endgameinc/ember THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum