Ember

ember
an open source
malware classifier and
dataset

whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame

whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum

Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time

One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.

Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html

Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk

Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intellectual property

Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s

DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms

Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!

Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification

Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet

AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.

A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?

Endgame Malware BEnchmark for Research
“MNIST for malware”
ember

“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”

Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember

The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data

Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data

7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2
[proth@proth-mbp data]$ cd ember
[proth@proth-mbp ember]$ ls -lh
total 9.2G
-rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl
-rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl

First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,

The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]

features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF

features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features

features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf

features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each

features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names

features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys

features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature

features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…

vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7

model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)

model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%

disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)

code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make predictions on
new PE files

notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset

suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training

suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification

ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637

ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/endgameinc/ember
THANKYOU!
Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

Ember

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ember

Similar to Ember (20)

Recently uploaded

Recently uploaded (20)

Ember