Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering

Pa#ern
Recogni-on
and
Applica-ons
Lab
Poisoning Behavioral Malware Clustering
University
of
Cagliari,
Italy
Department
of
Electrical
and
Electronic
Engineering
Ba#sta
Biggio1,
Konrad
Rieck2,
Davide
Ariu1,
Chris-an
Wressnegger2,
Igino
Corona1,
Giorgio
Giacinto1,
and
Fabio
Roli1
(1)
University
of
Cagliari
(IT)
(2)
University
of
GoeLngen
(GE)
Sco#sdale,
Arizona,
AISec
2014
US,
Nov.,
7
2014

Threats and Attacks in Computer Security
• Huge number of devices, services and apps on the Internet
– Vulnerabilities in code, services, apps, etc.
• Attacks through malicious software (malware)
– Botnets, spam, identity theft / stolen credit card numbers
• Manual analysis and crafting of signatures costly
– Need for automated / assisted detection (and rule generation)
– Machine learning-based defenses (data clustering)
http://pralab.diee.unica.it
2
Evasion: malware families / variants
+65% new malware variants from 2012 to 2013
Mobile Adware & Malw. Analysis, Symantec, 2014
Detection: antivirus systems
Rule-based systems

Data Clustering for Computer Security
• Goal: clustering of malware families to identify common
characteristics and design suitable countermeasures
• e.g., antivirus rules / signatures
3
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x1
x2
...
xd
feature extraction
(e.g., executed
instructions,
system calls, etc.)
clustering of
malware families
(e.g., similar program
behavior)
for
each
cluster
if
…
then
…
else
…
data analysis /
countermeasure design
(e.g., signature generation)
data collection
(honeypots)
Malware samples

Is Data Clustering Secure?
• Attackers can poison input data to subvert malware clustering
4
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x1
x2
...
xd
feature extraction
(e.g., executed
instructions,
system calls, etc.)
Malware samples designed
to subvert clustering
… is significantly
compromised
for
each
cluster
if
…
then
…
else
…
… becomes
useless (too many
false alarms, low
detection rate)
(1) B. Biggio et al. Is data clustering in adversarial settings secure? In AISec 2013
(2) B. Biggio et al.. Poisoning complete-linkage hierarchical clustering, In S+SSPR 2014
data collection
(honeypots)
clustering of
malware families
(e.g., similar program
behavior)
data analysis /
countermeasure design
(e.g., signature generation)

Is Data Clustering Secure?
• Our previous work (1,2):
– Framework for security evaluation of clustering algorithms
– Formalization of poisoning attacks (optimization) against
single- and complete-linkage hierarchical clustering
• In this work we focus on a realistic application example on
5
Poisoning
a,acks
against
a
behavioral
malware
clustering
approach
(3)
Malheur
h,p://www.mlsec.org/malheur/
(1) B. Biggio et al.. Is data clustering in adversarial settings secure? In AISec 2013
(2) B. Biggio et al.. Poisoning complete-linkage hierarchical clustering. In S+SSPR 2014
(3) K. Rieck et al.. Automatic analysis of malware behavior using machine learning. JCS 2011

Poisoning Attacks
• Goal: to maximally compromise the clustering output on D
• Capability: adding m attack samples
• Knowledge: perfect / worst-case attack
• Attack strategy:
x
6
max
A
m
dc (Y,Y!(A)), A= ai { }i=1
Distance between the clustering in the absence of attack and that under attack
Y! = fD(D∪A)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Attack samples A
Y = f (D)
x
x
x
x
x
x
x
x
x
x
x
x
Clustering on untainted data D

Poisoning Attacks
dc (Y,Y!) = YY T −Y!Y!T
7
F
m
, Y =
%%%%%% &
1 0 0
0 0 1
0 0 1
1 0 0
0 1 0
#
$
((((((
, YY T =
'
1 0 0 1 0
0 1 1 0 0
0 1 1 0 0
1 0 0 1 0
0 0 0 0 1
#
%%%%%%
$
This distance counts how many pairs of samples have been clustered
together in one clustering and not in the other, and vice-versa
&
((((((
'
For a given clustering:
Sample 1
…
Sample 5
max
A
dc (Y,Y!(A)), A= ai { }i=1

Single-Linkage Hierarchical Clustering
• Bottom-up agglomerative clustering
– each point is initially considered as a cluster
– closest clusters are iteratively merged
• Linkage criterion to define distance between clusters
– single-linkage criterion
x
x
• Clustering output is a hierarchy of clusterings
– Criterion needed to select a given clustering (e.g., number of clusters)
– Cutoff threshold on the maximum intra-cluster distance
8
x
dist(Ci,Cj ) = min
a∈Ci , b∈Cj
d(a, b) x
x
x
x
x

Poisoning Single-Linkage Clustering
• Attack strategy:
• Heuristic-based solutions
m
– Greedy approach: adding one attack sample at a time
– Bridge-based heuristics: local maxima are found in between the
closest points of adjacent clusters
9
max
A
dc (Y,Y!(A)), A= ai { }i=1

10
• Underlying idea: bridging the closest clusters
– Given K clusters, K-1 candidate attack points
Candidate attack points

1. Bridge (Best): evaluates Y’(a) for each candidate attack,
retaining the best one
– Clustering is run for each candidate attack point
2. Bridge (Hard): estimates Y’(a) assuming that each candidate will
split the corresponding cluster, potentially merging it with a
fragment of the closest cluster
– It does not require running clustering to find the best attack point
3. Bridge (Soft): estimates Y’(a) as Bridge (Hard), but using a soft
probabilistic estimate instead of 0/1 sample-to-cluster
assignments
– It does not require running clustering to find the best attack point
11

• The attack compromises the initial clustering by forming
heterogeneous clusters
12
Clustering on untainted data
2.5
Clustering after adding 20 attack samples
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5

Malheur
Behavioral Malware Clustering
• Malware executed in a sandbox (e.g., virtual machine)
– Monitoring of program behavior (instructions, system calls, etc.)
• Embedding of malware behavior in feature space
– Each feature denotes presence / absence of a given instruction
– Each vector is normalized to unit Euclidean norm
• Clustering using single-linkage (or other linkage variants)
13
Filesystem
copy file ‘a’ to ’b’
open file ’foo.txt’
Network
ping host ’10.1.2.3’
listen on port ‘31337’
Registry
set key ‘reboot’ to ‘1’
(level 1)
14
01
|
11
04
…
02
02
|
02
02
…
0d
01
|
03
0a
…
03
03
|
03
01
…
03
0a
|
11
04
…
Sandbox
MIST
Instruction
(opcode)
arguments
Feature space
14
01
02
02
+
(1) K. Rieck et al.. Automatic analysis of malware behavior using machine learning. JCS 2011

Poisoning Malheur
• Poisoning single-linkage hierarchical clustering
• Problem: how to create bridge points in this feature space?
– Binary-valued vectors normalized to unit Euclidean norm
• Additional constraint on the manipulation of malware samples
– Malware should be modified without affecting malicious functionality
– Adding instructions after malware program execution
– Feature values can be only incremented
14
x1 = (1 1 0 0 0)
x2 = (0 0 1 1 1)
1.5
1
0.5
0
d(x,x
)
1
d(x,x
)
2
0 1 2 3
number of added features
x2
14
01
02
02
+
+
+
Bridge point
x1
x x

Experimental Setup and Datasets
• Setup
– Data split into two portions of equal size T and S
– T used for extracting instructions and setting the cutoff threshold
– S used for performance evaluation
– F-measure: agreement between clusters and malware families
• Malheur data
– 3131 malware samples collected in 2009 (publicly available)
– 85 instructions / features (on average)
– Cutoff distance (max. F-measure on T): 0.49 (on average)
• Recent Malware data
– 657 malware samples from most prominent families in 2013
– 78 instructions / features (on average)
– Cutoff distance (max. F-measure on T): 0.63 (on average)
15

Experimental Results (Malheur data)
• Attack strategies
– Bridge (Best/Hard/Soft), Random, Random (Best), F-measure (Best)
• Results for Malheur data
– Random-based attacks are not effective (high-dimensional space)
– Bridging is effective / clusters are fused together (cutoff threshold is fixed)
– F-measure decreases while maximizing distance between clusterings
Random Random (Best) Bridge (Best) Bridge (Soft) Bridge (Hard) F−measure (Best)
1600
1400
1200
1000
800
600
400
200
0
16
100
10 20 30 40 50 60 70 80
0% 2% 5% 7% 9% 11%13%15%17%18%20%
Objective function
90
80
70
60
50
40
30
20
10
0% 2% 5% 7% 9% 11% 13% 15% 17% 18% 20%
F−measure
Fraction of poisoning attacks

Experimental Results (Recent Malware data)
• Attack strategies
– Bridge (Best/Hard/Soft), Random, Random (Best), F-measure (Best)
• Results for Recent Malware data
– Random-based attacks are not effective (high-dimensional space)
– Bridging is effective / clusters are fused together (cutoff threshold is fixed)
– F-measure decreases while maximizing distance between clusterings
Random Random (Best) Bridge (Best) Bridge (Soft) Bridge (Hard) F−measure (Best)
300
250
200
150
100
50
0
17
72
10 20 30 40 50 60 70 80
0% 2% 5% 7% 9% 11%13%15%16%18%20%
Objective function
70
68
66
64
62
60
58
56
54
52
0% 2% 5% 7% 9% 11%13%15%16%18%20%
F−measure
Fraction of poisoning attacks

Conclusions and Future Work
• Poisoning attacks can subvert behavioral malware clustering
• Future work
– Extensions to other clustering algorithms, common attack strategy
• e.g., black-box optimization with suitable heuristics
– Attacks with limited knowledge of the data / clustering algorithm
18
Secure clustering algorithms
Attacks against clustering

?
19
Thanks
for
your
a#en-on!
Any
quesCons

Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (10)

Similar to Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering

Similar to Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering (20)

More from Pluribus One

More from Pluribus One (17)

Recently uploaded

Recently uploaded (20)

Battista Biggio @ AISec 2014 - Poisoning Behavioral Malware Clustering