1. Pattern Recognition
and Applications Lab
Poisoning Complete-Linkage Hierarchical
University
of Cagliari, Italy
Department of
Electrical and Electronic
Engineering
Clustering
Ba#sta
Biggio1,
Samuel
Rota
Bulò2,
Ignazio
Pillai1,
Michele
Mura1,
Eyasu
Zemene
Mequanint3,
Marcello
Pelillo3,
and
Fabio
Roli1
(1)
Università
di
Cagliari
(IT);
(2)
FBK-‐irst,
Trento
(IT);
(3)
Università
Ca’
Foscari
di
Venezia
(IT)
Joensuu,
Finland,
S+SSPR
2014
20-‐22
August
2014
2. Threats and Attacks in Computer Security
• Growing number of devices,
services and applications
connected to the Internet
• Vulnerabilities and attacks
through malicious software (malware)
– Examples: Android market,
malware applications
• Identity theft
• Stolen credentials / credit card numbers
http://pralab.diee.unica.it
2
3. Threats and Attacks in Computer Security
• Need for (automated) detection (and rule generation)
– machine learning-based defenses (data clustering)
http://pralab.diee.unica.it
3
Evasion: malware families / variants
+65% new malware variants from 2012 to 2013
Mobile Adware and Malw. Analysis, Symantec, 2014
Detection: antivirus systems
Rule-based systems
4. Data Clustering for Computer Security
• Goal: clustering of malware families to identify common
characteristics and design suitable countermeasures
• e.g., antivirus rules / signatures
http://pralab.diee.unica.it
4
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x1
x2
...
xd
feature extraction
(e.g., URL length,
num. of parameters, etc.)
data collection
(honeypots)
clustering of
malware families
(e.g., similar HTTP
requests)
if
…
then
…
else
…
data analysis /
countermeasure design
(e.g., signature generation)
e.g.,
suspicious
HTTP
request
to
a
web
server
hVp://www.vulnerablehotel.com/components/
com_hbssearch/longDesc.php?h_id=1&
id=-‐2%20union%20select%20concat%28username,
0x3a,password%29%20from%20jos_users-‐-‐
5. Is Data Clustering Secure?
• Attackers can poison input data to subvert malware clustering
http://pralab.diee.unica.it
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x1
x2
...
xd
feature extraction
(e.g., URL length,
num. of parameters, etc.)
data collection
(honeypots)
clustering of
malware families
(e.g., similar HTTP
requests)
if
…
then
…
else
…
data analysis /
countermeasure design
(e.g., signature generation)
Well-‐cra9ed
HTTP
requests
to
subvert
clustering
hVp://www.vulnerablehotel.com/…
hVp://www.vulnerablehotel.com/…
hVp://www.vulnerablehotel.com/…
hVp://www.vulnerablehotel.com/…
… is significantly
compromised
… becomes
useless (too many
false alarms, low
detection rate)
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
6. Is Data Clustering Secure?
• Earlier work (1,2): qualitative definition of attacks
http://pralab.diee.unica.it
6
x
x
x
Samples can be added to
merge (and/or split) existing
clusters
x
x
Samples can be obfuscated
and hidden within existing
clusters (e.g., fringe clusters)
(1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009.
(2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf.
Intelligence and Security Informatics, pp.185–187, 2008.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Clustering on untainted data
7. Is Data Clustering Secure?
• Our previous work (1):
– Framework for security evaluation of clustering algorithms
– Formalization of poisoning and obfuscation attacks (optimization)
– Case study on single-linkage hierarchical clustering
• Despite hierarchical clustering is widely used for malware
clustering (2,3), it is significantly vulnerable to well-crafted attacks!
• In this work we focus on
Poisoning
a+acks
against
complete-‐linkage
http://pralab.diee.unica.it
7
hierarchical
clustering
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
(2) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based
malware. Computer Networks, 57(2):487-500, 2013
(3) K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using
machine learning. J. Comput. Secur., 19(4):639-668, 2011.
8. Complete-Linkage Hierarchical Clustering
• Bottom-up agglomerative clustering
– each point is initially considered as a cluster
– closest clusters are iteratively merged
• Linkage criterion to define distance between clusters
– complete-linkage criterion
x
x
• Clustering output is a hierarchy of clusterings
– Criterion needed to select a given clustering (e.g., number of clusters)
http://pralab.diee.unica.it
8
dist(Ci,Cj ) = max
a∈Ci , b∈Cj
d(a, b) x
x
x
x
x
x
9. Poisoning Attacks
• Goal: to maximally compromise the clustering output on D
• Capability: adding m attack samples
• Knowledge: perfect / worst-case attack
• Attack strategy:
Distance between the clustering in the absence of attack and that under attack
x
http://pralab.diee.unica.it
9
max
A
m
dc (Y,Y!(A)), A= ai { }i=1
x
Y = f (D) Y! = fD(D∪A)
x
x
x
Attack samples A
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Clustering on untainted data D
10. Poisoning Attacks
dc (Y,Y!) = YY T −Y!Y!T
http://pralab.diee.unica.it
10
F
m
, Y =
%%%%%% &
1 0 0
0 0 1
0 0 1
1 0 0
0 1 0
#
$
((((((
, YY T =
'
%%%%%%
The clustering algorithm chooses the number of clusters
that minimizes the attacker’s objective!
1 0 0 1 0
0 1 1 0 0
0 1 1 0 0
1 0 0 1 0
0 0 0 0 1
#
$
&
((((((
'
For a given clustering:
Sample 1
…
Sample 5
max
A
dc (Y,Y!(A)), A= ai { }i=1
How to choose a given clustering from the hierarchy?
This gives us a lower bound on the worst-case attack’s impact!
11. Poisoning Complete-Linkage Clustering
• Attack strategy:
• Heuristic-based solutions
m
– Greedy approach: adding one attack sample at a time
http://pralab.diee.unica.it
11
max
A
dc (Y,Y!(A)), A= ai { }i=1
12. Poisoning Complete-Linkage Clustering
• Local maxima are found at the clusters’ boundaries (wide regions)
http://pralab.diee.unica.it
12
dc (Y,Y!(a))
x1
x2
13. Poisoning Complete-Linkage Clustering
http://pralab.diee.unica.it
13
• Underlying idea: to increase intra-cluster distance (extend attack)
• For each cluster, consider two candidate attack points
Candidate attack points
16. Poisoning Complete-Linkage Clustering
1. Extend (Best): evaluates Y’(a) for each candidate attack,
retaining the best one
– Clustering is run for each candidate attack point, twice per cluster
2. Extend (Hard): estimates Y’(a) assuming that each candidate will
split the corresponding cluster, potentially merging it with a
fragment of the closest cluster
– It does not require running clustering to find the best attack point
3. Extend (Soft): estimates Y’(a) as Extend (Hard), but using a soft
probabilistic estimate instead of 0/1 sample-to-cluster
assignments
– It does not require running clustering to find the best attack point
http://pralab.diee.unica.it
16
17. Poisoning Complete-Linkage Clustering
• The attack compromises the initial clustering by forming
heterogeneous clusters
http://pralab.diee.unica.it
17
Clustering on untainted data Clustering after adding 10 attack samples
18. Experimental Setup
• Banana: artificial data, 80 samples, 2 features, k=4 initial clusters
• Malware: real data (1), 1,000 samples, 6 features, k≈9 initial clusters
(estimated from data minimizing the Davies-Bouldin Index)
– Features:
1. number of GET requests
2. number of POST requests
3. average URL length
4. average number of URL parameters
5. average amount of data sent by POST requests
6. average response length
• MNIST Handwritten Digits: real data, 330 samples per cluster,
28 x 28 = 784 features (pixels),
k=3 initial clusters corresponding to
http://pralab.diee.unica.it
18
(1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral
clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.
19. Experimental Results
• Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best)
– Banana:
• Extend (Best) very close to Optimal (Grid Search)
• Random (Best) competitive with Extend (Hard / Soft)
50
45
40
35
30
25
20
15
10
5
0
Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)
http://pralab.diee.unica.it
19
0%2%5%7%9% 12% 15% 18% 20%
Objective Function
Banana
Fraction of samples controlled by the attacker
0% 11.1% (10 attack samples)
20. Experimental Results
• Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best)
Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal 150
100
50
0
http://pralab.diee.unica.it
250
200
150
100
50
0
0.0% 0.2% 0.4% 0.6% 0.8% 1.0%
Objective Function
Digits
0% 1% 2% 3% 4% 5%
Objective Function
Malware
– Malware:
• Extend attacks and Random (Best) perform rather well
– MNIST Handwritten Digits:
• Random (Best) not effective
– high-dimensional feature space
• Extend (Soft) outperforms Extend (Best / Hard)
20
Fraction of samples controlled by the attacker
21. Conclusions and Future Work
• Framework for security evaluation of clustering algorithms
• Poisoning attack vs. complete-linkage hierarchical clustering
– Even random-based attacks can be effective!
• Future work
– Extensions to other clustering algorithms, common attack strategy
• e.g., black-box optimization with suitable heuristics
– Attacks with limited knowledge of the input data
http://pralab.diee.unica.it
21
Secure clustering algorithms
Attacks against clustering
24. Is Data Clustering Secure?
• Our previous work (1):
– Framework for security evaluation of clustering algorithms
1. Formal definition of potential attacks
2. Empirical evaluation of their impact
• Adversary’s model
– Goal (security violation)
– Knowledge of the attacked system
– Capability of manipulating the input data
– Attack strategy (optimization problem)
• Inspired from previous work on adversarial machine learning
– Barreno et al., Can machine learning be secure?, ASIACCS 2006
– Huang et al., Adversarial machine learning, AISec 2011
– Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.
Knowledge and Data Eng., 2013
http://pralab.diee.unica.it
24
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
25. Adversary’s Goal
• Security violation
– Integrity: hiding clusters / malicious
activities without compromising normal
system operation
• e.g., creating fringe clusters à
obfuscation attack
– Availability: compromising normal
system operation by maximally
altering the clustering output
• e.g., merging existing clusters à
poisoning attack
Integrity
Availability Privacy
– Privacy: gaining confidential information about system users by
reverse-engineering the clustering process
http://pralab.diee.unica.it
25
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
26. Adversary’s Knowledge
• Perfect knowledge
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x1
x2
...
xd
– upper bound on the performance degradation under attack
http://pralab.diee.unica.it
26
INPUT DATA
FEATURE
REPRESENTATION
CLUSTERING
ALGORITHM
e.g., k-means
ALGORITHM
PARAMETERS
e.g., initialization
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
27. Adversary’s Capability
• Attacker’s capability is bounded:
– maximum number of samples that can be added to the input data
• e.g., the attacker may only control a small fraction of malware samples
collected by a honeypot
– maximum amount of modifications
(application-specific constraints in feature space)
• e.g., malware samples should preserve their malicious functionality
(elements can not be removed à features can only be incremented)
http://pralab.diee.unica.it
27
x Feasible
domain
x '
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
28. Formalizing the Optimal Attack Strategy
http://pralab.diee.unica.it
28
max
A
Eθ ~μ g A;θ ( ) !"
#$
s.t. A ∈ Ω
Knowledge of the data, features, …
Capability of manipulating the input data
Attacker’s goal
Perfect knowledge: Eθ ~μ g A;θ ( ) !"
#$
= g A;θ0 ( )