Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
When Cyber Security Meets Machine Learning
1. When
Cyber Security
Meets
Machine Learning
Lior Rokach
Information Systems Eng., Ben-Gurion University of the Negev
College of Information Sciences and Technology, Penn State University
2. About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev
Email: liorrk@bgu.ac.il
PhD (2004) from Tel Aviv University
3. Why Cyber Security?
• Evolving Domain – Endless Game
• Plenty of Data
• Practical Contribution
• Strong support of the stakeholders
– Communications
– Collaborations
– Grants
7/30/2012 Ben-Gurion University
4. Cyber Security
Cyber security is defined as the intersection of
• computer security
• network security
• information security
2011
2008 2009 2010 2010
2011 SONY LOCKHEED
GHOSTNET AURORA STUXNET NASDAQ 4
MARTIN
5. Cyber Security
• Is a domain problem, not a domain solution,
thus, it seeks solutions from other areas.
• Traditionally, Security problems were aided by
Mathematical model. e.g.
– Secrecy – using cryptography
7/30/2012 Ben-Gurion University
6. Modern Cyber Security
• Deals with abstract threats which cannot be
solved only by using mathematical models:
– Malware detection.
– Intrusion detection.
– Data leakage, etc.
• Need for other research methods
7/30/2012 Ben-Gurion University
8. The concept of learning in a ML system
• Learning = Improving with experience at some
task
– Improve over task T,
– With respect to performance measure, P
– Based on experience, E.
8
9. Motivating Example
Learning to Filter Spam
Example: Spam Filtering
Spam - an email that the user does not want to
receive and has not asked to receive
T: Identify Spam Emails
P: % of spam emails that were filtered
% of ham (non-spam) emails that were
incorrectly filtered-out
E: a database of emails that were labelled
by users/experts
7/30/2012 Ben-Gurion University
11. The Learning Process in our Example
Model Learning Model
Testing
Number of recipients
Size of message
Number of attachments
Number of "re's" in the
subject line
Email Server …
12. Data Set
Target
Input Attributes
Attribute
Number of Email Country (IP) Customer Email Type
new Length (K) Type
Recipients
0 2 Germany Gold Ham
1 4 Germany Silver Ham
5 2 Nigeria Bronze Spam
Instances
2 4 Russia Bronze Spam
3 4 Germany Bronze Ham
0 1 USA Silver Ham
4 2 USA Silver Spam
13. Information security and machine
learning: Taxonomy
Problem Domain : Information Security – the problems
we need to solve:
Malware detection
Intrusion detection
SPAM mitigation
Etc.
Solution Domain: Machine-Learning – from which
solutions are drawn.
Artificial neural networks
Decision Trees etc.
7/30/2012 Ben-Gurion University
14. ISML Taxonomy Computer Security
Using Machine
Learning
Security domain Machine Learning domain
Protective Feature Learning
Threat type Damage Threat Raw data Extracted Analysis
Security Selection Algorithm
Type Domain Type Features type
System Method type
Information Network IntrusionPrevention
Executable n-grams Gain Ration Static
Viruses Leakage components System Supervised
Intrusion Detection Portable
Denial of Service Web applications Text File Fisher Score Dynamic
Worms System Executable Unsupervised
End Point Firewall/VPN E-Mail Function Based Document
Information Loss Sequence
Spam Computer Frequency Semi-supervised
Messaging End Point Antivirus IP-Packet String Signature Hierarchical feature Positive Examples
Personality theft
D.O.S attacks selection Only Learning
Loss of Internet Service Document
Network Antivirus XML File Network traffic
Buffer Overflow confidentuality providers Frequency
Signature Based
Time series
SQL Injection Filter Device
Anti-Spam
OpCode n-grams
Misuse Systems
XML features
System Intrusion
Packet header
7/30/2012 Ben-Gurion University
16. Malware
• Malware, short for malicious software, is
software designed to disrupt computer
operation, gather sensitive information, or
gain unauthorized access to a computer
system
7/30/2012 Ben-Gurion University
17. Static vs. Dynamic Analysis
• Static – Analyze the program (code) –
– leverage structural information (e.g. sequence of
bytes)
– attempts to detect malware before the program
under inspection executes
• Dynamic – Analyze the running process –
– leverage runtime information (e.g. network
usage)
– attempts to detect malicious behavior during
program execution or after program execution.
7/30/2012 Ben-Gurion University
19. Analogous of Malcode Detection as
Text Categorization
• Classifying Malicious Code can be analogous to
Text Categorization.
• Texts Malicious Code (Files)
• Words Code expressions
• Then weighting functions, such as tf or tfidf can
be used.
7/30/2012 Ben-Gurion University
20. Sec. 6.2.2
tf-idf weighting
• Best known weighting scheme in information retrieval
TF IDF
w t ,d log(1 tft ,d ) log10 ( N / dft )
• The TF (term frequency) tft,d of term t in document d
is defined as the number of times that t occurs in d.
• The IDF (inverse document frequency) : the inverse
number of documents that contain t
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
21. Dataset
• We acquired the malicious files from the VX Heaven website -
7688 malicious files for windows OS.
• Including executable and DLL (Dynamic Linked Library) files
• The benign set contained 22,735 files.
7/30/2012 Ben-Gurion University
23. Classification Algorithms
• In order to create rules from the raw data gathered
and presented on the previous slide, different
Classification Algorithms were examined
– Artificial Neural Networks (ANN)
– Decision Trees (DT)
– Naive Bayes (NB)
– Support Vector Machines (SVM)
– Boosted Decision Trees (BDT)
– Boosted Naive Bayes (BNB)
7/30/2012 Ben-Gurion University
24. Steps
• Determine the best conditions:
– The best term representation (TF /
TFIDF)
– The best N-gram (3 / 4 / 5 / 6)
– The best top-selection (50 / 100 / 200 /
300)
& best features selection ( DF / FS / GR)
7/30/2012 Ben-Gurion University
25. Performance Measures
• True Positive Rate (TPR) - The number
of positive instances classified
correctly.
• False Positive Rate (FPR) - The number
of negative instances misclassified.
• Total Accuracy - The number of
absolutely correctly classified
instances.
7/30/2012 Ben-Gurion University
26. Preliminary Results
• Mean accuracies
quite similar.
• Best performance:
top 5500.
• Best
representation: TF
• Best N-gram:
5-gram.
7/30/2012 Ben-Gurion University
27. Classifiers
• Under the best conditions presented above,
the classifiers that achieved the highest
accuracies, with lowest false positive rates,
are:
Classifier Accuracy FP FN
– Boosted Decision Tree ANN 0.941 0.033 0.134
– Artificial Neural Network DT 0.943 0.039 0.099
NB 0.697 0.382 0.069
BDT 0.949 0.040 0.110
BNB 0.697 0.382 0.069
SVM-lin 0.921 0.033 0.214
SVM-poly 0.852 0.014 0.544
SVM-rbf 0.939 0.029 0.154
7/30/2012 Ben-Gurion University
28. Portable Executable (PE)
• Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE
or DLL).
• PE Header that describes physical structure of a PE binary (e.g.,
creation/modification time, machine type, file size)
• Import Section: which DLLs were imported and which functions from
which imported DLLs were used
• Exports Section: which functions were exported (if the file being examined
is a DLL)
• Resource Directory: resources used by a given file (e.g., dialogs, cursors)
• Version Information (e.g., internal and external name of a file, version
number)
7/30/2012 Ben-Gurion University
30. Imbalanced Classification Tasks
• Data set is Imbalanced, if
the classes are unequally
distributed
• Class of interest (minority
class) is often much
smaller or rarer
• But, the cost of error on the
minority class can have a
bigger bite
7/30/2012 Ben-Gurion University
31. The Mal-ID Method
• Common Libraries
• Anti-Forensic means to avoid their detection
• Chronological evolution of malware – Most viruses
are variants of previous malwares.
Mal-ID: Automatic Malware Detection Using
Common Segment Analysis and Meta-Features,
Journal of Machine Learning Research 1 (2012) 1-48
7/30/2012 Ben-Gurion University
32. Andromaly
• Lightweight Host-based Intrusion
Detection System for Android-
based devices
7/30/2012 Ben-Gurion University
33. The “Andromaly”
• A lightweight Host-based Intrusion Detection
System for Android-based devices
• Providing real-time, monitoring, collection,
preprocessing and analysis of various system
metrics
• Open framework – possible to apply different types
of detection techniques
• Threat assessments (TAs) are weighted and
smoothed to avoid instantaneous false alarms
• An alert is matched against a set of
automatic/manual countermeasures
7/30/2012 Ben-Gurion University
34. The “Andromaly” architecture
Graphical User Interface
Feature Extractors
Application Level
Operating System
Alert Agent Service Loggers Scheduling
Manager SQLite
Processor Memory
Manager Config
Manager
Keyboard
Operation Mode Alert
Manager Handler Feature Network
Threat
Manager Hardware
Weighting Unit Communication layer
Power
Processors
Rule-based Classifier Application Linux
Framework Kernel
Anomaly
Detector KBTA
7/30/2012 Ben-Gurion University
38. Evaluation
Preparation of the data-sets
• The applications were installed on 25 Android G1devices
(each device has one user only)
• Each user activate each application
• In the background the Android agent was running and
logging data (feature vectors) on the SD-card (88 features
each 2 seconds)
• The feature vectors were added to our data-set and labeled
with the device id, application name and class (game/tool)
7/30/2012 Ben-Gurion University
39. Abnormal state detection
• Identify the most informative features to • Detection algorithms: K-
monitor Means, Histograms, Logistic
Regression, Decision Tree, Bayesian
• Evaluating various detection methods and
Net, Naïve Bayes
algorithms
• Feature selection: InfoGain, Chi
• Understanding the feasibility of running
Square, Fisher Score
these methods as detection units on
Android devices • Top best features: 10, 20, 50
(d) Experiment IV
Feature Extraction Malicious (4) Tools/Games (4)
• Recorded 90 features while activating
Train Train applications
Feature Selection • Differentiate between applications
Train Device 1
which are not included in the training
Training set when training and testing are
Test Device 2 performed on different devices
Testing Test Test
Step I:
Step II:
Differentiating games (23K) and tools (20K) using
Detecting Android malware (15K) using 25 devices
25 devices
Rotation Forest/Fisher Score/Top 10
Logistic Regression/Fisher Score/Top 20
Accuracy 87.4% (TRP 0.794, FPR 0.126)
Accuracy 75.3% (TRP 0.797, FPR 0.303)
7/30/2012 Ben-Gurion University
42. Data Leakage Prevention
Data leakage prevention solution is a system that is designed to detect
potential data breach incidents in timely manner and prevent them by
monitoring data while in-use (endpoint actions), in-motion (network
traffic), and at-rest (data storage)
7/30/2012 Ben-Gurion University
43. Honeytokens
• Honeytokens - faked digital data (e.g., a
credit card number, a database entry or
bogus login credentials) planted into a
genuine system resource (e.g., databases,
files and emails).
• Example:
– Insert a honey-table: a table with "sweet" name
able to attract malicious user (e.g.
"CREDIT_CARDS")
– These tables are not being used by any
44. HoneyGen
• Challenge: A good honeytoken is an artificial
data item that is hard to distinguish between
real tokens and the honeytoken
• HoneyGen: an Automated Honeytokens
Generator [Berkovitch, 2011]
– Proposed a generic method for honeytokens
generation that given any database will be able to
generate high quality honeytokens
44
45. HoneyGen System
• Rule mining: extrapolates rules that describe the "real" data structure,
attributes, constraints and logic (identity, reference, cardinality, value-
set, attribute dependency)
Honeytoken generation
Likelihood rating: sort
honeytokens by similarity
Real tokens
to real tokens in the input INPUT:
DB
database, according to the
commonness of its
combination of values PROCESS: Rules
Mining
Rules Honeytokens Honeytokens
Generation
Likelihood
Rating
Honeytokens
OUTPUT: Honeytokens Likelihood Scores
DB
45
47. Motivation
• Identity theft is one of the most usual crimes
in North America. There are closet to 10
million victims of identity theft each year.
• The Federal Trade Commission (FTC)
estimates that the cost of identity theft to
companies is approximately $50 billion per
year additionally to $5 billion worth of costs
to consumers.
• These days all the authentication of users is
based on Username & Password, which can
be stolen physically, by Phishing sites, Trojans,
as well as given.
48. Current Authentication Mechanisms – Costly
and often Unavailable.
Current Authentication mechanisms Disadvantages
Authentication by Hard to remember many passwords
Password predefined user name and Password may be copied, cracked or stolen
password
Can be lost or stolen
Token Based on an object (i.e.,
magnetic card, RFID tag) Expensive to deploy and maintain for
consumer market
Biometric
Biometrics based on a palm Expensive
(Palm/finger signature Limited availability
)
Biometric Biometrics based on a Accuracy limited with illness or background
(Voice) vocal patterns noise
Biometric Biometrics based on Expensive
structural and color patterns
(Iris) of the human iris. Accuracy problems for diabetes patients
ID numbers which are
Secure ID constantly generated (i.e., Can be lost or stolen
by RSA Secure ID)
Users are already hassled by current security mechanisms and
reluctant to accept new ones.
22.01.2010
48
52. Various actions for learning purposes
Mouse Point and
Point and Double
Move
Left Click Click
Mouse Left Click
Move Mouse Double
Move Click
Drag and
Drop
Point and Point and
Right Click Mouse Mouse Mouse DD
Down Move Up
Mouse Right Click Mouse Mouse Mouse Mouse
Move Move Down Move Up
53. Evaluation Measures
• False Acceptance Rate (FAR) –the ratio
between the number of attacks that were
erroneously labeled as authentic interactions
and the total number of attacks.
• False Rejection Rate (FRR) –the ratio between
the number of legitimate interactions that
were erroneously labeled as attacks and the
total number of legitimate interactions.
7/30/2012 Ben-Gurion University
54. Evaluation
• Fixed Text (password) / Continues Verification
• The Session Length (number of actions)
– more actions Better performance
• Keyboard is better than mouse
Session Size FAR FRR EER AUC
1/4 Session 4.33% 3.17% 3.75% 0.0308
1/2 Session 2.59% 2.86% 2.72% 0.0234
Full Session 1.48% 1.59% 1.53% 0.0144
• ~ 3 % FAR, FRR after 10 actions
• Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User
Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15
October 2012, Pages 19–36.
7/30/2012 Ben-Gurion University
56. Motivation
• Huge databases exist in society today
– Medical data
– Consumer purchase data
– Census data
– Communication and media-related data
– Data gathered by government agencies
• Can this data be utilized?
– For medical research
– For improving customer service
– For homeland security
• The Problem: The huge amount of data
available means that it is possible to learn a lot
of information about individuals
57. Privacy Challenge (Sweeney, 1998)
Disease
Birth Date
Zip
Sex
Name
87% of the population in the USA can be uniquely identified by zip, sex and DoB
58. Quasi-identifier
• The minimal set of attributes in a
table that can be joined with external
information to re-identify individual
records
59. k-Anonymity
Let R(A1,...,An) be a relation and QI be the quasi-identifier
associated with it. R is said to satisfy k-anonymity if and only if
every distinct value of QI has at least k occurrences in R.
60. Generalization and Suppression
Generalization
replacement of a value by a less specific (more general)
value using domain generalization relationship.
Suppression
remove the value.
Z2 = {537**}
Z1 = {5371*. 5370*}
Z0 = {53715. 53710, 53706, 53703}
537**
S1 = {Person} 5371* 5370*
S0 = {Male, Female}
53715 53710 53706 53703
61. Privacy-preserving data mining
(PPDM)
• Goal: Create accurate data mining models from
anonymous data.
• Performing anonymization while ignoring the data
mining task results in a loss of data quality
• Data owners must balance the desire to share
useful data and the need to protect private
information within the data. Trade-Off
62. k-Anonymity Classification Tree
Using Suppression
• Induce a classification tree with existing algorithm
(like C4.5)
• Walk over the tree and iteratively prune the rule
in bottom-up manner until we reach k-anonymity
The order of attributes in the path (from root to
the leaf) already denotes the importance (from
high to low) for predicting the class.
7/30/2012 Ben-Gurion University
63. Example QI = {Marital Status, Education, Occupation, Sex}
K=100
Marital Status = Married
| Education = High School: <=50K. (200)
| Education = Some college
| | Occupation = Handlers-cleaners: <=50K. (89)
| | Occupation = Exec managerial: >50K (120)
Complying nodes – child leafs whose frequency is bigger than k-anonymity
threshold
Non-complying nodes – child leafs whose frequency is lower than k-anonymity
threshold
Compensation - use complying nodes to drive anonymization process by
compensating part of their records in favor of non-complying records using
suppression
64. Example QI = {Marital Status, Education, Occupation, Sex}
K=100
Marital Status = Married
| Education = High School: <=50K. (200)
| Education = Some college
| | Occupation = Handlers-cleaners: <=50K. (89)
| | Occupation = Exec managerial: >50K (120)
Marital Status Education Occupation Sex Class
Married High School * * <=50K
200 : : : : :
Married High School * * <=50K
Married Some college * * <=50K
89 : : : : :
Married Some college * * <=50K
11 Married Some college * * <=50K
Married Some college Exec managerial * >50K
120-11=109
:
65. Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient
Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge
and Data Engineering, 22(3): 334-347 (2010).
7/30/2012 Ben-Gurion University
68. Machine Learning and Security
• Many current and emerging computer and network
security challenges can be solved only by using machine
learning techniques:
– Information leakage
– Data misuse
– Anomaly detection
– etc…
• It is very important to understand how to employ machine
learning techniques in an effective way.
• In particular:
– carefully construct training corpora,
– Effective feature extraction
– Effective feature selection, and
– Valid evaluations on representative corpora.
טקסונומיה של אבטחת מידע בשילוב למידת מכונה.עבודה לדוגמא יכולה להתמקד בחקירת משפחה של וירוסים הגורמים לאיבוד מידע באפליקציות רשת. אנטי וירוס רשתי הוא אמצעי ההגנה המתואר בעבודהכאשר המידע הגולמי שמערכת כזו מעבדת הם קבצים ברי הרצה.פתרון המוצע משתמש בחתימות מחרוזת. את המאפיינים לבחירת המחרוזת בוחרים בעזרת פישר סקור.מציאת החתימה נעשית על ידי ניתוח קוד סטטי.מאמנים מודל סיווג עי שימוש באחד מאלגוריתמי למידה מונחית.