When Cyber Security Meets Machine Learning

•

74 likes•17,756 views

Technology Education

When
Cyber Security
Meets
Machine Learning

Lior Rokach
Information Systems Eng., Ben-Gurion University of the Negev
College of Information Sciences and Technology, Penn State University

About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev

Email: liorrk@bgu.ac.il

PhD (2004) from Tel Aviv University

Why Cyber Security?
• Evolving Domain – Endless Game
• Plenty of Data
• Practical Contribution
• Strong support of the stakeholders
– Communications
– Collaborations
– Grants

7/30/2012 Ben-Gurion University

Cyber Security
Cyber security is defined as the intersection of
• computer security
• network security
• information security

2011
2008 2009 2010 2010
2011 SONY LOCKHEED
GHOSTNET AURORA STUXNET NASDAQ 4
MARTIN

Cyber Security
• Is a domain problem, not a domain solution,
thus, it seeks solutions from other areas.

• Traditionally, Security problems were aided by
Mathematical model. e.g.
– Secrecy – using cryptography

7/30/2012 Ben-Gurion University

Modern Cyber Security
• Deals with abstract threats which cannot be
solved only by using mathematical models:
– Malware detection.
– Intrusion detection.
– Data leakage, etc.

• Need for other research methods

7/30/2012 Ben-Gurion University

7/30/2012 Ben-Gurion University

The concept of learning in a ML system
• Learning = Improving with experience at some
task
– Improve over task T,
– With respect to performance measure, P
– Based on experience, E.

8

Motivating Example
Learning to Filter Spam
Example: Spam Filtering
Spam - an email that the user does not want to
receive and has not asked to receive

T: Identify Spam Emails

P: % of spam emails that were filtered
% of ham (non-spam) emails that were
incorrectly filtered-out

E: a database of emails that were labelled
by users/experts
7/30/2012 Ben-Gurion University

The Learning Process

Model Learning Model
Testing

The Learning Process in our Example

Model Learning Model
Testing

Number of recipients
Size of message
Number of attachments
Number of "re's" in the
subject line
Email Server …

Data Set
Target
Input Attributes
Attribute

Number of Email Country (IP) Customer Email Type
new Length (K) Type
Recipients
0 2 Germany Gold Ham
1 4 Germany Silver Ham
5 2 Nigeria Bronze Spam
Instances

2 4 Russia Bronze Spam
3 4 Germany Bronze Ham
0 1 USA Silver Ham
4 2 USA Silver Spam

Information security and machine
learning: Taxonomy
 Problem Domain : Information Security – the problems
we need to solve:
 Malware detection
 Intrusion detection
 SPAM mitigation
 Etc.

 Solution Domain: Machine-Learning – from which
solutions are drawn.
 Artificial neural networks
 Decision Trees etc.

7/30/2012 Ben-Gurion University

ISML Taxonomy Computer Security
Using Machine
Learning

Security domain Machine Learning domain
Protective Feature Learning
Threat type Damage Threat Raw data Extracted Analysis
Security Selection Algorithm
Type Domain Type Features type
System Method type
Information Network IntrusionPrevention
Executable n-grams Gain Ration Static
Viruses Leakage components System Supervised

Intrusion Detection Portable
Denial of Service Web applications Text File Fisher Score Dynamic
Worms System Executable Unsupervised

End Point Firewall/VPN E-Mail Function Based Document
Information Loss Sequence
Spam Computer Frequency Semi-supervised

Messaging End Point Antivirus IP-Packet String Signature Hierarchical feature Positive Examples
Personality theft
D.O.S attacks selection Only Learning

Loss of Internet Service Document
Network Antivirus XML File Network traffic
Buffer Overflow confidentuality providers Frequency

Signature Based
Time series
SQL Injection Filter Device

Anti-Spam
OpCode n-grams
Misuse Systems

XML features
System Intrusion

Packet header

7/30/2012 Ben-Gurion University

Detection of Unknown Malicious
Code

Malware
• Malware, short for malicious software, is
software designed to disrupt computer
operation, gather sensitive information, or
gain unauthorized access to a computer
system

7/30/2012 Ben-Gurion University

Static vs. Dynamic Analysis
• Static – Analyze the program (code) –
– leverage structural information (e.g. sequence of
bytes)
– attempts to detect malware before the program
under inspection executes
• Dynamic – Analyze the running process –
– leverage runtime information (e.g. network
usage)
– attempts to detect malicious behavior during
program execution or after program execution.
7/30/2012 Ben-Gurion University

Static Analysis Using
Machine Learning

7/30/2012 Ben-Gurion University

Analogous of Malcode Detection as
Text Categorization
• Classifying Malicious Code can be analogous to
Text Categorization.

• Texts  Malicious Code (Files)

• Words  Code expressions

• Then weighting functions, such as tf or tfidf can
be used.
7/30/2012 Ben-Gurion University

Sec. 6.2.2

tf-idf weighting
• Best known weighting scheme in information retrieval
TF IDF

w t ,d log(1 tft ,d ) log10 ( N / dft )
• The TF (term frequency) tft,d of term t in document d
is defined as the number of times that t occurs in d.
• The IDF (inverse document frequency) : the inverse
number of documents that contain t
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection

Dataset
• We acquired the malicious files from the VX Heaven website -
7688 malicious files for windows OS.
• Including executable and DLL (Dynamic Linked Library) files
• The benign set contained 22,735 files.

7/30/2012 Ben-Gurion University

Data preparations
• Creating Vocabularies (TF Vector)

N-Grams Vocabulary Size
3-gram 16,777,216
4-gram 1,084,793,035
5-gram 1,575,804,954
6-gram 1,936,342,220

7/30/2012 Ben-Gurion University

Classification Algorithms
• In order to create rules from the raw data gathered
and presented on the previous slide, different
Classification Algorithms were examined

– Artificial Neural Networks (ANN)
– Decision Trees (DT)
– Naive Bayes (NB)
– Support Vector Machines (SVM)
– Boosted Decision Trees (BDT)
– Boosted Naive Bayes (BNB)

7/30/2012 Ben-Gurion University

Steps
• Determine the best conditions:
– The best term representation (TF /
TFIDF)
– The best N-gram (3 / 4 / 5 / 6)
– The best top-selection (50 / 100 / 200 /
300)
& best features selection ( DF / FS / GR)

7/30/2012 Ben-Gurion University

Performance Measures
• True Positive Rate (TPR) - The number
of positive instances classified
correctly.
• False Positive Rate (FPR) - The number
of negative instances misclassified.
• Total Accuracy - The number of
absolutely correctly classified
instances.

7/30/2012 Ben-Gurion University

Preliminary Results
• Mean accuracies
quite similar.
• Best performance:
top 5500.
• Best
representation: TF
• Best N-gram:
5-gram.

7/30/2012 Ben-Gurion University

Classifiers
• Under the best conditions presented above,
the classifiers that achieved the highest
accuracies, with lowest false positive rates,
are:
Classifier Accuracy FP FN
– Boosted Decision Tree ANN 0.941 0.033 0.134

– Artificial Neural Network DT 0.943 0.039 0.099
NB 0.697 0.382 0.069
BDT 0.949 0.040 0.110
BNB 0.697 0.382 0.069
SVM-lin 0.921 0.033 0.214
SVM-poly 0.852 0.014 0.544
SVM-rbf 0.939 0.029 0.154
7/30/2012 Ben-Gurion University

Portable Executable (PE)
• Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE
or DLL).
• PE Header that describes physical structure of a PE binary (e.g.,
creation/modification time, machine type, file size)
• Import Section: which DLLs were imported and which functions from
which imported DLLs were used
• Exports Section: which functions were exported (if the file being examined
is a DLL)
• Resource Directory: resources used by a given file (e.g., dialogs, cursors)
• Version Information (e.g., internal and external name of a file, version
number)

7/30/2012 Ben-Gurion University

n-Grams vs. PE Features

7/30/2012 Ben-Gurion University

Imbalanced Classification Tasks

• Data set is Imbalanced, if
the classes are unequally
distributed
• Class of interest (minority
class) is often much
smaller or rarer
• But, the cost of error on the
minority class can have a
bigger bite

7/30/2012 Ben-Gurion University

The Mal-ID Method
• Common Libraries
• Anti-Forensic means to avoid their detection
• Chronological evolution of malware – Most viruses
are variants of previous malwares.

Mal-ID: Automatic Malware Detection Using
Common Segment Analysis and Meta-Features,
Journal of Machine Learning Research 1 (2012) 1-48
7/30/2012 Ben-Gurion University

Andromaly

• Lightweight Host-based Intrusion
Detection System for Android-
based devices

7/30/2012 Ben-Gurion University

The “Andromaly”
• A lightweight Host-based Intrusion Detection
System for Android-based devices
• Providing real-time, monitoring, collection,
preprocessing and analysis of various system
metrics
• Open framework – possible to apply different types
of detection techniques
• Threat assessments (TAs) are weighted and
smoothed to avoid instantaneous false alarms
• An alert is matched against a set of
automatic/manual countermeasures

7/30/2012 Ben-Gurion University

The “Andromaly” architecture
Graphical User Interface
Feature Extractors
Application Level
Operating System
Alert Agent Service Loggers Scheduling
Manager SQLite
Processor Memory
Manager Config
Manager
Keyboard
Operation Mode Alert
Manager Handler Feature Network
Threat
Manager Hardware
Weighting Unit Communication layer
Power

Processors
Rule-based Classifier Application Linux
Framework Kernel
Anomaly
Detector KBTA

7/30/2012 Ben-Gurion University

Few screenshots…

7/30/2012 Ben-Gurion University

Few screenshots…

7/30/2012 Ben-Gurion University

Collected features
Collected Features (88)
Touch screen: Memory: Network:
Avg_Touch_Pressure Garbage_Collections Local_TX_Packets
Avg_Touch_Area Free_Pages Local_TX_Bytes
Keyboard: Inactive_Pages Local_RX_Packets
Avg_Key_Flight_Tim Active_Pages Local_RX_Bytes
e Anonymous_Pages WiFi_TX_Packets
Del_Key_Use_Rate Mapped_Pages WiFi_TX_Bytes
Avg_Trans_To_U File_Pages WiFi_RX_Packets
Avg_Trans_L_To_R Dirty_Pages WiFi_RX_Bytes
Avg_Trans_R_To_L Writeback_Pages Hardware:
Avg_Key_Dwell_Tim DMA_Allocations Camera
e Page_Frees USB_State
Keyboard_Opening Page_Activations Binder:
Keyboard_Closing Page_Deactivations BC_Transaction
Scheduler: Minor_Page_Faults BC_Reply
Yield_Calls Application: BC_Acquire
Schedule_Calls Package_Changing BC_Release
Schedule_Idle Package_Restarting Binder_Active_Nodes
Running_Jiffies Package_Addition Binder_Total_Nodes
Waiting_Jiffies Package_Removal Binder_Ref_Active
CPU Load: Package_Restart Binder_Ref_Total
CPU_Usage UID_Removal Binder_Death_Active
Load_Avg_1_min Calls: Binder_Death_Total
Load_Avg_5_mins Incoming_Calls Binder_Transaction_Active
Load_Avg_15_mins Outgoing_Calls Binder_Transaction_Total
Runnable_Entities Missed_Calls Binder_Trns_Complete_Act
Total_Entities Outgoing_Non_CL_C ive
Messaging: alls Binder_Trns_Complete_Tot
Outgoing_SMS Operating System: al
Incoming_SMS Running_Processes Leds:
Outgoing_Non_CL_S Context_Switches Button_Backlight
MS Processes_Created Keyboard_Backlight
Power: Orientation_Changing LCD_Backlight
Charging_Enabled Blue_Led
Battery_Voltage Green_Led
Battery_Current Red_Led
Battery_Temp
Battery_Level_Change
Battery_Level
7/30/2012 Ben-Gurion University

Evaluation
Preparation of the data-sets
• The applications were installed on 25 Android G1devices
(each device has one user only)

• Each user activate each application

• In the background the Android agent was running and
logging data (feature vectors) on the SD-card (88 features
each 2 seconds)

• The feature vectors were added to our data-set and labeled
with the device id, application name and class (game/tool)

7/30/2012 Ben-Gurion University

Abnormal state detection
• Identify the most informative features to • Detection algorithms: K-
monitor Means, Histograms, Logistic
Regression, Decision Tree, Bayesian
• Evaluating various detection methods and
Net, Naïve Bayes
algorithms
• Feature selection: InfoGain, Chi
• Understanding the feasibility of running
Square, Fisher Score
these methods as detection units on
Android devices • Top best features: 10, 20, 50
(d) Experiment IV

Feature Extraction Malicious (4) Tools/Games (4)
• Recorded 90 features while activating
Train Train applications
Feature Selection • Differentiate between applications
Train Device 1
which are not included in the training
Training set when training and testing are
Test Device 2 performed on different devices
Testing Test Test

Step I:
Step II:
Differentiating games (23K) and tools (20K) using
Detecting Android malware (15K) using 25 devices
25 devices
Rotation Forest/Fisher Score/Top 10
Logistic Regression/Fisher Score/Top 20
Accuracy 87.4% (TRP 0.794, FPR 0.126)
Accuracy 75.3% (TRP 0.797, FPR 0.303)
7/30/2012 Ben-Gurion University

Data Leakage Prevention

7/30/2012 Ben-Gurion University

Data Leakage

41

Data Leakage Prevention
Data leakage prevention solution is a system that is designed to detect
potential data breach incidents in timely manner and prevent them by
monitoring data while in-use (endpoint actions), in-motion (network
traffic), and at-rest (data storage)

7/30/2012 Ben-Gurion University

Honeytokens

• Honeytokens - faked digital data (e.g., a
credit card number, a database entry or
bogus login credentials) planted into a
genuine system resource (e.g., databases,
files and emails).
• Example:
– Insert a honey-table: a table with "sweet" name
able to attract malicious user (e.g.
"CREDIT_CARDS")
– These tables are not being used by any

HoneyGen

• Challenge: A good honeytoken is an artificial
data item that is hard to distinguish between
real tokens and the honeytoken

• HoneyGen: an Automated Honeytokens
Generator [Berkovitch, 2011]
– Proposed a generic method for honeytokens
generation that given any database will be able to
generate high quality honeytokens

44

HoneyGen System
• Rule mining: extrapolates rules that describe the "real" data structure,
attributes, constraints and logic (identity, reference, cardinality, value-
set, attribute dependency)
 Honeytoken generation
 Likelihood rating: sort
honeytokens by similarity
Real tokens
to real tokens in the input INPUT:
DB

database, according to the
commonness of its
combination of values PROCESS: Rules
Mining
Rules Honeytokens Honeytokens
Generation
Likelihood
Rating

Honeytokens

OUTPUT: Honeytokens Likelihood Scores
DB

45

Activity Based Verification

7/30/2012 Ben-Gurion University

Motivation
• Identity theft is one of the most usual crimes
in North America. There are closet to 10
million victims of identity theft each year.
• The Federal Trade Commission (FTC)
estimates that the cost of identity theft to
companies is approximately $50 billion per
year additionally to $5 billion worth of costs
to consumers.
• These days all the authentication of users is
based on Username & Password, which can
be stolen physically, by Phishing sites, Trojans,
as well as given.

Current Authentication Mechanisms – Costly
and often Unavailable.
Current Authentication mechanisms Disadvantages
 Authentication by  Hard to remember many passwords
Password predefined user name and  Password may be copied, cracked or stolen
password
 Can be lost or stolen
Token  Based on an object (i.e.,
magnetic card, RFID tag)  Expensive to deploy and maintain for
consumer market
Biometric
 Biometrics based on a palm  Expensive
(Palm/finger signature  Limited availability
)
Biometric  Biometrics based on a  Accuracy limited with illness or background
(Voice) vocal patterns noise

Biometric  Biometrics based on  Expensive
structural and color patterns
(Iris) of the human iris.  Accuracy problems for diabetes patients
 ID numbers which are
Secure ID constantly generated (i.e.,  Can be lost or stolen
by RSA Secure ID)
Users are already hassled by current security mechanisms and
reluctant to accept new ones.

22.01.2010
48

Activity Based Verification

Textbox Headline Textbox Headline
Solution

49

Keyboard Dynamics Features

Sort the Di- Group 5 similar di-graphs to one cluster
Extract the Di-Graphs and
their corresponding temporal H-e 200 Graphs based l-l 130
e-l 150 on their e-l 150 l-l, e-l, space-w, H-e, r-l 176
features
l-l 130 temporal space-w 180 w-o, l-o, l-d, o-space, o-r 266
l-o 240 features H-e 200
o-space 300 r-l 220
Hello world Group 2 similar di-graphs to one cluster
space-w 180 w-o 230
w-o 230 l-o 240
l-l, e-l 140
o-r 310 l-d 250
space-w, H-e 190
r-l 220 o-space 300
r-l, w-o 225
l-d 250 o-r 310
l-o, l-d 245
o-space, o-r 305

7/30/2012 Ben-Gurion University

Mouse Trajectories

Classifier

Various actions for learning purposes
Mouse Point and
Point and Double
Move
Left Click Click

Mouse Left Click
Move Mouse Double
Move Click
Drag and
Drop

Point and Point and
Right Click Mouse Mouse Mouse DD
Down Move Up

Mouse Right Click Mouse Mouse Mouse Mouse
Move Move Down Move Up

Evaluation Measures
• False Acceptance Rate (FAR) –the ratio
between the number of attacks that were
erroneously labeled as authentic interactions
and the total number of attacks.
• False Rejection Rate (FRR) –the ratio between
the number of legitimate interactions that
were erroneously labeled as attacks and the
total number of legitimate interactions.

7/30/2012 Ben-Gurion University

Evaluation
• Fixed Text (password) / Continues Verification
• The Session Length (number of actions)
– more actions  Better performance
• Keyboard is better than mouse
Session Size FAR FRR EER AUC

1/4 Session 4.33% 3.17% 3.75% 0.0308

1/2 Session 2.59% 2.86% 2.72% 0.0234

Full Session 1.48% 1.59% 1.53% 0.0144

• ~ 3 % FAR, FRR after 10 actions
• Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User
Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15
October 2012, Pages 19–36.

7/30/2012 Ben-Gurion University

Privacy Preserving Data Mining

7/30/2012 Ben-Gurion University

Motivation
• Huge databases exist in society today
– Medical data
– Consumer purchase data
– Census data
– Communication and media-related data
– Data gathered by government agencies
• Can this data be utilized?
– For medical research
– For improving customer service
– For homeland security
• The Problem: The huge amount of data
available means that it is possible to learn a lot
of information about individuals

Privacy Challenge (Sweeney, 1998)

Disease

Birth Date

Zip
Sex

Name

87% of the population in the USA can be uniquely identified by zip, sex and DoB

Quasi-identifier
• The minimal set of attributes in a
table that can be joined with external
information to re-identify individual
records

k-Anonymity
Let R(A1,...,An) be a relation and QI be the quasi-identifier
associated with it. R is said to satisfy k-anonymity if and only if
every distinct value of QI has at least k occurrences in R.

Generalization and Suppression
Generalization
replacement of a value by a less specific (more general)
value using domain generalization relationship.

Suppression
remove the value.

Z2 = {537**}

Z1 = {5371*. 5370*}

Z0 = {53715. 53710, 53706, 53703}

537**

S1 = {Person} 5371* 5370*

S0 = {Male, Female}
53715 53710 53706 53703

Privacy-preserving data mining
(PPDM)
• Goal: Create accurate data mining models from
anonymous data.
• Performing anonymization while ignoring the data
mining task results in a loss of data quality
• Data owners must balance the desire to share
useful data and the need to protect private
information within the data. Trade-Off

k-Anonymity Classification Tree
Using Suppression
• Induce a classification tree with existing algorithm
(like C4.5)
• Walk over the tree and iteratively prune the rule
in bottom-up manner until we reach k-anonymity
 The order of attributes in the path (from root to
the leaf) already denotes the importance (from
high to low) for predicting the class.

7/30/2012 Ben-Gurion University

Example QI = {Marital Status, Education, Occupation, Sex}
K=100

Marital Status = Married
| Education = High School: <=50K. (200)
| Education = Some college
| | Occupation = Handlers-cleaners: <=50K. (89)
| | Occupation = Exec managerial: >50K (120)

Complying nodes – child leafs whose frequency is bigger than k-anonymity
threshold
Non-complying nodes – child leafs whose frequency is lower than k-anonymity
threshold
Compensation - use complying nodes to drive anonymization process by
compensating part of their records in favor of non-complying records using
suppression

Example QI = {Marital Status, Education, Occupation, Sex}
K=100

Marital Status = Married
| Education = High School: <=50K. (200)
| Education = Some college
| | Occupation = Handlers-cleaners: <=50K. (89)
| | Occupation = Exec managerial: >50K (120)

Marital Status Education Occupation Sex Class

Married High School * * <=50K

200 : : : : :

Married High School * * <=50K

Married Some college * * <=50K

89 : : : : :

Married Some college * * <=50K

11 Married Some college * * <=50K

Married Some college Exec managerial * >50K
120-11=109
:

Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient
Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge
and Data Engineering, 22(3): 334-347 (2010).
7/30/2012 Ben-Gurion University

Results – Bottom Line
Accuracy vs. Anonymity Level

87
Classification Accuracy

86
85
84
83
82
81
80
79
78
0 100 200 300 400 500 600 700 800 900 1000

k-Anonymity Level

7/30/2012 Ben-Gurion University

Conclusions

7/30/2012 Ben-Gurion University

Machine Learning and Security
• Many current and emerging computer and network
security challenges can be solved only by using machine
learning techniques:
– Information leakage
– Data misuse
– Anomaly detection
– etc…
• It is very important to understand how to employ machine
learning techniques in an effective way.
• In particular:
– carefully construct training corpora,
– Effective feature extraction
– Effective feature selection, and
– Valid evaluations on representative corpora.

Thank You
Lior Rokach
liorrk@bgu.ac.il

More Related Content

What's hot

A review of machine learning based anomaly detection

A review of machine learning based anomaly detection

A review of machine learning based anomaly detectionMohamed Elfadly

AI and ML in Cybersecurity

AI and ML in Cybersecurity

AI and ML in CybersecurityForcepoint LLC

Malware Detection using Machine Learning

Malware Detection using Machine Learning

Malware Detection using Machine Learning Cysinfo Cyber Security Community

HOW AI CAN HELP IN CYBERSECURITY

HOW AI CAN HELP IN CYBERSECURITY

HOW AI CAN HELP IN CYBERSECURITYPriyanshu Ratnakar

Deep learning approach for network intrusion detection system

Deep learning approach for network intrusion detection system

Deep learning approach for network intrusion detection systemAvinash Kumar

Intrusion Detection with Neural Networks

Intrusion Detection with Neural Networks

Intrusion Detection with Neural Networksantoniomorancardenas

Artificial Intelligence and Cybersecurity

Artificial Intelligence and Cybersecurity

Artificial Intelligence and CybersecurityOlivier Busolini

Machine Learning for Fraud Detection

Machine Learning for Fraud Detection

Machine Learning for Fraud DetectionNitesh Kumar

Security in the age of Artificial Intelligence

Security in the age of Artificial Intelligence

Security in the age of Artificial IntelligenceFaction XYZ

Machine Learning

Machine Learning

Machine LearningRahul Kumar

Fraud detection with Machine Learning

Fraud detection with Machine Learning

Fraud detection with Machine LearningScaleway

Malware Detection Using Machine Learning Techniques

Malware Detection Using Machine Learning Techniques

Malware Detection Using Machine Learning TechniquesArshadRaja786

Machine Learning

Machine Learning

Machine LearningRabab Munawar

Combating Cyber Security Using Artificial Intelligence

Combating Cyber Security Using Artificial Intelligence

Combating Cyber Security Using Artificial IntelligenceInderjeet Singh

Optimized Intrusion Detection System using Deep Learning Algorithm

Optimized Intrusion Detection System using Deep Learning Algorithm

Optimized Intrusion Detection System using Deep Learning Algorithmijtsrd

Role of data mining in cyber security

Role of data mining in cyber security

Role of data mining in cyber securityPranto26

Machine Learning

Machine Learning

Machine LearningKumar P

Cyber threat intelligence ppt

Cyber threat intelligence ppt

Cyber threat intelligence pptKumar Gaurav

Machine learning

Machine learning

Machine learningDr Geetha Mohan

Malware detection-using-machine-learning

Malware detection-using-machine-learning

Malware detection-using-machine-learningSecurity Bootcamp

What's hot (20)

A review of machine learning based anomaly detection

A review of machine learning based anomaly detection

A review of machine learning based anomaly detection

AI and ML in Cybersecurity

AI and ML in Cybersecurity

AI and ML in Cybersecurity

Malware Detection using Machine Learning

Malware Detection using Machine Learning

Malware Detection using Machine Learning

HOW AI CAN HELP IN CYBERSECURITY

HOW AI CAN HELP IN CYBERSECURITY

HOW AI CAN HELP IN CYBERSECURITY

Deep learning approach for network intrusion detection system

Deep learning approach for network intrusion detection system

Deep learning approach for network intrusion detection system

Intrusion Detection with Neural Networks

Intrusion Detection with Neural Networks

Intrusion Detection with Neural Networks

Artificial Intelligence and Cybersecurity

Artificial Intelligence and Cybersecurity

Artificial Intelligence and Cybersecurity

Machine Learning for Fraud Detection

Machine Learning for Fraud Detection

Machine Learning for Fraud Detection

Security in the age of Artificial Intelligence

Security in the age of Artificial Intelligence

Security in the age of Artificial Intelligence

Machine Learning

Machine Learning

Machine Learning

Fraud detection with Machine Learning

Fraud detection with Machine Learning

Fraud detection with Machine Learning

Malware Detection Using Machine Learning Techniques

Malware Detection Using Machine Learning Techniques

Malware Detection Using Machine Learning Techniques

Machine Learning

Machine Learning

Machine Learning

Combating Cyber Security Using Artificial Intelligence

Combating Cyber Security Using Artificial Intelligence

Combating Cyber Security Using Artificial Intelligence

Optimized Intrusion Detection System using Deep Learning Algorithm

Optimized Intrusion Detection System using Deep Learning Algorithm

Optimized Intrusion Detection System using Deep Learning Algorithm

Role of data mining in cyber security

Role of data mining in cyber security

Role of data mining in cyber security

Machine Learning

Machine Learning

Machine Learning

Cyber threat intelligence ppt

Cyber threat intelligence ppt

Cyber threat intelligence ppt

Machine learning

Machine learning

Machine learning

Malware detection-using-machine-learning

Malware detection-using-machine-learning

Malware detection-using-machine-learning

Similar to When Cyber Security Meets Machine Learning

CS_GA2009_Paper

CS_GA2009_Paper

CS_GA2009_PaperAlexandra

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityTasnim Alasali

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?Trend Micro (EMEA) Limited

Network security for E-Commerce

Network security for E-Commerce

Network security for E-CommerceHem Pokhrel

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBMArrow ECS UK

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?Trend Micro (EMEA) Limited

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...imec.archive

Fully Integrated Defense Operation

Fully Integrated Defense Operation

Fully Integrated Defense OperationRob Fry

Data Security Metricsa Value Based Approach

Data Security Metricsa Value Based Approach

Data Security Metricsa Value Based ApproachFlaskdata.io

Trend Micro - Targeted attacks: Have you found yours?

Trend Micro - Targeted attacks: Have you found yours?

Trend Micro - Targeted attacks: Have you found yours?Global Business Events

Mina.Deng.PhD.defense

Mina.Deng.PhD.defense

Mina.Deng.PhD.defenseminadeng

Mina Deng PhD defense

Mina Deng PhD defense

Mina Deng PhD defenseminadeng

Smart Protection Network

Smart Protection Network

Smart Protection Networkkevin liao

Skillmine-InfoSecurity-VAPT-V.2.

Skillmine-InfoSecurity-VAPT-V.2.

Skillmine-InfoSecurity-VAPT-V.2.Skillmine Technology Consulting

Cellopoint Email UTM

Cellopoint Email UTM

Cellopoint Email UTMAllyssa Yang

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems IntelligenceAndris Soroka

Infromation Security as an Institutional Priority

Infromation Security as an Institutional Priority

Infromation Security as an Institutional Priorityzohaibqadir

Privacy audittalkfinal

Privacy audittalkfinal

Privacy audittalkfinalAlan Hartman

Workshop content adams

Workshop content adams

Workshop content adamsSiddharth

IRJET- Lossless Encryption Technique for Finger Biometric Images

IRJET- Lossless Encryption Technique for Finger Biometric Images

IRJET- Lossless Encryption Technique for Finger Biometric ImagesIRJET Journal

Similar to When Cyber Security Meets Machine Learning (20)

CS_GA2009_Paper

CS_GA2009_Paper

CS_GA2009_Paper

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

Network security for E-Commerce

Network security for E-Commerce

Network security for E-Commerce

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM

Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

Targeted Attacks: Have you found yours?

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...

I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...

Fully Integrated Defense Operation

Fully Integrated Defense Operation

Fully Integrated Defense Operation

Data Security Metricsa Value Based Approach

Data Security Metricsa Value Based Approach

Data Security Metricsa Value Based Approach

Trend Micro - Targeted attacks: Have you found yours?

Trend Micro - Targeted attacks: Have you found yours?

Trend Micro - Targeted attacks: Have you found yours?

Mina.Deng.PhD.defense

Mina.Deng.PhD.defense

Mina.Deng.PhD.defense

Mina Deng PhD defense

Mina Deng PhD defense

Mina Deng PhD defense

Smart Protection Network

Smart Protection Network

Smart Protection Network

Skillmine-InfoSecurity-VAPT-V.2.

Skillmine-InfoSecurity-VAPT-V.2.

Skillmine-InfoSecurity-VAPT-V.2.

Cellopoint Email UTM

Cellopoint Email UTM

Cellopoint Email UTM

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence

DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence

Infromation Security as an Institutional Priority

Infromation Security as an Institutional Priority

Infromation Security as an Institutional Priority

Privacy audittalkfinal

Privacy audittalkfinal

Privacy audittalkfinal

Workshop content adams

Workshop content adams

Workshop content adams

IRJET- Lossless Encryption Technique for Finger Biometric Images

IRJET- Lossless Encryption Technique for Finger Biometric Images

IRJET- Lossless Encryption Technique for Finger Biometric Images

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piececharlottematthew16

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):comworks

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

The Ultimate Guide to Choosing WordPress Pros and Cons

The Ultimate Guide to Choosing WordPress Pros and Cons

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Are Multi-Cloud and Serverless Good or Bad?

Are Multi-Cloud and Serverless Good or Bad?

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Advanced Computer Architecture – An Introduction

Advanced Computer Architecture – An Introduction

Advanced Computer Architecture – An IntroductionDilum Bandara

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

Commit 2024 - Secret Management made easy

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

Developer Data Modeling Mistakes: From Postgres to NoSQL

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

The Ultimate Guide to Choosing WordPress Pros and Cons

The Ultimate Guide to Choosing WordPress Pros and Cons

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

SAP Build Work Zone - Overview L2-L3.pptx

Are Multi-Cloud and Serverless Good or Bad?

Are Multi-Cloud and Serverless Good or Bad?

Are Multi-Cloud and Serverless Good or Bad?

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Advanced Computer Architecture – An Introduction

Advanced Computer Architecture – An Introduction

Advanced Computer Architecture – An Introduction

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

When Cyber Security Meets Machine Learning

1. When Cyber Security Meets Machine Learning Lior Rokach Information Systems Eng., Ben-Gurion University of the Negev College of Information Sciences and Technology, Penn State University

2. About Me Prof. Lior Rokach Department of Information Systems Engineering Faculty of Engineering Sciences Head of the Machine Learning Lab Ben-Gurion University of the Negev Email: liorrk@bgu.ac.il PhD (2004) from Tel Aviv University

3. Why Cyber Security? • Evolving Domain – Endless Game • Plenty of Data • Practical Contribution • Strong support of the stakeholders – Communications – Collaborations – Grants 7/30/2012 Ben-Gurion University

4. Cyber Security Cyber security is defined as the intersection of • computer security • network security • information security 2011 2008 2009 2010 2010 2011 SONY LOCKHEED GHOSTNET AURORA STUXNET NASDAQ 4 MARTIN

5. Cyber Security • Is a domain problem, not a domain solution, thus, it seeks solutions from other areas. • Traditionally, Security problems were aided by Mathematical model. e.g. – Secrecy – using cryptography 7/30/2012 Ben-Gurion University

6. Modern Cyber Security • Deals with abstract threats which cannot be solved only by using mathematical models: – Malware detection. – Intrusion detection. – Data leakage, etc. • Need for other research methods 7/30/2012 Ben-Gurion University

7. 7/30/2012 Ben-Gurion University

8. The concept of learning in a ML system • Learning = Improving with experience at some task – Improve over task T, – With respect to performance measure, P – Based on experience, E. 8

9. Motivating Example Learning to Filter Spam Example: Spam Filtering Spam - an email that the user does not want to receive and has not asked to receive T: Identify Spam Emails P: % of spam emails that were filtered % of ham (non-spam) emails that were incorrectly filtered-out E: a database of emails that were labelled by users/experts 7/30/2012 Ben-Gurion University

10. The Learning Process Model Learning Model Testing

11. The Learning Process in our Example Model Learning Model Testing Number of recipients Size of message Number of attachments Number of "re's" in the subject line Email Server …

12. Data Set Target Input Attributes Attribute Number of Email Country (IP) Customer Email Type new Length (K) Type Recipients 0 2 Germany Gold Ham 1 4 Germany Silver Ham 5 2 Nigeria Bronze Spam Instances 2 4 Russia Bronze Spam 3 4 Germany Bronze Ham 0 1 USA Silver Ham 4 2 USA Silver Spam

13. Information security and machine learning: Taxonomy  Problem Domain : Information Security – the problems we need to solve:  Malware detection  Intrusion detection  SPAM mitigation  Etc.  Solution Domain: Machine-Learning – from which solutions are drawn.  Artificial neural networks  Decision Trees etc. 7/30/2012 Ben-Gurion University

14. ISML Taxonomy Computer Security Using Machine Learning Security domain Machine Learning domain Protective Feature Learning Threat type Damage Threat Raw data Extracted Analysis Security Selection Algorithm Type Domain Type Features type System Method type Information Network IntrusionPrevention Executable n-grams Gain Ration Static Viruses Leakage components System Supervised Intrusion Detection Portable Denial of Service Web applications Text File Fisher Score Dynamic Worms System Executable Unsupervised End Point Firewall/VPN E-Mail Function Based Document Information Loss Sequence Spam Computer Frequency Semi-supervised Messaging End Point Antivirus IP-Packet String Signature Hierarchical feature Positive Examples Personality theft D.O.S attacks selection Only Learning Loss of Internet Service Document Network Antivirus XML File Network traffic Buffer Overflow confidentuality providers Frequency Signature Based Time series SQL Injection Filter Device Anti-Spam OpCode n-grams Misuse Systems XML features System Intrusion Packet header 7/30/2012 Ben-Gurion University

15. Detection of Unknown Malicious Code

16. Malware • Malware, short for malicious software, is software designed to disrupt computer operation, gather sensitive information, or gain unauthorized access to a computer system 7/30/2012 Ben-Gurion University

17. Static vs. Dynamic Analysis • Static – Analyze the program (code) – – leverage structural information (e.g. sequence of bytes) – attempts to detect malware before the program under inspection executes • Dynamic – Analyze the running process – – leverage runtime information (e.g. network usage) – attempts to detect malicious behavior during program execution or after program execution. 7/30/2012 Ben-Gurion University

18. Static Analysis Using Machine Learning 7/30/2012 Ben-Gurion University

19. Analogous of Malcode Detection as Text Categorization • Classifying Malicious Code can be analogous to Text Categorization. • Texts  Malicious Code (Files) • Words  Code expressions • Then weighting functions, such as tf or tfidf can be used. 7/30/2012 Ben-Gurion University

20. Sec. 6.2.2 tf-idf weighting • Best known weighting scheme in information retrieval TF IDF w t ,d log(1 tft ,d ) log10 ( N / dft ) • The TF (term frequency) tft,d of term t in document d is defined as the number of times that t occurs in d. • The IDF (inverse document frequency) : the inverse number of documents that contain t • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection

21. Dataset • We acquired the malicious files from the VX Heaven website - 7688 malicious files for windows OS. • Including executable and DLL (Dynamic Linked Library) files • The benign set contained 22,735 files. 7/30/2012 Ben-Gurion University

22. Data preparations • Creating Vocabularies (TF Vector) N-Grams Vocabulary Size 3-gram 16,777,216 4-gram 1,084,793,035 5-gram 1,575,804,954 6-gram 1,936,342,220 7/30/2012 Ben-Gurion University

23. Classification Algorithms • In order to create rules from the raw data gathered and presented on the previous slide, different Classification Algorithms were examined – Artificial Neural Networks (ANN) – Decision Trees (DT) – Naive Bayes (NB) – Support Vector Machines (SVM) – Boosted Decision Trees (BDT) – Boosted Naive Bayes (BNB) 7/30/2012 Ben-Gurion University

24. Steps • Determine the best conditions: – The best term representation (TF / TFIDF) – The best N-gram (3 / 4 / 5 / 6) – The best top-selection (50 / 100 / 200 / 300) & best features selection ( DF / FS / GR) 7/30/2012 Ben-Gurion University

25. Performance Measures • True Positive Rate (TPR) - The number of positive instances classified correctly. • False Positive Rate (FPR) - The number of negative instances misclassified. • Total Accuracy - The number of absolutely correctly classified instances. 7/30/2012 Ben-Gurion University

26. Preliminary Results • Mean accuracies quite similar. • Best performance: top 5500. • Best representation: TF • Best N-gram: 5-gram. 7/30/2012 Ben-Gurion University

27. Classifiers • Under the best conditions presented above, the classifiers that achieved the highest accuracies, with lowest false positive rates, are: Classifier Accuracy FP FN – Boosted Decision Tree ANN 0.941 0.033 0.134 – Artificial Neural Network DT 0.943 0.039 0.099 NB 0.697 0.382 0.069 BDT 0.949 0.040 0.110 BNB 0.697 0.382 0.069 SVM-lin 0.921 0.033 0.214 SVM-poly 0.852 0.014 0.544 SVM-rbf 0.939 0.029 0.154 7/30/2012 Ben-Gurion University

28. Portable Executable (PE) • Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE or DLL). • PE Header that describes physical structure of a PE binary (e.g., creation/modification time, machine type, file size) • Import Section: which DLLs were imported and which functions from which imported DLLs were used • Exports Section: which functions were exported (if the file being examined is a DLL) • Resource Directory: resources used by a given file (e.g., dialogs, cursors) • Version Information (e.g., internal and external name of a file, version number) 7/30/2012 Ben-Gurion University

29. n-Grams vs. PE Features 7/30/2012 Ben-Gurion University

30. Imbalanced Classification Tasks • Data set is Imbalanced, if the classes are unequally distributed • Class of interest (minority class) is often much smaller or rarer • But, the cost of error on the minority class can have a bigger bite 7/30/2012 Ben-Gurion University

31. The Mal-ID Method • Common Libraries • Anti-Forensic means to avoid their detection • Chronological evolution of malware – Most viruses are variants of previous malwares. Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features, Journal of Machine Learning Research 1 (2012) 1-48 7/30/2012 Ben-Gurion University

32. Andromaly • Lightweight Host-based Intrusion Detection System for Android- based devices 7/30/2012 Ben-Gurion University

33. The “Andromaly” • A lightweight Host-based Intrusion Detection System for Android-based devices • Providing real-time, monitoring, collection, preprocessing and analysis of various system metrics • Open framework – possible to apply different types of detection techniques • Threat assessments (TAs) are weighted and smoothed to avoid instantaneous false alarms • An alert is matched against a set of automatic/manual countermeasures 7/30/2012 Ben-Gurion University

34. The “Andromaly” architecture Graphical User Interface Feature Extractors Application Level Operating System Alert Agent Service Loggers Scheduling Manager SQLite Processor Memory Manager Config Manager Keyboard Operation Mode Alert Manager Handler Feature Network Threat Manager Hardware Weighting Unit Communication layer Power Processors Rule-based Classifier Application Linux Framework Kernel Anomaly Detector KBTA 7/30/2012 Ben-Gurion University

35. Few screenshots… 7/30/2012 Ben-Gurion University

36. Few screenshots… 7/30/2012 Ben-Gurion University

37. Collected features Collected Features (88) Touch screen: Memory: Network: Avg_Touch_Pressure Garbage_Collections Local_TX_Packets Avg_Touch_Area Free_Pages Local_TX_Bytes Keyboard: Inactive_Pages Local_RX_Packets Avg_Key_Flight_Tim Active_Pages Local_RX_Bytes e Anonymous_Pages WiFi_TX_Packets Del_Key_Use_Rate Mapped_Pages WiFi_TX_Bytes Avg_Trans_To_U File_Pages WiFi_RX_Packets Avg_Trans_L_To_R Dirty_Pages WiFi_RX_Bytes Avg_Trans_R_To_L Writeback_Pages Hardware: Avg_Key_Dwell_Tim DMA_Allocations Camera e Page_Frees USB_State Keyboard_Opening Page_Activations Binder: Keyboard_Closing Page_Deactivations BC_Transaction Scheduler: Minor_Page_Faults BC_Reply Yield_Calls Application: BC_Acquire Schedule_Calls Package_Changing BC_Release Schedule_Idle Package_Restarting Binder_Active_Nodes Running_Jiffies Package_Addition Binder_Total_Nodes Waiting_Jiffies Package_Removal Binder_Ref_Active CPU Load: Package_Restart Binder_Ref_Total CPU_Usage UID_Removal Binder_Death_Active Load_Avg_1_min Calls: Binder_Death_Total Load_Avg_5_mins Incoming_Calls Binder_Transaction_Active Load_Avg_15_mins Outgoing_Calls Binder_Transaction_Total Runnable_Entities Missed_Calls Binder_Trns_Complete_Act Total_Entities Outgoing_Non_CL_C ive Messaging: alls Binder_Trns_Complete_Tot Outgoing_SMS Operating System: al Incoming_SMS Running_Processes Leds: Outgoing_Non_CL_S Context_Switches Button_Backlight MS Processes_Created Keyboard_Backlight Power: Orientation_Changing LCD_Backlight Charging_Enabled Blue_Led Battery_Voltage Green_Led Battery_Current Red_Led Battery_Temp Battery_Level_Change Battery_Level 7/30/2012 Ben-Gurion University

38. Evaluation Preparation of the data-sets • The applications were installed on 25 Android G1devices (each device has one user only) • Each user activate each application • In the background the Android agent was running and logging data (feature vectors) on the SD-card (88 features each 2 seconds) • The feature vectors were added to our data-set and labeled with the device id, application name and class (game/tool) 7/30/2012 Ben-Gurion University

39. Abnormal state detection • Identify the most informative features to • Detection algorithms: K- monitor Means, Histograms, Logistic Regression, Decision Tree, Bayesian • Evaluating various detection methods and Net, Naïve Bayes algorithms • Feature selection: InfoGain, Chi • Understanding the feasibility of running Square, Fisher Score these methods as detection units on Android devices • Top best features: 10, 20, 50 (d) Experiment IV Feature Extraction Malicious (4) Tools/Games (4) • Recorded 90 features while activating Train Train applications Feature Selection • Differentiate between applications Train Device 1 which are not included in the training Training set when training and testing are Test Device 2 performed on different devices Testing Test Test Step I: Step II: Differentiating games (23K) and tools (20K) using Detecting Android malware (15K) using 25 devices 25 devices Rotation Forest/Fisher Score/Top 10 Logistic Regression/Fisher Score/Top 20 Accuracy 87.4% (TRP 0.794, FPR 0.126) Accuracy 75.3% (TRP 0.797, FPR 0.303) 7/30/2012 Ben-Gurion University

40. Data Leakage Prevention 7/30/2012 Ben-Gurion University

41. Data Leakage 41

42. Data Leakage Prevention Data leakage prevention solution is a system that is designed to detect potential data breach incidents in timely manner and prevent them by monitoring data while in-use (endpoint actions), in-motion (network traffic), and at-rest (data storage) 7/30/2012 Ben-Gurion University

43. Honeytokens • Honeytokens - faked digital data (e.g., a credit card number, a database entry or bogus login credentials) planted into a genuine system resource (e.g., databases, files and emails). • Example: – Insert a honey-table: a table with "sweet" name able to attract malicious user (e.g. "CREDIT_CARDS") – These tables are not being used by any

44. HoneyGen • Challenge: A good honeytoken is an artificial data item that is hard to distinguish between real tokens and the honeytoken • HoneyGen: an Automated Honeytokens Generator [Berkovitch, 2011] – Proposed a generic method for honeytokens generation that given any database will be able to generate high quality honeytokens 44

45. HoneyGen System • Rule mining: extrapolates rules that describe the "real" data structure, attributes, constraints and logic (identity, reference, cardinality, value- set, attribute dependency)  Honeytoken generation  Likelihood rating: sort honeytokens by similarity Real tokens to real tokens in the input INPUT: DB database, according to the commonness of its combination of values PROCESS: Rules Mining Rules Honeytokens Honeytokens Generation Likelihood Rating Honeytokens OUTPUT: Honeytokens Likelihood Scores DB 45

46. Activity Based Verification 7/30/2012 Ben-Gurion University

47. Motivation • Identity theft is one of the most usual crimes in North America. There are closet to 10 million victims of identity theft each year. • The Federal Trade Commission (FTC) estimates that the cost of identity theft to companies is approximately $50 billion per year additionally to $5 billion worth of costs to consumers. • These days all the authentication of users is based on Username & Password, which can be stolen physically, by Phishing sites, Trojans, as well as given.

48. Current Authentication Mechanisms – Costly and often Unavailable. Current Authentication mechanisms Disadvantages  Authentication by  Hard to remember many passwords Password predefined user name and  Password may be copied, cracked or stolen password  Can be lost or stolen Token  Based on an object (i.e., magnetic card, RFID tag)  Expensive to deploy and maintain for consumer market Biometric  Biometrics based on a palm  Expensive (Palm/finger signature  Limited availability ) Biometric  Biometrics based on a  Accuracy limited with illness or background (Voice) vocal patterns noise Biometric  Biometrics based on  Expensive structural and color patterns (Iris) of the human iris.  Accuracy problems for diabetes patients  ID numbers which are Secure ID constantly generated (i.e.,  Can be lost or stolen by RSA Secure ID) Users are already hassled by current security mechanisms and reluctant to accept new ones. 22.01.2010 48

49. Activity Based Verification Textbox Headline Textbox Headline Solution 49

50. Keyboard Dynamics Features Sort the Di- Group 5 similar di-graphs to one cluster Extract the Di-Graphs and their corresponding temporal H-e 200 Graphs based l-l 130 e-l 150 on their e-l 150 l-l, e-l, space-w, H-e, r-l 176 features l-l 130 temporal space-w 180 w-o, l-o, l-d, o-space, o-r 266 l-o 240 features H-e 200 o-space 300 r-l 220 Hello world Group 2 similar di-graphs to one cluster space-w 180 w-o 230 w-o 230 l-o 240 l-l, e-l 140 o-r 310 l-d 250 space-w, H-e 190 r-l 220 o-space 300 r-l, w-o 225 l-d 250 o-r 310 l-o, l-d 245 o-space, o-r 305 7/30/2012 Ben-Gurion University

51. Mouse Trajectories Classifier

52. Various actions for learning purposes Mouse Point and Point and Double Move Left Click Click Mouse Left Click Move Mouse Double Move Click Drag and Drop Point and Point and Right Click Mouse Mouse Mouse DD Down Move Up Mouse Right Click Mouse Mouse Mouse Mouse Move Move Down Move Up

53. Evaluation Measures • False Acceptance Rate (FAR) –the ratio between the number of attacks that were erroneously labeled as authentic interactions and the total number of attacks. • False Rejection Rate (FRR) –the ratio between the number of legitimate interactions that were erroneously labeled as attacks and the total number of legitimate interactions. 7/30/2012 Ben-Gurion University

54. Evaluation • Fixed Text (password) / Continues Verification • The Session Length (number of actions) – more actions  Better performance • Keyboard is better than mouse Session Size FAR FRR EER AUC 1/4 Session 4.33% 3.17% 3.75% 0.0308 1/2 Session 2.59% 2.86% 2.72% 0.0234 Full Session 1.48% 1.59% 1.53% 0.0144 • ~ 3 % FAR, FRR after 10 actions • Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15 October 2012, Pages 19–36. 7/30/2012 Ben-Gurion University

55. Privacy Preserving Data Mining 7/30/2012 Ben-Gurion University

56. Motivation • Huge databases exist in society today – Medical data – Consumer purchase data – Census data – Communication and media-related data – Data gathered by government agencies • Can this data be utilized? – For medical research – For improving customer service – For homeland security • The Problem: The huge amount of data available means that it is possible to learn a lot of information about individuals

57. Privacy Challenge (Sweeney, 1998) Disease Birth Date Zip Sex Name 87% of the population in the USA can be uniquely identified by zip, sex and DoB

58. Quasi-identifier • The minimal set of attributes in a table that can be joined with external information to re-identify individual records

59. k-Anonymity Let R(A1,...,An) be a relation and QI be the quasi-identifier associated with it. R is said to satisfy k-anonymity if and only if every distinct value of QI has at least k occurrences in R.

60. Generalization and Suppression Generalization replacement of a value by a less specific (more general) value using domain generalization relationship. Suppression remove the value. Z2 = {537**} Z1 = {5371*. 5370*} Z0 = {53715. 53710, 53706, 53703} 537** S1 = {Person} 5371* 5370* S0 = {Male, Female} 53715 53710 53706 53703

61. Privacy-preserving data mining (PPDM) • Goal: Create accurate data mining models from anonymous data. • Performing anonymization while ignoring the data mining task results in a loss of data quality • Data owners must balance the desire to share useful data and the need to protect private information within the data. Trade-Off

62. k-Anonymity Classification Tree Using Suppression • Induce a classification tree with existing algorithm (like C4.5) • Walk over the tree and iteratively prune the rule in bottom-up manner until we reach k-anonymity  The order of attributes in the path (from root to the leaf) already denotes the importance (from high to low) for predicting the class. 7/30/2012 Ben-Gurion University

63. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Complying nodes – child leafs whose frequency is bigger than k-anonymity threshold Non-complying nodes – child leafs whose frequency is lower than k-anonymity threshold Compensation - use complying nodes to drive anonymization process by compensating part of their records in favor of non-complying records using suppression

64. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Marital Status Education Occupation Sex Class Married High School * * <=50K 200 : : : : : Married High School * * <=50K Married Some college * * <=50K 89 : : : : : Married Some college * * <=50K 11 Married Some college * * <=50K Married Some college Exec managerial * >50K 120-11=109 :

65. Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge and Data Engineering, 22(3): 334-347 (2010). 7/30/2012 Ben-Gurion University

66. Results – Bottom Line Accuracy vs. Anonymity Level 87 Classification Accuracy 86 85 84 83 82 81 80 79 78 0 100 200 300 400 500 600 700 800 900 1000 k-Anonymity Level 7/30/2012 Ben-Gurion University

67. Conclusions 7/30/2012 Ben-Gurion University

68. Machine Learning and Security • Many current and emerging computer and network security challenges can be solved only by using machine learning techniques: – Information leakage – Data misuse – Anomaly detection – etc… • It is very important to understand how to employ machine learning techniques in an effective way. • In particular: – carefully construct training corpora, – Effective feature extraction – Effective feature selection, and – Valid evaluations on representative corpora.

69. Thank You Lior Rokach liorrk@bgu.ac.il

Editor's Notes

טקסונומיה של אבטחת מידע בשילוב למידת מכונה.עבודה לדוגמא יכולה להתמקד בחקירת משפחה של וירוסים הגורמים לאיבוד מידע באפליקציות רשת. אנטי וירוס רשתי הוא אמצעי ההגנה המתואר בעבודהכאשר המידע הגולמי שמערכת כזו מעבדת הם קבצים ברי הרצה.פתרון המוצע משתמש בחתימות מחרוזת. את המאפיינים לבחירת המחרוזת בוחרים בעזרת פישר סקור.מציאת החתימה נעשית על ידי ניתוח קוד סטטי.מאמנים מודל סיווג עי שימוש באחד מאלגוריתמי למידה מונחית.