SlideShare a Scribd company logo
1 of 69
When
         Cyber Security
            Meets
        Machine Learning




                         Lior Rokach
Information Systems Eng., Ben-Gurion University of the Negev
College of Information Sciences and Technology, Penn State University
About Me
Prof. Lior Rokach
Department of Information Systems Engineering
Faculty of Engineering Sciences
Head of the Machine Learning Lab
Ben-Gurion University of the Negev

Email: liorrk@bgu.ac.il

PhD (2004) from Tel Aviv University
Why Cyber Security?
•   Evolving Domain – Endless Game
•   Plenty of Data
•   Practical Contribution
•   Strong support of the stakeholders
      – Communications
      – Collaborations
      – Grants


7/30/2012                Ben-Gurion University
Cyber Security
           Cyber security is defined as the intersection of
           • computer security
           • network security
           • information security




                                                                            2011
  2008            2009            2010          2010
                                                              2011 SONY   LOCKHEED
GHOSTNET         AURORA         STUXNET        NASDAQ                            4
                                                                           MARTIN
Cyber Security
• Is a domain problem, not a domain solution,
  thus, it seeks solutions from other areas.

• Traditionally, Security problems were aided by
  Mathematical model. e.g.
      – Secrecy – using cryptography




7/30/2012                  Ben-Gurion University
Modern Cyber Security
• Deals with abstract threats which cannot be
  solved only by using mathematical models:
      – Malware detection.
      – Intrusion detection.
      – Data leakage, etc.


• Need for other research methods


7/30/2012                Ben-Gurion University
7/30/2012   Ben-Gurion University
The concept of learning in a ML system
• Learning = Improving with experience at some
  task
  – Improve over task T,
  – With respect to performance measure, P
  – Based on experience, E.




                                             8
Motivating Example
                Learning to Filter Spam
            Example: Spam Filtering
            Spam - an email that the user does not want to
            receive and has not asked to receive

               T: Identify Spam Emails

               P: % of spam emails that were filtered
                  % of ham (non-spam) emails that were
                  incorrectly filtered-out

               E: a database of emails that were labelled
               by users/experts
7/30/2012                     Ben-Gurion University
The Learning Process

              Model Learning    Model
                               Testing
The Learning Process in our Example

                                               Model Learning    Model
                                                                Testing




                     Number of recipients
                     Size of message
                     Number of attachments
                     Number of "re's" in the
                   subject line
    Email Server   …
Data Set
                                                                  Target
                           Input Attributes
                                                                 Attribute


            Number of      Email      Country (IP)   Customer   Email Type
              new        Length (K)                    Type
            Recipients
                0            2        Germany          Gold       Ham
                1            4        Germany         Silver      Ham
                5            2          Nigeria       Bronze      Spam
Instances




                2            4          Russia        Bronze      Spam
                3            4        Germany         Bronze      Ham
                0            1           USA          Silver      Ham
                4            2           USA          Silver      Spam
Information security and machine
        learning: Taxonomy
               Problem Domain : Information Security – the problems
                we need to solve:
                     Malware detection
                     Intrusion detection
                     SPAM mitigation
                     Etc.


               Solution Domain: Machine-Learning – from which
                solutions are drawn.
                   Artificial neural networks
                   Decision Trees etc.

7/30/2012                             Ben-Gurion University
ISML Taxonomy                                                                                              Computer Security
                                                                                                                    Using Machine
                                                                                                                       Learning



                                                        Security domain                                                                               Machine Learning domain
                                                                                         Protective                                                             Feature                               Learning
                      Threat type            Damage                  Threat                                 Raw data                      Extracted                              Analysis
                                                                                         Security                                                               Selection                             Algorithm
                                             Type                    Domain                                 Type                          Features                               type
                                                                                         System                                                                 Method                                type
                                      Information               Network        IntrusionPrevention
                                                                                                      Executable                    n-grams              Gain Ration          Static
                Viruses                 Leakage               components              System                                                                                                   Supervised




                                                                               Intrusion Detection                                  Portable
                                    Denial of Service       Web applications                           Text File                                         Fisher Score        Dynamic
                Worms                                                                System                                        Executable                                                 Unsupervised




                                                               End Point          Firewall/VPN          E-Mail                   Function Based           Document
                                    Information Loss                                                                                                                         Sequence
                 Spam                                          Computer                                                                                   Frequency                         Semi-supervised




                                                              Messaging        End Point Antivirus    IP-Packet                 String Signature      Hierarchical feature                  Positive Examples
                                    Personality theft
             D.O.S attacks                                                                                                                                 selection                          Only Learning




                                        Loss of             Internet Service                                                                              Document
                                                                               Network Antivirus       XML File                  Network traffic
            Buffer Overflow          confidentuality            providers                                                                                 Frequency




                                                                                Signature Based
                                                                                                                                  Time series
             SQL Injection                                                        Filter Device




                                                                                   Anti-Spam
                                                                                                                                OpCode n-grams
                Misuse                                                              Systems




                                                                                                                                  XML features
            System Intrusion




                                                                                                                                 Packet header




7/30/2012                                                                            Ben-Gurion University
Detection of Unknown Malicious
              Code
Malware
• Malware, short for malicious software, is
  software designed to disrupt computer
  operation, gather sensitive information, or
  gain unauthorized access to a computer
  system




7/30/2012           Ben-Gurion University
Static vs. Dynamic Analysis
• Static – Analyze the program (code) –
      – leverage structural information (e.g. sequence of
        bytes)
      – attempts to detect malware before the program
        under inspection executes
• Dynamic – Analyze the running process –
      – leverage runtime information (e.g. network
        usage)
      – attempts to detect malicious behavior during
        program execution or after program execution.
7/30/2012                Ben-Gurion University
Static Analysis Using
             Machine Learning




7/30/2012          Ben-Gurion University
Analogous of Malcode Detection as
       Text Categorization
• Classifying Malicious Code can be analogous to
  Text Categorization.

• Texts  Malicious Code (Files)

• Words  Code expressions

• Then weighting functions, such as tf or tfidf can
  be used.
7/30/2012            Ben-Gurion University
Sec. 6.2.2



             tf-idf weighting
• Best known weighting scheme in information retrieval
                   TF                  IDF


 w t ,d   log(1 tft ,d ) log10 ( N / dft )
• The TF (term frequency) tft,d of term t in document d
  is defined as the number of times that t occurs in d.
• The IDF (inverse document frequency) : the inverse
  number of documents that contain t
• Increases with the number of occurrences within a
  document
• Increases with the rarity of the term in the collection
Dataset
• We acquired the malicious files from the VX Heaven website -
  7688 malicious files for windows OS.
• Including executable and DLL (Dynamic Linked Library) files
• The benign set contained 22,735 files.




7/30/2012                 Ben-Gurion University
Data preparations
       •    Creating Vocabularies (TF Vector)

                                         N-Grams          Vocabulary Size
                                         3-gram           16,777,216
                                         4-gram           1,084,793,035
                                         5-gram           1,575,804,954
                                         6-gram           1,936,342,220




7/30/2012                         Ben-Gurion University
Classification Algorithms
      • In order to create rules from the raw data gathered
        and presented on the previous slide, different
        Classification Algorithms were examined

            –   Artificial Neural Networks (ANN)
            –   Decision Trees (DT)
            –   Naive Bayes (NB)
            –   Support Vector Machines (SVM)
            –   Boosted Decision Trees (BDT)
            –   Boosted Naive Bayes (BNB)




7/30/2012                        Ben-Gurion University
Steps
       • Determine the best conditions:
            – The best term representation (TF /
              TFIDF)
            – The best N-gram (3 / 4 / 5 / 6)
            – The best top-selection (50 / 100 / 200 /
              300)
              & best features selection ( DF / FS / GR)



7/30/2012                   Ben-Gurion University
Performance Measures
  • True Positive Rate (TPR) - The number
    of positive instances classified
    correctly.
  • False Positive Rate (FPR) - The number
    of negative instances misclassified.
  • Total Accuracy - The number of
    absolutely correctly classified
    instances.


7/30/2012           Ben-Gurion University
Preliminary Results
• Mean accuracies
  quite similar.
• Best performance:
  top 5500.
• Best
  representation: TF
• Best N-gram:
  5-gram.


7/30/2012              Ben-Gurion University
Classifiers
• Under the best conditions presented above,
  the classifiers that achieved the highest
  accuracies, with lowest false positive rates,
  are:
                                                   Classifier   Accuracy    FP      FN
   – Boosted Decision Tree                         ANN           0.941     0.033   0.134

   – Artificial Neural Network                     DT            0.943     0.039   0.099
                                                   NB            0.697     0.382   0.069
                                                   BDT           0.949     0.040   0.110
                                                   BNB           0.697     0.382   0.069
                                                   SVM-lin       0.921     0.033   0.214
                                                   SVM-poly      0.852     0.014   0.544
                                                   SVM-rbf       0.939     0.029   0.154
 7/30/2012                 Ben-Gurion University
Portable Executable (PE)
• Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE
  or DLL).
• PE Header that describes physical structure of a PE binary (e.g.,
  creation/modification time, machine type, file size)
• Import Section: which DLLs were imported and which functions from
  which imported DLLs were used
• Exports Section: which functions were exported (if the file being examined
  is a DLL)
• Resource Directory: resources used by a given file (e.g., dialogs, cursors)
• Version Information (e.g., internal and external name of a file, version
  number)




7/30/2012                      Ben-Gurion University
n-Grams vs. PE Features




7/30/2012           Ben-Gurion University
Imbalanced Classification Tasks

        • Data set is Imbalanced, if
          the classes are unequally
          distributed
        • Class of interest (minority
          class) is often much
          smaller or rarer
        • But, the cost of error on the
          minority class can have a
          bigger bite



7/30/2012                   Ben-Gurion University
The Mal-ID Method
            • Common Libraries
            • Anti-Forensic means to avoid their detection
            • Chronological evolution of malware – Most viruses
              are variants of previous malwares.




            Mal-ID: Automatic Malware Detection Using
            Common Segment Analysis and Meta-Features,
            Journal of Machine Learning Research 1 (2012) 1-48
7/30/2012                   Ben-Gurion University
Andromaly

            • Lightweight Host-based Intrusion
              Detection System for Android-
              based devices


7/30/2012                Ben-Gurion University
The “Andromaly”
            • A lightweight Host-based Intrusion Detection
              System for Android-based devices
            • Providing real-time, monitoring, collection,
              preprocessing and analysis of various system
              metrics
            • Open framework – possible to apply different types
              of detection techniques
            • Threat assessments (TAs) are weighted and
              smoothed to avoid instantaneous false alarms
            • An alert is matched against a set of
              automatic/manual countermeasures

7/30/2012                    Ben-Gurion University
The “Andromaly” architecture
                                            Graphical User Interface
                                                                                Feature Extractors
                                                                                  Application Level
                                                                                  Operating System
               Alert              Agent Service         Loggers                      Scheduling
              Manager                                                SQLite
                                          Processor                                   Memory
                                          Manager        Config
                                                        Manager
                                                                                      Keyboard
                                    Operation Mode       Alert
                                      Manager           Handler      Feature          Network
               Threat
                                                                     Manager          Hardware
            Weighting Unit                Communication layer
                                                                                       Power


                    Processors
             Rule-based      Classifier                                        Application   Linux
                                                                               Framework     Kernel
               Anomaly
               Detector        KBTA




7/30/2012                                    Ben-Gurion University
Few screenshots…




7/30/2012        Ben-Gurion University
Few screenshots…




7/30/2012        Ben-Gurion University
Collected features
                                              Collected Features (88)
            Touch screen:              Memory:                       Network:
                Avg_Touch_Pressure          Garbage_Collections           Local_TX_Packets
                Avg_Touch_Area              Free_Pages                    Local_TX_Bytes
            Keyboard:                       Inactive_Pages                Local_RX_Packets
                Avg_Key_Flight_Tim          Active_Pages                  Local_RX_Bytes
                e                           Anonymous_Pages               WiFi_TX_Packets
                Del_Key_Use_Rate            Mapped_Pages                  WiFi_TX_Bytes
                Avg_Trans_To_U              File_Pages                    WiFi_RX_Packets
                Avg_Trans_L_To_R            Dirty_Pages                   WiFi_RX_Bytes
                Avg_Trans_R_To_L            Writeback_Pages          Hardware:
                Avg_Key_Dwell_Tim           DMA_Allocations               Camera
                e                           Page_Frees                    USB_State
                Keyboard_Opening            Page_Activations         Binder:
                Keyboard_Closing            Page_Deactivations            BC_Transaction
            Scheduler:                      Minor_Page_Faults             BC_Reply
                Yield_Calls            Application:                       BC_Acquire
                Schedule_Calls              Package_Changing              BC_Release
                Schedule_Idle               Package_Restarting            Binder_Active_Nodes
                Running_Jiffies             Package_Addition              Binder_Total_Nodes
                Waiting_Jiffies             Package_Removal               Binder_Ref_Active
            CPU Load:                       Package_Restart               Binder_Ref_Total
                CPU_Usage                   UID_Removal                   Binder_Death_Active
                Load_Avg_1_min         Calls:                             Binder_Death_Total
                Load_Avg_5_mins             Incoming_Calls                Binder_Transaction_Active
                Load_Avg_15_mins            Outgoing_Calls                Binder_Transaction_Total
                Runnable_Entities           Missed_Calls                  Binder_Trns_Complete_Act
                Total_Entities              Outgoing_Non_CL_C             ive
            Messaging:                      alls                          Binder_Trns_Complete_Tot
                Outgoing_SMS           Operating System:                  al
                Incoming_SMS                Running_Processes        Leds:
                Outgoing_Non_CL_S           Context_Switches              Button_Backlight
                MS                          Processes_Created             Keyboard_Backlight
            Power:                          Orientation_Changing          LCD_Backlight
                Charging_Enabled                                          Blue_Led
                Battery_Voltage                                           Green_Led
                Battery_Current                                           Red_Led
                Battery_Temp
                Battery_Level_Change
                Battery_Level
7/30/2012                                 Ben-Gurion University
Evaluation
            Preparation of the data-sets
• The applications were installed on 25 Android G1devices
  (each device has one user only)

• Each user activate each application

• In the background the Android agent was running and
  logging data (feature vectors) on the SD-card (88 features
  each 2 seconds)

• The feature vectors were added to our data-set and labeled
  with the device id, application name and class (game/tool)


7/30/2012                Ben-Gurion University
Abnormal state detection
    •      Identify the most informative features to                                          •   Detection algorithms: K-
           monitor                                                                                Means, Histograms, Logistic
                                                                                                  Regression, Decision Tree, Bayesian
    •      Evaluating various detection methods and
                                                                                                  Net, Naïve Bayes
           algorithms
                                                                                              •   Feature selection: InfoGain, Chi
    •      Understanding the feasibility of running
                                                                                                  Square, Fisher Score
           these methods as detection units on
           Android devices                                                                    •   Top best features: 10, 20, 50
                                   (d) Experiment IV

Feature Extraction                       Malicious (4)          Tools/Games (4)
                                                                                              •   Recorded 90 features while activating
                                        Train                    Train                            applications
Feature Selection                                                                             •   Differentiate between applications
                     Train   Device 1
                                                                                                  which are not included in the training
     Training                                                                                     set when training and testing are
                      Test   Device 2                                                             performed on different devices
        Testing                                          Test                     Test




Step I:
                                                                                         Step II:
Differentiating games (23K) and tools (20K) using
                                                                                         Detecting Android malware (15K) using 25 devices
      25 devices
                                                                                         Rotation Forest/Fisher Score/Top 10
Logistic Regression/Fisher Score/Top 20
                                                                                         Accuracy 87.4% (TRP 0.794, FPR 0.126)
Accuracy 75.3% (TRP 0.797, FPR 0.303)
    7/30/2012                                                        Ben-Gurion University
Data Leakage Prevention




7/30/2012           Ben-Gurion University
Data Leakage




               41
Data Leakage Prevention
            Data leakage prevention solution is a system that is designed to detect
            potential data breach incidents in timely manner and prevent them by
            monitoring data while in-use (endpoint actions), in-motion (network
            traffic), and at-rest (data storage)




7/30/2012                           Ben-Gurion University
Honeytokens


• Honeytokens - faked digital data (e.g., a
  credit card number, a database entry or
  bogus login credentials) planted into a
  genuine system resource (e.g., databases,
  files and emails).
• Example:
  – Insert a honey-table: a table with "sweet" name
    able to attract malicious user (e.g.
    "CREDIT_CARDS")
  – These tables are not being used by any
HoneyGen

• Challenge: A good honeytoken is an artificial
  data item that is hard to distinguish between
  real tokens and the honeytoken

• HoneyGen: an Automated Honeytokens
  Generator [Berkovitch, 2011]
  – Proposed a generic method for honeytokens
    generation that given any database will be able to
    generate high quality honeytokens

                                                     44
HoneyGen System
• Rule mining: extrapolates rules that describe the "real" data structure,
   attributes, constraints and logic (identity, reference, cardinality, value-
   set, attribute dependency)
  Honeytoken generation
  Likelihood rating: sort
   honeytokens by similarity
                                           Real tokens
   to real tokens in the input     INPUT:
                                               DB

   database, according to the
   commonness of its
   combination of values          PROCESS:   Rules
                                             Mining
                                                       Rules Honeytokens Honeytokens
                                                              Generation
                                                                                     Likelihood
                                                                                       Rating


                                                                          Honeytokens



                                         OUTPUT:                    Honeytokens   Likelihood Scores
                                                                        DB




                                                                                                45
Activity Based Verification




7/30/2012             Ben-Gurion University
Motivation
• Identity theft is one of the most usual crimes
  in North America. There are closet to 10
  million victims of identity theft each year.
• The Federal Trade Commission (FTC)
  estimates that the cost of identity theft to
  companies is approximately $50 billion per
  year additionally to $5 billion worth of costs
  to consumers.
• These days all the authentication of users is
  based on Username & Password, which can
  be stolen physically, by Phishing sites, Trojans,
  as well as given.
Current Authentication Mechanisms – Costly
               and often Unavailable.
Current Authentication mechanisms                  Disadvantages
                 Authentication by                Hard to remember many passwords
Password          predefined user name and         Password may be copied, cracked or stolen
                  password
                                                   Can be lost or stolen
Token            Based on an object (i.e.,
                  magnetic card, RFID tag)         Expensive to deploy and maintain for
                                                    consumer market
Biometric
                 Biometrics based on a palm       Expensive
(Palm/finger      signature                        Limited availability
)
Biometric        Biometrics based on a            Accuracy limited with illness or background
(Voice)           vocal patterns                    noise

Biometric        Biometrics based on              Expensive
                  structural and color patterns
(Iris)            of the human iris.               Accuracy problems for diabetes patients
                 ID numbers which are
Secure ID         constantly generated (i.e.,      Can be lost or stolen
                  by RSA Secure ID)
Users are already hassled by current security mechanisms and
 reluctant to accept new ones.


 22.01.2010
                            48
Activity Based Verification

Textbox Headline           Textbox Headline
                                 Solution




                          49
Keyboard Dynamics Features




                                                            Sort the Di-                   Group 5 similar di-graphs to one cluster
                Extract the Di-Graphs and
              their corresponding temporal H-e       200   Graphs based    l-l       130
                                           e-l       150      on their     e-l       150      l-l, e-l, space-w, H-e, r-l 176
                         features
                                           l-l       130     temporal      space-w   180      w-o, l-o, l-d, o-space, o-r 266
                                           l-o       240      features     H-e       200
                                           o-space   300                   r-l       220
               Hello world                                                                    Group 2 similar di-graphs to one cluster
                                           space-w   180                   w-o       230
                                           w-o       230                   l-o       240
                                                                                               l-l, e-l          140
                                           o-r       310                   l-d       250
                                                                                               space-w, H-e      190
                                           r-l       220                   o-space   300
                                                                                               r-l, w-o          225
                                           l-d       250                   o-r       310
                                                                                               l-o, l-d          245
                                                                                               o-space, o-r      305




7/30/2012                                             Ben-Gurion University
Mouse Trajectories




      Classifier
Various actions for learning purposes
                                      Mouse                   Point and
         Point and                                             Double
                                      Move
         Left Click                                             Click

 Mouse                Left Click
 Move                                                 Mouse               Double
                                                      Move                 Click
                                   Drag and
                                     Drop


 Point and                                                       Point and
 Right Click            Mouse       Mouse      Mouse                DD
                        Down        Move        Up


Mouse          Right Click         Mouse      Mouse      Mouse        Mouse
Move                               Move       Down       Move          Up
Evaluation Measures
• False Acceptance Rate (FAR) –the ratio
  between the number of attacks that were
  erroneously labeled as authentic interactions
  and the total number of attacks.
• False Rejection Rate (FRR) –the ratio between
  the number of legitimate interactions that
  were erroneously labeled as attacks and the
  total number of legitimate interactions.

7/30/2012          Ben-Gurion University
Evaluation
        • Fixed Text (password) / Continues Verification
        • The Session Length (number of actions)
             – more actions  Better performance
        • Keyboard is better than mouse
            Session Size   FAR     FRR      EER        AUC


            1/4 Session    4.33%   3.17%   3.75%      0.0308


            1/2 Session    2.59%   2.86%   2.72%      0.0234


            Full Session   1.48%   1.59%   1.53%      0.0144




        • ~ 3 % FAR, FRR after 10 actions
        •   Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User
            Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15
            October 2012, Pages 19–36.


7/30/2012                                  Ben-Gurion University
Privacy Preserving Data Mining




7/30/2012        Ben-Gurion University
Motivation
• Huge databases exist in society today
   –   Medical data
   –   Consumer purchase data
   –   Census data
   –   Communication and media-related data
   –   Data gathered by government agencies
• Can this data be utilized?
   – For medical research
   – For improving customer service
   – For homeland security
• The Problem: The huge amount of data
  available means that it is possible to learn a lot
  of information about individuals
Privacy Challenge (Sweeney, 1998)


                                                     Disease

                                                     Birth Date

                                                            Zip
                                                     Sex


                                                       Name




87% of the population in the USA can be uniquely identified by zip, sex and DoB
Quasi-identifier
• The minimal set of attributes in a
  table that can be joined with external
  information to re-identify individual
  records
k-Anonymity
Let R(A1,...,An) be a relation and QI be the quasi-identifier
associated with it. R is said to satisfy k-anonymity if and only if
every distinct value of QI has at least k occurrences in R.
Generalization and Suppression
 Generalization
       replacement of a value by a less specific (more general)
 value using domain generalization relationship.

 Suppression
      remove the value.


                                                        Z2 = {537**}



                                                        Z1 = {5371*. 5370*}



                                                        Z0 = {53715. 53710, 53706, 53703}

                                                                        537**


   S1 = {Person}                                        5371*                           5370*


   S0 = {Male, Female}
                                                53715           53710           53706           53703
Privacy-preserving data mining
             (PPDM)
• Goal: Create accurate data mining models from
  anonymous data.
• Performing anonymization while ignoring the data
  mining task results in a loss of data quality
• Data owners must balance the desire to share
  useful data and the need to protect private
  information within the data. Trade-Off
k-Anonymity Classification Tree
                 Using Suppression
        • Induce a classification tree with existing algorithm
           (like C4.5)
        • Walk over the tree and iteratively prune the rule
           in bottom-up manner until we reach k-anonymity
          The order of attributes in the path (from root to
           the leaf) already denotes the importance (from
           high to low) for predicting the class.

7/30/2012                 Ben-Gurion University
Example                QI = {Marital Status, Education, Occupation, Sex}
                       K=100

       Marital Status = Married
       | Education = High School: <=50K. (200)
       | Education = Some college
       | | Occupation = Handlers-cleaners: <=50K. (89)
       | | Occupation = Exec managerial: >50K (120)




  Complying nodes – child leafs whose frequency is bigger than k-anonymity
  threshold
  Non-complying nodes – child leafs whose frequency is lower than k-anonymity
  threshold
  Compensation - use complying nodes to drive anonymization process by
  compensating part of their records in favor of non-complying records using
  suppression
Example                         QI = {Marital Status, Education, Occupation, Sex}
                                K=100

               Marital Status = Married
               | Education = High School: <=50K. (200)
               | Education = Some college
               | | Occupation = Handlers-cleaners: <=50K. (89)
               | | Occupation = Exec managerial: >50K (120)


                   Marital Status    Education       Occupation      Sex      Class

                      Married       High School          *           *        <=50K

         200             :               :                :           :         :

                      Married       High School          *           *        <=50K

                      Married       Some college         *           *        <=50K

         89              :               :                :           :         :

                      Married       Some college         *           *        <=50K

         11           Married       Some college         *           *        <=50K

                      Married       Some college   Exec managerial   *        >50K
  120-11=109
                         :
Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient
        Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge
        and Data Engineering, 22(3): 334-347 (2010).
7/30/2012                          Ben-Gurion University
Results – Bottom Line
                                                            Accuracy vs. Anonymity Level


                                 87
       Classification Accuracy




                                 86
                                 85
                                 84
                                 83
                                 82
                                 81
                                 80
                                 79
                                 78
                                      0   100   200   300       400     500      600   700   800   900   1000

                                                                  k-Anonymity Level




7/30/2012                                               Ben-Gurion University
Conclusions




7/30/2012     Ben-Gurion University
Machine Learning and Security
• Many current and emerging computer and network
  security challenges can be solved only by using machine
  learning techniques:
   –   Information leakage
   –   Data misuse
   –   Anomaly detection
   –   etc…
• It is very important to understand how to employ machine
  learning techniques in an effective way.
• In particular:
   –   carefully construct training corpora,
   –   Effective feature extraction
   –   Effective feature selection, and
   –   Valid evaluations on representative corpora.
Thank You
             Lior Rokach
liorrk@bgu.ac.il

More Related Content

What's hot

A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detectionMohamed Elfadly
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in CybersecurityForcepoint LLC
 
HOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITYHOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITYPriyanshu Ratnakar
 
Deep learning approach for network intrusion detection system
Deep learning approach for network intrusion detection systemDeep learning approach for network intrusion detection system
Deep learning approach for network intrusion detection systemAvinash Kumar
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networksantoniomorancardenas
 
Artificial Intelligence and Cybersecurity
Artificial Intelligence and CybersecurityArtificial Intelligence and Cybersecurity
Artificial Intelligence and CybersecurityOlivier Busolini
 
Machine Learning for Fraud Detection
Machine Learning for Fraud DetectionMachine Learning for Fraud Detection
Machine Learning for Fraud DetectionNitesh Kumar
 
Security in the age of Artificial Intelligence
Security in the age of Artificial IntelligenceSecurity in the age of Artificial Intelligence
Security in the age of Artificial IntelligenceFaction XYZ
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine LearningScaleway
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
Combating Cyber Security Using Artificial Intelligence
Combating Cyber Security Using Artificial IntelligenceCombating Cyber Security Using Artificial Intelligence
Combating Cyber Security Using Artificial IntelligenceInderjeet Singh
 
Optimized Intrusion Detection System using Deep Learning Algorithm
Optimized Intrusion Detection System using Deep Learning AlgorithmOptimized Intrusion Detection System using Deep Learning Algorithm
Optimized Intrusion Detection System using Deep Learning Algorithmijtsrd
 
Role of data mining in cyber security
Role of data mining in cyber securityRole of data mining in cyber security
Role of data mining in cyber securityPranto26
 
Machine Learning
Machine LearningMachine Learning
Machine LearningKumar P
 
Cyber threat intelligence ppt
Cyber threat intelligence pptCyber threat intelligence ppt
Cyber threat intelligence pptKumar Gaurav
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learningSecurity Bootcamp
 

What's hot (20)

A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in Cybersecurity
 
Malware Detection using Machine Learning
Malware Detection using Machine Learning	Malware Detection using Machine Learning
Malware Detection using Machine Learning
 
HOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITYHOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITY
 
Deep learning approach for network intrusion detection system
Deep learning approach for network intrusion detection systemDeep learning approach for network intrusion detection system
Deep learning approach for network intrusion detection system
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networks
 
Artificial Intelligence and Cybersecurity
Artificial Intelligence and CybersecurityArtificial Intelligence and Cybersecurity
Artificial Intelligence and Cybersecurity
 
Machine Learning for Fraud Detection
Machine Learning for Fraud DetectionMachine Learning for Fraud Detection
Machine Learning for Fraud Detection
 
Security in the age of Artificial Intelligence
Security in the age of Artificial IntelligenceSecurity in the age of Artificial Intelligence
Security in the age of Artificial Intelligence
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Combating Cyber Security Using Artificial Intelligence
Combating Cyber Security Using Artificial IntelligenceCombating Cyber Security Using Artificial Intelligence
Combating Cyber Security Using Artificial Intelligence
 
Optimized Intrusion Detection System using Deep Learning Algorithm
Optimized Intrusion Detection System using Deep Learning AlgorithmOptimized Intrusion Detection System using Deep Learning Algorithm
Optimized Intrusion Detection System using Deep Learning Algorithm
 
Role of data mining in cyber security
Role of data mining in cyber securityRole of data mining in cyber security
Role of data mining in cyber security
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Cyber threat intelligence ppt
Cyber threat intelligence pptCyber threat intelligence ppt
Cyber threat intelligence ppt
 
Machine learning
Machine learningMachine learning
Machine learning
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learning
 

Similar to When Cyber Security Meets Machine Learning

CS_GA2009_Paper
CS_GA2009_PaperCS_GA2009_Paper
CS_GA2009_PaperAlexandra
 
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityAI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityTasnim Alasali
 
Network security for E-Commerce
Network security for E-CommerceNetwork security for E-Commerce
Network security for E-CommerceHem Pokhrel
 
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBMArrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBMArrow ECS UK
 
I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...
I Minds2009 Secure And Distributed Software  Prof  Wouter Joosen (Ibbt Distri...I Minds2009 Secure And Distributed Software  Prof  Wouter Joosen (Ibbt Distri...
I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...imec.archive
 
Fully Integrated Defense Operation
Fully Integrated Defense OperationFully Integrated Defense Operation
Fully Integrated Defense OperationRob Fry
 
Data Security Metricsa Value Based Approach
Data Security Metricsa Value Based ApproachData Security Metricsa Value Based Approach
Data Security Metricsa Value Based ApproachFlaskdata.io
 
Trend Micro - Targeted attacks: Have you found yours?
Trend Micro - Targeted attacks: Have you found yours?Trend Micro - Targeted attacks: Have you found yours?
Trend Micro - Targeted attacks: Have you found yours?Global Business Events
 
Mina.Deng.PhD.defense
Mina.Deng.PhD.defenseMina.Deng.PhD.defense
Mina.Deng.PhD.defenseminadeng
 
Mina Deng PhD defense
Mina Deng PhD defenseMina Deng PhD defense
Mina Deng PhD defenseminadeng
 
Smart Protection Network
Smart Protection NetworkSmart Protection Network
Smart Protection Networkkevin liao
 
Cellopoint Email UTM
Cellopoint Email UTMCellopoint Email UTM
Cellopoint Email UTMAllyssa Yang
 
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems IntelligenceDSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems IntelligenceAndris Soroka
 
Infromation Security as an Institutional Priority
Infromation Security as an Institutional PriorityInfromation Security as an Institutional Priority
Infromation Security as an Institutional Priorityzohaibqadir
 
Privacy audittalkfinal
Privacy audittalkfinalPrivacy audittalkfinal
Privacy audittalkfinalAlan Hartman
 
Workshop content adams
Workshop content adamsWorkshop content adams
Workshop content adamsSiddharth
 
IRJET- Lossless Encryption Technique for Finger Biometric Images
IRJET-  	  Lossless Encryption Technique for Finger Biometric ImagesIRJET-  	  Lossless Encryption Technique for Finger Biometric Images
IRJET- Lossless Encryption Technique for Finger Biometric ImagesIRJET Journal
 

Similar to When Cyber Security Meets Machine Learning (20)

CS_GA2009_Paper
CS_GA2009_PaperCS_GA2009_Paper
CS_GA2009_Paper
 
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityAI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
 
Targeted Attacks: Have you found yours?
Targeted Attacks: Have you found yours?Targeted Attacks: Have you found yours?
Targeted Attacks: Have you found yours?
 
Network security for E-Commerce
Network security for E-CommerceNetwork security for E-Commerce
Network security for E-Commerce
 
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBMArrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM
Arrow ECS IBM Partner Jam - Security Update - Vicki Cooper - IBM
 
Targeted Attacks: Have you found yours?
Targeted Attacks: Have you found yours?Targeted Attacks: Have you found yours?
Targeted Attacks: Have you found yours?
 
I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...
I Minds2009 Secure And Distributed Software  Prof  Wouter Joosen (Ibbt Distri...I Minds2009 Secure And Distributed Software  Prof  Wouter Joosen (Ibbt Distri...
I Minds2009 Secure And Distributed Software Prof Wouter Joosen (Ibbt Distri...
 
Fully Integrated Defense Operation
Fully Integrated Defense OperationFully Integrated Defense Operation
Fully Integrated Defense Operation
 
Data Security Metricsa Value Based Approach
Data Security Metricsa Value Based ApproachData Security Metricsa Value Based Approach
Data Security Metricsa Value Based Approach
 
Trend Micro - Targeted attacks: Have you found yours?
Trend Micro - Targeted attacks: Have you found yours?Trend Micro - Targeted attacks: Have you found yours?
Trend Micro - Targeted attacks: Have you found yours?
 
Mina.Deng.PhD.defense
Mina.Deng.PhD.defenseMina.Deng.PhD.defense
Mina.Deng.PhD.defense
 
Mina Deng PhD defense
Mina Deng PhD defenseMina Deng PhD defense
Mina Deng PhD defense
 
Smart Protection Network
Smart Protection NetworkSmart Protection Network
Smart Protection Network
 
Skillmine-InfoSecurity-VAPT-V.2.
Skillmine-InfoSecurity-VAPT-V.2.Skillmine-InfoSecurity-VAPT-V.2.
Skillmine-InfoSecurity-VAPT-V.2.
 
Cellopoint Email UTM
Cellopoint Email UTMCellopoint Email UTM
Cellopoint Email UTM
 
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems IntelligenceDSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence
DSS ITSEC Conference 2012 - SIEM Q1 Labs IBM Security Systems Intelligence
 
Infromation Security as an Institutional Priority
Infromation Security as an Institutional PriorityInfromation Security as an Institutional Priority
Infromation Security as an Institutional Priority
 
Privacy audittalkfinal
Privacy audittalkfinalPrivacy audittalkfinal
Privacy audittalkfinal
 
Workshop content adams
Workshop content adamsWorkshop content adams
Workshop content adams
 
IRJET- Lossless Encryption Technique for Finger Biometric Images
IRJET-  	  Lossless Encryption Technique for Finger Biometric ImagesIRJET-  	  Lossless Encryption Technique for Finger Biometric Images
IRJET- Lossless Encryption Technique for Finger Biometric Images
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

When Cyber Security Meets Machine Learning

  • 1. When Cyber Security Meets Machine Learning Lior Rokach Information Systems Eng., Ben-Gurion University of the Negev College of Information Sciences and Technology, Penn State University
  • 2. About Me Prof. Lior Rokach Department of Information Systems Engineering Faculty of Engineering Sciences Head of the Machine Learning Lab Ben-Gurion University of the Negev Email: liorrk@bgu.ac.il PhD (2004) from Tel Aviv University
  • 3. Why Cyber Security? • Evolving Domain – Endless Game • Plenty of Data • Practical Contribution • Strong support of the stakeholders – Communications – Collaborations – Grants 7/30/2012 Ben-Gurion University
  • 4. Cyber Security Cyber security is defined as the intersection of • computer security • network security • information security 2011 2008 2009 2010 2010 2011 SONY LOCKHEED GHOSTNET AURORA STUXNET NASDAQ 4 MARTIN
  • 5. Cyber Security • Is a domain problem, not a domain solution, thus, it seeks solutions from other areas. • Traditionally, Security problems were aided by Mathematical model. e.g. – Secrecy – using cryptography 7/30/2012 Ben-Gurion University
  • 6. Modern Cyber Security • Deals with abstract threats which cannot be solved only by using mathematical models: – Malware detection. – Intrusion detection. – Data leakage, etc. • Need for other research methods 7/30/2012 Ben-Gurion University
  • 7. 7/30/2012 Ben-Gurion University
  • 8. The concept of learning in a ML system • Learning = Improving with experience at some task – Improve over task T, – With respect to performance measure, P – Based on experience, E. 8
  • 9. Motivating Example Learning to Filter Spam Example: Spam Filtering Spam - an email that the user does not want to receive and has not asked to receive T: Identify Spam Emails P: % of spam emails that were filtered % of ham (non-spam) emails that were incorrectly filtered-out E: a database of emails that were labelled by users/experts 7/30/2012 Ben-Gurion University
  • 10. The Learning Process Model Learning Model Testing
  • 11. The Learning Process in our Example Model Learning Model Testing Number of recipients Size of message Number of attachments Number of "re's" in the subject line Email Server …
  • 12. Data Set Target Input Attributes Attribute Number of Email Country (IP) Customer Email Type new Length (K) Type Recipients 0 2 Germany Gold Ham 1 4 Germany Silver Ham 5 2 Nigeria Bronze Spam Instances 2 4 Russia Bronze Spam 3 4 Germany Bronze Ham 0 1 USA Silver Ham 4 2 USA Silver Spam
  • 13. Information security and machine learning: Taxonomy  Problem Domain : Information Security – the problems we need to solve:  Malware detection  Intrusion detection  SPAM mitigation  Etc.  Solution Domain: Machine-Learning – from which solutions are drawn.  Artificial neural networks  Decision Trees etc. 7/30/2012 Ben-Gurion University
  • 14. ISML Taxonomy Computer Security Using Machine Learning Security domain Machine Learning domain Protective Feature Learning Threat type Damage Threat Raw data Extracted Analysis Security Selection Algorithm Type Domain Type Features type System Method type Information Network IntrusionPrevention Executable n-grams Gain Ration Static Viruses Leakage components System Supervised Intrusion Detection Portable Denial of Service Web applications Text File Fisher Score Dynamic Worms System Executable Unsupervised End Point Firewall/VPN E-Mail Function Based Document Information Loss Sequence Spam Computer Frequency Semi-supervised Messaging End Point Antivirus IP-Packet String Signature Hierarchical feature Positive Examples Personality theft D.O.S attacks selection Only Learning Loss of Internet Service Document Network Antivirus XML File Network traffic Buffer Overflow confidentuality providers Frequency Signature Based Time series SQL Injection Filter Device Anti-Spam OpCode n-grams Misuse Systems XML features System Intrusion Packet header 7/30/2012 Ben-Gurion University
  • 15. Detection of Unknown Malicious Code
  • 16. Malware • Malware, short for malicious software, is software designed to disrupt computer operation, gather sensitive information, or gain unauthorized access to a computer system 7/30/2012 Ben-Gurion University
  • 17. Static vs. Dynamic Analysis • Static – Analyze the program (code) – – leverage structural information (e.g. sequence of bytes) – attempts to detect malware before the program under inspection executes • Dynamic – Analyze the running process – – leverage runtime information (e.g. network usage) – attempts to detect malicious behavior during program execution or after program execution. 7/30/2012 Ben-Gurion University
  • 18. Static Analysis Using Machine Learning 7/30/2012 Ben-Gurion University
  • 19. Analogous of Malcode Detection as Text Categorization • Classifying Malicious Code can be analogous to Text Categorization. • Texts  Malicious Code (Files) • Words  Code expressions • Then weighting functions, such as tf or tfidf can be used. 7/30/2012 Ben-Gurion University
  • 20. Sec. 6.2.2 tf-idf weighting • Best known weighting scheme in information retrieval TF IDF w t ,d log(1 tft ,d ) log10 ( N / dft ) • The TF (term frequency) tft,d of term t in document d is defined as the number of times that t occurs in d. • The IDF (inverse document frequency) : the inverse number of documents that contain t • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection
  • 21. Dataset • We acquired the malicious files from the VX Heaven website - 7688 malicious files for windows OS. • Including executable and DLL (Dynamic Linked Library) files • The benign set contained 22,735 files. 7/30/2012 Ben-Gurion University
  • 22. Data preparations • Creating Vocabularies (TF Vector) N-Grams Vocabulary Size 3-gram 16,777,216 4-gram 1,084,793,035 5-gram 1,575,804,954 6-gram 1,936,342,220 7/30/2012 Ben-Gurion University
  • 23. Classification Algorithms • In order to create rules from the raw data gathered and presented on the previous slide, different Classification Algorithms were examined – Artificial Neural Networks (ANN) – Decision Trees (DT) – Naive Bayes (NB) – Support Vector Machines (SVM) – Boosted Decision Trees (BDT) – Boosted Naive Bayes (BNB) 7/30/2012 Ben-Gurion University
  • 24. Steps • Determine the best conditions: – The best term representation (TF / TFIDF) – The best N-gram (3 / 4 / 5 / 6) – The best top-selection (50 / 100 / 200 / 300) & best features selection ( DF / FS / GR) 7/30/2012 Ben-Gurion University
  • 25. Performance Measures • True Positive Rate (TPR) - The number of positive instances classified correctly. • False Positive Rate (FPR) - The number of negative instances misclassified. • Total Accuracy - The number of absolutely correctly classified instances. 7/30/2012 Ben-Gurion University
  • 26. Preliminary Results • Mean accuracies quite similar. • Best performance: top 5500. • Best representation: TF • Best N-gram: 5-gram. 7/30/2012 Ben-Gurion University
  • 27. Classifiers • Under the best conditions presented above, the classifiers that achieved the highest accuracies, with lowest false positive rates, are: Classifier Accuracy FP FN – Boosted Decision Tree ANN 0.941 0.033 0.134 – Artificial Neural Network DT 0.943 0.039 0.099 NB 0.697 0.382 0.069 BDT 0.949 0.040 0.110 BNB 0.697 0.382 0.069 SVM-lin 0.921 0.033 0.214 SVM-poly 0.852 0.014 0.544 SVM-rbf 0.939 0.029 0.154 7/30/2012 Ben-Gurion University
  • 28. Portable Executable (PE) • Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE or DLL). • PE Header that describes physical structure of a PE binary (e.g., creation/modification time, machine type, file size) • Import Section: which DLLs were imported and which functions from which imported DLLs were used • Exports Section: which functions were exported (if the file being examined is a DLL) • Resource Directory: resources used by a given file (e.g., dialogs, cursors) • Version Information (e.g., internal and external name of a file, version number) 7/30/2012 Ben-Gurion University
  • 29. n-Grams vs. PE Features 7/30/2012 Ben-Gurion University
  • 30. Imbalanced Classification Tasks • Data set is Imbalanced, if the classes are unequally distributed • Class of interest (minority class) is often much smaller or rarer • But, the cost of error on the minority class can have a bigger bite 7/30/2012 Ben-Gurion University
  • 31. The Mal-ID Method • Common Libraries • Anti-Forensic means to avoid their detection • Chronological evolution of malware – Most viruses are variants of previous malwares. Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features, Journal of Machine Learning Research 1 (2012) 1-48 7/30/2012 Ben-Gurion University
  • 32. Andromaly • Lightweight Host-based Intrusion Detection System for Android- based devices 7/30/2012 Ben-Gurion University
  • 33. The “Andromaly” • A lightweight Host-based Intrusion Detection System for Android-based devices • Providing real-time, monitoring, collection, preprocessing and analysis of various system metrics • Open framework – possible to apply different types of detection techniques • Threat assessments (TAs) are weighted and smoothed to avoid instantaneous false alarms • An alert is matched against a set of automatic/manual countermeasures 7/30/2012 Ben-Gurion University
  • 34. The “Andromaly” architecture Graphical User Interface Feature Extractors Application Level Operating System Alert Agent Service Loggers Scheduling Manager SQLite Processor Memory Manager Config Manager Keyboard Operation Mode Alert Manager Handler Feature Network Threat Manager Hardware Weighting Unit Communication layer Power Processors Rule-based Classifier Application Linux Framework Kernel Anomaly Detector KBTA 7/30/2012 Ben-Gurion University
  • 35. Few screenshots… 7/30/2012 Ben-Gurion University
  • 36. Few screenshots… 7/30/2012 Ben-Gurion University
  • 37. Collected features Collected Features (88) Touch screen: Memory: Network: Avg_Touch_Pressure Garbage_Collections Local_TX_Packets Avg_Touch_Area Free_Pages Local_TX_Bytes Keyboard: Inactive_Pages Local_RX_Packets Avg_Key_Flight_Tim Active_Pages Local_RX_Bytes e Anonymous_Pages WiFi_TX_Packets Del_Key_Use_Rate Mapped_Pages WiFi_TX_Bytes Avg_Trans_To_U File_Pages WiFi_RX_Packets Avg_Trans_L_To_R Dirty_Pages WiFi_RX_Bytes Avg_Trans_R_To_L Writeback_Pages Hardware: Avg_Key_Dwell_Tim DMA_Allocations Camera e Page_Frees USB_State Keyboard_Opening Page_Activations Binder: Keyboard_Closing Page_Deactivations BC_Transaction Scheduler: Minor_Page_Faults BC_Reply Yield_Calls Application: BC_Acquire Schedule_Calls Package_Changing BC_Release Schedule_Idle Package_Restarting Binder_Active_Nodes Running_Jiffies Package_Addition Binder_Total_Nodes Waiting_Jiffies Package_Removal Binder_Ref_Active CPU Load: Package_Restart Binder_Ref_Total CPU_Usage UID_Removal Binder_Death_Active Load_Avg_1_min Calls: Binder_Death_Total Load_Avg_5_mins Incoming_Calls Binder_Transaction_Active Load_Avg_15_mins Outgoing_Calls Binder_Transaction_Total Runnable_Entities Missed_Calls Binder_Trns_Complete_Act Total_Entities Outgoing_Non_CL_C ive Messaging: alls Binder_Trns_Complete_Tot Outgoing_SMS Operating System: al Incoming_SMS Running_Processes Leds: Outgoing_Non_CL_S Context_Switches Button_Backlight MS Processes_Created Keyboard_Backlight Power: Orientation_Changing LCD_Backlight Charging_Enabled Blue_Led Battery_Voltage Green_Led Battery_Current Red_Led Battery_Temp Battery_Level_Change Battery_Level 7/30/2012 Ben-Gurion University
  • 38. Evaluation Preparation of the data-sets • The applications were installed on 25 Android G1devices (each device has one user only) • Each user activate each application • In the background the Android agent was running and logging data (feature vectors) on the SD-card (88 features each 2 seconds) • The feature vectors were added to our data-set and labeled with the device id, application name and class (game/tool) 7/30/2012 Ben-Gurion University
  • 39. Abnormal state detection • Identify the most informative features to • Detection algorithms: K- monitor Means, Histograms, Logistic Regression, Decision Tree, Bayesian • Evaluating various detection methods and Net, Naïve Bayes algorithms • Feature selection: InfoGain, Chi • Understanding the feasibility of running Square, Fisher Score these methods as detection units on Android devices • Top best features: 10, 20, 50 (d) Experiment IV Feature Extraction Malicious (4) Tools/Games (4) • Recorded 90 features while activating Train Train applications Feature Selection • Differentiate between applications Train Device 1 which are not included in the training Training set when training and testing are Test Device 2 performed on different devices Testing Test Test Step I: Step II: Differentiating games (23K) and tools (20K) using Detecting Android malware (15K) using 25 devices 25 devices Rotation Forest/Fisher Score/Top 10 Logistic Regression/Fisher Score/Top 20 Accuracy 87.4% (TRP 0.794, FPR 0.126) Accuracy 75.3% (TRP 0.797, FPR 0.303) 7/30/2012 Ben-Gurion University
  • 40. Data Leakage Prevention 7/30/2012 Ben-Gurion University
  • 42. Data Leakage Prevention Data leakage prevention solution is a system that is designed to detect potential data breach incidents in timely manner and prevent them by monitoring data while in-use (endpoint actions), in-motion (network traffic), and at-rest (data storage) 7/30/2012 Ben-Gurion University
  • 43. Honeytokens • Honeytokens - faked digital data (e.g., a credit card number, a database entry or bogus login credentials) planted into a genuine system resource (e.g., databases, files and emails). • Example: – Insert a honey-table: a table with "sweet" name able to attract malicious user (e.g. "CREDIT_CARDS") – These tables are not being used by any
  • 44. HoneyGen • Challenge: A good honeytoken is an artificial data item that is hard to distinguish between real tokens and the honeytoken • HoneyGen: an Automated Honeytokens Generator [Berkovitch, 2011] – Proposed a generic method for honeytokens generation that given any database will be able to generate high quality honeytokens 44
  • 45. HoneyGen System • Rule mining: extrapolates rules that describe the "real" data structure, attributes, constraints and logic (identity, reference, cardinality, value- set, attribute dependency)  Honeytoken generation  Likelihood rating: sort honeytokens by similarity Real tokens to real tokens in the input INPUT: DB database, according to the commonness of its combination of values PROCESS: Rules Mining Rules Honeytokens Honeytokens Generation Likelihood Rating Honeytokens OUTPUT: Honeytokens Likelihood Scores DB 45
  • 46. Activity Based Verification 7/30/2012 Ben-Gurion University
  • 47. Motivation • Identity theft is one of the most usual crimes in North America. There are closet to 10 million victims of identity theft each year. • The Federal Trade Commission (FTC) estimates that the cost of identity theft to companies is approximately $50 billion per year additionally to $5 billion worth of costs to consumers. • These days all the authentication of users is based on Username & Password, which can be stolen physically, by Phishing sites, Trojans, as well as given.
  • 48. Current Authentication Mechanisms – Costly and often Unavailable. Current Authentication mechanisms Disadvantages  Authentication by  Hard to remember many passwords Password predefined user name and  Password may be copied, cracked or stolen password  Can be lost or stolen Token  Based on an object (i.e., magnetic card, RFID tag)  Expensive to deploy and maintain for consumer market Biometric  Biometrics based on a palm  Expensive (Palm/finger signature  Limited availability ) Biometric  Biometrics based on a  Accuracy limited with illness or background (Voice) vocal patterns noise Biometric  Biometrics based on  Expensive structural and color patterns (Iris) of the human iris.  Accuracy problems for diabetes patients  ID numbers which are Secure ID constantly generated (i.e.,  Can be lost or stolen by RSA Secure ID) Users are already hassled by current security mechanisms and reluctant to accept new ones. 22.01.2010 48
  • 49. Activity Based Verification Textbox Headline Textbox Headline Solution 49
  • 50. Keyboard Dynamics Features Sort the Di- Group 5 similar di-graphs to one cluster Extract the Di-Graphs and their corresponding temporal H-e 200 Graphs based l-l 130 e-l 150 on their e-l 150 l-l, e-l, space-w, H-e, r-l 176 features l-l 130 temporal space-w 180 w-o, l-o, l-d, o-space, o-r 266 l-o 240 features H-e 200 o-space 300 r-l 220 Hello world Group 2 similar di-graphs to one cluster space-w 180 w-o 230 w-o 230 l-o 240 l-l, e-l 140 o-r 310 l-d 250 space-w, H-e 190 r-l 220 o-space 300 r-l, w-o 225 l-d 250 o-r 310 l-o, l-d 245 o-space, o-r 305 7/30/2012 Ben-Gurion University
  • 51. Mouse Trajectories Classifier
  • 52. Various actions for learning purposes Mouse Point and Point and Double Move Left Click Click Mouse Left Click Move Mouse Double Move Click Drag and Drop Point and Point and Right Click Mouse Mouse Mouse DD Down Move Up Mouse Right Click Mouse Mouse Mouse Mouse Move Move Down Move Up
  • 53. Evaluation Measures • False Acceptance Rate (FAR) –the ratio between the number of attacks that were erroneously labeled as authentic interactions and the total number of attacks. • False Rejection Rate (FRR) –the ratio between the number of legitimate interactions that were erroneously labeled as attacks and the total number of legitimate interactions. 7/30/2012 Ben-Gurion University
  • 54. Evaluation • Fixed Text (password) / Continues Verification • The Session Length (number of actions) – more actions  Better performance • Keyboard is better than mouse Session Size FAR FRR EER AUC 1/4 Session 4.33% 3.17% 3.75% 0.0308 1/2 Session 2.59% 2.86% 2.72% 0.0234 Full Session 1.48% 1.59% 1.53% 0.0144 • ~ 3 % FAR, FRR after 10 actions • Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15 October 2012, Pages 19–36. 7/30/2012 Ben-Gurion University
  • 55. Privacy Preserving Data Mining 7/30/2012 Ben-Gurion University
  • 56. Motivation • Huge databases exist in society today – Medical data – Consumer purchase data – Census data – Communication and media-related data – Data gathered by government agencies • Can this data be utilized? – For medical research – For improving customer service – For homeland security • The Problem: The huge amount of data available means that it is possible to learn a lot of information about individuals
  • 57. Privacy Challenge (Sweeney, 1998) Disease Birth Date Zip Sex Name 87% of the population in the USA can be uniquely identified by zip, sex and DoB
  • 58. Quasi-identifier • The minimal set of attributes in a table that can be joined with external information to re-identify individual records
  • 59. k-Anonymity Let R(A1,...,An) be a relation and QI be the quasi-identifier associated with it. R is said to satisfy k-anonymity if and only if every distinct value of QI has at least k occurrences in R.
  • 60. Generalization and Suppression Generalization replacement of a value by a less specific (more general) value using domain generalization relationship. Suppression remove the value. Z2 = {537**} Z1 = {5371*. 5370*} Z0 = {53715. 53710, 53706, 53703} 537** S1 = {Person} 5371* 5370* S0 = {Male, Female} 53715 53710 53706 53703
  • 61. Privacy-preserving data mining (PPDM) • Goal: Create accurate data mining models from anonymous data. • Performing anonymization while ignoring the data mining task results in a loss of data quality • Data owners must balance the desire to share useful data and the need to protect private information within the data. Trade-Off
  • 62. k-Anonymity Classification Tree Using Suppression • Induce a classification tree with existing algorithm (like C4.5) • Walk over the tree and iteratively prune the rule in bottom-up manner until we reach k-anonymity  The order of attributes in the path (from root to the leaf) already denotes the importance (from high to low) for predicting the class. 7/30/2012 Ben-Gurion University
  • 63. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Complying nodes – child leafs whose frequency is bigger than k-anonymity threshold Non-complying nodes – child leafs whose frequency is lower than k-anonymity threshold Compensation - use complying nodes to drive anonymization process by compensating part of their records in favor of non-complying records using suppression
  • 64. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Marital Status Education Occupation Sex Class Married High School * * <=50K 200 : : : : : Married High School * * <=50K Married Some college * * <=50K 89 : : : : : Married Some college * * <=50K 11 Married Some college * * <=50K Married Some college Exec managerial * >50K 120-11=109 :
  • 65. Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge and Data Engineering, 22(3): 334-347 (2010). 7/30/2012 Ben-Gurion University
  • 66. Results – Bottom Line Accuracy vs. Anonymity Level 87 Classification Accuracy 86 85 84 83 82 81 80 79 78 0 100 200 300 400 500 600 700 800 900 1000 k-Anonymity Level 7/30/2012 Ben-Gurion University
  • 67. Conclusions 7/30/2012 Ben-Gurion University
  • 68. Machine Learning and Security • Many current and emerging computer and network security challenges can be solved only by using machine learning techniques: – Information leakage – Data misuse – Anomaly detection – etc… • It is very important to understand how to employ machine learning techniques in an effective way. • In particular: – carefully construct training corpora, – Effective feature extraction – Effective feature selection, and – Valid evaluations on representative corpora.
  • 69. Thank You Lior Rokach liorrk@bgu.ac.il

Editor's Notes

  1. טקסונומיה של אבטחת מידע בשילוב למידת מכונה.עבודה לדוגמא יכולה להתמקד בחקירת משפחה של וירוסים הגורמים לאיבוד מידע באפליקציות רשת. אנטי וירוס רשתי הוא אמצעי ההגנה המתואר בעבודהכאשר המידע הגולמי שמערכת כזו מעבדת הם קבצים ברי הרצה.פתרון המוצע משתמש בחתימות מחרוזת. את המאפיינים לבחירת המחרוזת בוחרים בעזרת פישר סקור.מציאת החתימה נעשית על ידי ניתוח קוד סטטי.מאמנים מודל סיווג עי שימוש באחד מאלגוריתמי למידה מונחית.