SlideShare a Scribd company logo
1 of 25
Download to read offline
1© Copyright 2015 Pivotal. All rights reserved. 1
Data Science Driven
Malware Detection
Malicious Domain Association
Anirudh Kondaveeti, PhD
Principal Data Scientist
2© Copyright 2015 Pivotal. All rights reserved.
Project Goal
 Goal: Find domains that have time and user based co-occurrence
relationships to aid the detection of coordinated network attacks.
 Example: Domain A is a watering hole. It redirects users to an exploit kit at
Domain B within a short time window.
– B is relatively unknown: Visiting B is a low
frequency (support) event.
– B is almost always redirected from A: The
conditional probability (confidence) of an
initial visit to A is high given B is visited later on.
User visits
watering hole
domain A
Domain B
hosts exploit
kit
Watering hole
domain A
redirects to
domain B
User machine
compromised
3© Copyright 2015 Pivotal. All rights reserved.
Data Sources & Preprocessing
 Historical Proxy Logs
– Information about “who is accessing which website at what time”
– Approx. 3 months of data with billions of connection records
 Local Domain White List
– List of non-malicious websites
 Preprocessing
Host Name
Normalization
(anirudh.facebook.com ->
facebook.com)
Filter Invalid Host
Names
( www.facebook,ca)
Identify “unpopular”
domains
( www.francelegal.com)
User Specific
Sessionization
4© Copyright 2015 Pivotal. All rights reserved.
User-Specific Sessionization
 Each user’s proxy logs are sessionized so that two consecutive connections
in the same session occur within a user-specified time window (e.g. 60s).
 Sequential patterns are derived from sessionized data.
Connection Time Domain
Session
ID
2015-07-03 12:41:08 googlevideo.com 1
2015-07-03 12:41:09 twitter.com 1
2015-07-03 12:41:12 youtube.com 1
2015-07-03 12:41:14 doubleclick.net 1
2015-07-03 12:41:15 google.com 1
2015-07-03 12:41:15 googleanalytics.com 1
2015-07-03 12:41:28 youtube.com 1
2015-07-03 12:59:23 facebook.com 2
2015-07-03 12:59:24 yahoo.com 2
>60s apart, start
a new session
5© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
 Sequential Pattern Mining
– Find time-ordered co-occurrence relationships between multiple domains.
– Output low frequency, high confidence sequences of domains:
[{Domain1},{Domain2, Domain3},…] => [DomainN].
 Graph Mining
– Build a “social network” graph between domains by creating edges
between pairs of domains that are associated with high confidence
– Use graph based algorithms to find fully and partially connected
subgraphs
 Two approaches can be used in conjunction to compliment
each other.
6© Copyright 2015 Pivotal. All rights reserved.
Modeling Framework Design Considerations
 Operational feasibility
– Incremental data processing and modeling on incoming new data, e.g. on a weekly
basis, to distribute workload over time.
– Results are updated to incorporate new model outputs.
 Computational tractability
– Implement most of the modeling frameworks in plain SQL, and design efficient
Window functions to achieve better runtime performance.
– Explicit PL/R routine parallelization to leverage the Massively Parallel Processing
architecture of the Greenplum database.
7© Copyright 2015 Pivotal. All rights reserved.
An Incremental Modeling Framework
Initial Proxy Logs &
Domain Whitelist
Preprocessed Proxy
Logs
• Host normalization & validation
• Data filtering
• Sessionization
Model-Specific
Results
Model Execution:
• Sequential Pattern Mining
• Graph Mining
New Proxy Logs &
(Possibly) Updated
Domain Whitelist
Preprocessed New
Proxy Logs
• Host normalization & validation
• Data filtering
• Sessionization
Updated Model-
Specific Results
Initial Run
Update
Model Update:
• Sequential Pattern Mining
• Graph Mining
8© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Sequential Pattern Mining
9© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Sequential Pattern Mining
Create time-ordered
domain sequences from
sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset of
sequences containing
those domains
Find high confidence, low
support sequential patterns
of targeted domains in
parallel
10© Copyright 2015 Pivotal. All rights reserved.
Sequence Creation
 Each sequence contains domains in a session
by the same user.
 Domains are ordered by connection time.
 Sequence for example on the right
– Sequence 1 : [ {googlevideo.com}, {twitter.com},
{youtube.com}, {doubleclick.net}, {google.com},
{googleanalytics.com} ]
– Sequence 2: [{facebook.com}, {yahoo.com}]
Connection Time Domain
Session
ID
2015-01-06 14:41:08 googlevideo.com 1
2015-01-06 14:41:09 twitter.com 1
2015-01-06 14:41:12 youtube.com 1
2015-01-06 14:41:14 doubleclick.net 1
2015-01-06 14:41:15 google.com 1
2015-01-06 14:41:15 googleanalytics.com 1
2015-01-06 14:59:23 facebook.com 2
2015-01-06 14:59:24 yahoo.com 2
11© Copyright 2015 Pivotal. All rights reserved.
Sequence Statistics
 sup: Support of a pattern P is the ratio of sequences in which a
pattern occurs
– sup({a,e}) = 2/10
 conf: Confidence of a rule X => Y is proportion of transactions
containing X that also contain Y
– conf({a => e}) = sup({a,e})/sup({a}) = 2/5
 #users: Number of distinct users for which a pattern P occurs
– #users({a}) = 1
 sup and #users follow monotone property
i.e.
– {a,e} {a}
– sup({a,e}) ≤ sup({a})
– #users({a,e}) ≤ #users({a})
10 sequences from a single user
12© Copyright 2015 Pivotal. All rights reserved.
Sequential Pattern Mining (SPM) in Parallel
 Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with
low support and high confidence patterns occurring in a minimum number of user
sequences.
 High confidence patterns relating to a given set of domains are obtained in parallel:
i.e., SPM runs independently on different subsets of sequences for different domains.
SELECT a_targeted_domain,
sequential_pattern_mining(min_support, min_confidence, min_num_users)
FROM input_table
Pseudo code:
Find domain A with
small support (or
known bad domain)
Subset sequences from
data containing A
Find sequential patterns
of A with high confidence
Repeat for all A in parallel on separate GPDB node
13© Copyright 2015 Pivotal. All rights reserved.
Relative Confidence to Adjust Ranking of Patterns
 For each domain of interest, SPM is run only on the subset of sequences containing that domain. This
may cause some sequential patterns to have artificially high confidence.
 Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences
in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.
 We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset
where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.
 Relative confidence favors the pattern whose left hand side contains less popular domains (see the
highlighted example below).
Relative confidence
favors unpopular left
hand side pattern
Domain Pattern Supp Conf Rel Conf
revenueindia.
net
<{google.com},{facebook.com}> =>
<{revenueindia.net}> 0.079 0.75 0.0001
revenueindia.
net
<{google.com}, {fileshare.com}> =>
<{revenueindia.net}> 0.071 0.75 0.067
revenueindia.
net
<{fileshare.com},{redworm.com}> =>
<{revenueindia.net}> 0.030 1.00 0.51
14© Copyright 2015 Pivotal. All rights reserved.
Model Update: Sequential Pattern Mining
 The model update module for sequential pattern mining follows a similar workflow as
its model execution module.
 One additional step is simply to merge the new results obtained from the incoming
new data with the existing set of patterns, including updating rule quality metrics:
support, confidence, etc.
Create time-ordered
domain sequences from
new sessionized data
Given a list of targeted
domains (e.g. rare
domains), select subset
of sequences containing
those domains
Find high confidence, low
support sequential
patterns of targeted
domains in parallel
Merge new results with
the existing set of
patterns.
15© Copyright 2015 Pivotal. All rights reserved.
Modeling Approaches
Graph Mining
16© Copyright 2015 Pivotal. All rights reserved.
Model Execution: Graph Mining
Construct “baskets” of
domains (co-
occurrence domains)
by running a sliding
window of certain time
interval through data
Find high confidence,
low support pairwise
association rules of the
form
Domain 1 => Domain 2
Create social network
of domains
Find partially and fully
connected sub-graphs
17© Copyright 2015 Pivotal. All rights reserved.
Construction of “Baskets”
 Domains visited by a user in a certain
time window form a “basket”, analogous
to items purchased in a single
transaction as in market basket analysis.
 The time interval for the sliding window
(60s window used in the implementation)
can be tuned.
 A basket contains distinct domains in a
sliding window:
Example on right:
Basket 1 = {googlevideo.com, twitter.com, youtube.com,
doubleclick.net, google.com}
Connection Time Domain
2015-01-06 14:41:00 googlevideo.com
2015-01-06 14:41:09 twitter.com
2015-01-06 14:41:12 youtube.com
2015-01-06 14:41:14 doubleclick.net
2015-01-06 14:42:00 google.com
2015-01-06 14:42:05 googleanalytics.com
2015-01-06 14:42:08 pivotal.io
2015-01-06 14:59:23 facebook.com
2015-01-06 14:59:24 yahoo.com
1
2
18© Copyright 2015 Pivotal. All rights reserved.
Pairwise Association Rule Mining
 Given domain-to-basket assignments, pairwise association rule mining mainly
involves evaluation of:
– Co-occurrence frequency: the number of times two domains fall in a common basket.
– Conditional probability: probability of seeing domain 2 given domain 1 is present.
 Pairwise rule mining is implemented in plain SQL in a scalable fashion.
Domain A Domain B
#
{A,B}
# A # B P(A|B) P(B|A)
# A
to B
# B
to A
# AB
Same
Time
Max(#
User
Names/
M)
#
Date
Min
Date
Max
Date
pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1
2015-02-
26
2015-
02-26
pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1
2015-02-
23
2015-
02-23
pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8
2015-01-
23
2015-
02-17
High confidence (>0.5) associations involving
multiple users over several days (e.g. highlighted
rules) are generally more interesting.
19© Copyright 2015 Pivotal. All rights reserved.
Exploring Interactions between Domains
 To explore the interactions between domains, we build an
undirected correlation graph using the discovered pairwise
domain association rules.
 Each node in the graph is a domain. An edge connects two
domains if their co-occurrence confidence is higher than a
threshold (e.g. 0.2).
 The example on the right shows the tightly connected “social
network” of a particular domain.
 Partially and fully connected networks indicate possible
waterhole or bot-net attacks.
 Question: How to quantify the connectivity of a network?
0.25
0.37
0.71
0.52
0.1
0.6
0.1
Weight of Edge denotes
the confidence
Node denotes the
domain
abc.com
xyz.com
hga.com
hebf.com
20© Copyright 2015 Pivotal. All rights reserved.
OddBall Metrics for Graph Anomaly Detection
 We take the OddBall approach* to quantify the connectivity of each domain’s network:
– Identify each domain’s one-step neighborhood (also called ego-net).
– Extract two graph features from the ego-net:
▪ N: Number of neighbors
▪ E: Number of edges in the ego-net
 The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2
* OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010.
Picture Source: ICDM’12 tutorial
on graph anomaly detection
• Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1
indicates some degree of connectivity among neighbors.
• The higher the ratio the higher degree of connectivity (given
same number of neighbors). Generally OddBall ratio of >1.5 is
more interesting.
• One can additionally compute clique percentage: the ratio
between E and the number of edges needed to form a clique:
E/[(N2+N)/2], to measure network connectivity.
21© Copyright 2015 Pivotal. All rights reserved.
Sample Domains with Highly Connected Networks
Highlighted domain has a
fully connected network, a
clique!
Domain
#
Neighb
ors
Neighbours
#
Edg
e
log(
E)/lo
g(N)
Clique
Percen
t
# User
Names
a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6
s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9
r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7
abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11
d.com
e.com
b.com
c.com
a.com
22© Copyright 2015 Pivotal. All rights reserved.
Detecting Isolated Clusters
 Given the domain correlation graph, one can also identify isolated groups of domains that
only interact with domains in the same group, but not others (a bot-net like structure).
 This can be formulated as the task of finding connected components (CCs) in a graph.
 The example below show that malicious sites tend to exist in small CCs.
Sample Connected Component
qre.com
jekc.com
fbc.com
abc.com
ghk.com
bcd.com
Known malicious site
23© Copyright 2015 Pivotal. All rights reserved.
Operationalization and
Outlook
24© Copyright 2015 Pivotal. All rights reserved.
Operationalization Vision
Run Algorithms
Inspect Anomalies
Evaluate Model
Outputs
Refine Algorithms
Load New Data
• Owned by Data Engineer/Data Scientist
• Incrementally (e.g. weekly) update models
using new batches of data, e.g. as a Cron job
• Owned by security
team
• Ideally model outputs
provided via
interactive web
dashboards
• Feedback on model
performance from security
team.
• Opportunities for refinement
and ideas for new models
• Owned by Data Scientist
• Refine algorithms
• Owned by Data Engineer
• Load new data
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataScott Clinton
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreSri Ambati
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0DataWorks Summit
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
 
Perspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernancePerspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernanceCloudera, Inc.
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014Tom LaGatta
 

What's hot (20)

Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Threat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique BrezinskiThreat Detection and Response at Scale with Dominique Brezinski
Threat Detection and Response at Scale with Dominique Brezinski
 
Perspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernancePerspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data Governance
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
 

Viewers also liked

[FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview [FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview chanyoonkim
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...VMware Tanzu
 
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterprisePivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterpriseVMware Tanzu
 
저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다eungjin cho
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science VMware Tanzu
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingDerek Kane
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsDerek Kane
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security Inside Analysis
 
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...APNIC
 
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...CA API Management
 
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...Shawn Tuma
 
State of Application Security Vol. 4
State of Application Security Vol. 4State of Application Security Vol. 4
State of Application Security Vol. 4IBM Security
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015Patrick Kalaher
 
IoT and BD Introduction
IoT and BD IntroductionIoT and BD Introduction
IoT and BD IntroductionWayne Sun
 

Viewers also liked (20)

[FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview [FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
 
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterprisePivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
 
저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science
 
What Is the Future of Data Sharing?
What Is the Future of Data Sharing?What Is the Future of Data Sharing?
What Is the Future of Data Sharing?
 
Data Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic AlgorithmsData Science - Part XIV - Genetic Algorithms
Data Science - Part XIV - Genetic Algorithms
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Data Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series ForecastingData Science - Part X - Time Series Forecasting
Data Science - Part X - Time Series Forecasting
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
MATATABI: Cyber Threat Analysis and Defense Platform using Huge Amount of Dat...
 
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
 
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
Data Security and Privacy by Contract: Hacking Us All Into Business Associate...
 
State of Application Security Vol. 4
State of Application Security Vol. 4State of Application Security Vol. 4
State of Application Security Vol. 4
 
Senzations’15: Secure Internet of Things
Senzations’15: Secure Internet of ThingsSenzations’15: Secure Internet of Things
Senzations’15: Secure Internet of Things
 
frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015frog IoT Big Design IoT World Congress 2015
frog IoT Big Design IoT World Congress 2015
 
IoT and BD Introduction
IoT and BD IntroductionIoT and BD Introduction
IoT and BD Introduction
 

Similar to Data Science Driven Malware Detection

Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka Guido Schmutz
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druidSuman Banerjee
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptxTarekHamdi8
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsUSGProfessionalsBelgium
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsGuyVanderSande
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Effectively manage and scale word press multisite and importance of insights
Effectively manage and scale word press multisite and importance of insightsEffectively manage and scale word press multisite and importance of insights
Effectively manage and scale word press multisite and importance of insightsHarshit Sanghvi
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
RTP Bluemix Meetup April 20th 2016
RTP Bluemix Meetup April 20th 2016RTP Bluemix Meetup April 20th 2016
RTP Bluemix Meetup April 20th 2016Tom Boucher
 

Similar to Data Science Driven Malware Detection (20)

Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druid
 
Web engineering
Web engineeringWeb engineering
Web engineering
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Effectively manage and scale word press multisite and importance of insights
Effectively manage and scale word press multisite and importance of insightsEffectively manage and scale word press multisite and importance of insights
Effectively manage and scale word press multisite and importance of insights
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
RTP Bluemix Meetup April 20th 2016
RTP Bluemix Meetup April 20th 2016RTP Bluemix Meetup April 20th 2016
RTP Bluemix Meetup April 20th 2016
 

More from VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

More from VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Recently uploaded

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 

Recently uploaded (16)

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 

Data Science Driven Malware Detection

  • 1. 1© Copyright 2015 Pivotal. All rights reserved. 1 Data Science Driven Malware Detection Malicious Domain Association Anirudh Kondaveeti, PhD Principal Data Scientist
  • 2. 2© Copyright 2015 Pivotal. All rights reserved. Project Goal  Goal: Find domains that have time and user based co-occurrence relationships to aid the detection of coordinated network attacks.  Example: Domain A is a watering hole. It redirects users to an exploit kit at Domain B within a short time window. – B is relatively unknown: Visiting B is a low frequency (support) event. – B is almost always redirected from A: The conditional probability (confidence) of an initial visit to A is high given B is visited later on. User visits watering hole domain A Domain B hosts exploit kit Watering hole domain A redirects to domain B User machine compromised
  • 3. 3© Copyright 2015 Pivotal. All rights reserved. Data Sources & Preprocessing  Historical Proxy Logs – Information about “who is accessing which website at what time” – Approx. 3 months of data with billions of connection records  Local Domain White List – List of non-malicious websites  Preprocessing Host Name Normalization (anirudh.facebook.com -> facebook.com) Filter Invalid Host Names ( www.facebook,ca) Identify “unpopular” domains ( www.francelegal.com) User Specific Sessionization
  • 4. 4© Copyright 2015 Pivotal. All rights reserved. User-Specific Sessionization  Each user’s proxy logs are sessionized so that two consecutive connections in the same session occur within a user-specified time window (e.g. 60s).  Sequential patterns are derived from sessionized data. Connection Time Domain Session ID 2015-07-03 12:41:08 googlevideo.com 1 2015-07-03 12:41:09 twitter.com 1 2015-07-03 12:41:12 youtube.com 1 2015-07-03 12:41:14 doubleclick.net 1 2015-07-03 12:41:15 google.com 1 2015-07-03 12:41:15 googleanalytics.com 1 2015-07-03 12:41:28 youtube.com 1 2015-07-03 12:59:23 facebook.com 2 2015-07-03 12:59:24 yahoo.com 2 >60s apart, start a new session
  • 5. 5© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches  Sequential Pattern Mining – Find time-ordered co-occurrence relationships between multiple domains. – Output low frequency, high confidence sequences of domains: [{Domain1},{Domain2, Domain3},…] => [DomainN].  Graph Mining – Build a “social network” graph between domains by creating edges between pairs of domains that are associated with high confidence – Use graph based algorithms to find fully and partially connected subgraphs  Two approaches can be used in conjunction to compliment each other.
  • 6. 6© Copyright 2015 Pivotal. All rights reserved. Modeling Framework Design Considerations  Operational feasibility – Incremental data processing and modeling on incoming new data, e.g. on a weekly basis, to distribute workload over time. – Results are updated to incorporate new model outputs.  Computational tractability – Implement most of the modeling frameworks in plain SQL, and design efficient Window functions to achieve better runtime performance. – Explicit PL/R routine parallelization to leverage the Massively Parallel Processing architecture of the Greenplum database.
  • 7. 7© Copyright 2015 Pivotal. All rights reserved. An Incremental Modeling Framework Initial Proxy Logs & Domain Whitelist Preprocessed Proxy Logs • Host normalization & validation • Data filtering • Sessionization Model-Specific Results Model Execution: • Sequential Pattern Mining • Graph Mining New Proxy Logs & (Possibly) Updated Domain Whitelist Preprocessed New Proxy Logs • Host normalization & validation • Data filtering • Sessionization Updated Model- Specific Results Initial Run Update Model Update: • Sequential Pattern Mining • Graph Mining
  • 8. 8© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Sequential Pattern Mining
  • 9. 9© Copyright 2015 Pivotal. All rights reserved. Model Execution: Sequential Pattern Mining Create time-ordered domain sequences from sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel
  • 10. 10© Copyright 2015 Pivotal. All rights reserved. Sequence Creation  Each sequence contains domains in a session by the same user.  Domains are ordered by connection time.  Sequence for example on the right – Sequence 1 : [ {googlevideo.com}, {twitter.com}, {youtube.com}, {doubleclick.net}, {google.com}, {googleanalytics.com} ] – Sequence 2: [{facebook.com}, {yahoo.com}] Connection Time Domain Session ID 2015-01-06 14:41:08 googlevideo.com 1 2015-01-06 14:41:09 twitter.com 1 2015-01-06 14:41:12 youtube.com 1 2015-01-06 14:41:14 doubleclick.net 1 2015-01-06 14:41:15 google.com 1 2015-01-06 14:41:15 googleanalytics.com 1 2015-01-06 14:59:23 facebook.com 2 2015-01-06 14:59:24 yahoo.com 2
  • 11. 11© Copyright 2015 Pivotal. All rights reserved. Sequence Statistics  sup: Support of a pattern P is the ratio of sequences in which a pattern occurs – sup({a,e}) = 2/10  conf: Confidence of a rule X => Y is proportion of transactions containing X that also contain Y – conf({a => e}) = sup({a,e})/sup({a}) = 2/5  #users: Number of distinct users for which a pattern P occurs – #users({a}) = 1  sup and #users follow monotone property i.e. – {a,e} {a} – sup({a,e}) ≤ sup({a}) – #users({a,e}) ≤ #users({a}) 10 sequences from a single user
  • 12. 12© Copyright 2015 Pivotal. All rights reserved. Sequential Pattern Mining (SPM) in Parallel  Developed a scalable algorithm in Greenplum database (GPDB) to identify patterns with low support and high confidence patterns occurring in a minimum number of user sequences.  High confidence patterns relating to a given set of domains are obtained in parallel: i.e., SPM runs independently on different subsets of sequences for different domains. SELECT a_targeted_domain, sequential_pattern_mining(min_support, min_confidence, min_num_users) FROM input_table Pseudo code: Find domain A with small support (or known bad domain) Subset sequences from data containing A Find sequential patterns of A with high confidence Repeat for all A in parallel on separate GPDB node
  • 13. 13© Copyright 2015 Pivotal. All rights reserved. Relative Confidence to Adjust Ranking of Patterns  For each domain of interest, SPM is run only on the subset of sequences containing that domain. This may cause some sequential patterns to have artificially high confidence.  Recall: confidence(X=>Y):=support(<X,Y>)/support(X)=|<X,Y>|/|X|. |X|, the number of sequences in the subset that contain the left hand side pattern, may not reflect the popularity of X in the full dataset.  We define relative confidence as: relative_confidence(X=>Y):=|<X,Y>|/|Xi|fullset where|Xi|fullset is the number of sequences in the full dataset that contain the left hand pattern.  Relative confidence favors the pattern whose left hand side contains less popular domains (see the highlighted example below). Relative confidence favors unpopular left hand side pattern Domain Pattern Supp Conf Rel Conf revenueindia. net <{google.com},{facebook.com}> => <{revenueindia.net}> 0.079 0.75 0.0001 revenueindia. net <{google.com}, {fileshare.com}> => <{revenueindia.net}> 0.071 0.75 0.067 revenueindia. net <{fileshare.com},{redworm.com}> => <{revenueindia.net}> 0.030 1.00 0.51
  • 14. 14© Copyright 2015 Pivotal. All rights reserved. Model Update: Sequential Pattern Mining  The model update module for sequential pattern mining follows a similar workflow as its model execution module.  One additional step is simply to merge the new results obtained from the incoming new data with the existing set of patterns, including updating rule quality metrics: support, confidence, etc. Create time-ordered domain sequences from new sessionized data Given a list of targeted domains (e.g. rare domains), select subset of sequences containing those domains Find high confidence, low support sequential patterns of targeted domains in parallel Merge new results with the existing set of patterns.
  • 15. 15© Copyright 2015 Pivotal. All rights reserved. Modeling Approaches Graph Mining
  • 16. 16© Copyright 2015 Pivotal. All rights reserved. Model Execution: Graph Mining Construct “baskets” of domains (co- occurrence domains) by running a sliding window of certain time interval through data Find high confidence, low support pairwise association rules of the form Domain 1 => Domain 2 Create social network of domains Find partially and fully connected sub-graphs
  • 17. 17© Copyright 2015 Pivotal. All rights reserved. Construction of “Baskets”  Domains visited by a user in a certain time window form a “basket”, analogous to items purchased in a single transaction as in market basket analysis.  The time interval for the sliding window (60s window used in the implementation) can be tuned.  A basket contains distinct domains in a sliding window: Example on right: Basket 1 = {googlevideo.com, twitter.com, youtube.com, doubleclick.net, google.com} Connection Time Domain 2015-01-06 14:41:00 googlevideo.com 2015-01-06 14:41:09 twitter.com 2015-01-06 14:41:12 youtube.com 2015-01-06 14:41:14 doubleclick.net 2015-01-06 14:42:00 google.com 2015-01-06 14:42:05 googleanalytics.com 2015-01-06 14:42:08 pivotal.io 2015-01-06 14:59:23 facebook.com 2015-01-06 14:59:24 yahoo.com 1 2
  • 18. 18© Copyright 2015 Pivotal. All rights reserved. Pairwise Association Rule Mining  Given domain-to-basket assignments, pairwise association rule mining mainly involves evaluation of: – Co-occurrence frequency: the number of times two domains fall in a common basket. – Conditional probability: probability of seeing domain 2 given domain 1 is present.  Pairwise rule mining is implemented in plain SQL in a scalable fashion. Domain A Domain B # {A,B} # A # B P(A|B) P(B|A) # A to B # B to A # AB Same Time Max(# User Names/ M) # Date Min Date Max Date pivotal.io montecarlo.com 10 560 10 1.000000 0.017857 9 0 1 1 1 2015-02- 26 2015- 02-26 pivotal.io bigbangtheory.com 25 560 26 0.961538 0.044643 21 4 0 2 1 2015-02- 23 2015- 02-23 pivotal.io sciencefiction.com 78 560 97 0.804124 0.139286 61 15 2 4 8 2015-01- 23 2015- 02-17 High confidence (>0.5) associations involving multiple users over several days (e.g. highlighted rules) are generally more interesting.
  • 19. 19© Copyright 2015 Pivotal. All rights reserved. Exploring Interactions between Domains  To explore the interactions between domains, we build an undirected correlation graph using the discovered pairwise domain association rules.  Each node in the graph is a domain. An edge connects two domains if their co-occurrence confidence is higher than a threshold (e.g. 0.2).  The example on the right shows the tightly connected “social network” of a particular domain.  Partially and fully connected networks indicate possible waterhole or bot-net attacks.  Question: How to quantify the connectivity of a network? 0.25 0.37 0.71 0.52 0.1 0.6 0.1 Weight of Edge denotes the confidence Node denotes the domain abc.com xyz.com hga.com hebf.com
  • 20. 20© Copyright 2015 Pivotal. All rights reserved. OddBall Metrics for Graph Anomaly Detection  We take the OddBall approach* to quantify the connectivity of each domain’s network: – Identify each domain’s one-step neighborhood (also called ego-net). – Extract two graph features from the ego-net: ▪ N: Number of neighbors ▪ E: Number of edges in the ego-net  The number of neighbors and the number of edges follow a power law: E ∝ Nα , 1≤ α ≤ 2 * OddBall: Spotting Anomalies in Weighted Graphs, Leman Akoglu et al., PAKDD, Hyderabad, India, June 2010. Picture Source: ICDM’12 tutorial on graph anomaly detection • Use log(E)/log(N) to approximate the slope. log(E)/log(N) > 1 indicates some degree of connectivity among neighbors. • The higher the ratio the higher degree of connectivity (given same number of neighbors). Generally OddBall ratio of >1.5 is more interesting. • One can additionally compute clique percentage: the ratio between E and the number of edges needed to form a clique: E/[(N2+N)/2], to measure network connectivity.
  • 21. 21© Copyright 2015 Pivotal. All rights reserved. Sample Domains with Highly Connected Networks Highlighted domain has a fully connected network, a clique! Domain # Neighb ors Neighbours # Edg e log( E)/lo g(N) Clique Percen t # User Names a.com 4 {b.com, c.com,d.com, e.com} 10 1.66 100% 6 s.com 7 {a.com, b.com, c.com, d.com, e.com, f.com} 27 1.69 96% 9 r.com 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 43 1.71 96% 7 abc.ru 9 {a.com, b.com, c.com, d.com, e.com, f.com, g.com, h.com, i.com} 42 1.70 93% 11 d.com e.com b.com c.com a.com
  • 22. 22© Copyright 2015 Pivotal. All rights reserved. Detecting Isolated Clusters  Given the domain correlation graph, one can also identify isolated groups of domains that only interact with domains in the same group, but not others (a bot-net like structure).  This can be formulated as the task of finding connected components (CCs) in a graph.  The example below show that malicious sites tend to exist in small CCs. Sample Connected Component qre.com jekc.com fbc.com abc.com ghk.com bcd.com Known malicious site
  • 23. 23© Copyright 2015 Pivotal. All rights reserved. Operationalization and Outlook
  • 24. 24© Copyright 2015 Pivotal. All rights reserved. Operationalization Vision Run Algorithms Inspect Anomalies Evaluate Model Outputs Refine Algorithms Load New Data • Owned by Data Engineer/Data Scientist • Incrementally (e.g. weekly) update models using new batches of data, e.g. as a Cron job • Owned by security team • Ideally model outputs provided via interactive web dashboards • Feedback on model performance from security team. • Opportunities for refinement and ideas for new models • Owned by Data Scientist • Refine algorithms • Owned by Data Engineer • Load new data
  • 25. BUILT FOR THE SPEED OF BUSINESS