SlideShare a Scribd company logo
1 of 57
Download to read offline
Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017
BigML, Inc 2
Association Discovery
Finding Meaningful Correlations
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Association Discovery
Association Discovery
• An unsupervised learning technique
• No labels necessary
• Useful for data discovery
• Finds "significant" correlations/associations/relations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
• Expresses them as "if then rules"
• If "antecedent" then "consequent"
• Significance measures
• BigML: “Magnum Opus” from Geoff Webb
BigML, Inc 4Association Discovery
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 5Association Discovery
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
BigML, Inc 6Association Discovery
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 7Association Discovery
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
BigML, Inc 8Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 9Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
{customer = Bob, account = 3421}
BigML, Inc 10Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
BigML, Inc 11Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
{class = gas}
BigML, Inc 12Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
BigML, Inc 13Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
BigML, Inc 14Association Discovery
Use Cases
• Market Basket Analysis: Items that go together
• Data Discovery: how do instances relate?
• Behaviors that occur together
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• gene expression associated with outcomes
• Medical risk factors
BigML, Inc 15Association Discovery
What is interesting?
• In-frequent patterns can be strong, but are they
interesting?
• Vodka and caviar
• Storms and high water sales
• Frequent patterns can be strong, but are they interesting?
• Coffee and milk
• High plasma glucose and diabetes
• “Frequency” isn’t the answer…
• Depends on the data and domain
• We need to better metrics to define what is interesting
BigML, Inc 16Association Discovery
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
BigML, Inc 17Association Discovery
Association Metrics
Instances
A
C
Support
Percentage of instances
which match antecedent “A”
and Consequent “C”
BigML, Inc 18Association Discovery
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Association Metrics
Coverage
Support
Instances
A
C
BigML, Inc 19Association Discovery
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never 

implies C
A sometimes 

implies C
A always 

implies C
BigML, Inc 20Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
BigML, Inc 21Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
BigML, Inc 22Association Discovery
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent. 

Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
BigML, Inc 23Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
BigML, Inc 24Association Discovery
Magnum Opus
• Select measure of interest: Levarage, Lift, etc
• System finds the top-k associations on that measure within
constraints
• Must be statistically significant interaction between
antecedent and consequent
• Every item in the antecedent must increase the strength
of association
BigML, Inc 25Association Discovery
Basic AD Configuration
1. Search Strategy: Support/Coverage/Confidence/Lift/Leverage
2. Max Number of Associations: 1 to 500 (default 100)
3. Max Items in Antecedent: 1 to 10 (default 4)
4. Complement Items: True / False
• False: Coffee and…
• True: Not Coffee and…
5. Missing Items: True / False
• False: Loan Description contains "Ferrari" and…
• True: Loan Description is missing and…
BigML, Inc 26Association Discovery
Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
date-time2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times
BigML, Inc 27Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
BigML, Inc 28Association Discovery
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
BigML, Inc 29Association Discovery
Association Demo #1
BigML, Inc 30Association Discovery
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for diabetes.
BigML, Inc 31Association Discovery
Association Demo #2
BigML, Inc 32Association Discovery
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes often TRUE
BigML, Inc 33Association Discovery
Advanced AD Config
1. Measures: Set a minimum criteria for AD measures
2. Minimum Significance: lower values reduce spurious rules
3. Consequent: Restrict rules to a specific consequent criteria
4. Discretization: How numeric values are handled
1. Pretty: rounds off discretized values: 20 instead of 20.234
2. Size: the number of ranges (default 5)
3. Type: equal population / width
4. Trim: Removes percentage of values from the tails
BigML, Inc 34Association Discovery
Association Demo #3
BigML, Inc 35Association Discovery
Summary
• Association Discover Purpose
• Unsupervised technique for discovering interesting
associations
• Outputs antecedent/consequent rules
• Metrics: Support / Coverage / Confidence / Lift / Leverage
• Items type:
• Configuration:
• Search strategy / Minimum Measures
• Complementary rules / Missing
• Consequent Filtering
• Discretization
• Additional Uses
• Understanding clusters and anomaly detectors
Topic Modeling
BigML, Inc 2Topic Modeling - September 2017
Topic Modeling
• Method for discovering
structure in "unstructured"
text.
• Based on LDA,
introduced by David Blei,
Andrew Ng, and Michael
I. Jordan in 2003.
• Now "BigML Easy"
BigML, Inc 3Topic Modeling - September 2017
BigML Resources
SOURCE DATASET CORRELATION
STATISTICAL
TEST
MODEL ENSEMBLE
LOGISTIC
REGRESSION
EVALUATION
ANOMALY
DETECTOR
ASSOCIATION
DISCOVERY
SINGLE/BATCH
PREDICTION
SCRIPT LIBRARY EXECUTION
Data
Exploration
Supervised
Learning
Unsupervised
Learning
Automation
CLUSTER
Scoring
TOPIC MODEL
BigML, Inc 4Topic Modeling - September 2017
Unsupervised Learning
Features
Instances
• Learn from instances
• Each instance has features
• There is no label
Clustering
Find similar instances
Anomaly Detection
Find unusual instances
Association Discovery
Find feature rules
BigML, Inc 5Topic Modeling - September 2017
Topic Model
Text Fields
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that model
the text
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use it
• Unsupervised… model?
Questions:
BigML, Inc 6Topic Modeling - September 2017
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
Bag of Words
BigML, Inc 7Topic Modeling - September 2017
Text Analysis
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
Model
The token “great” 

occurs more than 3 times
The token “afraid” 

occurs no more than once
BigML, Inc 8Topic Modeling - September 2017
Hodor!
Text Analysis Demo #1
BigML, Inc 10Topic Modeling - September 2017
Text Analysis vs Topic Models
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
No measure of co-
occurrence
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Support for bigrams
BigML, Inc 11Topic Modeling - September 2017
Generative Modeling
• Decision trees are discriminative models
• Aggressively model the classification boundary
• Parsimonious: Don’t consider anything you don’t have to
• Topic Models are generative models
• Come up with a theory of how the data is generated
• Tweak the theory to fit your data
BigML, Inc 12Topic Modeling - September 2017
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness: 

some are born
great, some
achieve 

greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
BigML, Inc 13Topic Modeling - September 2017
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
BigML, Inc 14Topic Modeling - September 2017
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities
Topic Model Demo #1
BigML, Inc 16Topic Modeling - September 2017
Uses
• As a preprocessor for other techniques
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets
BigML, Inc 17Topic Modeling - September 2017
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
Topic Model Demo #2
BigML, Inc 19Topic Modeling - September 2017
Clustering?
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…
Topic Model Demo #3
BigML, Inc 21Topic Modeling - September 2017
Some Tips
• Setting k
• Much like k-means, the best value is data specific
• Too few will agglomerate unrelated topics, too many will
partition highly related topics
• I tend to find the latter more annoying than the former
• Tuning the Model
• Remove common, useless terms
• Set term limit higher, use bigrams
VSSML17 L4. Association Discovery and Latent Dirichlet Allocation

More Related Content

Similar to VSSML17 L4. Association Discovery and Latent Dirichlet Allocation

BigML Education - Association Discovery
BigML Education - Association DiscoveryBigML Education - Association Discovery
BigML Education - Association DiscoveryBigML, Inc
 
Open Policy Agent Deep Dive Seattle 2018
Open Policy Agent Deep Dive Seattle 2018Open Policy Agent Deep Dive Seattle 2018
Open Policy Agent Deep Dive Seattle 2018Torin Sandall
 
DutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesDutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesBigML, Inc
 
Fact Check Your Data - Data.Monks.pptx
Fact Check Your Data - Data.Monks.pptxFact Check Your Data - Data.Monks.pptx
Fact Check Your Data - Data.Monks.pptxDoug Hall
 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...KINSHIP digital
 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Walter Adamson
 
Data Storytelling - Game changer for Analytics
Data Storytelling - Game changer for Analytics Data Storytelling - Game changer for Analytics
Data Storytelling - Game changer for Analytics Gramener
 
5 company carbon dashboard (sample) - mar 2016 -csrhub
5 company carbon dashboard (sample) - mar 2016 -csrhub5 company carbon dashboard (sample) - mar 2016 -csrhub
5 company carbon dashboard (sample) - mar 2016 -csrhubJeff Hayes
 

Similar to VSSML17 L4. Association Discovery and Latent Dirichlet Allocation (10)

BigML Education - Association Discovery
BigML Education - Association DiscoveryBigML Education - Association Discovery
BigML Education - Association Discovery
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Open Policy Agent Deep Dive Seattle 2018
Open Policy Agent Deep Dive Seattle 2018Open Policy Agent Deep Dive Seattle 2018
Open Policy Agent Deep Dive Seattle 2018
 
DutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesDutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and Anomalies
 
Birnbaum
BirnbaumBirnbaum
Birnbaum
 
Fact Check Your Data - Data.Monks.pptx
Fact Check Your Data - Data.Monks.pptxFact Check Your Data - Data.Monks.pptx
Fact Check Your Data - Data.Monks.pptx
 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
 
Data Storytelling - Game changer for Analytics
Data Storytelling - Game changer for Analytics Data Storytelling - Game changer for Analytics
Data Storytelling - Game changer for Analytics
 
5 company carbon dashboard (sample) - mar 2016 -csrhub
5 company carbon dashboard (sample) - mar 2016 -csrhub5 company carbon dashboard (sample) - mar 2016 -csrhub
5 company carbon dashboard (sample) - mar 2016 -csrhub
 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Recently uploaded (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

VSSML17 L4. Association Discovery and Latent Dirichlet Allocation

  • 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  • 2. BigML, Inc 2 Association Discovery Finding Meaningful Correlations Poul Petersen CIO, BigML, Inc
  • 3. BigML, Inc 3Association Discovery Association Discovery • An unsupervised learning technique • No labels necessary • Useful for data discovery • Finds "significant" correlations/associations/relations • Shopping cart: Coffee and sugar • Medical: High plasma glucose and diabetes • Expresses them as "if then rules" • If "antecedent" then "consequent" • Significance measures • BigML: “Magnum Opus” from Geoff Webb
  • 4. BigML, Inc 4Association Discovery Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 5. BigML, Inc 5Association Discovery Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  • 6. BigML, Inc 6Association Discovery Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 7. BigML, Inc 7Association Discovery Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly
  • 8. BigML, Inc 8Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 9. BigML, Inc 9Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 {customer = Bob, account = 3421}
  • 10. BigML, Inc 10Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421}
  • 11. BigML, Inc 11Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140{customer = Bob, account = 3421} {class = gas}
  • 12. BigML, Inc 12Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 {customer = Bob, account = 3421} {class = gas}
  • 13. BigML, Inc 13Association Discovery Association Discovery date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 zip = 46140 amount < 100 Rules: Antecedent Consequent {customer = Bob, account = 3421} {class = gas}
  • 14. BigML, Inc 14Association Discovery Use Cases • Market Basket Analysis: Items that go together • Data Discovery: how do instances relate? • Behaviors that occur together • Web usage patterns • Intrusion detection • Fraud detection • Bioinformatics • gene expression associated with outcomes • Medical risk factors
  • 15. BigML, Inc 15Association Discovery What is interesting? • In-frequent patterns can be strong, but are they interesting? • Vodka and caviar • Storms and high water sales • Frequent patterns can be strong, but are they interesting? • Coffee and milk • High plasma glucose and diabetes • “Frequency” isn’t the answer… • Depends on the data and domain • We need to better metrics to define what is interesting
  • 16. BigML, Inc 16Association Discovery Association Metrics Coverage Percentage of instances which match antecedent “A” Instances A C
  • 17. BigML, Inc 17Association Discovery Association Metrics Instances A C Support Percentage of instances which match antecedent “A” and Consequent “C”
  • 18. BigML, Inc 18Association Discovery Confidence Percentage of instances in the antecedent which also contain the consequent. Association Metrics Coverage Support Instances A C
  • 19. BigML, Inc 19Association Discovery Association Metrics C Instances A C A Instances C Instances A Instances A C 0% 100% Instances A C Confidence A never implies C A sometimes implies C A always implies C
  • 20. BigML, Inc 20Association Discovery Association Metrics Lift Ratio of observed support to support if A and C were statistically independent. Support == Confidence p(A) * p(C) p(C) Independent A C C Observed A Problem: if p(C) is "small" then… lift may be large.
  • 21. BigML, Inc 21Association Discovery Association Metrics C Observed A Observed A C < 1 > 1 Independent A C Lift = 1 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C
  • 22. BigML, Inc 22Association Discovery Association Metrics Leverage Difference of observed support and support if A and C were statistically independent. Support - [ p(A) * p(C) ] Independent A C C Observed A
  • 23. BigML, Inc 23Association Discovery Association Metrics C Observed A Observed A C < 0 > 0 Independent A C Leverage = 0 Negative Correlation No Correlation Positive Correlation Independent A C Independent A C Observed A C -1…
  • 24. BigML, Inc 24Association Discovery Magnum Opus • Select measure of interest: Levarage, Lift, etc • System finds the top-k associations on that measure within constraints • Must be statistically significant interaction between antecedent and consequent • Every item in the antecedent must increase the strength of association
  • 25. BigML, Inc 25Association Discovery Basic AD Configuration 1. Search Strategy: Support/Coverage/Confidence/Lift/Leverage 2. Max Number of Associations: 1 to 500 (default 100) 3. Max Items in Antecedent: 1 to 10 (default 4) 4. Complement Items: True / False • False: Coffee and… • True: Not Coffee and… 5. Missing Items: True / False • False: Loan Description contains "Ferrari" and… • True: Loan Description is missing and…
  • 26. BigML, Inc 26Association Discovery Data Types numeric 1 2 3 1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical A B C date-time2013-09-25 10:02 DATE-TIME YEAR MONTH DAY-OF-MONTH YYYY-MM-DD DAY-OF-WEEK HOUR MINUTE YYYY-MM-DD YYYY-MM-DD M-T-W-T-F-S-D HH:MM:SS HH:MM:SS 2013 September 25 Wednesday 10 02 text Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. text “great” “afraid” “born” “some” appears 2 times appears 1 time appears 1 time appears 2 times
  • 27. BigML, Inc 27Association Discovery Items Type itemscoffee, sugar, milk, honey, dish soap, bread items • Canonical example: shopping cart contents • Single feature describing a list of items • Each item separated by a comma (default)
  • 28. BigML, Inc 28Association Discovery Use Cases GOAL: Discover “interesting” rules about what store items are typically purchased together. • Dataset of 9,834 grocery cart transactions • Each row is a list of all items in a cart at checkout
  • 29. BigML, Inc 29Association Discovery Association Demo #1
  • 30. BigML, Inc 30Association Discovery Use Cases GOAL: Find general rules that indicate diabetes. • Dataset of diagnostic measurements of 768 patients. • Each patient labelled True/False for diabetes.
  • 31. BigML, Inc 31Association Discovery Association Demo #2
  • 32. BigML, Inc 32Association Discovery Medical Risks Decision Tree If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44 then diabetes = TRUE Association Rule If plasma glucose > 146 then diabetes often TRUE
  • 33. BigML, Inc 33Association Discovery Advanced AD Config 1. Measures: Set a minimum criteria for AD measures 2. Minimum Significance: lower values reduce spurious rules 3. Consequent: Restrict rules to a specific consequent criteria 4. Discretization: How numeric values are handled 1. Pretty: rounds off discretized values: 20 instead of 20.234 2. Size: the number of ranges (default 5) 3. Type: equal population / width 4. Trim: Removes percentage of values from the tails
  • 34. BigML, Inc 34Association Discovery Association Demo #3
  • 35. BigML, Inc 35Association Discovery Summary • Association Discover Purpose • Unsupervised technique for discovering interesting associations • Outputs antecedent/consequent rules • Metrics: Support / Coverage / Confidence / Lift / Leverage • Items type: • Configuration: • Search strategy / Minimum Measures • Complementary rules / Missing • Consequent Filtering • Discretization • Additional Uses • Understanding clusters and anomaly detectors
  • 37. BigML, Inc 2Topic Modeling - September 2017 Topic Modeling • Method for discovering structure in "unstructured" text. • Based on LDA, introduced by David Blei, Andrew Ng, and Michael I. Jordan in 2003. • Now "BigML Easy"
  • 38. BigML, Inc 3Topic Modeling - September 2017 BigML Resources SOURCE DATASET CORRELATION STATISTICAL TEST MODEL ENSEMBLE LOGISTIC REGRESSION EVALUATION ANOMALY DETECTOR ASSOCIATION DISCOVERY SINGLE/BATCH PREDICTION SCRIPT LIBRARY EXECUTION Data Exploration Supervised Learning Unsupervised Learning Automation CLUSTER Scoring TOPIC MODEL
  • 39. BigML, Inc 4Topic Modeling - September 2017 Unsupervised Learning Features Instances • Learn from instances • Each instance has features • There is no label Clustering Find similar instances Anomaly Detection Find unusual instances Association Discovery Find feature rules
  • 40. BigML, Inc 5Topic Modeling - September 2017 Topic Model Text Fields • Unsupervised algorithm • Learns only from text fields • Finds hidden topics that model the text • How is this different from the Text Analysis that BigML already offers? • What does it output and how do we use it • Unsupervised… model? Questions:
  • 41. BigML, Inc 6Topic Modeling - September 2017 Text Analysis Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. great: appears 4 times Bag of Words
  • 42. BigML, Inc 7Topic Modeling - September 2017 Text Analysis … great afraid born achieve … … … 4 1 1 1 … … … … … … … … … Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. Model The token “great” occurs more than 3 times The token “afraid” occurs no more than once
  • 43. BigML, Inc 8Topic Modeling - September 2017 Hodor!
  • 45. BigML, Inc 10Topic Modeling - September 2017 Text Analysis vs Topic Models Text Topic Model Creates thousands of hidden token counts Token counts are independently uninteresting No semantic importance No measure of co- occurrence Creates tens of topics that model the text Topics are independently interesting Semantic meaning extracted Support for bigrams
  • 46. BigML, Inc 11Topic Modeling - September 2017 Generative Modeling • Decision trees are discriminative models • Aggressively model the classification boundary • Parsimonious: Don’t consider anything you don’t have to • Topic Models are generative models • Come up with a theory of how the data is generated • Tweak the theory to fit your data
  • 47. BigML, Inc 12Topic Modeling - September 2017 Generating Documents cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… shoe asteroid flashlight pizza… plate giraffe purple jump… Be not afraid of greatness: some are born great, some achieve greatness… • "Machine" that generates a random word with equal probability with each pull. • Pull random number of times to generate a document. • All documents can be generated, but most are nonsense. word probability shoe ϵ asteroid ϵ flashlight ϵ pizza ϵ … ϵ
  • 48. BigML, Inc 13Topic Modeling - September 2017 Topic Model • Written documents have meaning - one way to describe meaning is to assign a topic. • For our random machine, the topic can be thought of as increasing the probability of certain words. Intuition: Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… airplane passport pizza … word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… mars quasar lightyear soda word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ
  • 49. BigML, Inc 14Topic Modeling - September 2017 Topic Model plate giraffe purple jump… Topic: "1" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ Topic: "k" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability shoe 12,12 % coffee 3,39 % telephone 13,43 % paper 4,11 % … ϵ …Topic: "2" cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ airplane passport pizza … plate giraffe purple jump… • Each text field in a row is concatenated into a document • The documents are analyzed to generate "k" related topics • Each topic is represented by a distribution of term probabilities
  • 51. BigML, Inc 16Topic Modeling - September 2017 Uses • As a preprocessor for other techniques • Bootstrapping categories for classification • Recommendation • Discovery in large, heterogeneous text datasets
  • 52. BigML, Inc 17Topic Modeling - September 2017 Topic Distribution • Any given document is likely a mixture of the modeled topics… • This can be represented as a distribution of topic probabilities Intuition: Will 2020 be the year that humans will embrace space exploration and finally travel to Mars? Topic: travel cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability travel 23,55 % airplane 2,33 % mars 0,003 % mantle ϵ … ϵ 11% Topic: space cat shoe zebra ball tree jump pen asteroid cable box step cabinet yellow plate flashlight… word probability space 38,94 % airplane ϵ mars 13,43 % mantle 0,05 % … ϵ 89%
  • 54. BigML, Inc 19Topic Modeling - September 2017 Clustering? Unlabelled Data Centroid Label Unlabelled Data topic 1 prob topic 3 prob topic k prob Clustering Batch Centroid Topic Model Text Fields Batch Topic Distribution …
  • 56. BigML, Inc 21Topic Modeling - September 2017 Some Tips • Setting k • Much like k-means, the best value is data specific • Too few will agglomerate unrelated topics, too many will partition highly related topics • I tend to find the latter more annoying than the former • Tuning the Model • Remove common, useless terms • Set term limit higher, use bigrams