The document discusses topic modeling, an unsupervised machine learning technique for discovering hidden topics that model unstructured text. Topic modeling is based on LDA and finds topics that are independently interesting, extracting semantic meaning. It differs from traditional text analysis by creating tens of topics rather than thousands of token counts, and considers co-occurrence of words to model topics rather than analyzing words individually.
3. BigML, Inc 3Association Discovery
Association Discovery
• An unsupervised learning technique
• No labels necessary
• Useful for data discovery
• Finds "significant" correlations/associations/relations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
• Expresses them as "if then rules"
• If "antecedent" then "consequent"
• Significance measures
• BigML: “Magnum Opus” from Geoff Webb
4. BigML, Inc 4Association Discovery
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
5. BigML, Inc 5Association Discovery
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
6. BigML, Inc 6Association Discovery
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
7. BigML, Inc 7Association Discovery
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
8. BigML, Inc 8Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
9. BigML, Inc 9Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
{customer = Bob, account = 3421}
10. BigML, Inc 10Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
11. BigML, Inc 11Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
{class = gas}
12. BigML, Inc 12Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
13. BigML, Inc 13Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
14. BigML, Inc 14Association Discovery
Use Cases
• Market Basket Analysis: Items that go together
• Data Discovery: how do instances relate?
• Behaviors that occur together
• Web usage patterns
• Intrusion detection
• Fraud detection
• Bioinformatics
• gene expression associated with outcomes
• Medical risk factors
15. BigML, Inc 15Association Discovery
What is interesting?
• In-frequent patterns can be strong, but are they
interesting?
• Vodka and caviar
• Storms and high water sales
• Frequent patterns can be strong, but are they interesting?
• Coffee and milk
• High plasma glucose and diabetes
• “Frequency” isn’t the answer…
• Depends on the data and domain
• We need to better metrics to define what is interesting
16. BigML, Inc 16Association Discovery
Association Metrics
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
17. BigML, Inc 17Association Discovery
Association Metrics
Instances
A
C
Support
Percentage of instances
which match antecedent “A”
and Consequent “C”
18. BigML, Inc 18Association Discovery
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Association Metrics
Coverage
Support
Instances
A
C
19. BigML, Inc 19Association Discovery
Association Metrics
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
20. BigML, Inc 20Association Discovery
Association Metrics
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
21. BigML, Inc 21Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
22. BigML, Inc 22Association Discovery
Association Metrics
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
23. BigML, Inc 23Association Discovery
Association Metrics
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
24. BigML, Inc 24Association Discovery
Magnum Opus
• Select measure of interest: Levarage, Lift, etc
• System finds the top-k associations on that measure within
constraints
• Must be statistically significant interaction between
antecedent and consequent
• Every item in the antecedent must increase the strength
of association
25. BigML, Inc 25Association Discovery
Basic AD Configuration
1. Search Strategy: Support/Coverage/Confidence/Lift/Leverage
2. Max Number of Associations: 1 to 500 (default 100)
3. Max Items in Antecedent: 1 to 10 (default 4)
4. Complement Items: True / False
• False: Coffee and…
• True: Not Coffee and…
5. Missing Items: True / False
• False: Loan Description contains "Ferrari" and…
• True: Loan Description is missing and…
26. BigML, Inc 26Association Discovery
Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
date-time2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times
27. BigML, Inc 27Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
28. BigML, Inc 28Association Discovery
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at
checkout
30. BigML, Inc 30Association Discovery
Use Cases
GOAL: Find general rules that indicate diabetes.
• Dataset of diagnostic measurements of 768
patients.
• Each patient labelled True/False for diabetes.
32. BigML, Inc 32Association Discovery
Medical Risks
Decision Tree
If plasma glucose > 155
and bmi > 29.32
and diabetes pedigree > 0.32
and insulin <= 629
and age <= 44
then diabetes = TRUE
Association Rule
If plasma glucose > 146
then diabetes often TRUE
33. BigML, Inc 33Association Discovery
Advanced AD Config
1. Measures: Set a minimum criteria for AD measures
2. Minimum Significance: lower values reduce spurious rules
3. Consequent: Restrict rules to a specific consequent criteria
4. Discretization: How numeric values are handled
1. Pretty: rounds off discretized values: 20 instead of 20.234
2. Size: the number of ranges (default 5)
3. Type: equal population / width
4. Trim: Removes percentage of values from the tails
37. BigML, Inc 2Topic Modeling - September 2017
Topic Modeling
• Method for discovering
structure in "unstructured"
text.
• Based on LDA,
introduced by David Blei,
Andrew Ng, and Michael
I. Jordan in 2003.
• Now "BigML Easy"
38. BigML, Inc 3Topic Modeling - September 2017
BigML Resources
SOURCE DATASET CORRELATION
STATISTICAL
TEST
MODEL ENSEMBLE
LOGISTIC
REGRESSION
EVALUATION
ANOMALY
DETECTOR
ASSOCIATION
DISCOVERY
SINGLE/BATCH
PREDICTION
SCRIPT LIBRARY EXECUTION
Data
Exploration
Supervised
Learning
Unsupervised
Learning
Automation
CLUSTER
Scoring
TOPIC MODEL
39. BigML, Inc 4Topic Modeling - September 2017
Unsupervised Learning
Features
Instances
• Learn from instances
• Each instance has features
• There is no label
Clustering
Find similar instances
Anomaly Detection
Find unusual instances
Association Discovery
Find feature rules
40. BigML, Inc 5Topic Modeling - September 2017
Topic Model
Text Fields
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that model
the text
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use it
• Unsupervised… model?
Questions:
41. BigML, Inc 6Topic Modeling - September 2017
Text Analysis
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
Bag of Words
42. BigML, Inc 7Topic Modeling - September 2017
Text Analysis
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
Model
The token “great”
occurs more than 3 times
The token “afraid”
occurs no more than once
45. BigML, Inc 10Topic Modeling - September 2017
Text Analysis vs Topic Models
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
No measure of co-
occurrence
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Support for bigrams
46. BigML, Inc 11Topic Modeling - September 2017
Generative Modeling
• Decision trees are discriminative models
• Aggressively model the classification boundary
• Parsimonious: Don’t consider anything you don’t have to
• Topic Models are generative models
• Come up with a theory of how the data is generated
• Tweak the theory to fit your data
47. BigML, Inc 12Topic Modeling - September 2017
Generating Documents
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness:
some are born
great, some
achieve
greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
48. BigML, Inc 13Topic Modeling - September 2017
Topic Model
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
49. BigML, Inc 14Topic Modeling - September 2017
Topic Model
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities
51. BigML, Inc 16Topic Modeling - September 2017
Uses
• As a preprocessor for other techniques
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets
52. BigML, Inc 17Topic Modeling - September 2017
Topic Distribution
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
54. BigML, Inc 19Topic Modeling - September 2017
Clustering?
Unlabelled Data
Centroid Label
Unlabelled Data
topic 1
prob
topic 3
prob
topic k
prob
Clustering Batch Centroid
Topic Model
Text Fields
Batch Topic Distribution
…
56. BigML, Inc 21Topic Modeling - September 2017
Some Tips
• Setting k
• Much like k-means, the best value is data specific
• Too few will agglomerate unrelated topics, too many will
partition highly related topics
• I tend to find the latter more annoying than the former
• Tuning the Model
• Remove common, useless terms
• Set term limit higher, use bigrams