Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Pattern Mining: Getting the
most out of your log data.
Krishna Sridhar
Staff Data Scientist, Dato Inc.
krishna_srd
• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-sc...
45+$and$growing$fast!
About Us!
+ =
Questions?
• (Now) We are monitoring the chat window.
• (Later) Email me srikris@dato.com.
Webinars
About you?
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
...
GraphLab(Create(
Train
Model
Pipeline
Deploy
Models
Serve
Requests
(REST API)
Monitor
Services
Get Live
Feedback
Update
Pi...
Log Journey
Lots of data
Insights Profits
Log Mining: Pattern Mining
Logs are everywhere!
Machine Learning in Logs
Source: Mining Your Logs - Gaining Insight Through Visualization
Coffee shop
Coffee Shops Menu
Receipts
Coffee Shops Menu
Coffee Store Logs
Frequent Pattern Mining
What sets of items were bought together?
Real Applications
Real Applications
Real Applications
Log Mining: Rule Mining
Can we recommend items?
Rule Mining
Real Applications
Log Mining: Feature Extraction
Feature Extraction
0 1 0 0 0 0 1 1 0
1 1 0 0 1 0 0 0 0
0 0 1 1 1 0
Receipt Space Features in
Menu Space
ML
3 Useful Data Mining Tasks
Rule MiningPattern Mining Feature Extraction
Demo
ML is not a black-box.
Transparency
Learning is also about understanding.
Interpretability
Whatever can go wrong, will go ...
Pattern Mining Explained
Formulating Pattern Mining
N distinct items → 2N itemsets
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patte...
Pattern Mining
N distinct items → 2N itemsets
Pattern Mining: Principles
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D...
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D...
Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
...
Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth [GLC 1.6]
Lots of Generalizations
Source: http://www.philippe-fournier-viger.com/spmf/
Candidate Generation
Two phases
1. Candidate generation.
2. Candidate filtering.
Exploit Apriori Principle!
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} :...
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} :...
Pattern Growth
Two phases
1. Candidate filtering
2. Conditional database constructions.
Avoid full scans over the data & l...
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 ...
Pattern Growth - Preprocessing {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ ...
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? ...
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? ...
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? ...
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 ...
Pattern Growth
{B} : 4
{ } : 6
Call: Growth(db = DB{}, item = B, freq = {B,C,D})
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}...
Pattern Growth
{B} : 4
{ } : 6
Conditional Database Construction
DB{} DB{B}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}...
Pattern Growth
{B} : 4
{ } : 6
Candidate Filtering
DB{B}
{C, D}
{D}
{C, D}
{D}
{D} : 4
{C} : 2
DB{}
{B, C, D}
{A, C, D}
{B...
Pattern Growth - Depth First {C, D}
{D}
{C, D}
{D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D...
Pattern Growth
Recurse: Growth(db = DB{B}, item = D, freq = {D})
DB{B}
{C, D}
{D}
{C, D}
{D}
{B} : 4
{ } : 6
{BD} : 4
DB{B...
Pattern Growth - Depth First
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ...
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the d...
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the d...
FP-Tree Compression
Figures From Florian Verhein’s Slides on FP-Growth
FP-Growth Algorithm
Figures From Florian Verhein’s Slides on FP-Growth
Two phases
1. Candidate filtering.
2. Conditional d...
TopK FP-Growth Algorithm
Similar to FP-Growth
1. Dynamically raise min_support.
2. Estimates of min_support greatly help.
Performance on Website Logs
• 1.5m events
• 84k sessions
• 3k unique ids
Future Work
Distributed FP-Growth
Partition database on item-ids.
Database
Bags + Sequences
× 2
Itemset: {Item}
Bags: {Item: quantity}
Sequences : (item)
Model built, now what?
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
...
Demo
Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make pre...
SELECT questions FROM audience
WHERE difficulty == “Easy”
Thanks!
Extra Slides
Upcoming SlideShare
Loading in …5
×

1

Share

Download to read offline

Pattern Mining: Extracting Value from Log Data

Download to read offline

Presented by Srikrishna Sridhar

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Pattern Mining: Extracting Value from Log Data

  1. 1. Pattern Mining: Getting the most out of your log data. Krishna Sridhar Staff Data Scientist, Dato Inc. krishna_srd
  2. 2. • Background - Machine Learning (ML) Research. - Ph.D Numerical Optimization @Wisconsin • Now - Build ML tools for data-scientists & developers @Dato. - Help deploy ML algorithms. @krishna_srd, @DatoInc About Me!
  3. 3. 45+$and$growing$fast! About Us!
  4. 4. + = Questions? • (Now) We are monitoring the chat window. • (Later) Email me srikris@dato.com. Webinars
  5. 5. About you?
  6. 6. Creating a model pipeline Ingest Transform Model Deploy Unstructured Data exploration data modeling Data Science Workflow Ingest Transform Model Deploy
  7. 7. GraphLab(Create( Train Model Pipeline Deploy Models Serve Requests (REST API) Monitor Services Get Live Feedback Update Pipelines Prototype & Develop Model Pipelines Update Live Experiment Deploy New Pipeline Dato(Predic2ve(Services( Dato’s Products Dato(Distributed( We can help!
  8. 8. Log Journey Lots of data Insights Profits
  9. 9. Log Mining: Pattern Mining
  10. 10. Logs are everywhere!
  11. 11. Machine Learning in Logs Source: Mining Your Logs - Gaining Insight Through Visualization
  12. 12. Coffee shop Coffee Shops Menu
  13. 13. Receipts Coffee Shops Menu
  14. 14. Coffee Store Logs
  15. 15. Frequent Pattern Mining What sets of items were bought together?
  16. 16. Real Applications
  17. 17. Real Applications
  18. 18. Real Applications
  19. 19. Log Mining: Rule Mining
  20. 20. Can we recommend items? Rule Mining
  21. 21. Real Applications
  22. 22. Log Mining: Feature Extraction
  23. 23. Feature Extraction 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 Receipt Space Features in Menu Space ML
  24. 24. 3 Useful Data Mining Tasks Rule MiningPattern Mining Feature Extraction
  25. 25. Demo
  26. 26. ML is not a black-box. Transparency Learning is also about understanding. Interpretability Whatever can go wrong, will go wrong. Diagnosis Moving on
  27. 27. Pattern Mining Explained
  28. 28. Formulating Pattern Mining N distinct items → 2N itemsets
  29. 29. Formulating Pattern Mining Find the top K most frequent sets of length at least L that occur at least M times.
  30. 30. Formulating Pattern Mining Find the top K most frequent sets of length at least L that occur at least M times. - max_patterns - min_length - min_support
  31. 31. Pattern Mining N distinct items → 2N itemsets
  32. 32. Pattern Mining: Principles
  33. 33. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D}: 5 is frequent M = 4 {A, D}: 5 is not frequent
  34. 34. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D}: 5 is frequent M = 4 {A, D}: 5 is not frequent min_support
  35. 35. Principle 2: Apriori principle A pattern is frequent only if a subset is frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {B, C, D} : 5 is frequent therefore {C, D} : 5 is frequent {A} : 3 is not frequent therefore {A, D} : 3 is not frequent M = 4
  36. 36. Two Main Algorithms • Candidate Generation - Apriori - Eclat • Pattern Growth - FP-Growth - TopK FP-Growth [GLC 1.6]
  37. 37. Lots of Generalizations Source: http://www.philippe-fournier-viger.com/spmf/
  38. 38. Candidate Generation Two phases 1. Candidate generation. 2. Candidate filtering. Exploit Apriori Principle!
  39. 39. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  40. 40. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  41. 41. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  42. 42. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  43. 43. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  44. 44. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  45. 45. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  46. 46. Pattern Growth Two phases 1. Candidate filtering 2. Conditional database constructions. Avoid full scans over the data & large candidate sets!
  47. 47. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4 {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2 {BC} : 2
  48. 48. Pattern Growth - Preprocessing {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6
  49. 49. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  50. 50. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  51. 51. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  52. 52. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : 2
  53. 53. Pattern Growth {B} : 4 { } : 6 Call: Growth(db = DB{}, item = B, freq = {B,C,D}) DB{} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D}
  54. 54. Pattern Growth {B} : 4 { } : 6 Conditional Database Construction DB{} DB{B} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {C, D} {D} {C, D} {D}
  55. 55. Pattern Growth {B} : 4 { } : 6 Candidate Filtering DB{B} {C, D} {D} {C, D} {D} {D} : 4 {C} : 2 DB{} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} DB{B} Add {BD} as frequent
  56. 56. Pattern Growth - Depth First {C, D} {D} {C, D} {D} {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : 2
  57. 57. Pattern Growth Recurse: Growth(db = DB{B}, item = D, freq = {D}) DB{B} {C, D} {D} {C, D} {D} {B} : 4 { } : 6 {BD} : 4 DB{BD}
  58. 58. Pattern Growth - Depth First {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : X {ACD} : ? {BCD} : X {BC} : 2
  59. 59. Compare & Constrast • Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient.
  60. 60. Compare & Constrast • Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient. Better choice
  61. 61. FP-Tree Compression Figures From Florian Verhein’s Slides on FP-Growth
  62. 62. FP-Growth Algorithm Figures From Florian Verhein’s Slides on FP-Growth Two phases 1. Candidate filtering. 2. Conditional database constructions.
  63. 63. TopK FP-Growth Algorithm Similar to FP-Growth 1. Dynamically raise min_support. 2. Estimates of min_support greatly help.
  64. 64. Performance on Website Logs • 1.5m events • 84k sessions • 3k unique ids
  65. 65. Future Work
  66. 66. Distributed FP-Growth Partition database on item-ids. Database
  67. 67. Bags + Sequences × 2 Itemset: {Item} Bags: {Item: quantity} Sequences : (item)
  68. 68. Model built, now what?
  69. 69. Creating a model pipeline Ingest Transform Model Deploy Unstructured Data exploration data modeling Data Science Workflow Ingest Transform Model Deploy
  70. 70. Demo
  71. 71. Summary Log Data Mining ≠ Rocket Science • FP-Growth for finding frequent patterns. • Find rules from patterns to make predictions. • Extract features for useful ML in pattern space.
  72. 72. SELECT questions FROM audience WHERE difficulty == “Easy” Thanks!
  73. 73. Extra Slides
  • TausifBadu

    Jan. 17, 2017

Presented by Srikrishna Sridhar

Views

Total views

733

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

32

Shares

0

Comments

0

Likes

1

×