Salesforce Einstein is the artificial intelligence layer that delivers predictions and recommendations based on the customer’s unique business processes and data. Einstein Journey Insight is one of the key product offered by Salesforce DMP to help marketers and publishers leverage AI to analyze billions of touchpoints across consumer journeys and discover the optimal paths to conversion, including insights about which channels, messages, and events perform best.
To understand how consumers engage with website articles, advertising campaigns, social events, products and how that essentially leads to a conversion, analysts need to identify key events among thousands of events per user. Frequent pattern mining is the key technique for solving such problems. We have all heard about the beer and diaper story for mining consumer buying habits, however, at Salesforce DMP, we see over 3.5 billion unique users globally a month, across sites, media, mobile app, transactional, and offline, traffic sources. That is more than Facebook, Wikipedia and Twitter combined. The sheer volume, the heterogeneous nature of events and their metadata offer unique opportunities to analyze the complete consumer journey.
However, it also makes it extra challenging to interpret the results or even run the frequent pattern algorithm cost effectively. In this talk, we are going to share our experience of running large scale frequent pattern mining operations using Apache Spark in our Einstein Journey Insight product. We will examine the practicality of the Frequent Pattern technique, and show how Spark helps us address the scaling problem, deal with diverse metadata, and generate interpretable and actionable insights.
13. Theory Meets Reality
Large Scale Frequent Pattern Mining with Apache Spark in the Real World
Kexin Xie, Architect of Marketing Cloud Einstein
kexin.xie@salesforce.com, @realstraw
Wanderley Liu, Senior Data Science Engineer
wanderley.liu@salesforce.com
14. Marketing Cloud Einstein Journey Insights
Track the entire consumer journey
Gather online and offline interactions to stitch together a
complete view of the consumer
Discover the optimal path to conversion
Use AI to analyze all journey permutations and
automatically recommend the best channels, offers and
sequences that lead to conversion
Learn how customers are actually interacting with your brand
GA
18. User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
19. item support
a 8
b 7
c 6
d 5
e 3
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
20. item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
21. item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
22. item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
23. item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b 5
a, c 4
a, d 4
a, e 2
... ...
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
L1 Patterns
L2 Patterns
25. Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b ?
a, c ?
a, d ?
a, e ?
... ...
26. Min Support = 4
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b ?
a, c ?
a, d ?
a, e ?
... ...
Min Support = 6
item support
a 8
b 7
c 6
d 5
e 3
item support
a, b ?
a, c ?
a, d ?
a, e ?
... ...
48. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
User Items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
Header Table
51. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
52. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
53. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
54. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
55. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
56. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
57. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
58. item support
a 8
b 7
c 6
d 5
e 3
root
a: 8 b: 2
b: 5
c: 3
d: 1
d: 1
c: 1
d: 1
e: 1
c: 2
d: 1
d: 1
e: 1
e: 1
Header Table
59. item support
a 8
b 7
c 6
d 5
e 3
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
Header Table
60. item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
Header Table
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
61. item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, c
u-6 a, b, c, (d)
u-8 a, b, c
u-10 b, c, (e)
Header Table
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
62. item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, c
u-6 a, b, c, (d)
u-8 a, b, c
u-10 b, c, (e)
Header Table
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
u-3 a, c, d, e
u-4 a, d, e
u-10 b, c, e
63. item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-2 b, c, d
u-3 a, c, d, (e)
u-4 a, d, (e)
u-6 a, b, c, d
u-9 a, b, d
Header Table
u-3 a, c, d, e
u-4 a, d, e
u-10 b, c, e
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, c
u-6 a, b, c, (d)
u-8 a, b, c
u-10 b, c, (e)
64. item support
a 8
b 7
c 6
d 5
e 3 a
[a], b
[a, b], c
[a, b, c], d
[a, b, c, d], e
u-1 a, b
u-2 b, (c, d)
u-5 a, b, (c)
u-6 a, b, (c, d)
u-8 a, b, (c)
u-9 a, b, (d)
u-10 b, (c, e)
Header Table
user items
u-1 a, b
u-2 b, c, d
u-3 a, c, d, e
u-4 a, d, e
u-5 a, b, c
u-6 a, b, c, d
u-7 a
u-8 a, b, c
u-9 a, b, d
u-10 b, c, e
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, c
u-6 a, b, c, (d)
u-8 a, b, c
u-10 b, c, (e)
u-2 b, c, d
u-3 a, c, d, (e)
u-4 a, d, (e)
u-6 a, b, c, d
u-9 a, b, d
u-3 a, c, d, e
u-4 a, d, e
u-10 b, c, e
65. Number of rows
Numberofitems
u-1 a, b
u-2 b, (c, d)
u-5 a, b, (c)
u-6 a, b, (c, d)
u-8 a, b, (c)
u-9 a, b, (d)
u-10 b, (c, e)
u-2 b, c, (d)
u-3 a, c, (d, e)
u-5 a, b, c
u-6 a, b, c, (d)
u-8 a, b, c
u-10 b, c, (e)
u-2 b, c, d
u-3 a, c, d, (e)
u-4 a, d, (e)
u-6 a, b, c, d
u-9 a, b, d
u-3 a, c, d, e
u-4 a, d, e
u-10 b, c, e
66. Distribute rows to executors
Build FP-Trees on each node
and mine for patterns
Collect patterns
Build FP-tree header table
67. Distribute rows to executors
Build FP-Trees on each node
and mine for patterns
Collect patterns
val headerTable = data
.flatMap(_.items.map(_ -> 1L))
.reduceByKey(_ + _)
.filter(isFrequent)
.collect
.sorted
data
.flatMap(filterDataBasedHeaderTable (headerTable))
.groupByKey
.flatMap { case (k, rows) =>
mineForPatternsFor (k, rows)
}
.collect // If necessary
Build FP-tree header table
80. Pattern Frequency Test
CONDITION 1: Pattern Support ≥ Pattern Min Support
Pattern min support is defined as the lowest category minsup, given all items in the pattern
CONDITION 2 - Apriori Principle (Recursive)
If a pattern is frequent, all sub-patterns must be frequent
81. Condition 1: Pattern Support > Pattern Minimum Support
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B Common 100k
C Rare 1k
Pattern Support Minsup Condition 1
A B 80k
A C 4k
B C 3k
A B C 2k
82. Condition 1: Pattern Support > Pattern Minimum Support
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B Common 100k
C Rare 1k
Pattern Support Minsup Condition 1
A B 80k 100k
A C 4k 1k
B C 3k 1k
A B C 2k 1k
83. Condition 1: Pattern support > Lowest minsup given all items in the pattern
Pattern Frequency
Item Cat Minsup Condition 1
A Common 100k
B Common 100k
C Rare 1k
Pattern Support Minsup Condition 1
A B 80k 100k
A C 4k 1k
B C 3k 1k
A B C 2k 1k
84. Condition 2 - A priori principle
Pattern Frequency Test
Item Cat Minsup Condition 1
A Common 100k
B Common 100k
C Rare 1k
Pattern Support Minsup Condition 2
A B 80k 100k
A C 4k 1k
B C 3k 1k
A B C 2k 1k
85.
86. val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(headerTable))
.groupByKey
.flatMap { case (k, rows) =>
mineForPatternsFor (k, rows)
}
87. val catMinsupMap = sc.broadcast( computeCatMinSup (data))
val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(headerTable))
.groupByKey
.flatMap { case (k, rows) =>
mineForPatternsFor (k, rows, catMinsupMap.value )
}
CONDITION 1
88. val catMinsupMap = sc.broadcast( computeCatMinSup (data))
val fpTreeResults = data
.flatMap(filterDataBasedHeaderTable(headerTable))
.groupByKey
.flatMap { case (k, rows) =>
mineForPatternsFor (k, rows, catMinsupMap.value )
}
val patternsMap = sc.broadcast(fpTreeResults.keys.collect)
fpTreeResults
.filter { case (pattern, support) =>
pattern.subsets.subsetOf (patternMap.value)
}
CONDITION 1
CONDITION 2
89. Not the end of the story ...
https://w-dog.net/wallpaper/nature-night-star-tree-trees-stars-background-wal
lpaper-widescreen-full-screen-hd-wallpapers-fullscreen/id/308950/
90. Low Level Optimization
• Handled case where array length > Integer.MAX_VALUE
Result Set Compaction
• Remove redundant and noisy result sets
• Very efficient compaction - 95% without loss of information
Result Set Ranking
• Score patterns with multiple criteria
Items with Feature Set
• Not only which combinations work best, but what makes them work best
• Well received feature, direct feedback on strategy