React JS; all concepts. Contains React Features, JSX, functional & Class comp...
Identity Graph at Scale: Transforming Billions of Page Views to Unique Identity Profiles in Publishing
1. MEREDITH + CLIENT NAME | 1MEREDITH + CLIENT NAME | 1MEREDITH + GRAPHCONNECT 2020 | 1
Identity Graph at Scale
Transforming Billions of Page views
to Unique Identity Profiles in Publishing
2. MEREDITH + CLIENT NAME | 2MEREDITH + CLIENT NAME | 2MEREDITH + GRAPHCONNECT 2020 | 2
MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4
3. MEREDITH + CLIENT NAME | 3MEREDITH + CLIENT NAME | 3MEREDITH + GRAPHCONNECT 2020 | 3
We are Meredith Corporation, a publicly
held media and marketing services
company founded upon serving our
customers and committed to building
value for our shareholders.
We are on the pulse of pop culture,
entertainment, food, fashion and
lifestyle, news, business and finance,
and sports
Who we are:
Meredith
Brands
Our Brands are in nearly every grocery store, gas station and dentist
office across the U.S.
Digital Presence
Our multi-channel digital approach to media provides touchpoints
through varied devices from mobile, desktop, console and OTT.
Analytics
Our focus aims to provide our consumers with top content relevant
to their daily lives while providing directed audiences for advertising
that contribute directly to dollars spent by our consumers.
Programmatic Targeting
Our unparalleled database delivers custom content at the point-of-
decision, leveraging first-party data and unique distribution
resources to engage our audience.
R & D
Our Research and Development projects focus on cutting edge
technology to allow Meredith stand alone as the Premier Publishing
and Content platform in the U.S.
4. MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + GRAPHCONNECT 2020 | 4
Serve well-defined audiences, deliver the messages of national and local advertisers, and extend our brand franchises
and expertise to related markets. The new Paradigm of Publishing is Personalization.
Meredith Digital’s Mission
B A C K G R O U N D
ENTERTAINMENT + STYLE FOOD PARENTING HOME + LIFESTYLE TRAVEL + LUXURY HEALTH + WELLNESS
453M 531M 18M 148M 64M 31M
Source: comScore Multiplatform, December 2018
5. MEREDITH + CLIENT NAME | 5MEREDITH + CLIENT NAME | 5MEREDITH + GRAPHCONNECT 2020 | 5
Daily news
“Play news from People”
School lunch supplies are low!
“Add fruit to shopping list”
Social Media
Check Instagram & Facebook
7:30 AM
Drop kids
at school
10:30 PM
Lights Out
5:30 AM
Wake up…
Consumer Action
Media Moment
Get some exercise
Morning yoga Flow from Shape5:45 AM
6:05 AM
6:15 AM
6:45 AM
Daily commute
Parents Podcast
What’s for Dinner?
QR scan “Asian Salmon Bowls” in Real Simple
5:00 PM
Pick up kids
8:00 PM
Kids to bed
Daily commute
EW’s Game of Thrones Podcast
Make dinner
“Cook Salmon Bowls”
Self care
Daily meditation from Health
Stock up
Place Shipt order
7:45 AM
11:15 AM
12:20 PM3:15 PM
5:30 PM
6:00 PM
Me time
IGTV Locals from T&L
8:30 PM
Cleanup emergency!
“How to remove soy sauce?”
7:30 PM
Voice
Audio
Mobile
Print
Video
We connect
with her across
multiple touch
points throughout
her day
O U R T O U C H P O I N T S
6. MEREDITH + CLIENT NAME | 6MEREDITH + CLIENT NAME | 6MEREDITH + GRAPHCONNECT 2020 | 6
Measuring the
Mutable
Cookies are constantly changing.
Firewalls, anti-virus, and diligent digital users all
contribute to cookie loss
Cross Device challenges.
Typical users interact with our brands across
many devices but cookies are device confined
Intelligent Tracking Prevention 2.3
Browsers.
Safari, Chrome, Firefox all have new security
standards to inhibit third-party cookies
The fight against online tracking and
analytics
7. MEREDITH + CLIENT NAME | 7MEREDITH + CLIENT NAME | 7MEREDITH + GRAPHCONNECT 2020 | 7
Models Made of Sand
Audience Propensity on unstable Cookies
Models Cost Money.
Even the Best-in-class audience
segmentation models suffer from cookie loss
Activation is Paramount.
Propensity Models are only as good as their
activation
Advertising In the Dark.
Cookie loss leads to less click throughs
8. MEREDITH + CLIENT NAME | 8MEREDITH + CLIENT NAME | 8MEREDITH + GRAPHCONNECT 2020 | 8
Creating a Unified view of a Digital User
Confluence of Data
First–Party Data
Third–Party Data
• Various data stream providers and touchpoints
• A true Digital footprint requires all the sources
• No one stream has all of the information
• Cookie Recovery through Connections
• Creating Profiles with Longevity
• More touchpoints = Better Models
9. MEREDITH + CLIENT NAME | 9MEREDITH + CLIENT NAME | 9MEREDITH + GRAPHCONNECT 2020 | 9
Spotting Snake Oil Sellers
Investing in data you can trust
• Identity Resolution Vendors are a dime a dozen
but can cost a lot more
• How can you validate vendors with so much
anonymous traffic?
• Graphs + First-Party data give the power to
validate
• Look for linkages with too many locations,
repetitive timestamps, and multiple emails to
discredit faulty connections
10. MEREDITH + CLIENT NAME | 10MEREDITH + CLIENT NAME | 10MEREDITH + GRAPHCONNECT 2020 | 10
A Timeline of Development
From Proof of Concept to Production
T H E S O L U T I O N : I D E N T I T Y G R A P H
Data Size:
3 Months of data from first party
only sources – 100’s MM of cookies
.5 TB
RESULT:
Determined Graph Model
Rudimentary Import Process
Using Pattern matching - Cypher
Next Steps:
Scale to 1 year
Import/Export Process
Include 3rd party data
RESULT:
Discovered APOC and Graph Algos
UnionFind Algorithm Bug
APOC parallel Import procedures
Seeding UF work around
Data Size:
20+ Months of data from first and
third party sources - 4.4 TB
Custom Java Import Procedure
RESULT:
UnionFind Algorithm with Seeding
Custom Java Import/Export
Procedure
11. MEREDITH + CLIENT NAME | 11MEREDITH + CLIENT NAME | 11MEREDITH + GRAPHCONNECT 2020 | 11
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin
Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123 and u <> u2
RETURN u, u2
12. MEREDITH + CLIENT NAME | 12MEREDITH + CLIENT NAME | 12MEREDITH + GRAPHCONNECT 2020 | 12
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123 and u <> u2
RETURN u, u2
• 26 Alphabet + 10 digits = 36 Possibilities
• 36*36*36…*36 = 36^32 Permutations
• Chance any two people get the same id is
1/(6.3340287e+49) =~ 0
Probability:
13. MEREDITH + CLIENT NAME | 13MEREDITH + CLIENT NAME | 13MEREDITH + GRAPHCONNECT 2020 | 13
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Different relationships for cookie observation, URL
visits, IP/Device type Visits
• Import with Neo4j Admin import
• Data from AWS Redshift using UNLOAD cmd
• CSV using | delimiter
• Basic Pattern Matching
• Match (u:User)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User)
WHERE u <> u2 AND u <> u3 AND u2<>u3
RETURN u,m,u2,m2,u3
LIMIT 100
14. MEREDITH + CLIENT NAME | 14MEREDITH + CLIENT NAME | 14MEREDITH + GRAPHCONNECT 2020 | 14
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123
• RETURN u,u2
Problems:
• Graph was static – CSV import was
too slow
• Other Streams of Cookie Data
• IP gave conflicting Connections
• URL solved recommendation Not
Identity
15. MEREDITH + CLIENT NAME | 15MEREDITH + CLIENT NAME | 15MEREDITH + GRAPHCONNECT 2020 | 15
Building on Foundations
Proof of Concept
T H E S O L U T I O N
Next Steps:
• Scale to 1+ year
• Improve Import Procedure
• Develop Daily Import/Export
Procedure
• Include Other streams of Cookie data
• Prevent Multi Relationship Between
Cookies
16. MEREDITH + CLIENT NAME | 16MEREDITH + CLIENT NAME | 16MEREDITH + GRAPHCONNECT 2020 | 16
Building on Foundations
Proof of Concept
T H E S O L U T I O N
Next Steps:
• Scale to 1+ year
• Improve Import Procedure
• Develop Daily Import/Export
Procedure
• Include Other streams of Cookie data
• Prevent Multi Relationship Between
Cookies
17. MEREDITH + CLIENT NAME | 17MEREDITH + CLIENT NAME | 17MEREDITH + GRAPHCONNECT 2020 | 17
Moving Toward Production
• Scaling to 6 Months of data with 1
stream – 2 TB database
• Optimizing Neo4j Admin Import
• Graph Connect 2018
• Using APOC Periodic Iterate for
Import and Export Procedures
• Found some Identity Partners
showed Hyper Connections
LearningFromMistakes
18. MEREDITH + CLIENT NAME | 18MEREDITH + CLIENT NAME | 18MEREDITH + GRAPHCONNECT 2020 | 18
Moving Toward Production
• GraphConnect 2018
⁃ Learn about APOC & Algos
• Pattern Matching is slow
⁃ APOC Subgraph Procedure
• Cypher is non parallelized
⁃ APOC periodic Iterate is your
friend
• Utilizing Graph Algorithms
LearningFromMistakes
19. MEREDITH + CLIENT NAME | 19MEREDITH + CLIENT NAME | 19MEREDITH + GRAPHCONNECT 2020 | 19
Moving Toward Production
Union Find
• Calculate and Enumerate
all disjointed subgraphs
within a graph
• For every maximal
subgraph in a Database,
provide a unique integer
to represent that
subgraph
LearningFromMistakes
20. MEREDITH + CLIENT NAME | 20MEREDITH + CLIENT NAME | 20MEREDITH + GRAPHCONNECT 2020 | 20
Moving Toward Production
Union Find
• Calculate and Enumerate
all disjointed subgraphs
within a graph
• For every maximal
subgraph in a Database,
provide a unique integer
to represent that
subgraph
LearningFromMistakes
21. MEREDITH + CLIENT NAME | 21MEREDITH + CLIENT NAME | 21MEREDITH + GRAPHCONNECT 2020 | 21
Moving Toward Production
Problems
• Trouble Scaling to more
than 2 Billion – “huge”
parameter was not
working
• No seeding available –
every subgraph id was
shuffled each run
LearningFromMistakes
22. MEREDITH + CLIENT NAME | 22MEREDITH + CLIENT NAME | 22MEREDITH + GRAPHCONNECT 2020 | 22
Moving Toward Production
Solutions
• Only Use data you need –
Trim the Fat
• Use Dummy property and
run Apoc to Check when
seed Changed
LearningFromMistakes
23. MEREDITH + CLIENT NAME | 23MEREDITH + CLIENT NAME | 23MEREDITH + GRAPHCONNECT 2020 | 23
Moving Toward Production
Parallel Imports
CALL apoc.periodic.iterate('call apoc.load.jdbc($credentials,"select distinct cookie1, cookie2,
min(timestamp) as timestamp from cookie_table where cookie2 is not null group by cookie1,cookie2")
yield row','WITH row AS row, datetime(REPLACE(toString(row.timestamp),' ','T')) AS timestamp
MERGE (u:User {uid:trim(row.cookie1)}) SET u:IsNew, u.last_obs=timestamp WITH row, u,timestamp
FOREACH (n IN (CASE WHEN NOT exists(u.first_obs) THEN [1] ELSE [] END) | SET u.first_obs = timestamp)
MATCH (u)-[:OBSERVED_WITH]->(x) WITH u, row,timestamp, collect(distinct x) AS seen
OPTIONAL MATCH (u)-[:OBSERVED_BAD]->(y) WITH u, row,timestamp, collect(distinct y) AS
seen_bad,seen
MERGE (c: Cookie2{ cookie2:trim(row.cookie2)}) SET c.last_obs = timestamp, c:IsNew
FOREACH (n IN (CASE WHEN NOT exists(c.first_obs) THEN [1] ELSE [] END) | SET c.first_obs = timestamp)
FOREACH (n IN (CASE WHEN NOT c IN seen AND NOT c IN seen_bad THEN [1] ELSE [] END) | CREATE (u)-
[r1:OBSERVED_WITH]->(c) SET r1.first_obs = timestamp)’,
{batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}});
LearningFromMistakes
24. MEREDITH + CLIENT NAME | 24MEREDITH + CLIENT NAME | 24MEREDITH + GRAPHCONNECT 2020 | 24
Moving Toward Production
Parallel Imports – Code. breakdown
CALL apoc.periodic.iterate(‘ DRIVING STATEMENT’, ‘ACTION STATEMENT’,
{batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}});
DRIVING STATEMENT = call apoc.load.jdbc($credentials, “SQL STATEMENT”) yield row
ACTION STATEMENT = Lots of Merges and Conditional Look ups to prevent creating Multiple
Relationships
Note: Pass parameters through the Params Statement
Issues – If you have nodes that are being written/merged on across multiple threads that Batch will fail –
attempted to adjust by changing # of Threads and Batch size.
LearningFromMistakes
25. MEREDITH + CLIENT NAME | 25MEREDITH + CLIENT NAME | 25MEREDITH + GRAPHCONNECT 2020 | 25
Moving Toward Production
APOC Subgraph all
Previously using pattern matching was Expensive look ups, only 6 hops, imagine 10 hops out:
MATCH (u:User) WHERE u.uid = ‘1234’
WITH u MATCH (u)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User)-[]-(m3)-[]-(u4:User)
WHERE u <> u2 AND u <> u3 AND u <> u4 AND u2 <>u3 AND u2 <> u4 AND u3 <> u4
RETURN u,m,u2,m2,u3,m3,u4
LIMIT 100
APOC is better, faster, and easier:
MATCH (user:User) WHERE user.uid = “1234"
CALL apoc.path.subgraphAll(user, {maxLevel:10,filterStartNode:true,labelFilter:'>User'}) YIELD nodes
unwind nodes as no return no
LearningFromMistakes
27. MEREDITH + CLIENT NAME | 27MEREDITH + CLIENT NAME | 27MEREDITH + GRAPHCONNECT 2020 | 27
Moving Toward Production
NextSteps:
• Scale to 20+ Months
• Implement UnionFind +
Seeding as single Algo
• Daily importing and
exporting
• Optimize Heap usage
LearningFromMistakes
28. MEREDITH + CLIENT NAME | 28MEREDITH + CLIENT NAME | 28MEREDITH + GRAPHCONNECT 2020 | 28
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 14 hrs
• Improved Heap usage
• 20+ Months of data – 4+ TB database
• Custom Java Procedure Import/Exports
• UnionFind with Seeding
• Custom Java Procedure preprocessing
• Variable Heap and Page Cache
29. MEREDITH + CLIENT NAME | 29MEREDITH + CLIENT NAME | 29MEREDITH + GRAPHCONNECT 2020 | 29
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 14 hrs
• Improved Heap usage
Problem:
• Constantly fighting growing Heap
Demand
• 280 GB Heap -> 300 GB -> 330 GB
• More heap less Page Cache
Solution: Variable Heap and Page Cache
30. MEREDITH + CLIENT NAME | 30MEREDITH + CLIENT NAME | 30MEREDITH + GRAPHCONNECT 2020 | 30
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 12 hrs
• Improved Heap usage
• 14.4 Billion Nodes
• 67.6 Billion Properties
• 20.6 Billion Relationships
• 20 Months of data
31. MEREDITH + CLIENT NAME | 31MEREDITH + CLIENT NAME | 31MEREDITH + GRAPHCONNECT 2020 | 31
Illuminating The Anonymous
Measuring
Understanding Customers over time
Improved Targeting for more relevant
content and advertising campaigns.
241.6Days on average
per Profile
346MCookies to
163MProfiles
25%Of Traffic has
a Profile
From
3.9 Visits
Average
23.8 Visits
Average
612%Increase in Visits
per profile
- - - O U T C O M E S - - -
Source Line, Source Sans Reg, 8pt
32. MEREDITH + CLIENT NAME | 32MEREDITH + CLIENT NAME | 32MEREDITH + GRAPHCONNECT 2020 | 32
Identify
what data
Matters
APOC and
Algos are
your Friend
Simplify
Your
Problem
Custom Java
Procedures
Scale
Neo4j
Community
and
Engineers
Salient Takeaways To Scale
Apoc Periodic
Iterate and Graph
Algorithms use
Multiple cores
Evaluate what data
is needed to
Answer the
Question
Explore different
Graph models and
determine which is
the most simple
Custom Java
procedures can
empower your
Project
When issues arise
seek help from
Professionals and
Active Community
Members
- - - O U T C O M E S - - -
Learning from other’s experiences
33. MEREDITH + CLIENT NAME | 33MEREDITH + CLIENT NAME | 33MEREDITH + GRAPHCONNECT 2020 | 33
Thank You
Contact: Benjamin.Squire@Meredith.com
LinkedIn: linkedin.com/in/benjamin-squire/