SlideShare a Scribd company logo
1 of 42
Download to read offline
Moving to ScyllaDB - A
Graph of Billions scale
Saurabh Verma, Principal Engineer
K S Sathish, VP Engineering
Presenters
K S Sathish, VP Engineering
Sathish heads the engineering at Zeotap. Bangalore India
Engineering strategy and technical architecture.
17+ years of experience
Been building big data stacks for various verticals for past 8 years
Saurabh Verma, Principal Engineer
Saurabh is a Principal Engineer at Zeotap.
Leads Data engineering team for Identity product suite
Architecture, design and engineering delivery of the Identity product.
Spent the last 6 years in building big data systems.
Place company logo
here
■ Identity and Data platform - People Based data
■ Enables Brands to better understand their customers - 360º View
■ World’s Largest Independent People Graph
■ Full Privacy/GDPR compliant
■ 80+ Data partners
■ Catering to Ad-Tech and MarTech
ZEOTAP
Identity Resolution
Use Cases
Identity Resolution
● Singular View of all Identities of a
Person
● Multiple Identity sources
● Different Identifiers
○ Web Cookies
○ Mobile
○ Partner Platform
○ CRM
Linkages between these identifiers
are more important than the
individual Identifiers
Identity Use cases
■ Match Test - Reference IDs JOIN with ID universe
■ Export - IDs retrieved based on Match and pushed out
■ Reporting
■ Compliance - Opt Out - Disconnect
■ 3rd party extension
■ Identity Quality
■ Short SLAs for Freshness of Data - meaning quick ingestion and
retrieval
Data Access
Old Implementation
Reports
Redshift
Athena
Partner 1
Partner 2
Partner n
Processing
Curated
Denormalized
Data S3
Processing
Client ID sets Match Test
Exports
Identity Tech - Reqs
■ Workload
● High Read and High Write - Ingestion and Retrieval can happen simultaneously
■ Write
● Ingestion - Streaming and Batch
● Deletion - Streaming and Batch
● Above 50K writes per second to meet SLAs
■ Housekeep
● TTL - based on conditions
Identity Tech- Reqs Cont...
■ Read
● Lookup Matching IDs
● Retrieve Linked IDs
● Retrieve Linked IDs based on conditions
■ ID Type - Android ID, website cookie
■ Property - Recency, quality, country
● Count
● Depth filters
Time to Change
Reports
Processing
Client ID sets Match Test
Exports
ID Graph??
Partner 1
Partner 2
Partner n
Processing
Introducing GraphDB
Why Native Graph
Native Graph Database (JanusGraph)
Low latency
neighbourhood traversal
(OLTP) - Lookup & Retrieve
- Graph traversal modeled as iterative low-latency lookups in
the Scylla K,V store
- Runtime proportional to the client data set & overlap
percentage
Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations
- Aligned with Streaming & Differential data ingestions
- Economically lower footprint to run in production
Linkages are first-class
citizen
- Linkages have properties and traversals can leverage these
properties
- On the fly path computation
Analytics Stats on the
Graph, Clustering (OLAP)
- Bulk export and massive parallel processing available with
GraphComputer integration with Spark, Hadoop, Giraph
And… Concise solutions to the right problems
■ Find the path between 2 user IDs
SQL Gremlin Query
(select * from idmvp
where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 =
'id_mid_13'
and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 =
'id_mid_4') // depth = 1
union
(select * from idmvp t1, idmvp t2
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 =
'id_mid_4') // depth = 2
union
(select * from idmvp t1, idmvp t2, idmvp t3
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 =
'id_mid_4') // depth = 3
g.V()
.has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type',
'id_mid_13')
.repeat(both().simplePath().timeLimit(40000))
.until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245')
.has('type','id_mid_4'))
.limit(10).path()
.by(‘id’)
POCs and Findings
POC Hardware
Janus On Scylla Aerospike OrientDB DGraph
3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge
Client Configuration
3 x c5.18xLarge
Server Configuration
Replication Factor
1
Store Benchmarking - 3B IDs, 1B edges
JanusGraph with
ScyllaDB
Aerospike OrientDB DGraph
Sharded, Distributed
Storage Model LPG Custom LPG RDF
Cost of ETL before Ingestion Lower Lower Lower Higher
Native Graph DB
Node / Edge Schema Change without
downtime?
Benchmark dataset load completed?
Acceptable Query Performance? - -
Production Setup Running Cost Lower Higher - -
Production Setup Operational Management
(based on our experience with AS in prod)
Higher Lower - -
✓ ✓ ✓
✓✓✓
✓✓✓ ✓
✓ ✓
✓ ✓
❌
❌
❌ ❌
The Data Model
ID Graph Data Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Expressiveness of Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Quality
Filtered Links
ID Attribute
Filtering
Recency
Filtered Links
Extensible
Data Model
Transitive
Links
Streaming Ingestion
Streaming Ingestion
■ Workload
● 300 - 400 million data points per day
● Dedupe & Enrich
● Merge
● Final snapshot
■ Batch Process
● Spark Join
● Merge runtime - 4 to 6 hours
● Redshift load time - 2 to 3 hours
● Painful Failures
Stream & Batch
Dedup
Enrich
S3
Merge
Redshift
Streaming Ingestion
■ And...
● Time - 2 to 3 hours
● Join Vs Lookup
● All Stream
● Failures - down by 83%
Stream
& Batch
Dedup
Enrich
Streaming
Graph Ingester
Streaming
Graph Ingester
Vertex
Edge
KV Store
Findings
■ Consider Splitting Vertex Load from Edge Load
● Write behaviour is different
● Achieve overall better QPS
■ Benchmark Vertex load speed against CPU utilization
● Observed 5K TPS per server core
■ Consider Client Side Caching - Edge Load
● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing)
● 100% write - 4.8K TPS per core
● LeveledCompactionStrategy performed better than
SizeTieredCompactionStrategy
Traversal
Findings
■ Be Wary of Supernodes
● Supernodes > 600 vertices drastic QPS drop
● 40K QPS to 2K
■ Multi-Level Traversal - Depth limiting
● QPS decreases though not linear
● depth of 5 - 40K QPS to 12K
Findings
■ Play with Compaction strategies
● For our queries LevelTiered increased QPS by 2.5X
● With LevelTiered - concurrent clients better handled
● QPS stabilized at 30K
Know Your Query And Data
■ Segments are country based - filter based on Countries
■ Vertex Metadata not huge
Fetching individual properties from the Vertex
gremlin>g.V().has('id','1').has('type','email')
.values('id', 'type', 'USA').profile()
Fetching entire property map during traversal
gremlin>g.V().has('id','1').has('type','email')
.valueMap().profile()
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.987
JanusGraphPrope
rtiesStep
_condition=((type[
id] OR type[type]
OR type[USA]))
4 1.337
2.325 s
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.902
PropertyMapStep
(value)
1 0.175
1.077 s
~200%
Graph Analysis
ID Graph Quality
■ How Trustable is our ID graph
● What happens if match rate is ridiculously high
● Cluster of 63 million IDs
■ Connectivity analysis - heuristics
● Density
● Depth
● Clustering
● Distance
■ Can we arrive at Quality Score for edges?
Scoring V1
■ AD scoring - Edge Agreement (A) / Disagreement (D)
■ Recency Scoring - Augment A & D with Recency
■ Calculate Composite Score
■ Adjust composite score with IDs metadata
Scoring - AD
Scoring V1
AD Score
Recency
Score
Composite
Score
Adjust
Event Rarity
Final
Score
Scoring - Representation
OLTP & OLAP Export
■ Interaction with JanusGraph backed by ScyllaDB
● For each input ID find the connected IDs in the ID Graph based on filters
● Modeled as Depth First Search implemented in Gremlin in Apache Spark
● Property and depth filtering done at the application layer
● The overlapping ID output is stored on deep storage eg AWS s3
■ Across-Graph Traversals
● Separate compliance requirements per 3rd party Graph vendor
● Probabilistic vs Deterministic Graph vendors
● Each Graph Vendor represented as a separate keyspace in ScyllaDB
● The application layer enables runtime chaining and ordering for Across-Graph
traversals
OLTP Export - ID Overlap Finder Workflow
■ Export Native Graph DB to Deep Storage
■ Apache Spark based ID Graph Quality Scoring
OLAP Export - Storage & Analytics
OLTP ID
Graph
Periodic
Backup
ScyllaDB
SSTables
on AWS s3
OLAP ID
Graph
Periodic
Refresh
SparkOLAP
Export to AWS
s3
GryoOutputFormat
Native Graph on AWS
s3
Periodic Static
Reports
ID Graph Quality
Data Science
Pipeline
ID Graph Quality Score Update
Prod Setup
Prod Setup
■ V1 release in Nov 2018
■ In production on AWS i3.4xLarge instances
■ These are 16 core, 122 GB RAM instances
■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs
■ Using Scylla Grafana Dashboards for Production Metrics
■ Using LevelTieredCompactionStrategy in production
■ Stats (To be updated before final deck)
Take away
■ 2 primary Workflows
● ID overlap finder
● ID retriever
Consideration : 2-node Scylla cluster, the peak client connections is around 3,000
ID overlap finder ~4X numbers of ID retriever
Run Together
● Race and SLA degrade!
● High Failure Rates
Whatever The Tool...
Introduce - Prioritization & Throttling
Priority with Aging - Match Test get priority but nothing starves
Throttle - Limit concurrent Jobs
And…
■ SLA from p95 of 10 hours to 2 hours
■ Job failure rate from 20% to 2% per day
All Higher Level Constructs in Control Plane
Good Architecture is a Must!
Thank you Stay in touch
Any questions?
Sathish K S
sathish.ks@gmail
Not on Twitter!
Saurabh Verma
saurabhdec1988@gmail
@saurabhdec1988

More Related Content

What's hot

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Amazon Web Services
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Web Services Korea
 
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB AtlasMongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB AtlasMongoDB
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...Naoki (Neo) SATO
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...Amazon Web Services
 
MongoDB company and case studies - john hong
MongoDB company and case studies - john hong MongoDB company and case studies - john hong
MongoDB company and case studies - john hong Ha-Yang(White) Moon
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개Ha-Yang(White) Moon
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayAmazon Web Services Korea
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...Sriskandarajah Suhothayan
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Amazon Web Services
 

What's hot (20)

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
 
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB AtlasMongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
 
MongoDB company and case studies - john hong
MongoDB company and case studies - john hong MongoDB company and case studies - john hong
MongoDB company and case studies - john hong
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...
 

Similar to Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

Scaling Production Data across Microservices
Scaling Production Data across MicroservicesScaling Production Data across Microservices
Scaling Production Data across MicroservicesErik Ashepa
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Fwdays
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...Andrew Liu
 

Similar to Zeotap: Moving to ScyllaDB - A Graph of Billions Scale (20)

Scaling Production Data across Microservices
Scaling Production Data across MicroservicesScaling Production Data across Microservices
Scaling Production Data across Microservices
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
 

Recently uploaded

Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 

Recently uploaded (20)

Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

  • 1. Moving to ScyllaDB - A Graph of Billions scale Saurabh Verma, Principal Engineer K S Sathish, VP Engineering
  • 2. Presenters K S Sathish, VP Engineering Sathish heads the engineering at Zeotap. Bangalore India Engineering strategy and technical architecture. 17+ years of experience Been building big data stacks for various verticals for past 8 years Saurabh Verma, Principal Engineer Saurabh is a Principal Engineer at Zeotap. Leads Data engineering team for Identity product suite Architecture, design and engineering delivery of the Identity product. Spent the last 6 years in building big data systems. Place company logo here
  • 3. ■ Identity and Data platform - People Based data ■ Enables Brands to better understand their customers - 360º View ■ World’s Largest Independent People Graph ■ Full Privacy/GDPR compliant ■ 80+ Data partners ■ Catering to Ad-Tech and MarTech ZEOTAP
  • 5. Identity Resolution ● Singular View of all Identities of a Person ● Multiple Identity sources ● Different Identifiers ○ Web Cookies ○ Mobile ○ Partner Platform ○ CRM Linkages between these identifiers are more important than the individual Identifiers
  • 6. Identity Use cases ■ Match Test - Reference IDs JOIN with ID universe ■ Export - IDs retrieved based on Match and pushed out ■ Reporting ■ Compliance - Opt Out - Disconnect ■ 3rd party extension ■ Identity Quality ■ Short SLAs for Freshness of Data - meaning quick ingestion and retrieval
  • 7. Data Access Old Implementation Reports Redshift Athena Partner 1 Partner 2 Partner n Processing Curated Denormalized Data S3 Processing Client ID sets Match Test Exports
  • 8. Identity Tech - Reqs ■ Workload ● High Read and High Write - Ingestion and Retrieval can happen simultaneously ■ Write ● Ingestion - Streaming and Batch ● Deletion - Streaming and Batch ● Above 50K writes per second to meet SLAs ■ Housekeep ● TTL - based on conditions
  • 9. Identity Tech- Reqs Cont... ■ Read ● Lookup Matching IDs ● Retrieve Linked IDs ● Retrieve Linked IDs based on conditions ■ ID Type - Android ID, website cookie ■ Property - Recency, quality, country ● Count ● Depth filters
  • 10. Time to Change Reports Processing Client ID sets Match Test Exports ID Graph?? Partner 1 Partner 2 Partner n Processing
  • 12. Why Native Graph Native Graph Database (JanusGraph) Low latency neighbourhood traversal (OLTP) - Lookup & Retrieve - Graph traversal modeled as iterative low-latency lookups in the Scylla K,V store - Runtime proportional to the client data set & overlap percentage Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations - Aligned with Streaming & Differential data ingestions - Economically lower footprint to run in production Linkages are first-class citizen - Linkages have properties and traversals can leverage these properties - On the fly path computation Analytics Stats on the Graph, Clustering (OLAP) - Bulk export and massive parallel processing available with GraphComputer integration with Spark, Hadoop, Giraph
  • 13. And… Concise solutions to the right problems ■ Find the path between 2 user IDs SQL Gremlin Query (select * from idmvp where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 = 'id_mid_13' and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 = 'id_mid_4') // depth = 1 union (select * from idmvp t1, idmvp t2 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 = 'id_mid_4') // depth = 2 union (select * from idmvp t1, idmvp t2, idmvp t3 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 = 'id_mid_4') // depth = 3 g.V() .has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type', 'id_mid_13') .repeat(both().simplePath().timeLimit(40000)) .until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245') .has('type','id_mid_4')) .limit(10).path() .by(‘id’)
  • 15. POC Hardware Janus On Scylla Aerospike OrientDB DGraph 3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge Client Configuration 3 x c5.18xLarge Server Configuration Replication Factor 1
  • 16. Store Benchmarking - 3B IDs, 1B edges JanusGraph with ScyllaDB Aerospike OrientDB DGraph Sharded, Distributed Storage Model LPG Custom LPG RDF Cost of ETL before Ingestion Lower Lower Lower Higher Native Graph DB Node / Edge Schema Change without downtime? Benchmark dataset load completed? Acceptable Query Performance? - - Production Setup Running Cost Lower Higher - - Production Setup Operational Management (based on our experience with AS in prod) Higher Lower - - ✓ ✓ ✓ ✓✓✓ ✓✓✓ ✓ ✓ ✓ ✓ ✓ ❌ ❌ ❌ ❌
  • 18. ID Graph Data Model label: id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9}
  • 19. Expressiveness of Model label: id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9} Quality Filtered Links ID Attribute Filtering Recency Filtered Links Extensible Data Model Transitive Links
  • 21. Streaming Ingestion ■ Workload ● 300 - 400 million data points per day ● Dedupe & Enrich ● Merge ● Final snapshot ■ Batch Process ● Spark Join ● Merge runtime - 4 to 6 hours ● Redshift load time - 2 to 3 hours ● Painful Failures Stream & Batch Dedup Enrich S3 Merge Redshift
  • 22. Streaming Ingestion ■ And... ● Time - 2 to 3 hours ● Join Vs Lookup ● All Stream ● Failures - down by 83% Stream & Batch Dedup Enrich Streaming Graph Ingester Streaming Graph Ingester Vertex Edge KV Store
  • 23. Findings ■ Consider Splitting Vertex Load from Edge Load ● Write behaviour is different ● Achieve overall better QPS ■ Benchmark Vertex load speed against CPU utilization ● Observed 5K TPS per server core ■ Consider Client Side Caching - Edge Load ● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing) ● 100% write - 4.8K TPS per core ● LeveledCompactionStrategy performed better than SizeTieredCompactionStrategy
  • 25. Findings ■ Be Wary of Supernodes ● Supernodes > 600 vertices drastic QPS drop ● 40K QPS to 2K ■ Multi-Level Traversal - Depth limiting ● QPS decreases though not linear ● depth of 5 - 40K QPS to 12K
  • 26. Findings ■ Play with Compaction strategies ● For our queries LevelTiered increased QPS by 2.5X ● With LevelTiered - concurrent clients better handled ● QPS stabilized at 30K
  • 27. Know Your Query And Data ■ Segments are country based - filter based on Countries ■ Vertex Metadata not huge Fetching individual properties from the Vertex gremlin>g.V().has('id','1').has('type','email') .values('id', 'type', 'USA').profile() Fetching entire property map during traversal gremlin>g.V().has('id','1').has('type','email') .valueMap().profile() Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.987 JanusGraphPrope rtiesStep _condition=((type[ id] OR type[type] OR type[USA])) 4 1.337 2.325 s Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.902 PropertyMapStep (value) 1 0.175 1.077 s ~200%
  • 29. ID Graph Quality ■ How Trustable is our ID graph ● What happens if match rate is ridiculously high ● Cluster of 63 million IDs ■ Connectivity analysis - heuristics ● Density ● Depth ● Clustering ● Distance ■ Can we arrive at Quality Score for edges?
  • 30. Scoring V1 ■ AD scoring - Edge Agreement (A) / Disagreement (D) ■ Recency Scoring - Augment A & D with Recency ■ Calculate Composite Score ■ Adjust composite score with IDs metadata
  • 34. OLTP & OLAP Export
  • 35. ■ Interaction with JanusGraph backed by ScyllaDB ● For each input ID find the connected IDs in the ID Graph based on filters ● Modeled as Depth First Search implemented in Gremlin in Apache Spark ● Property and depth filtering done at the application layer ● The overlapping ID output is stored on deep storage eg AWS s3 ■ Across-Graph Traversals ● Separate compliance requirements per 3rd party Graph vendor ● Probabilistic vs Deterministic Graph vendors ● Each Graph Vendor represented as a separate keyspace in ScyllaDB ● The application layer enables runtime chaining and ordering for Across-Graph traversals OLTP Export - ID Overlap Finder Workflow
  • 36. ■ Export Native Graph DB to Deep Storage ■ Apache Spark based ID Graph Quality Scoring OLAP Export - Storage & Analytics OLTP ID Graph Periodic Backup ScyllaDB SSTables on AWS s3 OLAP ID Graph Periodic Refresh SparkOLAP Export to AWS s3 GryoOutputFormat Native Graph on AWS s3 Periodic Static Reports ID Graph Quality Data Science Pipeline ID Graph Quality Score Update
  • 38. Prod Setup ■ V1 release in Nov 2018 ■ In production on AWS i3.4xLarge instances ■ These are 16 core, 122 GB RAM instances ■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs ■ Using Scylla Grafana Dashboards for Production Metrics ■ Using LevelTieredCompactionStrategy in production ■ Stats (To be updated before final deck)
  • 40. ■ 2 primary Workflows ● ID overlap finder ● ID retriever Consideration : 2-node Scylla cluster, the peak client connections is around 3,000 ID overlap finder ~4X numbers of ID retriever Run Together ● Race and SLA degrade! ● High Failure Rates Whatever The Tool...
  • 41. Introduce - Prioritization & Throttling Priority with Aging - Match Test get priority but nothing starves Throttle - Limit concurrent Jobs And… ■ SLA from p95 of 10 hours to 2 hours ■ Job failure rate from 20% to 2% per day All Higher Level Constructs in Control Plane Good Architecture is a Must!
  • 42. Thank you Stay in touch Any questions? Sathish K S sathish.ks@gmail Not on Twitter! Saurabh Verma saurabhdec1988@gmail @saurabhdec1988