SlideShare a Scribd company logo
1 of 39
Download to read offline
Cross Region
Data Replication
Design Considerations
Itai Friendinger itai@forter.com
Our financial institutions remain strong, and the American
economy will be open for business as well.
2/40
TX Fraud
Decision
100ms
Decision as a Service Example
if isFraud(tx.address,tx.payment) {
return DECLINE;
} else {
return APPROVE;
}
TX Decision
3/40
Event Processor
1000ms
Change Account Address
Change Account Payment
Unified People Store
TX
partial update
read
Decision as a Service Example
TX Fraud
Decision
100ms
TX Decision
4/40
Design ‫בסדר‬ ‫יהיה‬
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
● No Cross Region Replication
5/40
Design ‫עליי‬
● Cron Sync every 3 hours
● Replication != Reconciliation
● Replication != Backup
TX Fraud
Decision
Event
Processor
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Cron Sync
raw event
TXDecision
6/40
● Read-Only RDS Replica
● Proxying data into a single Data Center
● Requires quarterly failover drills
● Cannot stand a real disaster for long
Design ‫פסדר‬ ‫יאללה‬
TX Fraud
Decision
Event
Forwarder
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
RDS Replication
raw event
TX
Forwarding
Decision
7/40
Design ‫אחד‬ ‫במחיר‬ ‫שניים‬
● CloudEndure DRaaS
● Point In Time Recovery
● Requires quarterly failover drills
● For existing apps (Enterprises)
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Block Device
Replication
8/40
Design ‫חכה‬ ‫חכה‬
● Google Cloud Spanner Is Here
Geo Distributed Transactions Is Coming
● For green-field apps (Startups)
TX Fraud
Decision
Event
Processor
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Transactions
raw event
TXDecision
9/40
Design ‫סמוך‬
● Out-Of-The-Box
Real-Time
Bi-Directional
Data-Center Aware
Replication
● Write Conflict resolution
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
2Way Replication
TX Fraud
Decision
Event
Processor
People
Store
raw event
TXDecision
10/40
Design ‫שלה‬ ‫אחות‬
● Replication of Raw Events
● State Divergence
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
2way Replication
TX Fraud
Decision
Event
Processor
People
Store
raw event
TXDecision
11/40
Read Consistency Guarantees
Loosely based on Consistency Explained Through Baseball by Doug Terry
● Strong ⇒ 2:2
○ See all previous writes
● Read own Writes
○ See all writes performed by reader
● Monotonic ⇒ 2:1
○ See all writes since the beginning till N seconds ago
● Eventual ⇒ 1:2
○ See the writes in different order (some still missing)
time partial
update
state
15m Hapoel =1 1:0
32m Maccabi =1 1:1
89m Hapoel =2 2:1
91m Maccabi =2 2:2
14/40
Hello Couchbase
read-mutate-write of entire state
Client reaches cluster’s primary node
Conflict Prevention CAS
Optimizations: subdocument API
Strong
node
us-west-2b
node
us-west-2c
Event Processor
(read/m/write)
TX Decision
(read)
Strong
16/40
Hello Couchbase
XDCR replicates entire state between clusters
Optimizations: dedup by key, metadata first
Strong
Monotonic
XDCR
node
us-west-2b
node
us-west-2c
Event Processor
(read/m/write)
node
us-east-1c
node
us-east-1b
TX Decision
(read)
TX Decision
(read)
Strong
17/40
Couchbase Last Write Wins
Conflict Resolution - LWW erases losing side
Remember: NTP, no “sudo date”
Document Version =
read-own-writes
Monotonic
XDCR
node
us-west-2b
node
us-west-2c
Event Processor
(read/m/write)
node
us-east-1c
node
us-east-1b
TX Decision
(read)
TX Decision
(read)
Monotonic
read-own-writes
Event Processor
(read/m/write)
‫סמוך‬
Design
Conflict Resolution
48bit timestamp
Conflict Prevention
16bit CAS
19/40
Hello Cassandra
node
us-west-2b
node
us-west-2a
node
us-west-2c
Event Processor
(partial update)
node
us-east-1b
TX Decision
(read)
Client reaches closest node, blocks until LOCAL_QUARUM
No Conflict Prevention ⇒ Use partial updates or inserts
Strong (?)
node
us-east-1c
node
us-east-1a
TX Decision (read)
21/40
Cassandra Last Write Wins per Column
Two clients update payment and address
of same person with exactly same client timestamps.
(?) (?)
update payment
wins
update address
wins
node
us-west-2b
node
us-west-2a
node
us-west-2c
Event Processor
(partial update)
node
us-east-1c
node
us-east-1a
node
us-east-1b
TX Decision
(read)
TX Decision (read)
Event Processor
(partial update)
‫סמוך‬
Design
23/40
Cassandra Multi Value per Column
Update different columns of same person
Conflict resolution in TX Decision (on read)
(?) (?)
update payment1,
address1
update payment2,
address2
node
us-west-2b
node
us-west-2a
node
us-west-2c
Event Processor
(partial update)
node
us-east-1c
node
us-east-1a
node
us-east-1b
TX Decision
(read)
TX Decision (read)
Event Processor
(partial update)
‫סמוך‬
Design
25/40
Kafka
Kafka
us-west-2
Event Source
(insert)
Kafka
us-east-1
TX Decision
(read)
Event
Processor
Event
Processor
S3 versioned
us-east-1
TX Decision
(read)
S3 versioned
us-west-2
(?) (?)
Event Source
(insert)
mirror(s)
us-west
mirror(s)
us-west
mirror(s)
us-west mirror(s)
us-west
mirror(s)
us-west
mirror(s)
us-east
inserts
Conflict resolution in Event Processor
Will both regions converge into the same state?
‫שלו‬ ‫אח‬
Design
27
Converging events into state
● Duplicate events
○ Idempotent compare-and-set(x, 2, 5)
○ De-duplication 2 +3 +3 = 5
○ Rollback
● Unordered events
○ Commutative 2+3=3+2
○ reordering window (requires state)
● Bulk/Parallel event processing
○ Associative (2+3)+4 = 2+(3+4)
29/40
Kafka Streams API - zooming in
Kafka
us-west-2
Event Source
(insert)
Kafka
us-east-1
TX Decision
(read)
Event
Processor
Event
Processor
S3 versioned
us-east-1
TX Decision
(read)
S3 versioned
us-west-2
(?) (?)
Event Source
(insert)
mirror(s)
us-west
mirror(s)
us-west
mirror(s)
us-west mirror(s)
us-west
mirror(s)
us-west
mirror(s)
us-east
inserts
‫שלו‬ ‫אח‬
Design
Kafka Streams API
Kafka
MirrorMaker
(?)
Kafka
S3 Connector
Kafka Stream API
‫סמוך‬
Design
Event Source
(insert)
builder.stream("kstream1","kstream2")
.filter(predicate)
.transform(processor)
.to("ktable")
S3
kstream1
kstream2
ktable
30/40
Kafka Processor API and Local Store
Kafka
MirrorMaker
(?)
Kafka
S3 Connector
Kafka Stream API
‫סמוך‬
Design
Event Source
(insert)
kstream1
kstream2
ktable
Map process(Map event) {
Map state = kvStore.get(event.key);
state.putAll(event); // not commutative (order matters)
kvStore.put(event.key, state);
return state;
}
S3
32/40
CRDT Graph Model
Conflict-free Replicated Data Type
Idempotent, Commutative, Associative
● Insert Only Graph
● Address / Payment / Person Objects
G-Set: Growing Set CRDT
Conflict-free Replicated Data Type
Idempotent, Commutative, Associative
A B
us-west-2 event us-east-1 state
{A,B} {A,B}
G-Set: Growing Set CRDT
Conflict resolution method: merge sets
A
C
B
us-west-2 event us-east-1 state
{A,B} {A,B}
{A,C} {A,B,C}
Comprised of two G-Sets (added and tombstone)
A B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
2P-Set: Two Phase Set CRDT
A
C
B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
add: {A,C}
rmv: {B,D}
add: {A,B,C}
rmv: {A,B,D}
Always grows
Garbage Collection algorithms exist.
2P-Set: Two Phase Set CRDT
D
A
C
B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
add: {A,C}
rmv: {B,D}
add: {A,B,C}
rmv: {A,B,D}
add: {D} add: {A,B,C,D}
rmv: {A,B,D}
Always grows
Garbage Collection algorithms exist.
2P-Set: Two Phase Set CRDT
A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {A}
add_e: {AB,AC,BC}
rmv_e: {AB,AC}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
AD
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {A}
add_e: {AB,AC,BC}
rmv_e: {AB,AC}
add_v: {D}
rmv_v: {}
add_e: {AD}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
AD
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {A}
add_e: {AB,AC,BC}
rmv_e: {AB,AC}
add_v: {D}
rmv_v: {}
add_e: {AD}
rmv_e: {}
add_v: {A,B,C,D}
rmv_v: {A}
add_e: {AB,AC,BC,AD}
rmv_e: {AB,AC,AD}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
Sometimes the state won't converge easily
● Missing events (broken links)
○ integrity checks
○ repair
● Rerunning bulk events after downtime
○ Clocks: Event vs. Ingestion vs. Processor vs. Logical
○ Enrichment: IP address reputation changes daily
37/40
Background Reconciliator
Reconciliation: Compare hash (Merkle) trees
Compensation: Merge CRDT states
client2 (read)
us-west-2a
S3 versioned
us-west-2
client1 (read)
us-east-1b
S3 versioned
us-east-1
Background
Reconciliator
38/40
Takeaways
● Define business need for cross region
Availability, Latency, Residency, Analytics
● Know your NoSQL
Couchbase != Cassandra != Kafka
● Ask about CRDTs
LWW-Register, MV-Register, 2P-Sets, 2P2P-Graphs
● Use Reconciliation
● Dedicated Fiber and Atomic clocks ARE COMING
40/40
“The Internet was designed to be an academic medium.
It was not designed to handle this level of transactions”
Fred Matteson @ schwab.com 1999
Advanced Topics
● ‫מרקחת‬ ‫לבית‬ ‫מאשר‬ ‫מטבחים‬ ‫לבית‬ ‫דומה‬ ‫יותר‬ ‫האמתי‬ ‫העולם‬
● Multi Data Center Topologies
○ Star (SPOF, simple)
○ Ring (TLV ←→ Eilat ←→ Jerusalem←→ TLV)
○ Mesh (resilient, complex)
● Data Residency
○ Separate PII from data
○ Peek at other data centers ad-hoc

More Related Content

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Reversim 2017 cross region data replication design considerations

  • 1. Cross Region Data Replication Design Considerations Itai Friendinger itai@forter.com
  • 2. Our financial institutions remain strong, and the American economy will be open for business as well. 2/40
  • 3. TX Fraud Decision 100ms Decision as a Service Example if isFraud(tx.address,tx.payment) { return DECLINE; } else { return APPROVE; } TX Decision 3/40
  • 4. Event Processor 1000ms Change Account Address Change Account Payment Unified People Store TX partial update read Decision as a Service Example TX Fraud Decision 100ms TX Decision 4/40
  • 5. Design ‫בסדר‬ ‫יהיה‬ TX Fraud Decision TX Decision Event Processor People Store raw event ● No Cross Region Replication 5/40
  • 6. Design ‫עליי‬ ● Cron Sync every 3 hours ● Replication != Reconciliation ● Replication != Backup TX Fraud Decision Event Processor People Store TX Fraud Decision TX Decision Event Processor People Store raw event Cron Sync raw event TXDecision 6/40
  • 7. ● Read-Only RDS Replica ● Proxying data into a single Data Center ● Requires quarterly failover drills ● Cannot stand a real disaster for long Design ‫פסדר‬ ‫יאללה‬ TX Fraud Decision Event Forwarder People Store TX Fraud Decision TX Decision Event Processor People Store raw event RDS Replication raw event TX Forwarding Decision 7/40
  • 8. Design ‫אחד‬ ‫במחיר‬ ‫שניים‬ ● CloudEndure DRaaS ● Point In Time Recovery ● Requires quarterly failover drills ● For existing apps (Enterprises) People Store TX Fraud Decision TX Decision Event Processor People Store raw event Block Device Replication 8/40
  • 9. Design ‫חכה‬ ‫חכה‬ ● Google Cloud Spanner Is Here Geo Distributed Transactions Is Coming ● For green-field apps (Startups) TX Fraud Decision Event Processor People Store TX Fraud Decision TX Decision Event Processor People Store raw event Transactions raw event TXDecision 9/40
  • 10. Design ‫סמוך‬ ● Out-Of-The-Box Real-Time Bi-Directional Data-Center Aware Replication ● Write Conflict resolution TX Fraud Decision TX Decision Event Processor People Store raw event 2Way Replication TX Fraud Decision Event Processor People Store raw event TXDecision 10/40
  • 11. Design ‫שלה‬ ‫אחות‬ ● Replication of Raw Events ● State Divergence TX Fraud Decision TX Decision Event Processor People Store raw event 2way Replication TX Fraud Decision Event Processor People Store raw event TXDecision 11/40
  • 12. Read Consistency Guarantees Loosely based on Consistency Explained Through Baseball by Doug Terry ● Strong ⇒ 2:2 ○ See all previous writes ● Read own Writes ○ See all writes performed by reader ● Monotonic ⇒ 2:1 ○ See all writes since the beginning till N seconds ago ● Eventual ⇒ 1:2 ○ See the writes in different order (some still missing) time partial update state 15m Hapoel =1 1:0 32m Maccabi =1 1:1 89m Hapoel =2 2:1 91m Maccabi =2 2:2 14/40
  • 13. Hello Couchbase read-mutate-write of entire state Client reaches cluster’s primary node Conflict Prevention CAS Optimizations: subdocument API Strong node us-west-2b node us-west-2c Event Processor (read/m/write) TX Decision (read) Strong 16/40
  • 14. Hello Couchbase XDCR replicates entire state between clusters Optimizations: dedup by key, metadata first Strong Monotonic XDCR node us-west-2b node us-west-2c Event Processor (read/m/write) node us-east-1c node us-east-1b TX Decision (read) TX Decision (read) Strong 17/40
  • 15. Couchbase Last Write Wins Conflict Resolution - LWW erases losing side Remember: NTP, no “sudo date” Document Version = read-own-writes Monotonic XDCR node us-west-2b node us-west-2c Event Processor (read/m/write) node us-east-1c node us-east-1b TX Decision (read) TX Decision (read) Monotonic read-own-writes Event Processor (read/m/write) ‫סמוך‬ Design Conflict Resolution 48bit timestamp Conflict Prevention 16bit CAS 19/40
  • 16. Hello Cassandra node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1b TX Decision (read) Client reaches closest node, blocks until LOCAL_QUARUM No Conflict Prevention ⇒ Use partial updates or inserts Strong (?) node us-east-1c node us-east-1a TX Decision (read) 21/40
  • 17. Cassandra Last Write Wins per Column Two clients update payment and address of same person with exactly same client timestamps. (?) (?) update payment wins update address wins node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1c node us-east-1a node us-east-1b TX Decision (read) TX Decision (read) Event Processor (partial update) ‫סמוך‬ Design 23/40
  • 18. Cassandra Multi Value per Column Update different columns of same person Conflict resolution in TX Decision (on read) (?) (?) update payment1, address1 update payment2, address2 node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1c node us-east-1a node us-east-1b TX Decision (read) TX Decision (read) Event Processor (partial update) ‫סמוך‬ Design 25/40
  • 19. Kafka Kafka us-west-2 Event Source (insert) Kafka us-east-1 TX Decision (read) Event Processor Event Processor S3 versioned us-east-1 TX Decision (read) S3 versioned us-west-2 (?) (?) Event Source (insert) mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-east inserts Conflict resolution in Event Processor Will both regions converge into the same state? ‫שלו‬ ‫אח‬ Design 27
  • 20. Converging events into state ● Duplicate events ○ Idempotent compare-and-set(x, 2, 5) ○ De-duplication 2 +3 +3 = 5 ○ Rollback ● Unordered events ○ Commutative 2+3=3+2 ○ reordering window (requires state) ● Bulk/Parallel event processing ○ Associative (2+3)+4 = 2+(3+4) 29/40
  • 21. Kafka Streams API - zooming in Kafka us-west-2 Event Source (insert) Kafka us-east-1 TX Decision (read) Event Processor Event Processor S3 versioned us-east-1 TX Decision (read) S3 versioned us-west-2 (?) (?) Event Source (insert) mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-east inserts ‫שלו‬ ‫אח‬ Design
  • 22. Kafka Streams API Kafka MirrorMaker (?) Kafka S3 Connector Kafka Stream API ‫סמוך‬ Design Event Source (insert) builder.stream("kstream1","kstream2") .filter(predicate) .transform(processor) .to("ktable") S3 kstream1 kstream2 ktable 30/40
  • 23. Kafka Processor API and Local Store Kafka MirrorMaker (?) Kafka S3 Connector Kafka Stream API ‫סמוך‬ Design Event Source (insert) kstream1 kstream2 ktable Map process(Map event) { Map state = kvStore.get(event.key); state.putAll(event); // not commutative (order matters) kvStore.put(event.key, state); return state; } S3 32/40
  • 24. CRDT Graph Model Conflict-free Replicated Data Type Idempotent, Commutative, Associative ● Insert Only Graph ● Address / Payment / Person Objects
  • 25. G-Set: Growing Set CRDT Conflict-free Replicated Data Type Idempotent, Commutative, Associative A B us-west-2 event us-east-1 state {A,B} {A,B}
  • 26. G-Set: Growing Set CRDT Conflict resolution method: merge sets A C B us-west-2 event us-east-1 state {A,B} {A,B} {A,C} {A,B,C}
  • 27. Comprised of two G-Sets (added and tombstone) A B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} 2P-Set: Two Phase Set CRDT
  • 28. A C B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} add: {A,C} rmv: {B,D} add: {A,B,C} rmv: {A,B,D} Always grows Garbage Collection algorithms exist. 2P-Set: Two Phase Set CRDT
  • 29. D A C B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} add: {A,C} rmv: {B,D} add: {A,B,C} rmv: {A,B,D} add: {D} add: {A,B,C,D} rmv: {A,B,D} Always grows Garbage Collection algorithms exist. 2P-Set: Two Phase Set CRDT
  • 30. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  • 31. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  • 32. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  • 33. AD C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} add_v: {D} rmv_v: {} add_e: {AD} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  • 34. AD C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} add_v: {D} rmv_v: {} add_e: {AD} rmv_e: {} add_v: {A,B,C,D} rmv_v: {A} add_e: {AB,AC,BC,AD} rmv_e: {AB,AC,AD} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  • 35. Sometimes the state won't converge easily ● Missing events (broken links) ○ integrity checks ○ repair ● Rerunning bulk events after downtime ○ Clocks: Event vs. Ingestion vs. Processor vs. Logical ○ Enrichment: IP address reputation changes daily 37/40
  • 36. Background Reconciliator Reconciliation: Compare hash (Merkle) trees Compensation: Merge CRDT states client2 (read) us-west-2a S3 versioned us-west-2 client1 (read) us-east-1b S3 versioned us-east-1 Background Reconciliator 38/40
  • 37. Takeaways ● Define business need for cross region Availability, Latency, Residency, Analytics ● Know your NoSQL Couchbase != Cassandra != Kafka ● Ask about CRDTs LWW-Register, MV-Register, 2P-Sets, 2P2P-Graphs ● Use Reconciliation ● Dedicated Fiber and Atomic clocks ARE COMING 40/40
  • 38. “The Internet was designed to be an academic medium. It was not designed to handle this level of transactions” Fred Matteson @ schwab.com 1999
  • 39. Advanced Topics ● ‫מרקחת‬ ‫לבית‬ ‫מאשר‬ ‫מטבחים‬ ‫לבית‬ ‫דומה‬ ‫יותר‬ ‫האמתי‬ ‫העולם‬ ● Multi Data Center Topologies ○ Star (SPOF, simple) ○ Ring (TLV ←→ Eilat ←→ Jerusalem←→ TLV) ○ Mesh (resilient, complex) ● Data Residency ○ Separate PII from data ○ Peek at other data centers ad-hoc