SlideShare a Scribd company logo
1 of 74
Download to read offline
The Hows and Whys
of a Distributed SQL Database
Alex Robinson / Member of the Technical Staff
alex@cockroachlabs.com / @alexwritescode
Agenda
• Why a distributed SQL database?
• How does a distributed SQL database work?
○ Data distribution
○ Data replication
○ Transactions
Why a distributed SQL database?
A brief history of databases
1960s: The first databases
• Records stored with pointers to each other in a hierarchy or network
• Queries had to process one tuple at a time, traversing the structure
• No independence between logical and physical schemas
○ To optimize for different queries, you had to dump and reload all your data
• Required a lot of programmer effort to use
1970s: SQL / RDBMS
• Relational model was designed in contrast to prior hierarchical DBs
○ Expose simple data structures and high-level language
○ Queries are independent of the underlying physical storage
• Very developer friendly
• Designed to run on a single machine
1980s-1990s: Continued growth of SQL / RDBMS
• SQL databases matured, took over the industry
• Object-oriented databases tried, but didn’t gain much traction
○ Didn’t support complex queries well
○ No standard API across systems
Early 2000s: Custom Sharding
• Rapidly growing scale necessitated new solutions
• Companies added custom middleware to
connect single-node DBs
• No cross-shard transactions, operational
nightmares
MySQL
Middleware
Applications
2004+: NoSQL
• Further growth and operational pain necessitated new systems
• Big change in priorities - scale and availability above all else
• Sacrificed a lot to get there
○ No relational model (custom APIs instead of SQL)
○ No (or very limited) transactions and indexes
○ Manual joins
2010s: “NewSQL” / Distributed SQL
Why another type of database?
Database Limitations
Existing database solutions place an undue burden on application
developers and operators:
● Scale (sql)
● Fault tolerance (sql)
● Limited transactions (nosql)
● Limited indexes (nosql)
● Consistency issues (nosql)
Strong Consistency and Transactions
“We believe it is better to have application
programmers deal with performance
problems due to overuse of transactions as
bottlenecks arise, rather than always coding
around the lack of transactions”
-Google’s Spanner paper
2010s: “NewSQL” / Distributed SQL
● Distributed databases that provide full SQL semantics
● Attempt to combine the best of both worlds
○ Fully-featured SQL
○ Horizontally scalable and highly available
● Entirely new systems, tough to tack onto existing databases
2010s: “NewSQL” / Distributed SQL
● Burst onto the scene with Google’s Spanner system (published in
2012)
● CockroachDB started open source development in 2014
So we know why, but how?
And what’s different from SQL and NoSQL databases?
What we’re going to cover
1. How data is distributed across machines
2. How remote copies of data are replicated and kept in sync
3. How transactions work
Plus how all of this differs from conventional SQL and NoSQL systems
Disclaimer
Data Distribution
Data Distribution in SQL
• Option 1: All data on one server
○ With optional secondary replicas/backups that also store all the data
• Option 2: Manually shard data across separate database instances
○ All data for each shard is still just on one server (plus any secondary replicas)
○ All data distribution choices are being made by the human DB admin
Data Distribution in NoSQL / NewSQL
• Core assumption: entire dataset won’t fit on one machine
• Given that, there are two key questions to answer:
○ How do you divide it up?
○ How do you locate any particular piece of data?
• Two primary approaches: Hashing or Order-Preserving
Option one: Hashing
● Pick one or more hash functions, divide up data by hashing each key
● Deterministically map hashed keys to servers
● Pros: easy to locate data by key
● Con: awkward, inefficient range scans
Option two: Order-Preserving
● Divide sorted key space up into ranges of approximately equal size
● Distribute the resulting key ranges across the servers
● Pros: efficient scans, easy splitting
● Con: requires additional indexing
Option two: Order-Preserving
Each shard (“range”) contains a contiguous segment of the key space
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
lemon
Option two: Order-Preserving
We need an indexing structure to locate a range
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
Ø-lem lem-pea pea-∞
lemon
range index
Option two: Order-Preserving
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lemon
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
Ø-lem lem-pea pea-∞
range index
Scans (fruits >= “cherry” AND <= “mango”) are efficient
Option two: Order-Preserving
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lemon
lime
mango
melon
orange
pea-str
peach
pear
pineapple
raspberry
Ø-lem lem-pea
range index
str-∞
strawberry
tamarillo
tamarind
str-∞pea-str
Data Distribution: Placement
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Range 2Range 3 Range 3
Range 2
Range 3
Each range is replicated
to three or more nodes
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Adding a new (empty)
node
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Range 3
A new replica is
allocated, data is
copied.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Range 3
The new replica is made
live, replacing another.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
The old (inactive) replica
is deleted.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
Process continues until
nodes are balanced.
Data Distribution: Recovery
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
X Losing a node causes
recovery of its replicas.
Data Distribution: Recovery
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
XRange 1
Range 3
A new replica gets
created on an existing
node to replace the lost
replica.
Data Replication
Keeping Remote Copies in Sync
Keeping Copies in Sync in SQL Databases
• Option one: cold backups
○ No real expectation of backups being fully up-to-date
• Option two: primary-secondary replication
○ All writes (and all reads that care about consistency) go to the “Primary” instance
○ All writes get pushed from the Primary to any Secondary instances
Primary/Secondary Replication
Put “cherry”
Put “cherry”
Primary
apricot
banana
blueberry
cherry
grape
Secondary
apricot
banana
blueberry
cherry
grape
In theory, replicas contain identical copies of data: Voila!
Primary/Secondary Replication
• In practice, you have to choose
○ Asynchronous replication
■ Secondaries lag behind primary
■ Failover is likely to lose recent writes
○ Synchronous replication
■ All writes get delayed waiting for secondary acknowledgment
■ What do you do when your secondary fails?
• Failover is really hard to get right
○ How do clients know which instance is the primary at any given time?
Keeping Copies in Sync in NoSQL Databases
• Most NoSQL systems are eventually consistent
• Very wide space of solutions, many different possible behaviors
○ Last write wins
○ Last write wins with vector clocks
○ Leader elections
○ Quorum reads and writes
○ Conflict-free replicated data types (CRDTs)
Keeping Copies in Sync in NewSQL Databases
• Use a distributed consensus protocol from the CS literature
○ Available protocols: Paxos, Raft, Zab, Viewstamped Replication, …
• At a high level:
○ Replicate data to N (often N=3 or N=5) nodes
○ Commit happens when a quorum have written the data
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Put “cherry”
Distributed Consensus: Raft made simple
Write committed when
2 out of 3 nodes have written data
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Put “cherry”
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Put “cherry”
AckAck
Distributed Consensus: Raft made simple
What happens during failover?
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Read “cherry”
Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
On failover, only data written to a
quorum is considered presentRead “cherry”
Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Read “cherry” Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Key not found
Consensus in NewSQL Databases
● Run one consensus group per range of data
● Many practical complications: member change, range splits, upgrades,
scaling number of ranges
Transactions
How can they be made to work in a distributed system?
What is an ACID Transaction?
• Atomic - entire transaction either fully commits or fully aborts
• Consistent - moves from one valid DB state to another
• Isolated - concurrent transactions don’t interfere with each other
○ “Serializable” is needed for true isolation
• Durable - committed transactions won’t disappear
Transactions in SQL Databases
• Atomicity is bootstrapped off a lower-level atomic primitive: log writes
○ All mutations applied as part of a transaction get tagged with a transaction ID
○ “Commit” log entry marks the transaction as committed, making mutations visible
• Isolation is typically provided in one of two ways
○ Read/Write Locks
○ Multi-Version Concurrency Control (MVCC)
Transactions in SQL Databases: Read/Write Locks
• Pretty simple at a high level:
○ When you read a row, take a read lock on it
○ When you write a row, take a write lock on it
○ When you commit, release all locks
• Requires deadlock detection
Transactions in SQL Databases: MVCC
• Store a timestamp with each row
• When updating a row, don’t overwrite the old value
• When reading, return the most recent value from before the read
○ Allows access to historical data
○ Allows long-running read-only transactions to not block new writes
• Need to detect conflicts and abort conflicting transactions
Transactions in NoSQL Databases
• Many NoSQL systems don’t offer transactions at all
• This is one of the biggest tradeoffs that was made to improve scalability
• Those that do typically limit transactions to a single key
○ Single-key transactions don’t require coordinating across shards
○ Can be implemented as just supporting a compare-and-swap operation
Transactions in NewSQL Databases
• Support traditional ACID semantics
• Atomicity is bootstrapped off distributed consensus
○ A “Commit” write to consensus system marks the transaction as committed
• Isolation can be handled similarly to single-node SQL databases
○ But the added latency and distributed state makes things tougher
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango” apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
lemon
mango (txn 1)
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
apricot
cherry (txn 1)
grape
Status: COMMITTED
Txn-1 Record
lemon
mango (txn 1)
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
5. Clean up intents
apricot
cherry
grape
Status: COMMITTED
Txn-1 Record
lemon
mango
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
5. Clean up intents
6. Remove Txn 1
apricot
cherry
grape
lemon
mango
orange
Distributed Transactions
● That’s the happy case
● What about conflicting transactions?
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 2)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
c. Update intent
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 2)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
c. Update intent
3. Commit Txn 2
Status: COMMITTED
Txn-2 Record
Distributed Transactions
1
https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/
2
https://research.google.com/archive/spanner.html
● Transaction atomicity is bootstrapped on top of Raft atomicity
● This description omitted MVCC and other types of conflicts
○ More details on the Cockroach Labs blog1
or Spanner paper2
Summary
Summary
● Databases are best understood in the context in which they were built
● How does a NewSQL database work in comparison to previous DBs?
○ Distribute data to support large data sets and fault-tolerance
○ Replicate consistently with consensus protocols like Raft
○ Distributed transactions using consensus as the foundation
Thank You!
CockroachLabs.com/blog
github.com/cockroachdb/cockroach
alex@cockroachlabs.com / @alexwritescode

More Related Content

What's hot

Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid arupmalakar
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 

What's hot (20)

Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 

Similar to The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outMariaDB plc
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
NoSQL Overview
NoSQL OverviewNoSQL Overview
NoSQL OverviewTu Hoang
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0Ted Wennmark
 

Similar to The Hows and Whys of a Distributed SQL Database - Strange Loop 2017 (20)

Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NoSQL Overview
NoSQL OverviewNoSQL Overview
NoSQL Overview
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
try
trytry
try
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0
 
No SQL
No SQLNo SQL
No SQL
 

Recently uploaded

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 

Recently uploaded (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

  • 1. The Hows and Whys of a Distributed SQL Database Alex Robinson / Member of the Technical Staff alex@cockroachlabs.com / @alexwritescode
  • 2. Agenda • Why a distributed SQL database? • How does a distributed SQL database work? ○ Data distribution ○ Data replication ○ Transactions
  • 3. Why a distributed SQL database? A brief history of databases
  • 4. 1960s: The first databases • Records stored with pointers to each other in a hierarchy or network • Queries had to process one tuple at a time, traversing the structure • No independence between logical and physical schemas ○ To optimize for different queries, you had to dump and reload all your data • Required a lot of programmer effort to use
  • 5. 1970s: SQL / RDBMS • Relational model was designed in contrast to prior hierarchical DBs ○ Expose simple data structures and high-level language ○ Queries are independent of the underlying physical storage • Very developer friendly • Designed to run on a single machine
  • 6. 1980s-1990s: Continued growth of SQL / RDBMS • SQL databases matured, took over the industry • Object-oriented databases tried, but didn’t gain much traction ○ Didn’t support complex queries well ○ No standard API across systems
  • 7. Early 2000s: Custom Sharding • Rapidly growing scale necessitated new solutions • Companies added custom middleware to connect single-node DBs • No cross-shard transactions, operational nightmares MySQL Middleware Applications
  • 8. 2004+: NoSQL • Further growth and operational pain necessitated new systems • Big change in priorities - scale and availability above all else • Sacrificed a lot to get there ○ No relational model (custom APIs instead of SQL) ○ No (or very limited) transactions and indexes ○ Manual joins
  • 9. 2010s: “NewSQL” / Distributed SQL
  • 10. Why another type of database?
  • 11. Database Limitations Existing database solutions place an undue burden on application developers and operators: ● Scale (sql) ● Fault tolerance (sql) ● Limited transactions (nosql) ● Limited indexes (nosql) ● Consistency issues (nosql)
  • 12. Strong Consistency and Transactions “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions” -Google’s Spanner paper
  • 13. 2010s: “NewSQL” / Distributed SQL ● Distributed databases that provide full SQL semantics ● Attempt to combine the best of both worlds ○ Fully-featured SQL ○ Horizontally scalable and highly available ● Entirely new systems, tough to tack onto existing databases
  • 14. 2010s: “NewSQL” / Distributed SQL ● Burst onto the scene with Google’s Spanner system (published in 2012) ● CockroachDB started open source development in 2014
  • 15. So we know why, but how? And what’s different from SQL and NoSQL databases?
  • 16. What we’re going to cover 1. How data is distributed across machines 2. How remote copies of data are replicated and kept in sync 3. How transactions work Plus how all of this differs from conventional SQL and NoSQL systems
  • 19. Data Distribution in SQL • Option 1: All data on one server ○ With optional secondary replicas/backups that also store all the data • Option 2: Manually shard data across separate database instances ○ All data for each shard is still just on one server (plus any secondary replicas) ○ All data distribution choices are being made by the human DB admin
  • 20. Data Distribution in NoSQL / NewSQL • Core assumption: entire dataset won’t fit on one machine • Given that, there are two key questions to answer: ○ How do you divide it up? ○ How do you locate any particular piece of data? • Two primary approaches: Hashing or Order-Preserving
  • 21. Option one: Hashing ● Pick one or more hash functions, divide up data by hashing each key ● Deterministically map hashed keys to servers ● Pros: easy to locate data by key ● Con: awkward, inefficient range scans
  • 22. Option two: Order-Preserving ● Divide sorted key space up into ranges of approximately equal size ● Distribute the resulting key ranges across the servers ● Pros: efficient scans, easy splitting ● Con: requires additional indexing
  • 23. Option two: Order-Preserving Each shard (“range”) contains a contiguous segment of the key space Ø-lem apricot banana blueberry cherry grape lem-pea lime mango melon orange pea-∞ peach pear pineapple raspberry strawberry lemon
  • 24. Option two: Order-Preserving We need an indexing structure to locate a range Ø-lem apricot banana blueberry cherry grape lem-pea lime mango melon orange pea-∞ peach pear pineapple raspberry strawberry Ø-lem lem-pea pea-∞ lemon range index
  • 27. Data Distribution: Placement Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Range 2Range 3 Range 3 Range 2 Range 3 Each range is replicated to three or more nodes
  • 28. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Adding a new (empty) node
  • 29. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Range 3 A new replica is allocated, data is copied.
  • 30. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Range 3 The new replica is made live, replacing another.
  • 31. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 The old (inactive) replica is deleted.
  • 32. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 Process continues until nodes are balanced.
  • 33. Data Distribution: Recovery Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 X Losing a node causes recovery of its replicas.
  • 34. Data Distribution: Recovery Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 XRange 1 Range 3 A new replica gets created on an existing node to replace the lost replica.
  • 36. Keeping Copies in Sync in SQL Databases • Option one: cold backups ○ No real expectation of backups being fully up-to-date • Option two: primary-secondary replication ○ All writes (and all reads that care about consistency) go to the “Primary” instance ○ All writes get pushed from the Primary to any Secondary instances
  • 37. Primary/Secondary Replication Put “cherry” Put “cherry” Primary apricot banana blueberry cherry grape Secondary apricot banana blueberry cherry grape In theory, replicas contain identical copies of data: Voila!
  • 38. Primary/Secondary Replication • In practice, you have to choose ○ Asynchronous replication ■ Secondaries lag behind primary ■ Failover is likely to lose recent writes ○ Synchronous replication ■ All writes get delayed waiting for secondary acknowledgment ■ What do you do when your secondary fails? • Failover is really hard to get right ○ How do clients know which instance is the primary at any given time?
  • 39. Keeping Copies in Sync in NoSQL Databases • Most NoSQL systems are eventually consistent • Very wide space of solutions, many different possible behaviors ○ Last write wins ○ Last write wins with vector clocks ○ Leader elections ○ Quorum reads and writes ○ Conflict-free replicated data types (CRDTs)
  • 40. Keeping Copies in Sync in NewSQL Databases • Use a distributed consensus protocol from the CS literature ○ Available protocols: Paxos, Raft, Zab, Viewstamped Replication, … • At a high level: ○ Replicate data to N (often N=3 or N=5) nodes ○ Commit happens when a quorum have written the data
  • 43. Write committed when 2 out of 3 nodes have written data Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Put “cherry”
  • 44. Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Put “cherry” AckAck
  • 45. Distributed Consensus: Raft made simple What happens during failover?
  • 46. Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Follower apricot banana blueberry grape Follower apricot banana blueberry grape
  • 47. Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Follower apricot banana blueberry grape Follower apricot banana blueberry grape
  • 48. Distributed Consensus: Raft made simple Read “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape
  • 49. Distributed Consensus: Raft made simple On failover, only data written to a quorum is considered presentRead “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape
  • 50. Distributed Consensus: Raft made simple Read “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Key not found
  • 51. Consensus in NewSQL Databases ● Run one consensus group per range of data ● Many practical complications: member change, range splits, upgrades, scaling number of ranges
  • 52. Transactions How can they be made to work in a distributed system?
  • 53. What is an ACID Transaction? • Atomic - entire transaction either fully commits or fully aborts • Consistent - moves from one valid DB state to another • Isolated - concurrent transactions don’t interfere with each other ○ “Serializable” is needed for true isolation • Durable - committed transactions won’t disappear
  • 54. Transactions in SQL Databases • Atomicity is bootstrapped off a lower-level atomic primitive: log writes ○ All mutations applied as part of a transaction get tagged with a transaction ID ○ “Commit” log entry marks the transaction as committed, making mutations visible • Isolation is typically provided in one of two ways ○ Read/Write Locks ○ Multi-Version Concurrency Control (MVCC)
  • 55. Transactions in SQL Databases: Read/Write Locks • Pretty simple at a high level: ○ When you read a row, take a read lock on it ○ When you write a row, take a write lock on it ○ When you commit, release all locks • Requires deadlock detection
  • 56. Transactions in SQL Databases: MVCC • Store a timestamp with each row • When updating a row, don’t overwrite the old value • When reading, return the most recent value from before the read ○ Allows access to historical data ○ Allows long-running read-only transactions to not block new writes • Need to detect conflicts and abort conflicting transactions
  • 57. Transactions in NoSQL Databases • Many NoSQL systems don’t offer transactions at all • This is one of the biggest tradeoffs that was made to improve scalability • Those that do typically limit transactions to a single key ○ Single-key transactions don’t require coordinating across shards ○ Can be implemented as just supporting a compare-and-swap operation
  • 58. Transactions in NewSQL Databases • Support traditional ACID semantics • Atomicity is bootstrapped off distributed consensus ○ A “Commit” write to consensus system marks the transaction as committed • Isolation can be handled similarly to single-node SQL databases ○ But the added latency and distributed state makes things tougher
  • 59. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record
  • 60. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record lemon mango (txn 1) orange
  • 61. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 apricot cherry (txn 1) grape Status: COMMITTED Txn-1 Record lemon mango (txn 1) orange
  • 62. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 5. Clean up intents apricot cherry grape Status: COMMITTED Txn-1 Record lemon mango orange
  • 63. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 5. Clean up intents 6. Remove Txn 1 apricot cherry grape lemon mango orange
  • 64. Distributed Transactions ● That’s the happy case ● What about conflicting transactions?
  • 65. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record
  • 66. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record 1. Begin Txn 2 2. Put “cherry” Status: PENDING Txn-2 Record
  • 67. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record Status: PENDING Txn-2 Record
  • 68. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 Status: PENDING Txn-2 Record
  • 69. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 2) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 c. Update intent Status: PENDING Txn-2 Record
  • 70. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 2) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 c. Update intent 3. Commit Txn 2 Status: COMMITTED Txn-2 Record
  • 71. Distributed Transactions 1 https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/ 2 https://research.google.com/archive/spanner.html ● Transaction atomicity is bootstrapped on top of Raft atomicity ● This description omitted MVCC and other types of conflicts ○ More details on the Cockroach Labs blog1 or Spanner paper2
  • 73. Summary ● Databases are best understood in the context in which they were built ● How does a NewSQL database work in comparison to previous DBs? ○ Distribute data to support large data sets and fault-tolerance ○ Replicate consistently with consensus protocols like Raft ○ Distributed transactions using consensus as the foundation