If you are only familiar with relational databases, Cassandra can be confusing. It is designed to shard, and it guarantees consistency in an interesting (and frustrating) way.
2. € whoami
● Federico Razzoli
● Freelance consultant
● Writing SQL since MySQL 2.23
hello@federico-razzoli.com
● I love open source, sharing,
Collaboration, win-win, etc
● I love MariaDB, MySQL, Postgres, etc
○ Even Db2, somehow
3. What Cassandra is for
● Get data by a Partitioning Key, ordered by a Clustering Key (primary key)
SELECT * FROM conf
WHERE year = 2019
ORDER BY attendees
● Simple aggregations
SELECT COUNT(*) FROM conf WHERE year = 2019
● No JOINs
● No filtering by other columns (by default)
● Different queries on same data? Multiple tables with different primary keys
● Data is eventually mostly consistent
4. CREATE TABLE
-- a database by any other name would contain as sweet tables
CREATE KEYSPACE stats
WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
-- london dc has 2 copies of each row on different nodes,
-- paris dc has 3
'datacenter_london': 2,
'datacenter_paris': 3
};
CREATE TABLE stats.conf (
name TEXT,
year SMALLINT,
attendees SMALLINT,
-- rows with same year are in the same nodes,
-- ordered by attendees
PRIMARY KEY (year, attendees)
) WITH CLUSTERING ORDER BY (attendees DESC) ;
5. Cassandra Cluster
● Clients can read from / write to any
node
● The contacted node acts as a
coordinator
● The Coord. Computes a hash of the
Partitioning Key, which determines
which node(s) own the rows
● The coordinator sends / requests data
to the closest owner(s)
● Replication is asynchronous
6. Consitency levels
● Consistency vs Speed
● Read / Write
● Default consistency level is ONE:
○ You write to a node and don’t wait for the change to be replicated
○ You read from one node possibly stale data
● TWO, THREE, QUORUM
○ Writes and results must be validated by 2, 3 or the majority of nodes
● With multiple datacenters, any of these levels can be very expensive
7. Consitency levels: local and paranoid
● Faster:
○ LOCAL_ONE
○ LOCAL_QUORUM
○ If connection between datacentres breaks, data will be stale
● More paranoid:
○ EACH_QUORUM
○ ALL
○ Avoid inconsistencies in any dc
8. Cost of strictness
● Stricter consistency levels mean:
○ More communications between nodes, higher latency
○ Query may fail if nodes crash or connection loss
9. Interesting inconsistencies
● 2 UPDATEs on the same row/column?
○ The latest wins
○ Who decides which is latest? Sometimes, network/node latency :)
● Rows are DELETEd on a the node and another is crashed?
○ That node may not delete those rows
● You DELETE a row and immediately after I UPDATE it on the same node?
○ Row may not be DELETEd
One second left, no time to explain :D