Developer Data Modeling Mistakes: From Postgres to NoSQL
Introduction to Cassandra
1. Cassandra History
● Created at Facebook
● Open-sourced since 2008
● Current version = 3.3
● Column-oriented ☞ distributed table
NoSQL database Cluster Layer
● Amazon DynamoDB paper
● Masterless architecture
Data-store layer
● Google Big Table paper
● Columns/columns family
2. But what is Cassandra ?
● Fast Distributed Database/Table
● High Availability
● Linear Scalability
● Predictable Performance
● No SPOF
● Multi-DC
● Commodity Hardware
● Easy to manage operationally
3. How is Cassandra different ?
● No master / slave / replica sets
● No config servers, zookeeper
● Data is partitioned around ring
● Data is replicated to RF=N servers
● All nodes hold data and can answer
queries (both reads & writes)
● Location of data on ring is determined by
partition key
4. ● Cassandra chooses Availability & Partition
Tolerance over Consistency
● Queries have tunable consistency level
● ALL, QUORUM, ONE
● Hinted Handoff to deal with failed nodes
CAP Theorem
6. Cassandra Writes
● Writes are written to any node in the cluster
(coordinator)
● Writes are written to commit log, then to
memtable
● Every write includes a timestamp
● Memtable flushed to disk periodically (sstable)
● Deletes are actually a special write case, called
a “tombstone”
● Partition is spread across multiple SSTables
● Same column can be in multiple SSTables
● Merged through compaction, only latest
timestamp is kept
7. Cassandra Reads
● Any node may be queried, it acts as
coordinator
● On each node, data is pulled from
Memtable + SSTables and merged
● Each SSTable has a BloomFilter
8. Cassandra is not...
● An In-Memory database
● A Key-Value storage (it has schema)
● A magical unicorn database that farts rainbow
10. CQL vs SQL
● no Sequences
○ Natural Keys - email etc…
○ Surrogate Keys - UUID
● no Order By - during select
11. Clustering Key
// Comments for a given video
CREATE TABLE comments_by_video (
videoid uuid,
comment_time timestamp,
userid uuid,
comment text,
PRIMARY KEY (videoid, comment_time)
) WITH CLUSTERING ORDER BY (comment_time DESC);
SELECT * FROM comments_by_video
videoid commendid userid comment
Video1 Dec 13, 11:00 user1 comment3
Video1 Dec 12, 15:00 user2 comment2
Video1 Dec 12, 12:00 user1 comment1
... ... ... ...
more comments
more columns
...Dec 13, 11:00
user1, comment3
Dec 12, 15:00
user2, comment2
Dec 12, 12:00
user1, comment3
Video1 Up to 2 Billion
New Data
12. Example: KillrVideo.com
CREATE TABLE user_videos (
userid uuid,
added_date timestamp,
videoid uuid,
name text,
priview_image_location text,
PRIMARY KEY (userid, added_date, video_id)
) WITH CLUSTERING ORDER BY (added_date DESC, video_id ASC);
// Index for tag keyword
CREATE TABLE videos_by_tag (
tag text,
videoid uuid,
added_date timestamp,
name text,
priview_image_location text,
tagged_date timestamp
PRIMARY KEY (tag, video_id)
);
// Index for tag keyword
CREATE TABLE tags_by_letter (
first_letter text,
tag text,
PRIMARY KEY (first_letter, tag)
);
13. Secondary Indexes
Country Continent
Germany Europe
China Asia
France Europe
NODE 1
Germany: Europe
NODE 2
China: Asia
NODE 2
France: Europe
SELECT * FROM … WHERE Continent = ‘Europe’
SELECT * FROM … WHERE Country = ‘France’
● Continent is Secondary
Index.