1. A highly scalable, eventually consistent, distributed,
structured key-value store.
Wednesday, December 16, 2009
2. Why
• Scaling existing Relational
Databases is hard.
• Sharding is one solution, but
makes your RDBMS unusuable.
• Operational Nightmare.
Wednesday, December 16, 2009
3. The Bigdata Age
• Scale horizontally, just add more
servers
• Cluster growth. Load balance
automatically
• Flexible schemas
• Key-Oriented Queries
• High Availability, 24 x 7 x 365
Wednesday, December 16, 2009
4. Cassandra Design
• High availability.
• Eventual consistency.
• Incremental scalability.
• Optimistic Replication.
• Low total cost of ownership.
• Tunable tradeoffs between consistency &
latency.
• Minimal administration.
Wednesday, December 16, 2009
5. General Data Models
"contents:" "anchor:cnnsi.com" "anchor:my.look.ca"
Key-Value
"<html>..." t3
"com.cnn.www" <html>
"<html>..." t5 "CNN" t9 "CNN.com" t8
"<html>..." t6
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page
"contents:" "anchor:cnnsi.com" "anchor:my.look.ca"
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
Key-Columns
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .
"<html>..." t
"com.cnn.www" "<html>..."a variety3
t5 "CNN" t "CNN.com" t8
We settled on this data model after examining
"<html>..." Column Families
9
t6
of potential uses of a Bigtable-like system. As one con-
crete example that drove some of our design decisions, Column keys are grouped into sets called column fami-
suppose we want to keep a copy of a large collection of lies, which form the basic unit of access control. All data
web pages and related information that could be used by stored in a column family is usually of the same type (we
Figure 1: A slice of an example table that stores Web pages. The compress a reversed URL. The contents column family con-
many different projects; let us call this particular table row name is data in the same column family together). A
tains the page contents, and the anchor column family contains the column family must be createdthe page. CNN’s home page
text of any anchors that reference before data can be stored
the Webtable. In Webtable, we would use URLs as row
under any row contains columns named anchor:cnnsi.com
is referenced by both the Sports Illustrated and the MY-look home pages, so thecolumn key in that family; after a family has
keys, various aspects ofEach anchor as column names, and contents column has three versions, at timestamps t , t , and t .
and anchor:my.look.ca. web pages cell has one version; the been created, any column key within the family can be
3 5 6
store the contents of the web pages in the contents: col-
umn under the timestamps when they were fetched, as used. It is our intent that the number of distinct column
Bigtable: Ain a table be small (in the hundreds at most), - Google Inc
families Families Storage System for Structured Data and
Distributed
illustrated in Figure 1. model after examining a variety
We settled on this data Column
of potential uses16, 2009
of a Bigtable-like system. As one con- that families rarely change during operation. In contrast,
Wednesday, December
6. General Data Models
"contents:" "anchor:cnnsi.com" "anchor:my.look.ca"
Key-Value
"<html>..." t3
"com.cnn.www" <html>
"<html>..." t5 "CNN" t9 "CNN.com" t8
"<html>..." t6
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page
"contents:" "anchor:cnnsi.com" "anchor:my.look.ca"
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
Key-Columns
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .
"<html>..." t
"com.cnn.www" "<html>..."a variety3
t5 "CNN" t "CNN.com" t8
We settled on this data model after examining
"<html>..." Column Families
9
t6
of potential uses of a Bigtable-like system. As one con-
crete example that drove some of our design decisions, Column keys are grouped into sets called column fami-
suppose we want to keep a copy of a large collection of lies, which form the basic unit of access control. All data
Column Family
web pages and related information that could be used by stored in a column Family usually of the same type (we
Column family is
Figure 1: A slice of an example table that stores Web pages. The compress a reversed URL. The contents column family con-
many different projects; let us call this particular table row name is data in the same column family together). A
tains the page contents, and the anchor column family contains the column family must be createdthe page. CNN’s home page
text of any anchors that reference before data can be stored
the Webtable. In Webtable, we would use URLs as row
under any row contains columns named anchor:cnnsi.com
is referenced by both the Sports Illustrated and the MY-look home pages, so thecolumn key in that family; after a family has
keys, various aspects ofEach anchor as column names, and contents column has three versions, at timestamps t , t , and t .
andColumn Family web pages cell has one version; the Group” in Google’s Bigtable terminologycan be
anchor:my.look.ca. is also named a “Locality been created, any column key within the family 6
3 5
store the contents of the web pages in the contents: col-
umn under the timestamps when they were fetched, as used. It is our intent that the number of distinct column
Bigtable: Ain a table be small (in the hundreds at most), - Google Inc
families Families Storage System for Structured Data and
Distributed
illustrated in Figure 1. model after examining a variety
We settled on this data Column
of potential uses16, 2009
of a Bigtable-like system. As one con- that families rarely change during operation. In contrast,
Wednesday, December
8. Row-based Storage
Row-Based storage
• Pros:GoodRead/Write of adisk and in cache) in
Pros: single row
a single IO operation
locality of access (on of
different columns
• Cons:Cons: If you a single row scan a IO operation.
Read/write of want to is a single only
one column youscan only one column, you still
But if you want to
still read all data.
read all. Design Patterns for Distributed Non-Relational Databases - Todd Lipcon, Cloudera
Wednesday, December 16, 2009
9. Columnar Storage
Columnar storage
• Pros: good locality of access for
Pros:
differentfor a given column is stored sequentially
Data columns
Scanning a single column (eg aggregate queries) is
• Cons: Reading a single row may
fast
seek once per column
Cons:
Reading a Design Patterns for Distributed Non-Relational Databases - Todd Lipcon, Cloudera
single row may seek once per column.
Wednesday, December 16, 2009
10. Columnar Storage with Locality Groups
with Column Family
• Pros: Scanning a single column
(aggregate queries) is fast
Columns are organized into families (“locality
groups”)
• Cons: Reading a single row may
Benefits of row-based layout within a group.
seek once per column
Benefits of column-based - Non-Relational have- Todd Lipcon, Cloudera
Design Patterns for Distributed
don’t Databases to read
groups you don’t care about.
Wednesday, December 16, 2009
11. Log Structured Merge
Trees
Convert random writes to sequential writes.
• Writes go to a commit log and in-memory storage (Memtable)
• The Memtable is occasionally flushed to disk (SSTable)
• The SSTables are periodically compacted into one.
The log-structured merge-tree (LSM-tree) P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil.
Wednesday, December 16, 2009
12. Write Operations
Write Read
Memtable RAM
Commit Log SSTable
(DISK) SSTable DISK
SSTable
Wednesday, December 16, 2009
13. Read Operations
Write Read
Memtable RAM
Commit Log SSTable
(DISK) SSTable DISK
SSTable
Wednesday, December 16, 2009
14. Read Operations
Write Read
Memtable RAM
Commit Log SSTable
(DISK) SSTable DISK
SSTable
Bloom Filter
Wednesday, December 16, 2009
15. Flush Memtable
Write Read
Memtable RAM
Commit Log SSTable
(DISK) SSTable DISK
SSTable
Wednesday, December 16, 2009
16. Flush Memtable
RAM
SSTable 4
SSTable 3
(DISK) SSTable 2
DISK
SSTable 1
Wednesday, December 16, 2009
18. Compactation
RAM
(DISK) SSTable 1'
DISK
Wednesday, December 16, 2009
19. Write Operations
Write Read
Memtable RAM
Commit Log SSTable
(DISK) SSTable
SSTable 1'
DISK
SSTable
Wednesday, December 16, 2009
20. WRITE PROPERTIES
• No locks in the critical path
• Sequential disk access
• Behaves like a write back Cache
• Append support without read ahead
• Atomicity guarantee for a key
• “Always Writable” –accept writes during
failure scenarios
Wednesday, December 16, 2009
21. CAP Theorem
• CONSISTENCY:
...how and whether a system is left in a consistent state after an
operation.
• AVAILABILITY:
refers to system such that it is ensured to remain operational over
some period of time.
• PARTITION-TOLERANCE:
Ability for a system to continue to operate in the presence of a
network partitions.
Wednesday, December 16, 2009
22. Eventual Consistency
• As t! !, readers will see writes.
• In a steady state, the system is
guaranteed to eventually return the
las written value.
• Examples: DNS or MySQL slave
replication.
Wednesday, December 16, 2009
27. Read Repair
Client
Query Result
Cassandra Cluster
Closest replica Result
Replica A
Digest Query
Digest Response Digest Response
Replica B Replica C
Wednesday, December 16, 2009
28. Read Repair
Client
Query Result
Cassandra Cluster
Closest replica Result
Read repair if digests differ
Replica A
Digest Query
Digest Response Digest Response
Replica B Replica C
Wednesday, December 16, 2009
29. Cluster Memebership
• Gossip protocol is used for cluster membership.
• Super lightweight with mathematically provable
properties.
• State disseminated in O(log2 N) rounds where
N is the number of nodes in the cluster.
• A member merges the list with its own list.
• Every T seconds each member increments its
heartbeat counter and selects one other
member to send its list to.
Wednesday, December 16, 2009
37. DATA MODEL:Columns
COLUMN FAMILY: Users
Name Name Name Name Name
!"# Value Value Value Value Value
Timestamp Timestamp Timestamp Timestamp Timestamp
name lastname likes
$%&'()( Pablo Delgado Sugar
#timestamp #timestamp #timestamp
name lastname
%*)+*,+ Antonio Garrote
#timestamp #timestamp
name lastname age language updated_at
-%./+ Mauro Pompilio 25 es 2009/05/03
#timestamp #timestamp #timestamp #timestamp #timestamp
Wednesday, December 16, 2009
38. DATA MODEL:Columns
COLUMN FAMILY: Users
Name Name Name Name Name
!"# Value Value Value Value Value
Timestamp Timestamp Timestamp Timestamp Timestamp
name lastname likes
ordered column keys
$%&'()( Pablo Delgado Sugar
#timestamp #timestamp #timestamp
name lastname
%*)+*,+ Antonio Garrote
#timestamp #timestamp
name lastname age language updated_at
-%./+ Mauro Pompilio 25 es 2009/05/03
#timestamp #timestamp #timestamp #timestamp #timestamp
ordered keys
Wednesday, December 16, 2009
39. DATA MODEL: SuperColumns
COLUMN FAMILY: Tags
Name Name Name
KEY
beach mountain
pablete
Wednesday, December 16, 2009
40. DATA MODEL: SuperColumns
COLUMN FAMILY: Tags
Name Name Name
KEY
beach mountain
ordered supercolumn keys
pablete
ordered keys
Wednesday, December 16, 2009
41. DATA MODEL: SuperColumns
COLUMN FAMILY: Tags
Name Name Name
Name Name Name Name Name Name Name
Title Title Title Title Title Title Title
KEY
Name Name Name Name
Title Title Title Title
beach mountain
9876 843 777 1234
san-diego barcelona cadaques barcelona
pablete
654 555 78
888
niza sicilia trapani
andorra
Wednesday, December 16, 2009
42. DATA MODEL: SuperColumns
COLUMN FAMILY: Tags
Name Name Name
Name Name Name Name Name Name Name
Title Title Title Title Title Title Title
KEY
Name Name Name Name
Title Title Title Title
beach mountain ordered supercolumn keys
9876 843 777 1234
san-diego barcelona cadaques barcelona ordered column keys
pablete
654 555 78
888
niza sicilia trapani
andorra
ordered keys
Wednesday, December 16, 2009