1. Introduction to NoSQL and
Apache Cassandra
Patricio Echagüe
patricioe@gmail.com
@patricioe
2. About me
Present:
Relateiq (Data Processing and Scalability)
Hector committer
Past:
DataStax (The Cassandra Company)
Cassandra/Hadoop distribution (former Brisk)
Cassandra FS
CQL connection pool
Cassandra contributions
6. What is “NoSQL” ?
systems able to store and retrieve great
quantities of data with none or little
information about the relationships
between them.
Generally they don't have a SQL like
language for data manipulation and
their schema is more relaxed than
traditional RDBM systems.
Full ACID is not often guaranteed.
7. Brewer's CAP theorem
Consistency: all replicas agree on the
same value
Availability: always get an answer from
a replica
Partition Tolerance: the system works
even if replicas can't talk
You can have 2 of these
11. What's eventual consistency?
It is a promise that eventually, in the
absence of new writes, all replicas that
are responsible for a data item will
agree on the same version
12. How eventual is eventual?
Write to 1 replica and Read from 1 replica of a total
of 3
13. How eventual is eventual?
Write to 2 replicas and Read from 2 replicas of a total
of 3
14. Why is it good?
because, by contacting fewer
replicas, read and write operations
complete more quickly, lowering
latency.
15. Cassandra is a distributed
, fault
tolerant, scalable, column
oriented and tunable
consistency data store.
21. Data Model
“4-Dimensional Hash Table”
A Keyspace contains a collection of Column Families
(Controls replication)
A Column Family contains Rows
A Row have a key, and each row has columns
(No need to define the columns before hand)
Each column has a name and a value and a
timestamp
(TTL is optional)
22. Data Model – (RDBMS)
Keyspace (Schema)
Column Family(CF) (table)
Row (row)
Column (column*) → may not be present in all
rows
23. Data Model – Column Family
Static Column Family
- Model my object data
Dynamic Column Family
- Precalculated / Prematerialized query results
Nothing stopping you from mixing them!
25. Data Model – Dynamic CF
stats for a specific date
26. Data Model – Dynamic CF
Timeline of tweets by a user
Timeline of tweets by all of the people a user is
following
List of comments sorted by score
List of friends grouped by state
Metrics for a time bucket
31. ...
Now we need to know what nodes the key was written
to so we can read it later
32. ...
The Initial Token specifies the upper value of the key
range each node is responsible for
#1
#5 <= 'd'
<= 'z' 'e f g h I j k '
#2
<= 'k'
#4
<= 'u'
#3
<= 'p'
a b c d e f g h I j k l m n …. z
33. ...
Gossip is the protocol Cassandra uses to interchange
information with nodes in the cluster (a.k.a. Ring)
34. …
Gossip is the protocol Cassandra uses to interchange
information with nodes in the cluster (a.k.a. Ring)
For example, what nodes owns the key “foo”
35. ...
Gossip is the protocol Cassandra uses to interchange
information with nodes in the cluster (a.k.a. Ring)
For example, what nodes owns the key “foo”
#1
Read 'foo'
#5 <= 'd'
Client 'e f g h I j k '
<= 'z'
#2
'foo'
<= 'k'
#4
<= 'u'
#3
<= 'p'
36. ...
A Partitioner is used to transform the key.
“foo1” and “foo2” may end up in different nodes
37. ...
A Partitioner is used to transform the key.
“foo1” and “foo2” may end up in different nodes
The most commonly used is Random Partitioner
“foo1” md5(“foo1”) “A99A0B....”
38. ...
A Partitioner is used to transform the key.
“foo1” and “foo2” may end up in different nodes
The most commonly used is Random Partitioner
#1 'foo1'
#5
#2
'foo2'
#4
#3
50. ...
But what if node #3 is down?
hint
#1 'foo1'
Client #
5
'foo1'
#2
#
4
#3
51. ...
But what if node #3 is down?
The coordinator nodes will store a hint and will replay
that mutation when the down node comes back up.
This is known as Hinted Handoff
52. ...
Node #5 will replay the hint to node #3 when it comes
back online
hint
'foo1'
#1
Client #5
'foo1'
#2
#4
#3 'foo1'
53. ...
And if node #5 dies before sending the hints to node
#3?
hint
#1 'foo1'
Client #5
'foo1'
#2
#4
#3
54. ...
If using Quorum, node #4 will request for 'foo' to all
the replicas
hint
#1 'foo1'
Client #5
'foo1'
#2
#4
#3 ''
55. ...
If the result received do not match, a Read Repair
process is performed in the background
hint
#1 'foo1'
Client #5
'foo1'
#2
#4
#3 ''
56. ...
And the missing or not up-to-date value is pushed to
the out of date node. #3 in this case
hint
#1 'foo1'
Client #5
'foo1'
#2
#4
'foo' != '' #3 'foo'
57. ...
The last feature to achieve consistency is the Anti
Entropy Service (AES)
Should run periodically as part of the cluster
maintenance or when a node was down
78. Storage – Compaction
Size-tiered compaction
Performance no guaranteed
Row may be across many SSTables
Waste of space
Good for write heavy ops
Rows are written once
100% more space than SSTables
79. Storage – Compaction
Leveled compaction
Grouped into levels
No overlapping within a level
Each level is ten times as large
90% of reads satisfied with 1 SSTable
Twice as much I/O
87. Using a Client
- Hector
http://hector-client.org
- Astyanax
https://github.com/Netflix/astyanax
- Pelops
https://github.com/s7/scale7-pelops
88. Using a Client → Hector
- Most popular Java client
- In use at very large installations
- A number of tools and utilities built on top
- Very active community
- MIT Licensed
89. Features
- High Level API
- Failover behavior
- High performant connection pool
- JMX counters for management
- Discoverability of new nodes
- Automatic retry of downed hosts
- Suspension of nodes after several timeouts
- Load Balancing: Configurable and extensible
- Locking (Beta)
93. ColumnFamilyTemplate
Familiar, type-safe approach
- based on template-method design pattern
- generic: ColumnFamilyTemplate<K,N>
(K is the key type, N the column name type)
ColumnFamilyTemplate template =
new ThriftColumnFamilyTemplate(keyspaceName,
columnFamilyName,
StringSerializer.get(),
StringSerializer.get());
*** (no generics for clarity)
94. ColumnFamilyTemplate
new ThriftColumnFamilyTemplate(
keyspaceName,
columnFamilyName,
StringSerializer.get(),
Key Format
StringSerializer.get());
Column Name Format
- Cassandra calls this a “comparator”
- Remember: defines column order in on-disk format
96. ColumnFamilyTemplate
Inserting data with ColumnFamilyUpdater
ColumnFamilyUpdater updater = template.createUpdater(”pato");
updater.setString("companyName",”Relateiq");
updater.addKey(”sabina");
updater.setString("companyName",”Globant");
template.update(updater);
97. ColumnFamilyTemplate
Deleting Data with ColumnFamilyTemplate
template.deleteColumn("zznate", "notNeededStuff");
template.deleteColumn("zznate", "somethingElse");
template.deleteColumn("patricioe", "aDifferentColumnName");
...
template.deleteRow(“someuser”);
template.executeBatch();
98. Integrating with existing patterns
Hector Object Mapper -> Apache Gora
https://github.com/hector-client/hector/tree/master/object-mapper
Hector JPA*:
https://github.com/riptano/hector-jpa
Spring IOC
CQL: JDBC Driver and Pool in 1.0!
JdbcTemplate FTW!
101. Summary
- Take advantage of strengths
- idempotence and asynchronicity are your friends
- If it's not in the API, you are probably doing it wrong
- Seek death is still possible if you model incorrectly
- Try Denormalizing (append-only model ?)
104. Additional Resources
DataStax Documentation: http://www.datastax.com/docs
Apache Cassandra project wiki: http://wiki.apache.org/cassandra/
“The Dynamo Paper”
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
P. Helland. Building on Quicksand
http://arxiv.org/pdf/0909.1788
P. Helland. Life Beyond Distributed Transactions
http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
S. Anand. “Netflix's Transition to High-Availability Storage Systems”
http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf
“The Megastore Paper”
http://research.google.com/pubs/archive/36971.pdf