Understanding NoSQL Databases and Cassandra

Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary

Why NoSQL?
● original intention: modern web-scale DBs
○ amount of data drastically increased
○ data in the web is less structured
● higher requirements regarding performance
● some problems are easier to solve without the relational approach
● scaling out & running on commodity HW is much cheaper than scaling up

Typical Characteristics
● non-relational
● horizontally scalable
● flexible schema
● easy replication support
● simple API
● eventually consistent -> BASE principle

Classification
source: http://blog.octo.com/wp-content/uploads/2012/07/QuadrantNoSQL.png

Classification
source: http://www.sics.se/~amir/files/download/dic/NoSQL%20Databases.pdf

Key/Value Stores
● data model: collection of key/value pairs
● keys and values can be complex compounds
● based on Amazon’s Dynamo Paper
● designed to handle massive load

Key/Value Stores
● no complex query filters
● all joins must be in the code
● easy to distribute across cluster
● very predictable performance -> O(1)

Wide Column Stores
● Tables are similar to RDBMS, but semi-structured
● based on Google’s BigTable
● Rows can have arbitrary columns

Wide Column Stores -> BigTable
● <RowKey, ColumnKey, Timestamp> triple as key for lookups, inserts, deletes
● ColumnKey uses syntax family:qualifier
● arbitrary columns on a row-by-row basis
● does not support a relational model
○ no table-wide integrity constraints
○ no multi-row transactions
source: http://research.google.com/archive/bigtable.html

Document Stores
● inspired by Lotus Notes
● central concept of a Document
● Documents encapsulate/encode data in some format/encoding
● Encodings:
○ XML, YAML, JSON, BSON, PDF

Document Stores
source: http://www.mongodb.org/

Graph Databases
● based on Graph Theory -> G = (V, E)
● designed for data that is well represented in a graph
○ social networks, public transport links, network topologies, road maps
● nodes, edges, properties are used to represent and store data
● graph relationships are queryable

Graph Databases
source: http://www.neo4j.org/

Graph Databases
source: http://en.wikipedia.org/wiki/Graph_database

CAP Theorem
source: http://blog.nahurst.com/visual-guide-to-nosql-systems

ACID
● Atomicity
○ all-or-nothing approach
● Consistency
○ DB will be in a consistent state before & after a transaction
● Isolation
○ transaction will behave as if it’s the only operation being performed upon the
DB
● Durability
○ once a transaction is committed, it is durably preserved
● CA-Systems are ACID-Systems

BASE
● an application that works basically all the time, does not have to be
consistent all the time, but will be in some known state eventually
● Basically Available
○ achieved by using a highly distributed approach
● Soft State
○ state of the system is always “soft” due to eventual consistency
● Eventual Consistency (in German: schlussendliche Konsistenz)
○ at some point in the future, the data will be consistent
○ no guarantees are made about when this will occur

BASE vs ACID
source: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

Cassandra
● initially created by Facebook for Inbox Search
● distributed, horizontally scalable database
● high availability
● very flexible data model
○ data might be structured, semi-structured, unstructured
● commercial support through DataStax

Cassandra - Design
● all nodes are equally important
● no Single-Point-of-Failure
● no central controller
● no master/slave relationships
● every node knows how to route requests
and where the data lives
source: http://cassandra.apache.org/

Scales Linearly
source: http://www.datastax.com

Uses Consistent Hashing
Murmur3Partitioner generates hash
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html

Uses Consistent Hashing
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html

Writes are very fast
● All writes are sequential
● no reading & seeking before a
write
● Each of the N node will perform
the following upon receiving the
RowMutation message:
○ Append write to the commit log
○ Update in-memory Memtable data
structure
○ Write is done!
● If Memtable gets full, it’s flushed
to disk (SSTable)
source: http://www.roman10.net/how-apache-cassandra-write-works/

Write Requests
● Client requests can go to any node in the cluster because all nodes are
peers
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
write consistency level
is configurable

Write Requests
● Cassandra chooses one Coordinator per remote data center to handle
requests to replicas
● coordinator only needs to forward WR to one node in each remote data
center
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html

Read Requests
● Two different types of Read Requests
○ direct read request (RR)
○ background read repair request (RRR)
● number of replicas contacted by a RR is determined by Consistency Level
● RRR are sent to any additional nodes that did not get a direct RR
● RRR ensure consistency

Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html

Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
2 of the 3 replicas for the
given row must respond
to fulfill the read request

Read Requests
source: http://www.datastax.com/documentation/cassandra/2.
0/cassandra/architecture/architectureClientRequestsRead_c.html

CQL
● very similar to SQL
● does not support JOINS / Subqueries
● no referential integrity
● no cascading operations
We denormalize the data because joins
are not performant in a distributed
system

CQL - Collections
● CQL introduced collections to columns
○ list
○ map
○ set
● Add new collections to the previous example

Cassandra vs MySQL (50GB)
● MySQL
○ writes avg: ~300ms
○ reads avg: ~350ms
● Cassandra
○ writes avg: ~0.12ms
○ reads avg: ~15ms
source: http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf

Summary
● elastic scaling (scaling out instead of up)
● huge amounts of data can be handled while maintaining high
throughput rates
● require less DBA’s and management resources
○ automatic repairs/data distribution
○ simpler data models
● better economics
○ cost per GB is much lower than for RDBMS due to clusters of
commodity HW
○ we handle more data with less money
● flexible data models
○ very relaxed or even non-existent data model restrictions
○ changes to data model are much cheaper

Summary
● might not be mature enough for enterprises
● compatibility issues regarding standards
○ each DB has its own API
○ not easy to switch to another NoSQL DB
● search support is not the same as in RDBMS
● easier to find experienced RDBMS experts than NoSQL experts

Which DB for which purpose?
● NoSQL is an alternative
○ addresses certain limitations of the relational DB world
● depends on characteristics of data
○ if data is well structured -> relational DB might be better
○ if data is very complex -> might be difficult to map it to the
relational model
● depends on volatility of the data model
○ what if schema changes daily?
● relational DBs still have their pluses
○ relational model / transactions / query language
○ should be used when multi-row transactions and strict consistency is
required

Understanding NoSQL Databases and Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Understanding NoSQL Databases and Cassandra

Similar to Understanding NoSQL Databases and Cassandra (20)

Recently uploaded

Recently uploaded (20)

Understanding NoSQL Databases and Cassandra