This document provides an overview of NoSQL databases, including why they were created, common characteristics, and classifications. It discusses key concepts like the CAP theorem, BASE vs ACID properties, and gives examples like Cassandra. Cassandra is a distributed, horizontally scalable database designed for high availability. It uses consistent hashing to distribute data and is very fast for writes. The document concludes with tradeoffs between SQL and NoSQL databases and when each may be preferable.
2. Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
3. Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
4. Why NoSQL?
● original intention: modern web-scale DBs
○ amount of data drastically increased
○ data in the web is less structured
● higher requirements regarding performance
● some problems are easier to solve without the relational approach
● scaling out & running on commodity HW is much cheaper than scaling up
10. Key/Value Stores
● data model: collection of key/value pairs
● keys and values can be complex compounds
● based on Amazon’s Dynamo Paper
● designed to handle massive load
11. Key/Value Stores
● no complex query filters
● all joins must be in the code
● easy to distribute across cluster
● very predictable performance -> O(1)
12. Wide Column Stores
● Tables are similar to RDBMS, but semi-structured
● based on Google’s BigTable
● Rows can have arbitrary columns
13. Wide Column Stores -> BigTable
● <RowKey, ColumnKey, Timestamp> triple as key for lookups, inserts, deletes
● ColumnKey uses syntax family:qualifier
● arbitrary columns on a row-by-row basis
● does not support a relational model
○ no table-wide integrity constraints
○ no multi-row transactions
source: http://research.google.com/archive/bigtable.html
14. Document Stores
● inspired by Lotus Notes
● central concept of a Document
● Documents encapsulate/encode data in some format/encoding
● Encodings:
○ XML, YAML, JSON, BSON, PDF
17. Graph Databases
● based on Graph Theory -> G = (V, E)
● designed for data that is well represented in a graph
○ social networks, public transport links, network topologies, road maps
● nodes, edges, properties are used to represent and store data
● graph relationships are queryable
22. Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
23. ACID
● Atomicity
○ all-or-nothing approach
● Consistency
○ DB will be in a consistent state before & after a transaction
● Isolation
○ transaction will behave as if it’s the only operation being performed upon the
DB
● Durability
○ once a transaction is committed, it is durably preserved
● CA-Systems are ACID-Systems
24. BASE
● an application that works basically all the time, does not have to be
consistent all the time, but will be in some known state eventually
● Basically Available
○ achieved by using a highly distributed approach
● Soft State
○ state of the system is always “soft” due to eventual consistency
● Eventual Consistency (in German: schlussendliche Konsistenz)
○ at some point in the future, the data will be consistent
○ no guarantees are made about when this will occur
25. BASE vs ACID
source: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
26. Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
27. Cassandra
● initially created by Facebook for Inbox Search
● distributed, horizontally scalable database
● high availability
● very flexible data model
○ data might be structured, semi-structured, unstructured
● commercial support through DataStax
28. Cassandra - Design
● all nodes are equally important
● no Single-Point-of-Failure
● no central controller
● no master/slave relationships
● every node knows how to route requests
and where the data lives
source: http://cassandra.apache.org/
32. Writes are very fast
● All writes are sequential
● no reading & seeking before a
write
● Each of the N node will perform
the following upon receiving the
RowMutation message:
○ Append write to the commit log
○ Update in-memory Memtable data
structure
○ Write is done!
● If Memtable gets full, it’s flushed
to disk (SSTable)
source: http://www.roman10.net/how-apache-cassandra-write-works/
33. Write Requests
● Client requests can go to any node in the cluster because all nodes are
peers
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
write consistency level
is configurable
34. Write Requests
● Cassandra chooses one Coordinator per remote data center to handle
requests to replicas
● coordinator only needs to forward WR to one node in each remote data
center
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
35. Read Requests
● Two different types of Read Requests
○ direct read request (RR)
○ background read repair request (RRR)
● number of replicas contacted by a RR is determined by Consistency Level
● RRR are sent to any additional nodes that did not get a direct RR
● RRR ensure consistency
39. CQL
● very similar to SQL
● does not support JOINS / Subqueries
● no referential integrity
● no cascading operations
We denormalize the data because joins
are not performant in a distributed
system
44. Cassandra vs MySQL (50GB)
● MySQL
○ writes avg: ~300ms
○ reads avg: ~350ms
● Cassandra
○ writes avg: ~0.12ms
○ reads avg: ~15ms
source: http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf
45. Overview
● Why NoSQL?
● Classification
● CAP Theorem
● BASE vs ACID
● Cassandra in Action
● Summary
46. Summary
● elastic scaling (scaling out instead of up)
● huge amounts of data can be handled while maintaining high
throughput rates
● require less DBA’s and management resources
○ automatic repairs/data distribution
○ simpler data models
● better economics
○ cost per GB is much lower than for RDBMS due to clusters of
commodity HW
○ we handle more data with less money
● flexible data models
○ very relaxed or even non-existent data model restrictions
○ changes to data model are much cheaper
47. Summary
● might not be mature enough for enterprises
● compatibility issues regarding standards
○ each DB has its own API
○ not easy to switch to another NoSQL DB
● search support is not the same as in RDBMS
● easier to find experienced RDBMS experts than NoSQL experts
48. Which DB for which purpose?
● NoSQL is an alternative
○ addresses certain limitations of the relational DB world
● depends on characteristics of data
○ if data is well structured -> relational DB might be better
○ if data is very complex -> might be difficult to map it to the
relational model
● depends on volatility of the data model
○ what if schema changes daily?
● relational DBs still have their pluses
○ relational model / transactions / query language
○ should be used when multi-row transactions and strict consistency is
required