SlideShare a Scribd company logo
1 of 166
Apache Cassandra
1. Introduction to NoSQL systems, Extensible Record Stores and Amazon’s Dynamo
+ Google Bigtable
2. What Cassandra is and how it is compared with other similar systems
3. What applications are better supported - examples, case studies
4. Technical Description, architecture, internals
5. How is it used and installed, requirements and in what platforms does it run on
6. Demo
7. References
Contents
Background
NoSQL, Extensible Record Stores, Cassandra’s Parents
1.
NoSQL or Not-Only-SQL systems: Next Generation Databases. The initial movement started in 2009
with the goal of creating modern, web-scale DBs. Currently, they exist more than 225 NoSQL
systems.
In general, they share the following features:
• Schema-free databases
• Easy replication support
• Simple API
• Distributed
• Open Source
NoSQL Systems
• BASE (instead of ACID)
• Huge amount of data
• Horizontally scalable
• Motivated by Google’s Big Table.
• Basic Data Model: Rows and Columns
• Basic Scalability Model: Rows and Columns are splitted into nodes.
• Rows: split across nodes through sharding on the primary key.
• Columns: distributed over multiple nodes by using ‘column groups’.
• Other systems that use this technology: Hypertable, HBase.
Extensible Record Stores (or Wide Column Stores)
What is it?
A highly-available and scalable storage system used by Amazon to store and retrieve user shopping
charts and other core services. It pioneered the idea of eventual consistency. Key-Value Store.
How it works?
Allows read and write operations to continue even during network partitions and resolves update
conflicts using different conflict resolution mechanisms.
Sacrifices consistency for availability.
Allows customization to meet desired preference.
Consistent Hashing, Vector Clocks (not in Cassandra), Gossip Protocol, Hinted Handoff, Read Repair
Cassandra’s Parents - Amazon Dynamo
Cassandra’s Parents - Google Bigtable
What is it?
A high performance data storage system built on Google File System and other Google
technologies.
How it works?
Provides both structure and data distribution but relies on a distributed file system for
durability.
Richer data model from Dynamo. One key, many values. Fast sequential access.
Columnar, SSTable Storage, Append-only, Memtable, Compaction
What features does Cassandra use from Google’s BigTable?
1. Column Families
2. Memtables
3. SSTables
What features does Cassandra use from Amazon Dynamo?
1. Consistent hashing
2. Partitioning
3. Replication
Cassandra’s Parents
Cassandra and Parents
Description and Comparisons
What Cassandra is and how it is compared with other similar systems
2.
Avinash Lakshman
• Inventor, Apache Cassandra
• Co-inventor, Amazon Dynamo
Prashant Malik
• Inventor, Apache Cassandra
• Technical Leader, Facebook
What is cassandra?
Definition
• A distributed NoSQL database system for managing large
amounts of structured data across many commodity servers,
while providing highly available service and no single point of
failure.
Timeline with activities
• July 2008
Facebook released Cassandra as an open-source project
• March 2009
Cassandra became an Apache Incubator project
• 17th February 2010
Cassandra graduated to a top-level project
• 2012
University of Toronto researchers studying NoSQL systems concluded that “In
terms of scalability, there is a clear winner throughout our experiments”
• 2010-2015
New releases of Cassandra
Strengths
• Linear scale performance
The ability to add nodes without failures leads to predictable increases in
performance
• Supports multiple languages
Python, C#/.NET, C++, Ruby, Java, Go, and many more…
• Operational and developmental simplicity
There are no complex software tiers to be managed, so administration duties are
greatly simplified.
• Ability to deploy across data centres
Cassandra can be deployed across multiple, geographically dispersed data centres
• Cloud availability
Installations in cloud environments
• Peer to peer architecture
Cassandra follows a peer-to-peer architecture, instead of master-slave architecture
• Flexible data model
Supports modern data types with fast writes and reads
• Fault tolerance
Nodes that fail can easily be restored or replaced
• High Performance
Cassandra has demonstrated brilliant performance under large sets of data
Strengths (1)
• ColumnFamily Store
Cassandra stores columns based on the column names, leading to very quick slicing
• Tunable consistency
Support for strong or eventual data consistency across a widely distributed cluster
• Schema-free/Schema-less
In Cassandra, columns can be created at your will within the rows. Cassandra data
model is also famously known as a schema-optional data model
• AP-CAP
Cassandra is typically classified as an AP system, meaning that availability and
partition tolerance are generally considered to be more important than consistency in
Cassandra
Strengths (2)
CAP and Cassandra
Variable number of columns per row
Weaknesses
Use Cases where is better to avoid using Cassandra
• If there are too many joins required to retrieve the data
• To store configuration data
• During compaction, things slow down and throughput degrades
• Basic things like aggregation operators are not supported
• Range queries on partition key are not supported
• If there are transactional data which require 100% consistency
• Cassandra can update and delete data but it is not designed to do so
Business Insider
“The basic problem Cassandra solved is that when you have a lot of
data sitting on a lot of servers, as Facebook does, you end up with a
house of cards. A single server going down can collapse the whole
stack.”
Cassandra compared to other
NoSQL Systems
Read & Write latency for workload Read/Write
Throughput for workload Read/Write &
Read/Scan/Write
Insert-mostly Workload
Mixed Operational & Analytical Workload
Read-Modify-Write Workload
Balanced Read/Write Mix
Read-mostly Workload
Load Process
VLDB Benchmark (RWS)
Differences between Cassandra and RDBMS
RDBMS Cassandra
relational database keyspace
b-trees log-structured merge-trees
rows which do not include a particular
column value → NULL (in that position)
for each row, only the columns with a value
are stored
support ACID transactions only supports AID
Supported Applications -
Customers - Case Studies
3.
What kind of applications are supported by Cassandra
>80% of the clients fit into one of the next categories:
I. Product Catalog/Playlist
II. Recommendation/Personalization Engine
III. Sensor Data/Internet of Things
IV. Messaging (and generally time-series data)
V. Fraud Detection
In other words, applications that need to...
• store and handle time-series data (most common use case)
• store and handle large volumes of data
• scale predictably
• be continuously available
• protect their data
Datastax
• A software company that develops and provides support for a commercial
edition of Cassandra.
• Massively scalable NoSQL platform able to run online applications for
innovative and data-intensive companies (e.g. Netflix).
• Faster to deploy and less expensive to maintain than other database
platforms.
• Powered by Cassandra and contains only selected releases of it, chosen
by its expert staff.
Datastax (1)
• Supports businesses that need a progressive data management.
• Can serve as a real-time datastore for online production.
• Delivers a unique, smart data platform, suitable for the cloud.
Customers
• Over 3.000 companies around the world use (or have used)
Apache Cassandra in production.
• Most famous:
Cassandra Summit
• Organized by DataStax for 7 consecutive years (in both US and Europe).
• New product releases are announced.
• Customers describe their usage of Cassandra
Key Terms
• Cluster
• Distributed Location
• Node
CASE STUDIES
Category: Messaging
Facebook Inbox Search - Requirements
“The system was required to handle a very high write throughput,
billions of writes per day, and also scale with the number of users”
“Since users are served from data centres that are geographically
distributed, being able to replicate data across data centres was key
to keep search latencies down”
• Lakshman, Malik
Facebook Inbox Search
The reason why Cassandra was initially built.
Facebook maintains a per user index of all messages that have
been exchanged between the senders and the recipients of the
message.
Two kinds of search features enabled at 2008:
I. term search
II. interactions - given a person’s name, returns all the messages
have been sent/received from that person
Facebook Inbox Search (1)
How did they do that?
The schema consists of two column families. Exploits the “time sorting”
feature of Cassandra.
For the term search:
• UserID → key
• Words that make up the message → super columns
• Columns within the super column → individual message identifiers
(MessageID) of the messages that contain the word.
Facebook Inbox Search (2)
For the interactions:
• UserID → key
• RecipientsID’s → super columns
• Columns within the super columns → MessageID’s
• Cassandra provides certain hooks for intelligent caching of data
Inbox Search Schema
Facebook Inbox Search (3)
In 2008:
• system was storing 50+ TB of data
• on a 150 node cluster
• spread out between east and west coast data centres
Performance:
Facebook abandoned Cassandra for the Inbox at late 2010
Cassandra has been deployed as the backend storage system for
multiple services within Facebook.
Categories: Fraud Detection and Time-series data
Instagram Fraud Detection
Initially was using Redis for auditing information related to security
and site integrity purposes (e.g. fighting spam, finding abusive
users).
But…
• data size was growing too quickly
• high write and low read rate
• keeping the data in memory was too costly
So… Cassandra
Instagram Fraud Detection (1)
• Started with 3 nodes and very soon they had grown to a 12 node
cluster.
• No need to store very large instances in memory → put everything on
disks.
“Implementing Cassandra cut our costs to the point where we were paying
around a quarter of what we were paying before. Not only that, but it also
freed us to just throw data at the cluster because it was much more scalable
and we could add nodes whenever needed.”
- Brick Branson, Software Engineer at Instagram
Instagram “Inbox”
Newsfeed or inbox part of Instagram: a feed of all the activity that would be
associated with a given user’s account.
Previously in Redis, with the same (memory) limitations as in the Fraud Detection
case.
Instagram’s Cassandra Cluster:
• 12 nodes on EC2 (AWS)
• 1.2 TB of data stored
• 20.000 writes/sec.
• 15.000 reads/sec.
Category: Sensors and IoT
i2O Water
Description: i2O Water helps utility companies operate more
efficiently through the use of IoT aiming at solving the water crisis.
Challenges:
• Massive volumes of time-series data (>1.5 TB and growing)
• Need for search and analysis of high velocity, streaming data in
near real-time
• SQL Server (previously used) has limitations on scalability and
performance
i2O Water (1)
Solution: After evaluating many common NoSQL technologies, they chose Cassandra.
Why?
I. performance (50-60.000 writes and 20-40.000 reads/sec instead of 0.5 writes/sec and 5
reads/sec with SQL Server)
II. easy to maintain
III. easy to upgrade
IV. ability to handle structured and unstructured real-time streaming data
V. continuous availability and reliability
VI. operationally simple to manage
i2O Water (2)
Results:
I. 235 m. lt. of water saved per day
II. successfully handling massive volumes of data from 15.000
devices without latency or downtime
III. fault tolerance even during upgrades (99.9% availability)
Category: Product Catalogs and Playlists
Spotify
Description: Spotify delivers streaming music in real time to over 40
million active users (the number is growing), without interruption.
Challenges:
• postgreSQL (previously used) and generally RDBMSs cannot
deliver 100% availability
• limited scalability across data centers
• difficult to analyze massive volumes of data
Spotify (1)
Solution: Cassandra.
Why?
I. high availability (due to masterclass architecture)
II. stores data for the entire product catalog and key customer
experience capabilities
III. multi data centre application and no single point of failure
IV. integration with Apache Spark for real time processing and
analytics
Spotify (2)
Results:
I. 40.000 requests/sec. handled successfully and on-time
II. >500 nodes across 4.000 servers in 4 data centres
III. >1.5 bn playlists created from 40m active users and managed in
real time
Data
Centre Data
Centre
Data
Centre
Data
Centre
Spotify - Data Centres (2 in the US - 2 in Europe)
Category: Recommendation/Personalization Engine
Netflix
Description: Netflix is the world’s leading internet television network with
more than 48 million users in 40 countries.
Challenges:
• Oracle database (was used until 2010) was approaching its limits
on traffic and capacity
• single centre → single point of failure
• system downtime every two weeks for schema changes
• need for reliability and flexibility for international expansion
Netflix (1)
Solution: Cassandra (on the Cloud, AWS) was the clear winner of
the extensive evaluation of NoSQL DB options. (Later on, Netflix
migrated to DataStax Enterprise for security and production.)
Why?
I. persistent datastore, 100% uptime and cost-effective scalability
II. ability to create a cluster in any region in 10’
III. expert support
Netflix (2)
Results:
I. throughput of >10 m. transactions/sec.
II. process of >2.1 bn. reads and 4.3 bn. writes/day
III. delivers >76.000 genre types and captures every detail of
customers’ habits for tailoring the customer experience
Category: Product Catalogs and Playlists
Coursera
Description: Coursera is an education platform which partners with
top universities and organizations worldwide, to offer courses online
for anyone to take, for free.
Challenges:
• MySQL (previously used for class interaction) was insufficient:
• unstable performance,
• unexpected downtime,
• limitation in introducing new features
Coursera (1)
Solution: After evaluating emerging database technologies, it chose
Cassandra (DataStax).
Why?
• 100% application uptime needed (customers from all over the
world)
• Scalability (enabling storage of growing user data)
Coursera (2)
Results:
I. 3 nodes on AWS in the US East region and plans to expand to multiple data
centers across different regions
II. 24x7 availability to the users
III. Helps innovation
IV. Reduced time to market on new features
“High availability with reliable performance is a big win for us. With Datastax Enterprise,
our customers around the world are able to take any course, anytime through our on-
demand model.”
• Daniel Chia, Software Engineer at Coursera
Coursera (3)
Coursera (4)
Coursera (5)
Coursera (6)
Coursera (7)
Category: Messaging
The Weather Channel
Description: The Weather Channel delivers breaking news to
countless viewers and users from web, desktop and mobile
applications.
Challenges:
• Customer experience in the center of attention (continuous
availability, global and diverse users)
• New capabilities including statistics from unstructured data,
CGS for customer engagement etc.
The Weather Channel (1)
Solution: Cassandra
Why?
I. linear scalability
II. 100% uptime
III. supports almost all possible types of content (e.g. observations,
forecasts, marine data, ads)
The Weather Channel (2)
Results:
I. billions requests/month are processed - no fear of downtime
II. node count was grown from 3 to 36 in AWS in 1 year across 3
data centers (US East and West and Western Europe).
III. capability for new offerings (e.g. social weather)
The Weather Channel (3)
Technical Description,
Architecture, Internals
4.
Key Terms – Data Structures
• Commit Log
• Memtable
• Sorted String Table (SST)
• Bloom Filter
• Index File
Key Terms
• Gossip protocol: helps each node learn about the topology of the cluster
(communication and detection of faulty nodes).
• Snitch: indicates which node is closest to the current location.
Log-Structured Merge-Tree (LSM-Tree)
What is it?
• A disk-based data structure designed to provide low-cost indexing for a file
experiencing a high rate of record inserts.
• A simple LSM-Tree comprises of two tree-like structures: Co (in memory)
and C1 (disk).
• Maintains key-value pairs.
• In Cassandra, each value represents a row.
Used in:
• BigTable, HBase, MongoDB, SQLite, RocksDB, InfluxDB
Data Model
• Each Row → Identified by a Unique Key (Primary Key)
• Keyspace → Outermost container for data (one or more column families)
• Column Family → Contains Supercolumns or Columns (but not both)
• Column → Basic data structures with: key, value, timestamp
• Supercolumn → Special column, stores a map of sub-columns. Columns that
you are likely to query together should be placed in the same column family.
• Columns could be of variable number per key. For instance, key K1 could
have 1024 columns/supercolumns while K2 could have 64
columns/supercolumns
Data Model (1)
• Partition key: The first column declared in the primary key. Determines which node stores the
data.
• Clustering Columns: The remaining fields of the primary key, which determine the ordering of the
data in the disk.
• Any column within a column family is accessed using the convention: column_family: column
• For Supercolumns: column_family: super_column: column
• Values → Addressed by the triple (row-key, column-key, timestamp)
• The system allows columns to be sorted either by time or by name.
• Time sorting: exploited by applications such as FB Inbox Search where the results are always
displayed in time sorted order.
Data Model (2)
Data Model (3)
Relational Schema vs Cassandra
SYSTEM ARCHITECTURE
Introduction
The architecture of a storage system that needs to operate in a production setting is complex.
We will focus on the core distributed systems techniques used in Cassandra:
I. Partitioning
II. Replication
III. Membership
IV. Failure Handling
V. Scaling
All these modules work in synchrony to handle read/write requests.
Partitioning
Offers the ability to scale incrementally.
How?
• Dynamically partition the data over the set of nodes in the cluster.
• Consistent hashing (order preserving hash function).
• Output range: a ring.
• Each node: is assigned a random value which determines its place
on the ring.
Partitioning (1)
• Each data item: is assigned to a node by hashing its key to yield
its position on the ring and then walking the ring clockwise to
find the first node with a position larger than the item’s position.
• Each node becomes responsible for the region in the ring
between it and its predecessor node on the ring.
• Departure or arrival of a node only affects the immediate
neighbours.
Partitioning (2)
Challenges:
I. Random positioning of each node leads to non-uniform data and load
distribution.
II. The basic algorithm is oblivious to the heterogeneity in the performance
of nodes.
Addressed by:
Analysing load information on the ring and having lightly loaded nodes move
on the ring to alleviate heavily loaded ones.
Partitioning (3)
Node: Storage layer within a server
Before:
● 1 server/machine (machine: physical server or EC2 instance-AWS)
● 1 node/server (server: an installation of Cassandra)
Now:
● 256 vnodes/server (virtual nodes)
Vnodes or Virtual Nodes or Tokens:
Define the section of the ring (token ranges) the node will become responsible for
Why?
Much easier and faster in case of a node failure
Virtual Nodes (version >=1.2)
Replication
Used to achieve high availability and durability.
How?
• Replication factor: determines how many copies of your data exist.
• Each data item: is replicated at N hosts (N=replication factor).
• Coordinator node: in charge of the replication of the data items that fall within its range.
• Consistency level: refers to how much up-to-date and synchronized a row of Cassandra
is in all of its replicas e.g. quorum → replication_factor/2 + 1.
• Various replication policies: Rack Unaware, Rack Aware and Datacentre Aware.
• Each row is replicated across multiple datacentres which are connected through high
speed network links.
Replication - Rack Unaware
Replication - Zookeeper
• Cassandra elects a leader amongst its nodes using Zookeeper.
• All nodes on joining the cluster contact the leader who tells them for
what ranges they are replicas for.
• Leader tries to maintain the invariant that no node is responsible for
more than N-1 ranges in the ring.
Replication - Zookeeper
• Cassandra elects a leader amongst its nodes using Zookeeper.
• All nodes on joining the cluster contact the leader who tells them for what ranges
they are replicas for.
• Leader tries to maintain the invariant that no node is responsible for more than N-1
ranges in the ring.
• Metadata about the ranges a node is responsible is 1) cached locally at each node
and 2) in a fault-tolerant manner inside Zookeeper.
• This way, a node that crashes and comes back knows what ranges it was
responsible for.
Replication - Zookeeper (1)
Membership
Based on Scuttle-butt, a very efficient anti-entropy Gossip based
mechanism.
Benefits:
I. Efficient CPU utilization.
II. Efficient utilization of the Gossip Channel.
Gossip: a P2P communication protocol to discover and share location
and state information about the other nodes in a Cassandra cluster.
Gossip information is also persisted locally by each node to use
immediately when a node restarts.
Gossip
Gossip (1)
Gossip (2)
Gossip (3)
Membership - Failure Detection
Every node can locally determine if any other node in the system is up or down.
Used to avoid attempts to communicate with unreachable nodes.
How?
• Make use of Φ Accrual Failure Detector (emits a value which represent a suspicion level
for each of monitored nodes)
• With Φ=1, likelihood of mistake: 10%
• With Φ=2, likelihood of mistake: 1%
• and so on…
Bootstrapping (adding a new node in the cluster)
Process of getting data from other nodes in the ring for a new node that starts
for the first time.
How?
• When the new node enters the cluster, it chooses a random token for its
position in the ring.
• It also reads its configuration file which contains the seeds (initial contact
points) of the cluster.
• Token information is then gossiped around the cluster enabling any node
to route a request for a key to the correct node.
Bootstrapping (adding a new node in the cluster) (1)
In Facebook’s environment…
• Node outages are often transient but may last for extended intervals.
• Failures can be of various forms such as disk failures, bad CPU, etc.
• A node failure rarely signifies a permanent departure and therefore should not result in re-
balancing of the partition assignment.
• Manual error could result in the unintentional startup of new nodes.
• To that effect, every message contains the cluster name of each Cassandra instance.
• An admin uses a cmd tool or a browser to connect to a Cassandra node and issue a
membership change to join or leave a cluster.
Scaling the Cluster
Adding a new node on the system in order to alleviate another heavily
loaded node.
How?
• Gets assigned a token.
• Splits the responsibility range of the other node.
• Data are streamed between the nodes using kernel-kernel copy
techniques.
• Data are transferred at approximately the rate of 40 MB/sec.
Local Persistence
• Cassandra relies on the local file system for data persistence.
• The data is represented on disk using a format that lends itself to
efficient data retrieval.
Implementation Details
The Cassandra process on a single machine primarily consists of:
I. A partitioning module,
II. The cluster membership and failure detection module,
III. The storage engine module.
Each of these modules has been implemented from the ground up using Java.
The II) is built on top of a network layer which uses non-blocking I/O.
Application relate messages for replication and request routing relies on TCP.
Implementation Details (1)
The request routing modules are implemented using a certain state machine.
When a read/write request arrives at any node in the cluster the state
machine…
I. Identifies the node(s) that own the data for the key
II. Routes the requests to the nodes and wait for the responses to arrive
III. If the replies do not arrive within a configured timeout value fail the
request
IV. Figures out the latest response based on a timestamp
V. Schedules a repair of the data at any replica if they do not have the latest
piece of data.
No coordination at all?
“We have learnt that having some amount of coordination is essential
to making the implementation of some distributed features tractable”
-Lakshman & Malik
• Integration with Zookeeper → can be used for various tasks in
large scale distributed systems.
WRITE/READ REQUESTS
Write Request Flow
Write Request Flow (1)
Inside the Node (1)
Inside the Node (2)
Inside the Node (3)
Inside the Node (4)
Write Request Flow in short
In case of a Node Failure...
I. A locally stored hint with a specified time to live
II. When the nodes are available again, the write operation is sent
How is a Memtable flushed on the disk?
• A background thread keeps checking the size of all the
Memtables while the clients keep writing on the cluster
• If one of the above conditions is met, a new Memtable is created
and the previous one is marked for flushing.
i. node’s global memory thresholds have been reached,
ii. commit log is full,
iii. a table level interval has been reached
How is a Memtable flushed on the disk? (1)
• Another thread (or multiple threads) flushes all the marked Memtables on
the disc.
• The commit log segments corresponding to the entries of the flushed
Memtable are marked for recycling.
• A bloom filter and an index file are created.
Compaction
When the number of SStables has been increased, Cassandra
automatically merges multiple SStables, based on an algorithm,
specified in the compaction strategy.
• Optimizes read requests.
Compaction
Read Request Flow
Read Request Flow (1)
Installation, Usage, Requirements,
Platforms
5.
Client Interfaces and Language Support
• CQL (Cassandra Query Language) and Thrift
• Internal API: StorageProxy API available to JVM-based clients
(internal use, highly specialized use-cases)
• Spark
• Hadoop (Map/Reduce jobs)
• Client Libraries for: Python, Java, .Net, Ruby, PHP, Perl, C++ etc.
APIs
The Cassandra API consists of the following three simple methods:
• insert(table, key, rowMutation)
• get(table, key, columnName)
• delete(table, key, columnName)
Thrift to CQL - An ongoing transition from 2012
Where the legacy Thrift API exposes the internal storage structure
of Cassandra pretty much directly, CQL provides a thin abstraction
layer over this internal structure.
CQL
• The primary language for communicating with the Cassandra
database.
• Most basic way to interact with Cassandra is using the CQL shell,
cqlsh.
• Very similar syntax with SQL.
• Does not support creation of supercolumns.
Cassandra as a Cloud Database
Meets all the requirements of a Cloud
Database:
• Transparent elasticity
• Transparent scalability
• High availability
• Security
• Easy data distribution
• Data redundancy
• Support all data formats
• Low cost
• Simple manageability
Integration with other tools
BI Tools:
• MS Excel
• Pentaho
• Tableau
• Jaspersoft
• Talend
Monitoring Cassandra
• Integration with Ganglia (distributed performance tool).
• Several system level metrics have been exposed to Gaglia.
• Helps in understanding the system’s behavior in production
conditions.
Installation on Windows
Requirements:
• The latest version of Java 8
• The latest version of Python 2.7
• Download the Software (DataStax Community Edition for
Apache Cassandra™ v3.0.9 installer)
Steps
Creating a Virtual Machine
Requirements:
• Install Ubuntu Server 12.04 LTS 64 bit OS (any linux system
with linux kernel -2.6.x or later)
• Update the OS
• sudo apt-get update (Internet Connection is mandatory)
Installing Virtual Machine
• Download and install Virtual Box
• Steps:
Installation on Mac
Requirements:
• The latest version of Java 8
• The latest version of Python 2.7
• Download the software(DataStax Community Edition for Apache
Cassandra™ v3.0.9 Tarball)
Steps
• Open a terminal window
• Navigate to the Cassandra folder (via cd command) and then to bin directory e.g.
robinsmac:dev robin$ cd dsc-cassandra-1.2.2/bin
• Start Cassandra on terminal e.g. robinsmac:bin robin$ ./cqlsh .
• Your terminal window should look like this
Installation on Ubuntu
Requirements:
• The latest version of Java 8
• The latest version of Python 2.7
• Download the software(DataStax Community Edition for Apache
Cassandra™ v3.0.9 Tarball)
Steps
• Start Cassandra in foreground mode
• cd home/virtualmachine_name/cassandra/apache-
cassandra-2.0.14- bin
• bin/cassandra –f
• Test Cassandra
• bin/cqlsh
Steps (1)
• Untar Cassandra
• cd Cassandra
• tar –xvf apache-cassandra-2.0.14-bin.tar
• Create necessary directories and change ownership
• Sudo mkdir /var/lib/Cassandra
• Sudo mkdir /var/log/Cassandra
• Sudo chown –R $USER:$GROUP /var/lib/Cassandra
• Sudo chown –R$USER:$GROUP /var/log/Cassandra
Ubuntu commands
Terminal commands
Result
Demo
6.
OpsCenter
OpsCenter (1)
OpsCenter (2)
References
7.
Main Reference
References
1. A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):
35-40, 2010
2. Cassandra.apache.org. (2016). Apache Cassandra. [online] Available at: http://cassandra.apache.org/
3. Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), p.12.
4. Cockcroft, A. (2011). Benchmarking Cassandra Scalability on AWS - Over a million writes per second. [online]
Techblog.netflix.com. Available at: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-
on.html
5. Cs.uwaterloo.ca. (2016). [online] Available at:
https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf
6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A. and Gruber,
R. (2008). Bigtable. ACM Transactions on Computer Systems, 26(2), pp.1-26.
7. DataStax. (2016). Case Studies. [online] Available at: http://www.datastax.com/resources/casestudies
References (1)
8. Docs.datastax.com. (2016). About hinted handoff writes. [online] Available at:
https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html
9. DataStax. (2016). Customers. [online] Available at: http://www.datastax.com/customers
10. Docs.datastax.com. (2016). Introduction to Cassandra Query Language. [online] Available at:
https://docs.datastax.com/en/cql/3.1/cql/cql_intro_c.html
11. DataStax. (2014). What on earth are people using Cassandra for anyway?. [online] Available at:
http://www.datastax.com/2014/06/what-are-people-using-cassandra-for
12. DataStax. (2012). A thrift to CQL3 upgrade guide. [online] Available at:
http://www.datastax.com/dev/blog/thrift-to-cql3
13. DataStax. (2012). Virtual nodes in Cassandra 1.2. [online] Available at:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
14. DataStax. (2012). Schema in Cassandra 1.1. [online] Available at: http://www.datastax.com/dev/blog/schema-
in-cassandra-1-1
References (2)
15. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S.,
Vosshall, P. and Vogels, W. (2007). Dynamo. ACM SIGOPS Operating Systems Review, 41(6), p.205.
16. Docs.datastax.com. (2016). Architecture in brief. [online] Available at:
https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html
17. Docs.datastax.com. (2016). How data is distributed across a cluster (using virtual nodes). [online] Available
at:
http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
18. Docs.datastax.com. (2016). Internode communications (gossip). [online] Available at:
https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureGossipAbout_c.html
19. D0.awsstatic.com. (2016). [online] Available at:
https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
20. Edlich, P. (2016). NOSQL Databases. [online] Nosql-database.org. Available at: http://nosql-database.org/
References (3)
21. Edu.dmst.aueb.gr. (2016). Πύλη Τηλεκπαίδευσης Τμήματος Διοικητικής Επιστήμης & Τεχνολογίας: Είσοδος
στο δικτυακό τόπο. [online] Available at:
https://edu.dmst.aueb.gr/pluginfile.php/3614/mod_resource/content/0/BigDataSystems.pdf
22. En.wikipedia.org. (2016). Apache Cassandra. [online] Available at:
https://en.wikipedia.org/wiki/Apache_Cassandra
23. En.wikipedia.org. (2016). DataStax. [online] Available at: https://en.wikipedia.org/wiki/DataStax
24. En.wikipedia.org. (2016). Log-structured merge-tree. [online] Available at: https://en.wikipedia.org/wiki/Log-
structured_merge-tree
25. Exponential.io. (2016). Cassandra terminology - Exponential.io . [online] Available at:
http://exponential.io/blog/2015/01/08/cassandra-terminology/
References (4)
26. Facebook.com. (2016). Cassandra – A structured storage system on a P2P Network. [online] Available at:
https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p-
network/24413138919/
27. O&#039, P. and Neil, E. (2016). The Log-Structured Merge-Tree (LSM-Tree). [online] Citeseerx.ist.psu.edu.
Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2782
28. YouTube. (2016). Getting Started with Cassandra CQL on a Mac. [online] Available at:
https://www.youtube.com/watch?v=9zQc959w6Ho
29. YouTube. (2016). Installing Apache Cassandra In Windows. [online] Available at:
https://www.youtube.com/watch?v=fspXzjwfii0
30. YouTube. (2016). Part 1 - Apache Cassandra Installation From Scratch - Ubuntu. [online] Available at:
https://www.youtube.com/watch?v=ToztU48UxYE
References (5)
31. Weinberger, M. (2016). The Facebook engineer who taught its data how to dance is solving a new
complicated problem. [online] Business Insider. Available at: http://www.businessinsider.com/hedvig-avinash-
lakshman-facebook-cassandra-data-storage-2015-3
32. Wiki.apache.org. (2016). FrontPage - Cassandra Wiki. [online] Available at:
https://wiki.apache.org/cassandra/
33. www.tutorialspoint.com. (2016). Cassandra Introduction. [online] Available at:
https://www.tutorialspoint.com/cassandra/cassandra_introduction.htm

More Related Content

What's hot

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...IDERA Software
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 

What's hot (20)

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 

Viewers also liked

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzDataStax Academy
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - DenverJon Haddad
 
Cassandra Basics: Indexing
Cassandra Basics: IndexingCassandra Basics: Indexing
Cassandra Basics: IndexingBenjamin Black
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Jon Haddad
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLRyu Kobayashi
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best FriendsJon Haddad
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to CassandraJon Haddad
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 

Viewers also liked (20)

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Cassandra design patterns
Cassandra design patternsCassandra design patterns
Cassandra design patterns
 
Vim week
Vim weekVim week
Vim week
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Cassandra database design best practises
Cassandra database design best practisesCassandra database design best practises
Cassandra database design best practises
 
Elassandra
ElassandraElassandra
Elassandra
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
 
Cassandra Basics: Indexing
Cassandra Basics: IndexingCassandra Basics: Indexing
Cassandra Basics: Indexing
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQL
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 

Similar to Presentation of Apache Cassandra

Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Cassandra
Cassandra Cassandra
Cassandra Pooja GV
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
An introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDBAn introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDBSamuel Demharter
 
cassandra
cassandracassandra
cassandraAkash R
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataChen Robert
 

Similar to Presentation of Apache Cassandra (20)

Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Cassandra
Cassandra Cassandra
Cassandra
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
An introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDBAn introduction to cloud computing with Amazon Web Services and MongoDB
An introduction to cloud computing with Amazon Web Services and MongoDB
 
cassandra
cassandracassandra
cassandra
 
2. Lecture2_NOSQL_KeyValue.ppt
2. Lecture2_NOSQL_KeyValue.ppt2. Lecture2_NOSQL_KeyValue.ppt
2. Lecture2_NOSQL_KeyValue.ppt
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive Computing
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 

Recently uploaded

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Presentation of Apache Cassandra

  • 2. 1. Introduction to NoSQL systems, Extensible Record Stores and Amazon’s Dynamo + Google Bigtable 2. What Cassandra is and how it is compared with other similar systems 3. What applications are better supported - examples, case studies 4. Technical Description, architecture, internals 5. How is it used and installed, requirements and in what platforms does it run on 6. Demo 7. References Contents
  • 3. Background NoSQL, Extensible Record Stores, Cassandra’s Parents 1.
  • 4. NoSQL or Not-Only-SQL systems: Next Generation Databases. The initial movement started in 2009 with the goal of creating modern, web-scale DBs. Currently, they exist more than 225 NoSQL systems. In general, they share the following features: • Schema-free databases • Easy replication support • Simple API • Distributed • Open Source NoSQL Systems • BASE (instead of ACID) • Huge amount of data • Horizontally scalable
  • 5. • Motivated by Google’s Big Table. • Basic Data Model: Rows and Columns • Basic Scalability Model: Rows and Columns are splitted into nodes. • Rows: split across nodes through sharding on the primary key. • Columns: distributed over multiple nodes by using ‘column groups’. • Other systems that use this technology: Hypertable, HBase. Extensible Record Stores (or Wide Column Stores)
  • 6. What is it? A highly-available and scalable storage system used by Amazon to store and retrieve user shopping charts and other core services. It pioneered the idea of eventual consistency. Key-Value Store. How it works? Allows read and write operations to continue even during network partitions and resolves update conflicts using different conflict resolution mechanisms. Sacrifices consistency for availability. Allows customization to meet desired preference. Consistent Hashing, Vector Clocks (not in Cassandra), Gossip Protocol, Hinted Handoff, Read Repair Cassandra’s Parents - Amazon Dynamo
  • 7. Cassandra’s Parents - Google Bigtable What is it? A high performance data storage system built on Google File System and other Google technologies. How it works? Provides both structure and data distribution but relies on a distributed file system for durability. Richer data model from Dynamo. One key, many values. Fast sequential access. Columnar, SSTable Storage, Append-only, Memtable, Compaction
  • 8. What features does Cassandra use from Google’s BigTable? 1. Column Families 2. Memtables 3. SSTables What features does Cassandra use from Amazon Dynamo? 1. Consistent hashing 2. Partitioning 3. Replication Cassandra’s Parents
  • 10. Description and Comparisons What Cassandra is and how it is compared with other similar systems 2.
  • 11. Avinash Lakshman • Inventor, Apache Cassandra • Co-inventor, Amazon Dynamo
  • 12. Prashant Malik • Inventor, Apache Cassandra • Technical Leader, Facebook
  • 13.
  • 15. Definition • A distributed NoSQL database system for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.
  • 16. Timeline with activities • July 2008 Facebook released Cassandra as an open-source project • March 2009 Cassandra became an Apache Incubator project • 17th February 2010 Cassandra graduated to a top-level project • 2012 University of Toronto researchers studying NoSQL systems concluded that “In terms of scalability, there is a clear winner throughout our experiments” • 2010-2015 New releases of Cassandra
  • 17. Strengths • Linear scale performance The ability to add nodes without failures leads to predictable increases in performance • Supports multiple languages Python, C#/.NET, C++, Ruby, Java, Go, and many more… • Operational and developmental simplicity There are no complex software tiers to be managed, so administration duties are greatly simplified. • Ability to deploy across data centres Cassandra can be deployed across multiple, geographically dispersed data centres
  • 18. • Cloud availability Installations in cloud environments • Peer to peer architecture Cassandra follows a peer-to-peer architecture, instead of master-slave architecture • Flexible data model Supports modern data types with fast writes and reads • Fault tolerance Nodes that fail can easily be restored or replaced • High Performance Cassandra has demonstrated brilliant performance under large sets of data Strengths (1)
  • 19. • ColumnFamily Store Cassandra stores columns based on the column names, leading to very quick slicing • Tunable consistency Support for strong or eventual data consistency across a widely distributed cluster • Schema-free/Schema-less In Cassandra, columns can be created at your will within the rows. Cassandra data model is also famously known as a schema-optional data model • AP-CAP Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra Strengths (2)
  • 21. Variable number of columns per row
  • 22. Weaknesses Use Cases where is better to avoid using Cassandra • If there are too many joins required to retrieve the data • To store configuration data • During compaction, things slow down and throughput degrades • Basic things like aggregation operators are not supported • Range queries on partition key are not supported • If there are transactional data which require 100% consistency • Cassandra can update and delete data but it is not designed to do so
  • 23. Business Insider “The basic problem Cassandra solved is that when you have a lot of data sitting on a lot of servers, as Facebook does, you end up with a house of cards. A single server going down can collapse the whole stack.”
  • 24. Cassandra compared to other NoSQL Systems
  • 25. Read & Write latency for workload Read/Write
  • 26. Throughput for workload Read/Write & Read/Scan/Write
  • 28. Mixed Operational & Analytical Workload
  • 34. Differences between Cassandra and RDBMS RDBMS Cassandra relational database keyspace b-trees log-structured merge-trees rows which do not include a particular column value → NULL (in that position) for each row, only the columns with a value are stored support ACID transactions only supports AID
  • 35.
  • 37. What kind of applications are supported by Cassandra >80% of the clients fit into one of the next categories: I. Product Catalog/Playlist II. Recommendation/Personalization Engine III. Sensor Data/Internet of Things IV. Messaging (and generally time-series data) V. Fraud Detection
  • 38. In other words, applications that need to... • store and handle time-series data (most common use case) • store and handle large volumes of data • scale predictably • be continuously available • protect their data
  • 39. Datastax • A software company that develops and provides support for a commercial edition of Cassandra. • Massively scalable NoSQL platform able to run online applications for innovative and data-intensive companies (e.g. Netflix). • Faster to deploy and less expensive to maintain than other database platforms. • Powered by Cassandra and contains only selected releases of it, chosen by its expert staff.
  • 40. Datastax (1) • Supports businesses that need a progressive data management. • Can serve as a real-time datastore for online production. • Delivers a unique, smart data platform, suitable for the cloud.
  • 41. Customers • Over 3.000 companies around the world use (or have used) Apache Cassandra in production. • Most famous:
  • 42. Cassandra Summit • Organized by DataStax for 7 consecutive years (in both US and Europe). • New product releases are announced. • Customers describe their usage of Cassandra
  • 43. Key Terms • Cluster • Distributed Location • Node
  • 46. Facebook Inbox Search - Requirements “The system was required to handle a very high write throughput, billions of writes per day, and also scale with the number of users” “Since users are served from data centres that are geographically distributed, being able to replicate data across data centres was key to keep search latencies down” • Lakshman, Malik
  • 47. Facebook Inbox Search The reason why Cassandra was initially built. Facebook maintains a per user index of all messages that have been exchanged between the senders and the recipients of the message. Two kinds of search features enabled at 2008: I. term search II. interactions - given a person’s name, returns all the messages have been sent/received from that person
  • 48. Facebook Inbox Search (1) How did they do that? The schema consists of two column families. Exploits the “time sorting” feature of Cassandra. For the term search: • UserID → key • Words that make up the message → super columns • Columns within the super column → individual message identifiers (MessageID) of the messages that contain the word.
  • 49. Facebook Inbox Search (2) For the interactions: • UserID → key • RecipientsID’s → super columns • Columns within the super columns → MessageID’s • Cassandra provides certain hooks for intelligent caching of data
  • 51. Facebook Inbox Search (3) In 2008: • system was storing 50+ TB of data • on a 150 node cluster • spread out between east and west coast data centres Performance:
  • 52. Facebook abandoned Cassandra for the Inbox at late 2010 Cassandra has been deployed as the backend storage system for multiple services within Facebook.
  • 53. Categories: Fraud Detection and Time-series data
  • 54. Instagram Fraud Detection Initially was using Redis for auditing information related to security and site integrity purposes (e.g. fighting spam, finding abusive users). But… • data size was growing too quickly • high write and low read rate • keeping the data in memory was too costly So… Cassandra
  • 55. Instagram Fraud Detection (1) • Started with 3 nodes and very soon they had grown to a 12 node cluster. • No need to store very large instances in memory → put everything on disks. “Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed.” - Brick Branson, Software Engineer at Instagram
  • 56. Instagram “Inbox” Newsfeed or inbox part of Instagram: a feed of all the activity that would be associated with a given user’s account. Previously in Redis, with the same (memory) limitations as in the Fraud Detection case. Instagram’s Cassandra Cluster: • 12 nodes on EC2 (AWS) • 1.2 TB of data stored • 20.000 writes/sec. • 15.000 reads/sec.
  • 58. i2O Water Description: i2O Water helps utility companies operate more efficiently through the use of IoT aiming at solving the water crisis. Challenges: • Massive volumes of time-series data (>1.5 TB and growing) • Need for search and analysis of high velocity, streaming data in near real-time • SQL Server (previously used) has limitations on scalability and performance
  • 59. i2O Water (1) Solution: After evaluating many common NoSQL technologies, they chose Cassandra. Why? I. performance (50-60.000 writes and 20-40.000 reads/sec instead of 0.5 writes/sec and 5 reads/sec with SQL Server) II. easy to maintain III. easy to upgrade IV. ability to handle structured and unstructured real-time streaming data V. continuous availability and reliability VI. operationally simple to manage
  • 60. i2O Water (2) Results: I. 235 m. lt. of water saved per day II. successfully handling massive volumes of data from 15.000 devices without latency or downtime III. fault tolerance even during upgrades (99.9% availability)
  • 61. Category: Product Catalogs and Playlists
  • 62. Spotify Description: Spotify delivers streaming music in real time to over 40 million active users (the number is growing), without interruption. Challenges: • postgreSQL (previously used) and generally RDBMSs cannot deliver 100% availability • limited scalability across data centers • difficult to analyze massive volumes of data
  • 63. Spotify (1) Solution: Cassandra. Why? I. high availability (due to masterclass architecture) II. stores data for the entire product catalog and key customer experience capabilities III. multi data centre application and no single point of failure IV. integration with Apache Spark for real time processing and analytics
  • 64. Spotify (2) Results: I. 40.000 requests/sec. handled successfully and on-time II. >500 nodes across 4.000 servers in 4 data centres III. >1.5 bn playlists created from 40m active users and managed in real time
  • 65. Data Centre Data Centre Data Centre Data Centre Spotify - Data Centres (2 in the US - 2 in Europe)
  • 67. Netflix Description: Netflix is the world’s leading internet television network with more than 48 million users in 40 countries. Challenges: • Oracle database (was used until 2010) was approaching its limits on traffic and capacity • single centre → single point of failure • system downtime every two weeks for schema changes • need for reliability and flexibility for international expansion
  • 68. Netflix (1) Solution: Cassandra (on the Cloud, AWS) was the clear winner of the extensive evaluation of NoSQL DB options. (Later on, Netflix migrated to DataStax Enterprise for security and production.) Why? I. persistent datastore, 100% uptime and cost-effective scalability II. ability to create a cluster in any region in 10’ III. expert support
  • 69. Netflix (2) Results: I. throughput of >10 m. transactions/sec. II. process of >2.1 bn. reads and 4.3 bn. writes/day III. delivers >76.000 genre types and captures every detail of customers’ habits for tailoring the customer experience
  • 70. Category: Product Catalogs and Playlists
  • 71. Coursera Description: Coursera is an education platform which partners with top universities and organizations worldwide, to offer courses online for anyone to take, for free. Challenges: • MySQL (previously used for class interaction) was insufficient: • unstable performance, • unexpected downtime, • limitation in introducing new features
  • 72. Coursera (1) Solution: After evaluating emerging database technologies, it chose Cassandra (DataStax). Why? • 100% application uptime needed (customers from all over the world) • Scalability (enabling storage of growing user data)
  • 73. Coursera (2) Results: I. 3 nodes on AWS in the US East region and plans to expand to multiple data centers across different regions II. 24x7 availability to the users III. Helps innovation IV. Reduced time to market on new features “High availability with reliable performance is a big win for us. With Datastax Enterprise, our customers around the world are able to take any course, anytime through our on- demand model.” • Daniel Chia, Software Engineer at Coursera
  • 80. The Weather Channel Description: The Weather Channel delivers breaking news to countless viewers and users from web, desktop and mobile applications. Challenges: • Customer experience in the center of attention (continuous availability, global and diverse users) • New capabilities including statistics from unstructured data, CGS for customer engagement etc.
  • 81. The Weather Channel (1) Solution: Cassandra Why? I. linear scalability II. 100% uptime III. supports almost all possible types of content (e.g. observations, forecasts, marine data, ads)
  • 82. The Weather Channel (2) Results: I. billions requests/month are processed - no fear of downtime II. node count was grown from 3 to 36 in AWS in 1 year across 3 data centers (US East and West and Western Europe). III. capability for new offerings (e.g. social weather)
  • 85. Key Terms – Data Structures • Commit Log • Memtable • Sorted String Table (SST) • Bloom Filter • Index File
  • 86. Key Terms • Gossip protocol: helps each node learn about the topology of the cluster (communication and detection of faulty nodes). • Snitch: indicates which node is closest to the current location.
  • 87. Log-Structured Merge-Tree (LSM-Tree) What is it? • A disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts. • A simple LSM-Tree comprises of two tree-like structures: Co (in memory) and C1 (disk). • Maintains key-value pairs. • In Cassandra, each value represents a row. Used in: • BigTable, HBase, MongoDB, SQLite, RocksDB, InfluxDB
  • 88. Data Model • Each Row → Identified by a Unique Key (Primary Key) • Keyspace → Outermost container for data (one or more column families) • Column Family → Contains Supercolumns or Columns (but not both) • Column → Basic data structures with: key, value, timestamp • Supercolumn → Special column, stores a map of sub-columns. Columns that you are likely to query together should be placed in the same column family. • Columns could be of variable number per key. For instance, key K1 could have 1024 columns/supercolumns while K2 could have 64 columns/supercolumns
  • 89. Data Model (1) • Partition key: The first column declared in the primary key. Determines which node stores the data. • Clustering Columns: The remaining fields of the primary key, which determine the ordering of the data in the disk. • Any column within a column family is accessed using the convention: column_family: column • For Supercolumns: column_family: super_column: column • Values → Addressed by the triple (row-key, column-key, timestamp) • The system allows columns to be sorted either by time or by name. • Time sorting: exploited by applications such as FB Inbox Search where the results are always displayed in time sorted order.
  • 92. Relational Schema vs Cassandra
  • 94. Introduction The architecture of a storage system that needs to operate in a production setting is complex. We will focus on the core distributed systems techniques used in Cassandra: I. Partitioning II. Replication III. Membership IV. Failure Handling V. Scaling All these modules work in synchrony to handle read/write requests.
  • 95. Partitioning Offers the ability to scale incrementally. How? • Dynamically partition the data over the set of nodes in the cluster. • Consistent hashing (order preserving hash function). • Output range: a ring. • Each node: is assigned a random value which determines its place on the ring.
  • 96. Partitioning (1) • Each data item: is assigned to a node by hashing its key to yield its position on the ring and then walking the ring clockwise to find the first node with a position larger than the item’s position. • Each node becomes responsible for the region in the ring between it and its predecessor node on the ring. • Departure or arrival of a node only affects the immediate neighbours.
  • 97. Partitioning (2) Challenges: I. Random positioning of each node leads to non-uniform data and load distribution. II. The basic algorithm is oblivious to the heterogeneity in the performance of nodes. Addressed by: Analysing load information on the ring and having lightly loaded nodes move on the ring to alleviate heavily loaded ones.
  • 99. Node: Storage layer within a server Before: ● 1 server/machine (machine: physical server or EC2 instance-AWS) ● 1 node/server (server: an installation of Cassandra) Now: ● 256 vnodes/server (virtual nodes) Vnodes or Virtual Nodes or Tokens: Define the section of the ring (token ranges) the node will become responsible for Why? Much easier and faster in case of a node failure
  • 101. Replication Used to achieve high availability and durability. How? • Replication factor: determines how many copies of your data exist. • Each data item: is replicated at N hosts (N=replication factor). • Coordinator node: in charge of the replication of the data items that fall within its range. • Consistency level: refers to how much up-to-date and synchronized a row of Cassandra is in all of its replicas e.g. quorum → replication_factor/2 + 1. • Various replication policies: Rack Unaware, Rack Aware and Datacentre Aware. • Each row is replicated across multiple datacentres which are connected through high speed network links.
  • 102. Replication - Rack Unaware
  • 103. Replication - Zookeeper • Cassandra elects a leader amongst its nodes using Zookeeper. • All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for. • Leader tries to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.
  • 104. Replication - Zookeeper • Cassandra elects a leader amongst its nodes using Zookeeper. • All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for. • Leader tries to maintain the invariant that no node is responsible for more than N-1 ranges in the ring. • Metadata about the ranges a node is responsible is 1) cached locally at each node and 2) in a fault-tolerant manner inside Zookeeper. • This way, a node that crashes and comes back knows what ranges it was responsible for.
  • 106. Membership Based on Scuttle-butt, a very efficient anti-entropy Gossip based mechanism. Benefits: I. Efficient CPU utilization. II. Efficient utilization of the Gossip Channel. Gossip: a P2P communication protocol to discover and share location and state information about the other nodes in a Cassandra cluster. Gossip information is also persisted locally by each node to use immediately when a node restarts.
  • 107. Gossip
  • 111. Membership - Failure Detection Every node can locally determine if any other node in the system is up or down. Used to avoid attempts to communicate with unreachable nodes. How? • Make use of Φ Accrual Failure Detector (emits a value which represent a suspicion level for each of monitored nodes) • With Φ=1, likelihood of mistake: 10% • With Φ=2, likelihood of mistake: 1% • and so on…
  • 112. Bootstrapping (adding a new node in the cluster) Process of getting data from other nodes in the ring for a new node that starts for the first time. How? • When the new node enters the cluster, it chooses a random token for its position in the ring. • It also reads its configuration file which contains the seeds (initial contact points) of the cluster. • Token information is then gossiped around the cluster enabling any node to route a request for a key to the correct node.
  • 113. Bootstrapping (adding a new node in the cluster) (1) In Facebook’s environment… • Node outages are often transient but may last for extended intervals. • Failures can be of various forms such as disk failures, bad CPU, etc. • A node failure rarely signifies a permanent departure and therefore should not result in re- balancing of the partition assignment. • Manual error could result in the unintentional startup of new nodes. • To that effect, every message contains the cluster name of each Cassandra instance. • An admin uses a cmd tool or a browser to connect to a Cassandra node and issue a membership change to join or leave a cluster.
  • 114. Scaling the Cluster Adding a new node on the system in order to alleviate another heavily loaded node. How? • Gets assigned a token. • Splits the responsibility range of the other node. • Data are streamed between the nodes using kernel-kernel copy techniques. • Data are transferred at approximately the rate of 40 MB/sec.
  • 115. Local Persistence • Cassandra relies on the local file system for data persistence. • The data is represented on disk using a format that lends itself to efficient data retrieval.
  • 116. Implementation Details The Cassandra process on a single machine primarily consists of: I. A partitioning module, II. The cluster membership and failure detection module, III. The storage engine module. Each of these modules has been implemented from the ground up using Java. The II) is built on top of a network layer which uses non-blocking I/O. Application relate messages for replication and request routing relies on TCP.
  • 117. Implementation Details (1) The request routing modules are implemented using a certain state machine. When a read/write request arrives at any node in the cluster the state machine… I. Identifies the node(s) that own the data for the key II. Routes the requests to the nodes and wait for the responses to arrive III. If the replies do not arrive within a configured timeout value fail the request IV. Figures out the latest response based on a timestamp V. Schedules a repair of the data at any replica if they do not have the latest piece of data.
  • 118. No coordination at all? “We have learnt that having some amount of coordination is essential to making the implementation of some distributed features tractable” -Lakshman & Malik • Integration with Zookeeper → can be used for various tasks in large scale distributed systems.
  • 126. Write Request Flow in short
  • 127. In case of a Node Failure... I. A locally stored hint with a specified time to live II. When the nodes are available again, the write operation is sent
  • 128. How is a Memtable flushed on the disk? • A background thread keeps checking the size of all the Memtables while the clients keep writing on the cluster • If one of the above conditions is met, a new Memtable is created and the previous one is marked for flushing. i. node’s global memory thresholds have been reached, ii. commit log is full, iii. a table level interval has been reached
  • 129. How is a Memtable flushed on the disk? (1) • Another thread (or multiple threads) flushes all the marked Memtables on the disc. • The commit log segments corresponding to the entries of the flushed Memtable are marked for recycling. • A bloom filter and an index file are created.
  • 130. Compaction When the number of SStables has been increased, Cassandra automatically merges multiple SStables, based on an algorithm, specified in the compaction strategy. • Optimizes read requests.
  • 135. Client Interfaces and Language Support • CQL (Cassandra Query Language) and Thrift • Internal API: StorageProxy API available to JVM-based clients (internal use, highly specialized use-cases) • Spark • Hadoop (Map/Reduce jobs) • Client Libraries for: Python, Java, .Net, Ruby, PHP, Perl, C++ etc.
  • 136. APIs The Cassandra API consists of the following three simple methods: • insert(table, key, rowMutation) • get(table, key, columnName) • delete(table, key, columnName)
  • 137. Thrift to CQL - An ongoing transition from 2012 Where the legacy Thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL provides a thin abstraction layer over this internal structure.
  • 138. CQL • The primary language for communicating with the Cassandra database. • Most basic way to interact with Cassandra is using the CQL shell, cqlsh. • Very similar syntax with SQL. • Does not support creation of supercolumns.
  • 139. Cassandra as a Cloud Database Meets all the requirements of a Cloud Database: • Transparent elasticity • Transparent scalability • High availability • Security • Easy data distribution • Data redundancy • Support all data formats • Low cost • Simple manageability
  • 140. Integration with other tools BI Tools: • MS Excel • Pentaho • Tableau • Jaspersoft • Talend
  • 141. Monitoring Cassandra • Integration with Ganglia (distributed performance tool). • Several system level metrics have been exposed to Gaglia. • Helps in understanding the system’s behavior in production conditions.
  • 142. Installation on Windows Requirements: • The latest version of Java 8 • The latest version of Python 2.7 • Download the Software (DataStax Community Edition for Apache Cassandra™ v3.0.9 installer)
  • 143. Steps
  • 144. Creating a Virtual Machine Requirements: • Install Ubuntu Server 12.04 LTS 64 bit OS (any linux system with linux kernel -2.6.x or later) • Update the OS • sudo apt-get update (Internet Connection is mandatory)
  • 145. Installing Virtual Machine • Download and install Virtual Box • Steps:
  • 146. Installation on Mac Requirements: • The latest version of Java 8 • The latest version of Python 2.7 • Download the software(DataStax Community Edition for Apache Cassandra™ v3.0.9 Tarball)
  • 147. Steps • Open a terminal window • Navigate to the Cassandra folder (via cd command) and then to bin directory e.g. robinsmac:dev robin$ cd dsc-cassandra-1.2.2/bin • Start Cassandra on terminal e.g. robinsmac:bin robin$ ./cqlsh . • Your terminal window should look like this
  • 148. Installation on Ubuntu Requirements: • The latest version of Java 8 • The latest version of Python 2.7 • Download the software(DataStax Community Edition for Apache Cassandra™ v3.0.9 Tarball)
  • 149. Steps • Start Cassandra in foreground mode • cd home/virtualmachine_name/cassandra/apache- cassandra-2.0.14- bin • bin/cassandra –f • Test Cassandra • bin/cqlsh
  • 150. Steps (1) • Untar Cassandra • cd Cassandra • tar –xvf apache-cassandra-2.0.14-bin.tar • Create necessary directories and change ownership • Sudo mkdir /var/lib/Cassandra • Sudo mkdir /var/log/Cassandra • Sudo chown –R $USER:$GROUP /var/lib/Cassandra • Sudo chown –R$USER:$GROUP /var/log/Cassandra
  • 153.
  • 154.
  • 155.
  • 161. References 1. A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2): 35-40, 2010 2. Cassandra.apache.org. (2016). Apache Cassandra. [online] Available at: http://cassandra.apache.org/ 3. Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), p.12. 4. Cockcroft, A. (2011). Benchmarking Cassandra Scalability on AWS - Over a million writes per second. [online] Techblog.netflix.com. Available at: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability- on.html 5. Cs.uwaterloo.ca. (2016). [online] Available at: https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf 6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A. and Gruber, R. (2008). Bigtable. ACM Transactions on Computer Systems, 26(2), pp.1-26. 7. DataStax. (2016). Case Studies. [online] Available at: http://www.datastax.com/resources/casestudies
  • 162. References (1) 8. Docs.datastax.com. (2016). About hinted handoff writes. [online] Available at: https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html 9. DataStax. (2016). Customers. [online] Available at: http://www.datastax.com/customers 10. Docs.datastax.com. (2016). Introduction to Cassandra Query Language. [online] Available at: https://docs.datastax.com/en/cql/3.1/cql/cql_intro_c.html 11. DataStax. (2014). What on earth are people using Cassandra for anyway?. [online] Available at: http://www.datastax.com/2014/06/what-are-people-using-cassandra-for 12. DataStax. (2012). A thrift to CQL3 upgrade guide. [online] Available at: http://www.datastax.com/dev/blog/thrift-to-cql3 13. DataStax. (2012). Virtual nodes in Cassandra 1.2. [online] Available at: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 14. DataStax. (2012). Schema in Cassandra 1.1. [online] Available at: http://www.datastax.com/dev/blog/schema- in-cassandra-1-1
  • 163. References (2) 15. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P. and Vogels, W. (2007). Dynamo. ACM SIGOPS Operating Systems Review, 41(6), p.205. 16. Docs.datastax.com. (2016). Architecture in brief. [online] Available at: https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html 17. Docs.datastax.com. (2016). How data is distributed across a cluster (using virtual nodes). [online] Available at: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html 18. Docs.datastax.com. (2016). Internode communications (gossip). [online] Available at: https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureGossipAbout_c.html 19. D0.awsstatic.com. (2016). [online] Available at: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf 20. Edlich, P. (2016). NOSQL Databases. [online] Nosql-database.org. Available at: http://nosql-database.org/
  • 164. References (3) 21. Edu.dmst.aueb.gr. (2016). Πύλη Τηλεκπαίδευσης Τμήματος Διοικητικής Επιστήμης & Τεχνολογίας: Είσοδος στο δικτυακό τόπο. [online] Available at: https://edu.dmst.aueb.gr/pluginfile.php/3614/mod_resource/content/0/BigDataSystems.pdf 22. En.wikipedia.org. (2016). Apache Cassandra. [online] Available at: https://en.wikipedia.org/wiki/Apache_Cassandra 23. En.wikipedia.org. (2016). DataStax. [online] Available at: https://en.wikipedia.org/wiki/DataStax 24. En.wikipedia.org. (2016). Log-structured merge-tree. [online] Available at: https://en.wikipedia.org/wiki/Log- structured_merge-tree 25. Exponential.io. (2016). Cassandra terminology - Exponential.io . [online] Available at: http://exponential.io/blog/2015/01/08/cassandra-terminology/
  • 165. References (4) 26. Facebook.com. (2016). Cassandra – A structured storage system on a P2P Network. [online] Available at: https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p- network/24413138919/ 27. O&#039, P. and Neil, E. (2016). The Log-Structured Merge-Tree (LSM-Tree). [online] Citeseerx.ist.psu.edu. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2782 28. YouTube. (2016). Getting Started with Cassandra CQL on a Mac. [online] Available at: https://www.youtube.com/watch?v=9zQc959w6Ho 29. YouTube. (2016). Installing Apache Cassandra In Windows. [online] Available at: https://www.youtube.com/watch?v=fspXzjwfii0 30. YouTube. (2016). Part 1 - Apache Cassandra Installation From Scratch - Ubuntu. [online] Available at: https://www.youtube.com/watch?v=ToztU48UxYE
  • 166. References (5) 31. Weinberger, M. (2016). The Facebook engineer who taught its data how to dance is solving a new complicated problem. [online] Business Insider. Available at: http://www.businessinsider.com/hedvig-avinash- lakshman-facebook-cassandra-data-storage-2015-3 32. Wiki.apache.org. (2016). FrontPage - Cassandra Wiki. [online] Available at: https://wiki.apache.org/cassandra/ 33. www.tutorialspoint.com. (2016). Cassandra Introduction. [online] Available at: https://www.tutorialspoint.com/cassandra/cassandra_introduction.htm

Editor's Notes

  1. Sources: http://nosql-database.org/ BASE: Basic Availability, Soft state, Eventual consistency ACID (Atomicity, Consistency, Isolation, Durability)
  2. Source: noSql paper http://nosql-database.org/
  3. Source: dynamo paper, cassandra paper, nosql paper Data fetched are not guaranteed to be up-to-date but updates are guaranteed to be propagated to all nodes eventually.
  4. https://en.wikipedia.org/wiki/Apache_Cassandra
  5. www.businessinsider.com/hedvig-avinash-lakshman-facebook-cassandra-data-storage-2015-3
  6. ACID (Atomicity, Consistency, Isolation, Durability)
  7. Stats provided by authors using Facebook data
  8. Source: cloud_cassandra paper
  9. Adjusted from: https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf
  10. Source: Cassandra paper
  11. http://www.datastax.com/dev/blog/facebooks-instagram-making-the-switch-to-cassandra-from-redis-a-75-insta-savings
  12. Taken from: http://www.slideshare.net/planetcassandra/courseras-adoption-of-cassandra
  13. Taken from: http://www.slideshare.net/planetcassandra/courseras-adoption-of-cassandra
  14. Taken from: http://www.slideshare.net/planetcassandra/courseras-adoption-of-cassandra
  15. Taken from: http://www.slideshare.net/planetcassandra/courseras-adoption-of-cassandra
  16. Taken from: http://www.slideshare.net/planetcassandra/courseras-adoption-of-cassandra
  17. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2782
  18. Sources: cassandra paper, slides (eduportal), fb post: https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p-network/24413138919/
  19. Sources: cassandra paper, slides (eduportal), fb post: https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p-network/24413138919/
  20. Figure taken from Eben Hewitt’s (author of Oreilly’s Cassandra book) slides.
  21. Source: https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf
  22. Source: http://www.datastax.com/dev/blog/schema-in-cassandra-1-1
  23. Source: cassandra paper
  24. Source: cassandra paper
  25. Source: cassandra paper
  26. Source: cassandra paper
  27. Figure taken from Avinash Lakshman and Prashant Malik slides
  28. Source: http://cassandra.apache.org/doc/latest/operating/topo_changes.html#bootstrap
  29. Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
  30. Source: cassandra paper
  31. Figure taken from Avinash Lakshman and Prashant Malik slides
  32. Source: cassandra paper
  33. Source: cassandra paper
  34. Figure taken from Avinash Lakshman and Prashant Malik slides
  35. Sources: Source: cassandra paper https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureGossipAbout_c.html https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureIntro_c.html
  36. Figure taken from Avinash Lakshman and Prashant Malik slides
  37. Figure taken from Avinash Lakshman and Prashant Malik slides
  38. Figure taken from Avinash Lakshman and Prashant Malik slides
  39. Figure taken from Avinash Lakshman and Prashant Malik slides
  40. Source: cassandra paper
  41. Source: cassandra paper
  42. Source: cassandra paper
  43. Source: cassandra paper
  44. Source: cassandra paper
  45. Source: cassandra paper
  46. Source: cassandra paper
  47. Source: cassandra paper
  48. Source: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
  49. Figure taken from Avinash Lakshman and Prashant Malik slides
  50. Taken from a Cassandra video (screenshots)
  51. Taken from a Cassandra video (screenshots)
  52. Taken from a Cassandra video (screenshots)
  53. Taken from a Cassandra video (screenshots)
  54. Source: cassandra paper
  55. Source: cassandra paper
  56. Source: cassandra paper
  57. Source: AWS whitepaper
  58. Figure taken from Avinash Lakshman and Prashant Malik slides
  59. Figure taken from Avinash Lakshman and Prashant Malik slides
  60. Source: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
  61. Sources: cassandra wiki
  62. Source: Cassandra paper
  63. Source: http://www.datastax.com/dev/blog/thrift-to-cql3
  64. https://docs.datastax.com/en/cql/3.1/cql/cql_intro_c.html
  65. Source: Cloud_Cassandra paper
  66. Source: datastax
  67. Source: Cassandra paper
  68. Photo created by ourselves
  69. Photo created by ourselves
  70. Photo created by ourselves