SlideShare a Scribd company logo
1 of 43
BigData As A Platform
Cassandra and current trends

Matthew F. Dennis // @mdennis
8th Advanced Computing Conference
September 12, 2012
Seoul, South Korea
Why Does BigData Matter To Business?
Effective Use of BigData Leads To Success
And Businesses Have Noticed …
     (according to indeed.com job trends)
Requirements Of A BigData Platform


●   Performance


●   Scalability


●   Availability
Measuring Performance


●   throughput (in ops/sec)



●   latency (for a single request)
Uncommon Cassandra Performance Features

●   No Locking

●   Log Structured Storage Engine

●   Highly Parallel

●   Immutable On Disk Structures

●   On Disk Compression

●   No “Master” Nodes
Cassandra Throughput
Cassandra Latency
                                           (in microseconds)




http://www.slideshare.net/adrianco/cassandra-performance-on-aws
Cassandra + SSD


But can you get such low latency and high
throughput for random reads from disk?


With Cassandra + SSDs, YES!
(SSD latency is usually only ~100 us)
A Random Note About C* and SSDs

     ●   Cassandra can use cheap consumer grade MLC
         SSDs (~$1.00 USD / GB)
     ●   no in-place updates results in far fewer erase
         cycles on the drive which results in the drive
         lasting longer
     ●   Compared to nearly all other databases,
         consumer SSDs last ~10x longer on C*
     ●   to put it another way MLC drives last about as
         long with C* as enterprise SLC drives last with
         most other databases

http://www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds
Why Not On Rotational Disks Too?


●   rotational disks require ~8ms per seek
●   note that this is a HW limitation, an
    absolute upper limit (for that HW)
●   no system can do better than the seek
    time when randomly retrieving data from
    disk (and most do far worse)
What About Writes/Updates?

●   all write I/O in Cassandra is sequential
●   no global write lock
●   no Btrees
●   compare to MySQL, BerkeleyDB, MongoDB,
    Oracle, et cetera which either lock (sometimes
    with a global lock) and/or generate random
    writes for updates
●   locking is not the only way to handle
    concurrency !!!
Larger Than Memory Datasets

●   write performance degrades only
    marginally as the dataset outgrows
    memory; essentially no change in latency
    or throughput

●   read performance degrades gracefully and
    is relative to the percent of data in memory
Most Systems Do Not Behave That Way
Performance Compared to HBase

●   10x better read throughput
●   8x better write throughput
●   8x better read latency
●   10x better write latency
    (even when HBase was running without durability)


    University of Toronto, Canada
    Middleware Systems Research Group, et al
    38th International Conference On Very Large Data Bases
    http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
And We’re Not Even Finished Yet ...


●   native CQL transport layer
●   unified off-heap row and key caches
●   vnodes
●   no read-before-write on secondary indexes
●   native JBOD support
●   straight up profiler driven optimizations
Measuring Scalability


If performance is measured in throughput
and latency, then scalability is the stability
of latency as throughput increases (or the
stability of latency and throughput as “load”
increases); essentially scalability is how
well a system handles growth
Linear Scalability

  If for all values of X, Y, Z, C and N:

latency=X                   latency=X
throughput=Y                throughput=Y
“load”=Z          implies   “load”=CZ
nodes=N                     nodes=CN

 Then: the system is perfectly linearly
       scalable with respect to “load”
Linear Scalability

   If for all values of X, Y, Z, C and N:

latency=X                    latency=X
throughput=Y                 throughput=CY
“load”=Z           implies   “load”=Z
nodes=N                      nodes=CN

Then: the system is perfectly linearly
      scalable with respect to throughput
Cassandra Scalability


“In terms of scalability, there is a clear winner throughout our
experiments. Cassandra achieves the highest throughput for the
maximum number of nodes in all experiments with [nearly linear]
increasing throughput from 1 to 12 nodes.”


University of Toronto, Canada
Middleware Systems Research Group, et al
38th International Conference On Very Large Data Bases
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Cassandra Scalability

                reads: 25%
                scans: 25%
                writes: 50%




University of Toronto, Canada
Middleware Systems Research Group, et al
38th International Conference On Very Large Data Bases
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Cassandra Scalability
                      (client operations per second at RF=3)




http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Requirements Of A BigData Platform

    Performance


    Scalability


●Availability – performance and scalability
are easy if you ignore availability and/or
assume failures never happen
Measuring Availability




availability is measured by the amount of
downtime a system has over a given
period of time
Common Causes of Downtime
             In Large Scale Systems

●   Component Failure (disk)

●   Machine Failure (NIC, cpu, power supply)

●   Rack Failure (router, switch, UPS, AC)

●   Site Failure (power grid, natural disaster, war, coup)
The Common Theme In The Solutions?


            Replication
Legacy Replication




can be HW or SW, replicated or not
  this is not necessarily a SPOF
Legacy Replication




          SPOF                SPOF   SPOF   SPOF




the master/leader is a SPOF
Thoughts On Availability
           (and legacy replication)
“High availability implies that a single fault will not
bring down your system. Not ‘we’ll recover quickly.’”
                                  -- Ben Coverston, DataStax




“The biggest problem with failover is that you’re
almost never using it until it really hurts. Its like
backups that you never test.”
                                   -- Rick Branson, Instagram
Cassandra Replication


cloud provider 0         cloud provider 1




 data center 0             data center 1
More Thoughts On Availability
  (and legacy replication)
Continuous Global Availability ...




… requires more than being able to
recover from faults, it requires
being able to tolerate the faults
without downtime in the first place
if you care about
     continuous global availability
then you must serve reads and writes from
    multiple geographical locations


         there is no alternative
Cassandra As A BigData Platform



Performance


Scalability


Availability
And now back to our trends ...
Public Cassandra Users Early 2011
Public Cassandra Users Mid 2012
DataStax Enterprise
DataStax Enterprise
Q?
Matthew F. Dennis // @mdennis
http://slideshare.net/mattdennis
Thank You!
 Matthew F. Dennis // @mdennis
 http://slideshare.net/mattdennis
A Quick Side Note ...

               Cassandra Replication Follows
                   The Dynamo Model *


      http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html




                      Read It!



* Cassandra is not a strict reimplementation of dynamo

More Related Content

What's hot

Cassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on FireCassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on FireDataStax
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
 
Building Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraPatrick McFadin
 
Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, SamsungRedis Labs
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...ScyllaDB
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?Elvis Saravia
 
Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)ahl0003
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Johnny Miller
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreTilmann Rabl
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesRick Branson
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogicalUmair Shahid
 
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...DataStax Academy
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScyllaDB
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverJohn Paulett
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsSingleStore
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...ScyllaDB
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph Community
 

What's hot (20)

Cassandra at scale
Cassandra at scaleCassandra at scale
Cassandra at scale
 
Cassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on FireCassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on Fire
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
 
Building Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache Cassandra
 
Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, Samsung
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)Hybrid Storage Pools (Now with the benefit of hindsight!)
Hybrid Storage Pools (Now with the benefit of hindsight!)
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
San Francisco Cassadnra Meetup - March 2014: I/O Performance tuning on AWS fo...
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive Analytics
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack
 

Viewers also liked

durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durabilityMatthew Dennis
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarMatthew Dennis
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGMatthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13Dave Gardner
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
Validation driven change
Validation driven changeValidation driven change
Validation driven changeMichael Goetz
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoDave Gardner
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Generic VNF configuration management and orchestration
Generic VNF configuration management and orchestrationGeneric VNF configuration management and orchestration
Generic VNF configuration management and orchestration🦾 Adam Israel
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 

Viewers also liked (19)

durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Cassandra and Hybrid Cloud - Introducing Mache
Cassandra and Hybrid Cloud - Introducing MacheCassandra and Hybrid Cloud - Introducing Mache
Cassandra and Hybrid Cloud - Introducing Mache
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Validation driven change
Validation driven changeValidation driven change
Validation driven change
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Generic VNF configuration management and orchestration
Generic VNF configuration management and orchestrationGeneric VNF configuration management and orchestration
Generic VNF configuration management and orchestration
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 

Similar to BigData as a Platform: Cassandra and Current Trends

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Scaling Your Database In The Cloud
Scaling Your Database In The CloudScaling Your Database In The Cloud
Scaling Your Database In The CloudCory Isaacson
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Clustrix
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsPanagiotis Papadopoulos
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerAlexDepo
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudRightScale
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflixgreggulrich
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudVinay Kumar Chella
 
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Matt Stubbs
 
Why new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases fasterWhy new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases fasterSolarWinds
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL DatabaseSteve Min
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewAlan McSweeney
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 

Similar to BigData as a Platform: Cassandra and Current Trends (20)

Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Scaling Your Database In The Cloud
Scaling Your Database In The CloudScaling Your Database In The Cloud
Scaling Your Database In The Cloud
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL Server
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the Cloud
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...
 
Why new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases fasterWhy new hardware may not make Oracle databases faster
Why new hardware may not make Oracle databases faster
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL Database
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

BigData as a Platform: Cassandra and Current Trends

  • 1. BigData As A Platform Cassandra and current trends Matthew F. Dennis // @mdennis 8th Advanced Computing Conference September 12, 2012 Seoul, South Korea
  • 2. Why Does BigData Matter To Business?
  • 3. Effective Use of BigData Leads To Success
  • 4. And Businesses Have Noticed … (according to indeed.com job trends)
  • 5. Requirements Of A BigData Platform ● Performance ● Scalability ● Availability
  • 6. Measuring Performance ● throughput (in ops/sec) ● latency (for a single request)
  • 7. Uncommon Cassandra Performance Features ● No Locking ● Log Structured Storage Engine ● Highly Parallel ● Immutable On Disk Structures ● On Disk Compression ● No “Master” Nodes
  • 9. Cassandra Latency (in microseconds) http://www.slideshare.net/adrianco/cassandra-performance-on-aws
  • 10. Cassandra + SSD But can you get such low latency and high throughput for random reads from disk? With Cassandra + SSDs, YES! (SSD latency is usually only ~100 us)
  • 11. A Random Note About C* and SSDs ● Cassandra can use cheap consumer grade MLC SSDs (~$1.00 USD / GB) ● no in-place updates results in far fewer erase cycles on the drive which results in the drive lasting longer ● Compared to nearly all other databases, consumer SSDs last ~10x longer on C* ● to put it another way MLC drives last about as long with C* as enterprise SLC drives last with most other databases http://www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds
  • 12. Why Not On Rotational Disks Too? ● rotational disks require ~8ms per seek ● note that this is a HW limitation, an absolute upper limit (for that HW) ● no system can do better than the seek time when randomly retrieving data from disk (and most do far worse)
  • 13. What About Writes/Updates? ● all write I/O in Cassandra is sequential ● no global write lock ● no Btrees ● compare to MySQL, BerkeleyDB, MongoDB, Oracle, et cetera which either lock (sometimes with a global lock) and/or generate random writes for updates ● locking is not the only way to handle concurrency !!!
  • 14. Larger Than Memory Datasets ● write performance degrades only marginally as the dataset outgrows memory; essentially no change in latency or throughput ● read performance degrades gracefully and is relative to the percent of data in memory
  • 15. Most Systems Do Not Behave That Way
  • 16. Performance Compared to HBase ● 10x better read throughput ● 8x better write throughput ● 8x better read latency ● 10x better write latency (even when HBase was running without durability) University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 17. And We’re Not Even Finished Yet ... ● native CQL transport layer ● unified off-heap row and key caches ● vnodes ● no read-before-write on secondary indexes ● native JBOD support ● straight up profiler driven optimizations
  • 18. Measuring Scalability If performance is measured in throughput and latency, then scalability is the stability of latency as throughput increases (or the stability of latency and throughput as “load” increases); essentially scalability is how well a system handles growth
  • 19. Linear Scalability If for all values of X, Y, Z, C and N: latency=X latency=X throughput=Y throughput=Y “load”=Z implies “load”=CZ nodes=N nodes=CN Then: the system is perfectly linearly scalable with respect to “load”
  • 20. Linear Scalability If for all values of X, Y, Z, C and N: latency=X latency=X throughput=Y throughput=CY “load”=Z implies “load”=Z nodes=N nodes=CN Then: the system is perfectly linearly scalable with respect to throughput
  • 21. Cassandra Scalability “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with [nearly linear] increasing throughput from 1 to 12 nodes.” University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 22. Cassandra Scalability reads: 25% scans: 25% writes: 50% University of Toronto, Canada Middleware Systems Research Group, et al 38th International Conference On Very Large Data Bases http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
  • 23. Cassandra Scalability (client operations per second at RF=3) http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
  • 24. Requirements Of A BigData Platform Performance Scalability ●Availability – performance and scalability are easy if you ignore availability and/or assume failures never happen
  • 25. Measuring Availability availability is measured by the amount of downtime a system has over a given period of time
  • 26. Common Causes of Downtime In Large Scale Systems ● Component Failure (disk) ● Machine Failure (NIC, cpu, power supply) ● Rack Failure (router, switch, UPS, AC) ● Site Failure (power grid, natural disaster, war, coup)
  • 27. The Common Theme In The Solutions? Replication
  • 28. Legacy Replication can be HW or SW, replicated or not this is not necessarily a SPOF
  • 29. Legacy Replication SPOF SPOF SPOF SPOF the master/leader is a SPOF
  • 30. Thoughts On Availability (and legacy replication) “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston, DataStax “The biggest problem with failover is that you’re almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson, Instagram
  • 31. Cassandra Replication cloud provider 0 cloud provider 1 data center 0 data center 1
  • 32. More Thoughts On Availability (and legacy replication)
  • 33. Continuous Global Availability ... … requires more than being able to recover from faults, it requires being able to tolerate the faults without downtime in the first place
  • 34. if you care about continuous global availability then you must serve reads and writes from multiple geographical locations there is no alternative
  • 35. Cassandra As A BigData Platform Performance Scalability Availability
  • 36. And now back to our trends ...
  • 41. Q? Matthew F. Dennis // @mdennis http://slideshare.net/mattdennis
  • 42. Thank You! Matthew F. Dennis // @mdennis http://slideshare.net/mattdennis
  • 43. A Quick Side Note ... Cassandra Replication Follows The Dynamo Model * http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Read It! * Cassandra is not a strict reimplementation of dynamo