Spark with Cassandra by Christopher Batey

•

8 likes•3,514 views

Spark Summit

Data & Analytics

@chbatey
Christopher Batey - @chbatey
Freelance Software engineer/Devops/Architect
Loves:
Breaking software
Distributed systems
Hates:
Fragile software
Untested code :(
Introduction

Overview
Cassandra architecture
Modelling time series - weather data for many
stations
What can be done with pure C*
When to introduce Spark

What do we use Spark for?
Batch processing
Machine Learning
Ad-hoc querying of large datasets
Streaming processing

What do we use Cassandra
for?
Operational Database
OLTP

@chbatey
Master slave
Master
Async replication
Slave

@chbatey
Consistent hashing
jim age: 36 car: ford gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M
suzy age: 10 gender: F
Partition Key Hash value
jim 350
carol 998
johnny 50
suzy 600
Partition Key

999
49
0
50
A
B
C
D
249750
749
250
B
CD
A

Example
Node Start range End range Primary
key
Hash value
A 0 249 johnny 50
B 250 499 jim 350
C 500 749 suzy 600
D 750 999 carol 998

@chbatey
Fault tolerance
Replicate each price of data on multiple nodes
Keep replicas on different racks
Datacenter aware

DC2
client
RF3 RF3
C
C
WRITE
CL = 1 We have
replication!
DC1

Storing weather data
CREATE TABLE raw_weather_data (
weather_station text,
year int,
month int,
day int,
hour int,
temp double,
PRIMARY KEY ((weather_station), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour
DESC);

@chbatey
Primary key relationship
PRIMARY KEY ((weatherstation_id),year,month,day,hour)

Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!

Primary key relationship
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Columns
10010:99999

2005:12:1:8:temp 2005:12:1:7:temp
-5.6
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Columns
10010:99999
-5.1
2005:12:1:10:temp
-5.3
2005:12:1:9:temp
-4.9
Primary key relationship

I have a question!!
What happens if I want to
do an adhoc query??

I’ve stored the data
partitioned by weather id…
… now I want a report for
all stations

I’ve stored the raw weather
data…
… now I want rollups/
aggregates

Deployment
- Spark worker on each of the
Cassandra nodes
- Partitions made up of
LOCAL cassandra data
S C
S C
S C
S C

Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4

Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes

Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes

Each Spark partition is made up of
token ranges that live on the same
node

Each Spark partition is made up of
Cassandra partitions that are on the
same node

(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)
Partition key
=
Single node

(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)
No partition key
=
Every node

daily_aggregate_precip
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipitation counter,
PRIMARY KEY ((weather_station), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
SELECT precipitation FROM daily_aggregate_precip
WHERE weather_station='010010:99999'
AND year=2005 AND month=12 AND day>=1 AND day <= 7;

725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0

Building an aggregate
CREATE TABLE daily_aggregate_precip (
weather_station text,
year int,
month int,
day int,
precipitation counter,
PRIMARY KEY ((weather_station), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
CQL Counter

Want more Spark/C*
goodness?
@helenaedelson

Conclusion
Cassandra = OLTP database for the large scale
Spark can be used to do complex queries in a
partition
Or analytical queries for an entire table
Spark streaming to keep tables up to date

Thanks for listening
Questions later? @chbatey

What's hot

Time series with Apache Cassandra - Long versionPatrick McFadin

Analytics with Cassandra & SparkMatthias Niehoff

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit

SMACK Stack 1.1Joe Stein

Time Series Processing with Apache SparkJosef Adersberger

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff

Apache cassandra & apache spark for time series dataPatrick McFadin

Kafka Lambda architecture with mirroringAnant Rustagi

Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Spark Streaming with CassandraJacek Lewandowski

OLAP with Cassandra and SparkEvan Chan

Leveraging the Power of Solr with SparkQAware GmbH

Adding Complex Data to Spark Stack by Tug GrallSpark Summit

Druid meetup 4th_sql_on_druidYousun Jeong

Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer

DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy

The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax

Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit

What's hot (20)

Time series with Apache Cassandra - Long version

Analytics with Cassandra & Spark

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...

SMACK Stack 1.1

Time Series Processing with Apache Spark

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Apache cassandra & apache spark for time series data

Kafka Lambda architecture with mirroring

Feeding Cassandra with Spark-Streaming and Kafka

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Spark Streaming with Cassandra

OLAP with Cassandra and Spark

Leveraging the Power of Solr with Spark

Adding Complex Data to Spark Stack by Tug Grall

Druid meetup 4th_sql_on_druid

Spark Cassandra Connector: Past, Present, and Future

DataStax and Esri: Geotemporal IoT Search and Analytics

The How and Why of Fast Data Analytics with Apache Spark

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...

Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

Similar to Spark with Cassandra by Christopher Batey

Fast track to getting started with DSE Max @ INGDuyhai Doan

PresentationDimitris Stripelis

Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin

Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan

TechEvent Apache CassandraTrivadis

Data Science Lab Meetup: Cassandra and SparkChristopher Batey

1 Dundee - Cassandra 101Christopher Batey

Cassandra, web scale no sql data platformMarko Švaljek

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB

A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin

Spark & Cassandra - DevFest CórdobaJose Mº Muñoz

Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB

Cassandra Deep Diver & Data ModelingBrian Enochson

Nike Tech Talk: Double Down on Apache Cassandra and SparkPatrick McFadin

Cassandra and Sparknickmbailey

Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland

Cassandra + Spark (You’ve got the lighter, let’s start a fire)Robert Stupp

Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

Similar to Spark with Cassandra by Christopher Batey (20)

Fast track to getting started with DSE Max @ ING

Presentation

Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra and Spark, closing the gap between no sql and analytics codemotio...

TechEvent Apache Cassandra

Data Science Lab Meetup: Cassandra and Spark

1 Dundee - Cassandra 101

Cassandra, web scale no sql data platform

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

Spark & Cassandra - DevFest Córdoba

Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All

Cassandra Deep Diver & Data Modeling

Nike Tech Talk: Double Down on Apache Cassandra and Spark

Cassandra and Spark

Lambda at Weather Scale - Cassandra Summit 2015

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Unlocking Your Hadoop Data with Apache Spark and CDH5

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

Recently uploaded

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Invezz.com - Grow your wealth with trading signalsInvezz1

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Industrialised data - the key to AI success.pdfLars Albertsson

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Midocean dropshipping via API with DroFxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Recently uploaded (20)

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Invezz.com - Grow your wealth with trading signals

Log Analysis using OSSEC sasoasasasas.pptx

Edukaciniai dropshipping via API with DroFx

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

RA-11058_IRR-COMPRESS Do 198 series of 1998

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Generative AI on Enterprise Cloud with NiFi and Milvus

Ravak dropshipping via API with DroFx.pptx

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Industrialised data - the key to AI success.pdf

Carero dropshipping via API with DroFx.pptx

Midocean dropshipping via API with DroFx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

100-Concepts-of-AI by Anupama Kate .pptx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Spark with Cassandra by Christopher Batey

1. Spark with Cassandra @chbatey

2. @chbatey Christopher Batey - @chbatey Freelance Software engineer/Devops/Architect Loves: Breaking software Distributed systems Hates: Fragile software Untested code :( Introduction

3. Audience

4. Assumption is…

5. Overview Cassandra architecture Modelling time series - weather data for many stations What can be done with pure C* When to introduce Spark

6. What do we use Spark for? Batch processing Machine Learning Ad-hoc querying of large datasets Streaming processing

7. What do we use Cassandra for? Operational Database OLTP

8. Casandra overview

9. @chbatey Master slave Master Async replication Slave

10. @chbatey Sharding

11. @chbatey The other way

12. @chbatey Consistent hashing jim age: 36 car: ford gender: M carol age: 37 car: bmw gender: F johnny age: 12 gender: M suzy age: 10 gender: F Partition Key Hash value jim 350 carol 998 johnny 50 suzy 600 Partition Key

13. 999 49 0 50 A B C D 249750 749 250 B CD A

14. Example Node Start range End range Primary key Hash value A 0 249 johnny 50 B 250 499 jim 350 C 500 749 suzy 600 D 750 999 carol 998

15. @chbatey Fault tolerance Replicate each price of data on multiple nodes Keep replicas on different racks Datacenter aware

16. DC2 client RF3 RF3 C C WRITE CL = 1 We have replication! DC1

17. Storing weather data CREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

18. @chbatey Primary key relationship PRIMARY KEY ((weatherstation_id),year,month,day,hour)

19. Data Locality weatherstation_id=‘10010:99999’ ? 1000 Node Cluster You are here!

20. Primary key relationship PRIMARY KEY ((weatherstation_id),year,month,day,hour) Partition Key Clustering Columns 10010:99999

21. 2005:12:1:8:temp 2005:12:1:7:temp -5.6 PRIMARY KEY ((weatherstation_id),year,month,day,hour) Partition Key Clustering Columns 10010:99999 -5.1 2005:12:1:10:temp -5.3 2005:12:1:9:temp -4.9 Primary key relationship

22. I have a question!! What happens if I want to do an adhoc query??

23. I’ve stored the data partitioned by weather id… … now I want a report for all stations

24. I’ve stored the raw weather data… … now I want rollups/ aggregates

25. Analytics Workload Isolation

26. Deployment - Spark worker on each of the Cassandra nodes - Partitions made up of LOCAL cassandra data S C S C S C S C

27. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4

28. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes

29. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes

30. Cassandra RDD

31. Each Spark partition is made up of token ranges that live on the same node

32. Each Spark partition is made up of Cassandra partitions that are on the same node

33. Storing weather data CREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

34. (count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863) Partition key = Single node

35. (count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000) No partition key = Every node

36. Not quick enough?

37. daily_aggregate_precip CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;

38. Weather station info

39. 725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0

40. Creating a Stream

41. Saving the raw data

42. Building an aggregate CREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); CQL Counter

43. Want more Spark/C* goodness? @helenaedelson

44. Conclusion Cassandra = OLTP database for the large scale Spark can be used to do complex queries in a partition Or analytical queries for an entire table Spark streaming to keep tables up to date

45. Thanks for listening Questions later? @chbatey

Spark with Cassandra by Christopher Batey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark with Cassandra by Christopher Batey

Similar to Spark with Cassandra by Christopher Batey (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark with Cassandra by Christopher Batey