Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Analyzing Time-Series Data with
Apache Spark and Cassandra
Andrew Psaltis
HDF / IoT Product Solution Architect
@itmdata
StampedeCon 2016

If you every wanted to….
Build models over measurements coming in
every second from sensors across the
world?
Dig into intra-day trading prices of millions
of financial instruments?
Compare hourly page view statistics across
every page on Wikipedia?

You need to do it over a large
sequence of measurements
over time.

A problem perfect for
Cassandra and Spark

Time-series data
consists of sequences
of measurements,
each occurring at a
point in time.

Example: Weather Station
Weather station collects data
Cassandra stores in sequence
Application reads in sequence

Query use cases
Weather Station ID
Get weather data given:
Weather Station ID and Time
Weather Station ID and
Range of Time

Aggregation use cases
Weather Station ID
Get temperature stats given:
Weather Station ID and Time
Weather Station ID and
Range of Time

Cassandra architecture is
Shared nothing[1]
Materless peer-to-peer
Shard Free
Based on Amazon Dynamo and Google BigTable
[1] 1986 paper “The Case for Shared Nothing” -http://db.cs.berkeley.edu/papers/hpts85-
nothing.pdf.

Partition Key Clustering Columns
Order Override

Partition Key
Clustering Columns
10010:99999
2016:07:28:1
2
-5.6
2016:07:28:1
1
-5.1
2016:07:28:1
0
-4.9
2016:07:28:0
9
-5.3
Primary key relationship

Tokens
Consistent hash between 2-63 and 264
Each node owns a range of those values
The token is the beginning of that range to
the next node’s token value
Virtual Nodes break these down further

Replication
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0.4
76-100
10.0.0.3
51-75
DC 1
DC 1 RF: 1

Replication
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
10.0.0.1
00-25
76-100
10.0.0.2
26-50
00-25
10.0.0.4
76-100
51-75
10.0.0.3
51-75
26-50
DC 1
DC 1 RF: 2

Replication
Node Primary Replica Replica
10.0.0.
1
00-25 76-100 51-75
10.0.0.
2
26-50 00-25 76-100
10.0.0.
3
51-75 26-50 00-25
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1
DC 1 RF: 3

Replication
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1 RF: 3
Client
Write to
partition 15

Multi-Datacenter
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1 RF: 3
Client
Write to partition 15
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 2 RF: 3

What is Spark used for?
Fast and general purpose engine for large scale
data processing
Provides a framework that supports In-Memory
Cluster Computing
Designed for iterative computations and interactive
data mining

Resilient Distributed Dataset (RDD)
• Created through transformations
on data (map,filter..) or other
RDDs
• Immutable
• Partitioned
• Reusable

RDD Partitioning
Number of RDD partitions will
control how many parallel tasks
can be run against the data stored
in the RDD
Hint: in general make it at least as large as the # of cpu cores in your cluster

 Transformations - Similar to scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip,
reduceByKey, subtract
 Actions
• Require materialization of the records to generate
a value
• collect: Array[T], count, fold, reduce..
RDD Operations

Data Locality
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For every partition:
•Spark gets a list of preferred nodes to process on
from RDD
•Spark creates a task and sends it to one of the
nodes for execution

What is Spark Streaming?
•Provides efficient, fault-tolerant stateful stream
processing
•Provides a simple API for implementing complex
algorithms
•Integrates with Spark’s batch and interactive
processing
•Integrates with other Spark extensions

Discretized Streams (DStreams)
•The basic abstraction provided by Spark Streaming
•Continuous series of RDDs

Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
• Natural Time Series Integration

Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs +
Spark DStreams

Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions

The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance

Transactional
10.0.0.1
00-25
10.0.0.2
26-5010.0.0.4
76-100
10.0.0.3
51-75
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0.4
76-100
10.0.0.3
51-75
Analytics

Batch Weather Station Analysis

Weather Station Analysis
Spark rolls up data into new tables

Weather Station Stream Analysis
Data processed in stream

Weather Station Stream Analysis
Counter

https://github.com/killrweather/killrweather
To explore at home….

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Similar to Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Editor's Notes