Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
1. Analyzing Time-Series Data with
Apache Spark and Cassandra
Andrew Psaltis
HDF / IoT Product Solution Architect
@itmdata
StampedeCon 2016
2. If you every wanted to….
Build models over measurements coming in
every second from sensors across the
world?
Dig into intra-day trading prices of millions
of financial instruments?
Compare hourly page view statistics across
every page on Wikipedia?
3. You need to do it over a large
sequence of measurements
over time.
10. Cassandra architecture is
Shared nothing[1]
Materless peer-to-peer
Shard Free
Based on Amazon Dynamo and Google BigTable
[1] 1986 paper “The Case for Shared Nothing” -http://db.cs.berkeley.edu/papers/hpts85-
nothing.pdf.
17. Tokens
Consistent hash between 2-63 and 264
Each node owns a range of those values
The token is the beginning of that range to
the next node’s token value
Virtual Nodes break these down further
27. What is Spark used for?
Fast and general purpose engine for large scale
data processing
Provides a framework that supports In-Memory
Cluster Computing
Designed for iterative computations and interactive
data mining
28. Resilient Distributed Dataset (RDD)
• Created through transformations
on data (map,filter..) or other
RDDs
• Immutable
• Partitioned
• Reusable
29. RDD Partitioning
Number of RDD partitions will
control how many parallel tasks
can be run against the data stored
in the RDD
Hint: in general make it at least as large as the # of cpu cores in your cluster
30. Transformations - Similar to scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip,
reduceByKey, subtract
Actions
• Require materialization of the records to generate
a value
• collect: Array[T], count, fold, reduce..
RDD Operations
31. Data Locality
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For every partition:
•Spark gets a list of preferred nodes to process on
from RDD
•Spark creates a task and sends it to one of the
nodes for execution
32. What is Spark Streaming?
•Provides efficient, fault-tolerant stateful stream
processing
•Provides a simple API for implementing complex
algorithms
•Integrates with Spark’s batch and interactive
processing
•Integrates with other Spark extensions
39. Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
• Natural Time Series Integration
40. Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs +
Spark DStreams
Analyzing Time-Series Data with Apache Spark and Cassandra
What is Time-Series Data?
Time-series data consists of sequences of measurements, each occurring at a point in time.
A variety of terms are used to describe time-series data, and many of them apply to conflicting or overlapping concepts.
In the interest of clarity, in spark-ts , we stick to a particular vocabulary:
A time series is a sequence of floating-point values, each linked to a timestamp.
Consistent hash between 2-63 and 264
•Each node owns a range of those values
•The token is the beginning of that range to the next node’s token value
•Virtual Nodes break these down further
Each partition is a 128 bit value
Since we are really just creating discrete RDD’s we will see how we have the opportunity to combine streaming with the rest of the stack