Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

High Throughput for Real-Time Applications with
SMACK, Kafka and Spark Streaming
Ryan Knight – Solution Engineer, DataStax
@knight_cloud

Spark
Mesos
Akka
Cassandra
Kafka

CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka
SparkSparkSpark
AkkaAkkaAkka
CassandraCassandraCassandra

Move from Proactive to Predictive Analytics
• Real time analytics of streaming data
• Common use cases – fraud detection, login
analysis, web traffic analysis, marketing data
• High quality data pipeline = High quality data
science
• Difficult to deal with the scale and volume of data
flowing through enterprises today
© 2015 DataStax, All Rights Reserved. 5

Spark Streaming – Predictive Analytics at
Scale
• Kafka + Spark Streaming – Ideal tools for handling
massive volumes of data
• Built to scale – easy to parallelize and distribute
• Resilient and Fault Tolerant – Ensure data is not
lost

How do we Scale for Load and Traffic?

Spark Streaming Micro Batches

1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
9© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka

Kafka

Data Modeling using Event Sourcing
• Append-Only Logging
• Database of Facts
• Snapshots or Roll-Ups
• Why Delete Data any more?
• Replay Events

Kafka

• Common use case to join streaming data with
lookup tables
• Broadcast Joins
• joinWithCassandraTable
• Use Data Frames to leverage catalyst optimizer
Avoid Network Shuffles

Kafka

Tuning Spark Streaming
• Processing Time <
Batch Duration
• Total Delay Grows
Unbounded = Out
Of Memory Errors

Batch Interval Gone Wrong
• Scheduling Delay of
41 Minutes!

Setting the Right Batch Interval
100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
• Processing Time is consistently
below our Batch Interval Time
• Good approach is to test with a
conservative batch interval
(e.g. 5-10 seconds) and a low
data rate
• If the Total Delay is constantly
under the Batch Interval, then
the system is stable

Kafka

Kafka High-level
Review
1
9
Anatomy of a Topic
Writes
0 6541 2 3Partition 2
0 6 9541 2 3 7 8Partition 1
0 6 7541 2 3Partition 0
Old New Data
offsets

Advantage of Kafka Direct API
20

Advantages Kafka Direct API
• Number of partitions per Kafka Topic = Degree of
parallelism
• Simplifies Parallelism
• Efficiency – single copy of data on read
• Easier to work with
• Resiliency without copying data

Kafka

Reduce Processing Time by Increasing
Parallelism
1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s

Sizing Data Pipeline
• Look at the data flow for the entire pipeline
• Benchmarking is key!
• Calculate number of messages a single Spark
Streaming server can handle
• Calculate number of messages flowing into Kafka

Sizing Spark Streaming
• Number of CPU Cores is the max number of
Parallel Tasks
• RDD (Spark Data Type) internally divided into
Partitions based on data set size
• Data transformation on one partition is a task
• Each CPU core can process one task

Java Monitoring of Kafka with
JConsole

Formula for Sizing Spark Streaming
Total Servers
=
Example:
(# of Kafka Messages)
(# of Messages Streaming Server can Process)
100K
20K
= Minimum of 5 Servers

Spark at Scale
DataStax Enterprise Platform
Web Service
Legacy Systems
https://github.com/retroryan/sparkatscal
e
DataStax Enterprise Platform

Akka Feeder – Simulates Messages
val feederTick = context.system.scheduler.schedule(Duration.Zero, tickInterval,
self, SendNextLine)
……
case SendNextLine =>
val record = new ProducerRecord[String, String]
(feederExtension.kafkaTopic, key, nxtRating.toString)
val future = feederExtension.producer.send(record, new Callback
{ ….

Spark Streaming – Reading the Messages
val rawRatingsStream = KafkaUtils.createDirectStream …..
……
ratingsStream.foreachRDD {
(message: RDD[Rating], batchTime: Time) => {
// convert each RDD from the batch into a Ratings DataFrame
val ratingDF = message.toDF()
// save the DataFrame to Cassandra
// Note: Cassandra has been initialized through dse spark-submit, so we don't have to
explicitly set the connection
ratingDF.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "movie_db", "table" -> "rating_by_movie"))
.save()
}

Coming Soon!
• June 1: Building Data Pipelines with SMACK: Storage Strategy using
Cassandra and DSE
• July 6: Building Data Pipelines with SMACK: Analyzing Data with Spark
• For the latest schedule of webinars, check out our Webinars
page: http://www.datastax.com/resources/webinars

Get your SMACK on!
Thank you!
Follow me on Twitter: @knight_cloud

Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Recommended

Recommended

More Related Content

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Editor's Notes