SlideShare a Scribd company logo
1 of 33
High Throughput for Real-Time Applications with
SMACK, Kafka and Spark Streaming
Ryan Knight – Solution Engineer, DataStax
@knight_cloud
Spark
Mesos
Akka
Cassandra
Kafka
CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka
SparkSparkSpark
AkkaAkkaAkka
CassandraCassandraCassandra
Data Pipelines
Move from Proactive to Predictive Analytics
• Real time analytics of streaming data
• Common use cases – fraud detection, login
analysis, web traffic analysis, marketing data
• High quality data pipeline = High quality data
science
• Difficult to deal with the scale and volume of data
flowing through enterprises today
© 2015 DataStax, All Rights Reserved. 5
Spark Streaming – Predictive Analytics at
Scale
• Kafka + Spark Streaming – Ideal tools for handling
massive volumes of data
• Built to scale – easy to parallelize and distribute
• Resilient and Fault Tolerant – Ensure data is not
lost
© 2015 DataStax, All Rights Reserved. 6
How do we Scale for Load and Traffic?
© 2015 DataStax, All Rights Reserved. 7
Spark Streaming Micro Batches
© 2015 DataStax, All Rights Reserved. 8
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
9© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
10© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
Data Modeling using Event Sourcing
• Append-Only Logging
• Database of Facts
• Snapshots or Roll-Ups
• Why Delete Data any more?
• Replay Events
© 2015 DataStax, All Rights Reserved. 11
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
12© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
• Common use case to join streaming data with
lookup tables
• Broadcast Joins
• joinWithCassandraTable
• Use Data Frames to leverage catalyst optimizer
© 2015 DataStax, All Rights Reserved. 13
Avoid Network Shuffles
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
14© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
Tuning Spark Streaming
• Processing Time <
Batch Duration
• Total Delay Grows
Unbounded = Out
Of Memory Errors
© 2015 DataStax, All Rights Reserved. 15
Batch Interval Gone Wrong
© 2015 DataStax, All Rights Reserved. 16
• Scheduling Delay of
41 Minutes!
Setting the Right Batch Interval
© 2015 DataStax, All Rights Reserved. 17
100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
• Processing Time is consistently
below our Batch Interval Time
• Good approach is to test with a
conservative batch interval
(e.g. 5-10 seconds) and a low
data rate
• If the Total Delay is constantly
under the Batch Interval, then
the system is stable
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
18© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
Kafka High-level
Review
1
9
Anatomy of a Topic
Writes
0 6541 2 3Partition 2
0 6 9541 2 3 7 8Partition 1
0 6 7541 2 3Partition 0
Old New Data
offsets
Advantage of Kafka Direct API
© 2015 DataStax, All Rights Reserved. 20
20
Advantages Kafka Direct API
• Number of partitions per Kafka Topic = Degree of
parallelism
• Simplifies Parallelism
• Efficiency – single copy of data on read
• Easier to work with
• Resiliency without copying data
© 2015 DataStax, All Rights Reserved. 21
1 Use Event Sourcing / Append Only Data Model
2 Avoid Network Shuffles
3 Tune Spark Streaming Processing Time
4 Use Kafka Direct API
5 Size Spark Streaming Batch Sizes
22© 2015 DataStax, All Rights Reserved.
5 Keys to Scaling Spark Streaming w/
Kafka
Reduce Processing Time by Increasing
Parallelism
© 2015 DataStax, All Rights Reserved. 23
1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
Sizing Data Pipeline
• Look at the data flow for the entire pipeline
• Benchmarking is key!
• Calculate number of messages a single Spark
Streaming server can handle
• Calculate number of messages flowing into Kafka
© 2015 DataStax, All Rights Reserved. 24
Sizing Spark Streaming
• Number of CPU Cores is the max number of
Parallel Tasks
• RDD (Spark Data Type) internally divided into
Partitions based on data set size
• Data transformation on one partition is a task
• Each CPU core can process one task
© 2015 DataStax, All Rights Reserved. 25
Java Monitoring of Kafka with
JConsole
© 2015 DataStax, All Rights Reserved. 26
Formula for Sizing Spark Streaming
© 2015 DataStax, All Rights Reserved. 27
Total Servers
=
Example:
(# of Kafka Messages)
(# of Messages Streaming Server can Process)
100K
20K
= Minimum of 5 Servers
Example Architecture
Spark at Scale
© 2015 DataStax, All Rights Reserved. 29
DataStax Enterprise Platform
Web Service
Legacy Systems
https://github.com/retroryan/sparkatscal
e
DataStax Enterprise Platform
Akka Feeder – Simulates Messages
val feederTick = context.system.scheduler.schedule(Duration.Zero, tickInterval,
self, SendNextLine)
……
case SendNextLine =>
val record = new ProducerRecord[String, String]
(feederExtension.kafkaTopic, key, nxtRating.toString)
val future = feederExtension.producer.send(record, new Callback
{ ….
© 2015 DataStax, All Rights Reserved. 30
Spark Streaming – Reading the Messages
val rawRatingsStream = KafkaUtils.createDirectStream …..
……
ratingsStream.foreachRDD {
(message: RDD[Rating], batchTime: Time) => {
// convert each RDD from the batch into a Ratings DataFrame
val ratingDF = message.toDF()
// save the DataFrame to Cassandra
// Note: Cassandra has been initialized through dse spark-submit, so we don't have to
explicitly set the connection
ratingDF.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "movie_db", "table" -> "rating_by_movie"))
.save()
}
© 2015 DataStax, All Rights Reserved. 31
Coming Soon!
• June 1: Building Data Pipelines with SMACK: Storage Strategy using
Cassandra and DSE
• July 6: Building Data Pipelines with SMACK: Analyzing Data with Spark
• For the latest schedule of webinars, check out our Webinars
page: http://www.datastax.com/resources/webinars
© 2015 DataStax, All Rights Reserved. 32
Get your SMACK on!
Thank you!
Follow me on Twitter: @knight_cloud
© 2015 DataStax, All Rights Reserved. 33

More Related Content

More from DataStax

Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 
Innovation Around Data and AI for Fraud Detection
Innovation Around Data and AI for Fraud DetectionInnovation Around Data and AI for Fraud Detection
Innovation Around Data and AI for Fraud DetectionDataStax
 
How to get Real-Time Value from your IoT Data - Datastax
How to get Real-Time Value from your IoT Data - DatastaxHow to get Real-Time Value from your IoT Data - Datastax
How to get Real-Time Value from your IoT Data - DatastaxDataStax
 

More from DataStax (20)

Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 
Innovation Around Data and AI for Fraud Detection
Innovation Around Data and AI for Fraud DetectionInnovation Around Data and AI for Fraud Detection
Innovation Around Data and AI for Fraud Detection
 
How to get Real-Time Value from your IoT Data - Datastax
How to get Real-Time Value from your IoT Data - DatastaxHow to get Real-Time Value from your IoT Data - Datastax
How to get Real-Time Value from your IoT Data - Datastax
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

  • 1. High Throughput for Real-Time Applications with SMACK, Kafka and Spark Streaming Ryan Knight – Solution Engineer, DataStax @knight_cloud
  • 5. Move from Proactive to Predictive Analytics • Real time analytics of streaming data • Common use cases – fraud detection, login analysis, web traffic analysis, marketing data • High quality data pipeline = High quality data science • Difficult to deal with the scale and volume of data flowing through enterprises today © 2015 DataStax, All Rights Reserved. 5
  • 6. Spark Streaming – Predictive Analytics at Scale • Kafka + Spark Streaming – Ideal tools for handling massive volumes of data • Built to scale – easy to parallelize and distribute • Resilient and Fault Tolerant – Ensure data is not lost © 2015 DataStax, All Rights Reserved. 6
  • 7. How do we Scale for Load and Traffic? © 2015 DataStax, All Rights Reserved. 7
  • 8. Spark Streaming Micro Batches © 2015 DataStax, All Rights Reserved. 8
  • 9. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 9© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 10. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 10© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 11. Data Modeling using Event Sourcing • Append-Only Logging • Database of Facts • Snapshots or Roll-Ups • Why Delete Data any more? • Replay Events © 2015 DataStax, All Rights Reserved. 11
  • 12. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 12© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 13. • Common use case to join streaming data with lookup tables • Broadcast Joins • joinWithCassandraTable • Use Data Frames to leverage catalyst optimizer © 2015 DataStax, All Rights Reserved. 13 Avoid Network Shuffles
  • 14. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 14© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 15. Tuning Spark Streaming • Processing Time < Batch Duration • Total Delay Grows Unbounded = Out Of Memory Errors © 2015 DataStax, All Rights Reserved. 15
  • 16. Batch Interval Gone Wrong © 2015 DataStax, All Rights Reserved. 16 • Scheduling Delay of 41 Minutes!
  • 17. Setting the Right Batch Interval © 2015 DataStax, All Rights Reserved. 17 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s • Processing Time is consistently below our Batch Interval Time • Good approach is to test with a conservative batch interval (e.g. 5-10 seconds) and a low data rate • If the Total Delay is constantly under the Batch Interval, then the system is stable
  • 18. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 18© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 19. Kafka High-level Review 1 9 Anatomy of a Topic Writes 0 6541 2 3Partition 2 0 6 9541 2 3 7 8Partition 1 0 6 7541 2 3Partition 0 Old New Data offsets
  • 20. Advantage of Kafka Direct API © 2015 DataStax, All Rights Reserved. 20 20
  • 21. Advantages Kafka Direct API • Number of partitions per Kafka Topic = Degree of parallelism • Simplifies Parallelism • Efficiency – single copy of data on read • Easier to work with • Resiliency without copying data © 2015 DataStax, All Rights Reserved. 21
  • 22. 1 Use Event Sourcing / Append Only Data Model 2 Avoid Network Shuffles 3 Tune Spark Streaming Processing Time 4 Use Kafka Direct API 5 Size Spark Streaming Batch Sizes 22© 2015 DataStax, All Rights Reserved. 5 Keys to Scaling Spark Streaming w/ Kafka
  • 23. Reduce Processing Time by Increasing Parallelism © 2015 DataStax, All Rights Reserved. 23 1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s
  • 24. Sizing Data Pipeline • Look at the data flow for the entire pipeline • Benchmarking is key! • Calculate number of messages a single Spark Streaming server can handle • Calculate number of messages flowing into Kafka © 2015 DataStax, All Rights Reserved. 24
  • 25. Sizing Spark Streaming • Number of CPU Cores is the max number of Parallel Tasks • RDD (Spark Data Type) internally divided into Partitions based on data set size • Data transformation on one partition is a task • Each CPU core can process one task © 2015 DataStax, All Rights Reserved. 25
  • 26. Java Monitoring of Kafka with JConsole © 2015 DataStax, All Rights Reserved. 26
  • 27. Formula for Sizing Spark Streaming © 2015 DataStax, All Rights Reserved. 27 Total Servers = Example: (# of Kafka Messages) (# of Messages Streaming Server can Process) 100K 20K = Minimum of 5 Servers
  • 29. Spark at Scale © 2015 DataStax, All Rights Reserved. 29 DataStax Enterprise Platform Web Service Legacy Systems https://github.com/retroryan/sparkatscal e DataStax Enterprise Platform
  • 30. Akka Feeder – Simulates Messages val feederTick = context.system.scheduler.schedule(Duration.Zero, tickInterval, self, SendNextLine) …… case SendNextLine => val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, key, nxtRating.toString) val future = feederExtension.producer.send(record, new Callback { …. © 2015 DataStax, All Rights Reserved. 30
  • 31. Spark Streaming – Reading the Messages val rawRatingsStream = KafkaUtils.createDirectStream ….. …… ratingsStream.foreachRDD { (message: RDD[Rating], batchTime: Time) => { // convert each RDD from the batch into a Ratings DataFrame val ratingDF = message.toDF() // save the DataFrame to Cassandra // Note: Cassandra has been initialized through dse spark-submit, so we don't have to explicitly set the connection ratingDF.write.format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace" -> "movie_db", "table" -> "rating_by_movie")) .save() } © 2015 DataStax, All Rights Reserved. 31
  • 32. Coming Soon! • June 1: Building Data Pipelines with SMACK: Storage Strategy using Cassandra and DSE • July 6: Building Data Pipelines with SMACK: Analyzing Data with Spark • For the latest schedule of webinars, check out our Webinars page: http://www.datastax.com/resources/webinars © 2015 DataStax, All Rights Reserved. 32
  • 33. Get your SMACK on! Thank you! Follow me on Twitter: @knight_cloud © 2015 DataStax, All Rights Reserved. 33

Editor's Notes

  1. In this webinar, you will: ✓ Learn how to estimate the size of your data pipeline ✓ Learn its impact on the overall architecture of your application ✓ Understand the five most important considerations when developing a data pipeline
  2. Data pipelines consisting of many parts Kafka to organize Akka and Spark to process Cassandra to Store Mesos to manage everything
  3. Current Issues with Data Pipelines / Problems customers are facing
  4. Requires Scaling Data Pipeline for high-through put of data
  5. Italian Job - Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
  6. DStream: continuous sequence of micro batches More complex processing models are possible with less effort Streaming computations as a series of deterministic batch computations on small time intervals Ingests and processes data using complex algorithms, which are expressed in high-level functions.
  7. Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
  8. Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
  9. Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
  10. Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
  11. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
  12. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
  13. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this: Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. ========================= Nice feature of the Direct API. Kafka Topics map to Cassandra Tables Kafka Partitions map to Spark Partitions.
  14. 1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s, Increasing the level of parallelism for receiving data, increase partitions (kafka -> spark) Setting the right batch interval, stay away from sub-second batch intervals unless your processing time can meet that constraint. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
  15. How this fits into your existing architecture
  16. Lots of applications that Spark Makes much easier But it really breaks down into small questions and Big questions