SlideShare a Scribd company logo
1 of 23
Deep Learning and
Streaming in Apache Spark
2.x
Matei Zaharia
@matei_zaharia
Welcome to Spark Summit Europe
Our largest European summit yet
102talks
1200attendees
11tracks
What’s New in Spark?
Cost-based optimizer (Spark 2.2)
Python and R improvements
• PyPI & CRAN packages (Spark 2.2)
• Python ML plugins (Spark 2.3)
• Vectorized Pandas UDFs (Spark 2.3)
Kubernetes support (targeting 2.3)
0
10
20
30
40
50
Time(s)
Spark 2.2
Vectorized UDFs
0
25
50
75
100
125
Q1 Q2
Time(s)
Pandas
Spark
Spark: The Definitive Guide
To be released this winter
Free preview chapters and
code on Databricks website:
dbricks.co/spark-guide
Two Fast-Growing Workloads
Both are important but complex with current tools
We think we can simplify both with Apache Spark!
Streaming Deep
Learning
&
Why are Streaming and DL
Hard?Similar to early big data tools!
Tremendous potential, but very hard to use at first:
• Low-level APIs (MapReduce)
• Separate systems for each task (SQL, ETL, ML,
etc)
Spark’s Approach
1) Composable, high-level APIs
• Build apps from components
2) Unified engine
• Run complete, end-to-end apps
SQLStreaming ML Graph
…
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Structured Streaming
Streaming today requires separate APIs & systems
Structured Streaming is a high-level, end-to-end API
• Simple interface: run any DataFrame or SQL code incrementally
• Complete apps: combine with batch & interactive queries
• End-to-end reliability: exactly-once processing
Became GA in Apache Spark 2.2
Structured Streaming Use
Cases
Monitor quality of live video streaming
Anomaly detection on millions of WiFi hotspots
100s of customer apps in production on Databricks
Largest apps process tens of trillions of records per month
Real-time game analytics at scale
KTable<String, String> kCampaigns = builder.table("campaigns", "cmp-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filtered.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KStream<String, String> keyedData = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedData.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
streams
Example:
Benchmark DataFrames
events
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
window('event_time, "10 seconds"),
'campaign_id)
.count()
Batch Plan Incremental Plan
Scan Files
Aggregate
Write to Sink
Scan New
Files
Stateful Agg.
Update Sink
automatic
transformation
4xlower cost
Structured Streaming
reuses the Spark SQL
Optimizer and Tungsten
Engine.
https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
Performance:
Benchmark System Throughput
700K
15M
65M
0
10
20
30
40
50
60
70
Kafka
Streams
Apache Flink Structured
Streaming
Millionsofrecords/s
4xfewer nodes
What About Latency?
Continuous processing mode to run without
microbatches
• <1 ms latency (same as per-record streaming systems)
• No changes to user code
• Proposal in SPARK-20928
Key idea: same API can target both streaming & batch
Find out more in today’s deep dive
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Deep Learning has Huge
PotentialUnprecedented ability to work with unstructured
data such as images and text
But Deep Learning is Hard to
UseCurrent APIs (TensorFlow, Keras, etc) are low-level
• Build a computation graph from scratch
Scale-out requires manual parallelization
Hard to use models in larger applications
Very similar to early big data APIs
Deep Learning on Spark
Image support in MLlib: SPARK-21866 (Spark 2.3)
DL framework integrations: TensorFlowOnSpark,
MMLSpark, Intel BigDL
Higher-level APIs: Deep Learning Pipelines
New in TensorFlowOnSpark
Library to run distributed TF on Spark clusters & data
• Built at Yahoo!, where it powers photos, videos & more
Yahoo! and Databricks collaborated to add:
• ML pipeline APIs
• Support for non-YARN and AWS clusters
github.com/yahoo/TensorFlowOnSpark
talk
tomorrow
at 17:00
Deep Learning Pipelines
Low-level DL frameworks are powerful, but common
use cases should be much simpler to build
Goal: Enable an order of magnitude more
users to build production apps using deep
learning
Deep Learning Pipelines
Key idea: High-level API built on ML Pipelines model
• Common use cases are just a few lines of code
• All operators automatically scale over Spark
• Expose models in batch, streaming & SQL apps
Uses existing DL engines (TensorFlow, Keras, etc)
Example: Using Existing
Modelpredictor = DeepImagePredictor(inputCol="image",
outputCol="labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
SELECT image, my_predictor(image) AS labels
FROM uploaded_images
Example: Model Search
est = KerasImageFileEstimator()
grid = ParamGridBuilder() 
.addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) 
.addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) 
.build()
CrossValidator(est, eval, grid).fit(image_df)
InceptionV3
batch size 32
ResNet50
batch size 32
InceptionV3
batch size 64
ResNet50
batch size 64
Spark
Driver
Deep Learning Pipelines
DemoSue Ann Hong

More Related Content

What's hot

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreLegacy Typesafe (now Lightbend)
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsMaking Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsLightbend
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
 
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudPakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudLightbend
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionjglobal
 

What's hot (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsMaking Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development Teams
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudPakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed Applications
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
 

Similar to Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 

Similar to Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia (20)

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

More from Jen Aman

Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 

More from Jen Aman (20)

Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 

Recently uploaded

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

  • 1. Deep Learning and Streaming in Apache Spark 2.x Matei Zaharia @matei_zaharia
  • 2. Welcome to Spark Summit Europe Our largest European summit yet 102talks 1200attendees 11tracks
  • 3. What’s New in Spark? Cost-based optimizer (Spark 2.2) Python and R improvements • PyPI & CRAN packages (Spark 2.2) • Python ML plugins (Spark 2.3) • Vectorized Pandas UDFs (Spark 2.3) Kubernetes support (targeting 2.3) 0 10 20 30 40 50 Time(s) Spark 2.2 Vectorized UDFs 0 25 50 75 100 125 Q1 Q2 Time(s) Pandas Spark
  • 4. Spark: The Definitive Guide To be released this winter Free preview chapters and code on Databricks website: dbricks.co/spark-guide
  • 5. Two Fast-Growing Workloads Both are important but complex with current tools We think we can simplify both with Apache Spark! Streaming Deep Learning &
  • 6. Why are Streaming and DL Hard?Similar to early big data tools! Tremendous potential, but very hard to use at first: • Low-level APIs (MapReduce) • Separate systems for each task (SQL, ETL, ML, etc)
  • 7. Spark’s Approach 1) Composable, high-level APIs • Build apps from components 2) Unified engine • Run complete, end-to-end apps SQLStreaming ML Graph …
  • 8. Expanding Spark to New Areas Structured Streaming Deep Learning 1 2
  • 9. Structured Streaming Streaming today requires separate APIs & systems Structured Streaming is a high-level, end-to-end API • Simple interface: run any DataFrame or SQL code incrementally • Complete apps: combine with batch & interactive queries • End-to-end reliability: exactly-once processing Became GA in Apache Spark 2.2
  • 10. Structured Streaming Use Cases Monitor quality of live video streaming Anomaly detection on millions of WiFi hotspots 100s of customer apps in production on Databricks Largest apps process tens of trillions of records per month Real-time game analytics at scale
  • 11. KTable<String, String> kCampaigns = builder.table("campaigns", "cmp-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filtered.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedData = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedData.groupByKey() .count(TimeWindows.of(10000), "time-windows"); streams Example: Benchmark DataFrames events .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( window('event_time, "10 seconds"), 'campaign_id) .count() Batch Plan Incremental Plan Scan Files Aggregate Write to Sink Scan New Files Stateful Agg. Update Sink automatic transformation
  • 12. 4xlower cost Structured Streaming reuses the Spark SQL Optimizer and Tungsten Engine. https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark Performance: Benchmark System Throughput 700K 15M 65M 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Structured Streaming Millionsofrecords/s 4xfewer nodes
  • 13. What About Latency? Continuous processing mode to run without microbatches • <1 ms latency (same as per-record streaming systems) • No changes to user code • Proposal in SPARK-20928 Key idea: same API can target both streaming & batch Find out more in today’s deep dive
  • 14. Expanding Spark to New Areas Structured Streaming Deep Learning 1 2
  • 15. Deep Learning has Huge PotentialUnprecedented ability to work with unstructured data such as images and text
  • 16. But Deep Learning is Hard to UseCurrent APIs (TensorFlow, Keras, etc) are low-level • Build a computation graph from scratch Scale-out requires manual parallelization Hard to use models in larger applications Very similar to early big data APIs
  • 17. Deep Learning on Spark Image support in MLlib: SPARK-21866 (Spark 2.3) DL framework integrations: TensorFlowOnSpark, MMLSpark, Intel BigDL Higher-level APIs: Deep Learning Pipelines
  • 18. New in TensorFlowOnSpark Library to run distributed TF on Spark clusters & data • Built at Yahoo!, where it powers photos, videos & more Yahoo! and Databricks collaborated to add: • ML pipeline APIs • Support for non-YARN and AWS clusters github.com/yahoo/TensorFlowOnSpark talk tomorrow at 17:00
  • 19. Deep Learning Pipelines Low-level DL frameworks are powerful, but common use cases should be much simpler to build Goal: Enable an order of magnitude more users to build production apps using deep learning
  • 20. Deep Learning Pipelines Key idea: High-level API built on ML Pipelines model • Common use cases are just a few lines of code • All operators automatically scale over Spark • Expose models in batch, streaming & SQL apps Uses existing DL engines (TensorFlow, Keras, etc)
  • 21. Example: Using Existing Modelpredictor = DeepImagePredictor(inputCol="image", outputCol="labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df) SELECT image, my_predictor(image) AS labels FROM uploaded_images
  • 22. Example: Model Search est = KerasImageFileEstimator() grid = ParamGridBuilder() .addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) .addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) .build() CrossValidator(est, eval, grid).fit(image_df) InceptionV3 batch size 32 ResNet50 batch size 32 InceptionV3 batch size 64 ResNet50 batch size 64 Spark Driver

Editor's Notes

  1. Make this more about how easy it is.
  2. Comparable latency to flink
  3. We’ve been experimenting with this at DB and we’re excited to contribute it back.