SlideShare a Scribd company logo
1 of 25
Download to read offline
Apache Spark 3:
The (possible) future!
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
What will be covered?
● Constraints on predicting the future in open source
● The current state of Spark
● Some exciting new things likely/possibly in Spark 3
● A guide on how to build your own crystal ball to double check my crystal ball
● A look through JIRA
● A plea for you to help us with code reviews
● Q & A
Predicting the future in OSS is hard
● This represents my views as someone who works on Spark
● What people end up deciding to work on / review may not match
● Conesus decision making can sometimes be unpredictable
● While we don't have a crystal ball, we do have JIRA
Hisashi
Some key themes for Spark 3
● Deep Learning
○ VC dollars
● Kubernetes
○ If all the cool kids replaced their scheduler, would you?
○ Also see above
● Removing deprecated APIs (yay?)
● … Scala upgrade?
Andy
Blackledge
Deep Learning: Does this slide have cat?
● New scheduler to support deep learning
● New data types to support deep learning
● Better interchange to support deep learning
● Actual deep learning algorithms…. Where are they?
Quinn Dombrowski
New "Gang" Scheduler
● Announced at Spark Summit back in 2.3 -
https://www.datanami.com/2018/06/05/project-hydrogen-u
nites-apache-spark-with-dl-frameworks/
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src
hkarau@hkarau-glaptop:~/repos/spark$
Lisa Larsson
New "Gang" Barrier Scheduler
● https://issues.apache.org/jira/browse/SPARK-24374
● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy
4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit#
hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class
RDDBarrierSuite extends SparkFunSuite with SharedSparkContext {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an
RDDBarrier") {
./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:
Lisa Larsson
What does this fix?
● Allows scheduling all of the DL job together
● Allows regular scheduling otherwise
● Handles failures (e.g. single executor failure == retry all)
Lisa Larsson
Spark Deep Learning Pipelines
● ML pipelines being extended to better support image data
● Some work external e.g.
https://github.com/databricks/spark-deep-learning
Alternatives:
● TensorflowOnSpark
● MMLSpark, etc.
Smokey Combs
Growth of use of Arrow (maybe)?
Logos trademarks of their respective projects
Juha Kettunen
Kubernetes
● A new cluster manager, used for more than "just" big data
● Spark "supports" but lacks difficult to use
● Active work by people at many companies (yay!)
Lisa Zins
What do we need to do next?
● Better* dynamic scaling
● Easier uploading for user code & dependencies
● Better auth integration
● Better documentation (ugh client mode)
● Better job resource requirement tagging
● Better shell scripts for packaging dependencies
○ It's easier than YARN but that's not saying a lot
○ Asking a junior Data Scientist to build a docker
container doesn't always go so well
Hisashi
Photo by: squidish
Building your own crystal ball:
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Lee Jordan
Spark JIRA
● Let's go look at issues on the Spark JIRA together!
● Don't all rush at once….
Jean Georges Perrin
#SparkInAction
Who likes doing code reviews?
● We have over 400* of them!
● Don't all rush at once….
Want to get involved?
● Join us on th dev@ list -
http://spark.apache.org/community.html
● Triage Issues:
○ https://issues.apache.org/jira/projects/SPARK/issues/?
filter=allopenissues
● Please help review code!
○ http://spark-prs.appspot.com
○ https://www.youtube.com/user/holdenkarau
Hisashi
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today! On the internet!
Nothing on Spark 3 because it doesn't exist yet
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
And some upcoming talks:
● March
○ Dataworks Barcelona
○ Strata San Francisco
● May
○ KiwiCoda Mania
● June
○ "Secret" (for another week or so)
● July
○ OSCON Portland
○ Skills Matter in London
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback

More Related Content

Similar to Apache Spark 3: The (possible) future

Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...Databricks
 
Exem flamingo meetup-7th-sparkr
Exem flamingo meetup-7th-sparkrExem flamingo meetup-7th-sparkr
Exem flamingo meetup-7th-sparkr남 남종환
 

Similar to Apache Spark 3: The (possible) future (20)

Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 
Exem flamingo meetup-7th-sparkr
Exem flamingo meetup-7th-sparkrExem flamingo meetup-7th-sparkr
Exem flamingo meetup-7th-sparkr
 

More from Lightbend

IoT 'Megaservices' - High Throughput Microservices with Akka
IoT 'Megaservices' - High Throughput Microservices with AkkaIoT 'Megaservices' - High Throughput Microservices with Akka
IoT 'Megaservices' - High Throughput Microservices with AkkaLightbend
 
How Akka Cluster Works: Actors Living in a Cluster
How Akka Cluster Works: Actors Living in a ClusterHow Akka Cluster Works: Actors Living in a Cluster
How Akka Cluster Works: Actors Living in a ClusterLightbend
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsLightbend
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
 
Digital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesDigital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesLightbend
 
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Detecting Real-Time Financial Fraud with Cloudflow on KubernetesDetecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Detecting Real-Time Financial Fraud with Cloudflow on KubernetesLightbend
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessLightbend
 
Digital Transformation from Monoliths to Microservices to Serverless and Beyond
Digital Transformation from Monoliths to Microservices to Serverless and BeyondDigital Transformation from Monoliths to Microservices to Serverless and Beyond
Digital Transformation from Monoliths to Microservices to Serverless and BeyondLightbend
 
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6Lightbend
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...Lightbend
 
Microservices, Kubernetes, and Application Modernization Done Right
Microservices, Kubernetes, and Application Modernization Done RightMicroservices, Kubernetes, and Application Modernization Done Right
Microservices, Kubernetes, and Application Modernization Done RightLightbend
 
Full Stack Reactive In Practice
Full Stack Reactive In PracticeFull Stack Reactive In Practice
Full Stack Reactive In PracticeLightbend
 
Akka and Kubernetes: A Symbiotic Love Story
Akka and Kubernetes: A Symbiotic Love StoryAkka and Kubernetes: A Symbiotic Love Story
Akka and Kubernetes: A Symbiotic Love StoryLightbend
 
Scala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowScala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowLightbend
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsLightbend
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
 
Designing Events-First Microservices For A Cloud Native World
Designing Events-First Microservices For A Cloud Native WorldDesigning Events-First Microservices For A Cloud Native World
Designing Events-First Microservices For A Cloud Native WorldLightbend
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaLightbend
 

More from Lightbend (20)

IoT 'Megaservices' - High Throughput Microservices with Akka
IoT 'Megaservices' - High Throughput Microservices with AkkaIoT 'Megaservices' - High Throughput Microservices with Akka
IoT 'Megaservices' - High Throughput Microservices with Akka
 
How Akka Cluster Works: Actors Living in a Cluster
How Akka Cluster Works: Actors Living in a ClusterHow Akka Cluster Works: Actors Living in a Cluster
How Akka Cluster Works: Actors Living in a Cluster
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed Applications
 
Digital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesDigital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and Microservices
 
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Detecting Real-Time Financial Fraud with Cloudflow on KubernetesDetecting Real-Time Financial Fraud with Cloudflow on Kubernetes
Detecting Real-Time Financial Fraud with Cloudflow on Kubernetes
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful Serverless
 
Digital Transformation from Monoliths to Microservices to Serverless and Beyond
Digital Transformation from Monoliths to Microservices to Serverless and BeyondDigital Transformation from Monoliths to Microservices to Serverless and Beyond
Digital Transformation from Monoliths to Microservices to Serverless and Beyond
 
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
Microservices, Kubernetes, and Application Modernization Done Right
Microservices, Kubernetes, and Application Modernization Done RightMicroservices, Kubernetes, and Application Modernization Done Right
Microservices, Kubernetes, and Application Modernization Done Right
 
Full Stack Reactive In Practice
Full Stack Reactive In PracticeFull Stack Reactive In Practice
Full Stack Reactive In Practice
 
Akka and Kubernetes: A Symbiotic Love Story
Akka and Kubernetes: A Symbiotic Love StoryAkka and Kubernetes: A Symbiotic Love Story
Akka and Kubernetes: A Symbiotic Love Story
 
Scala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowScala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To Know
 
Migrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive SystemsMigrating From Java EE To Cloud-Native Reactive Systems
Migrating From Java EE To Cloud-Native Reactive Systems
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Designing Events-First Microservices For A Cloud Native World
Designing Events-First Microservices For A Cloud Native WorldDesigning Events-First Microservices For A Cloud Native World
Designing Events-First Microservices For A Cloud Native World
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
 

Recently uploaded

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Apache Spark 3: The (possible) future

  • 1.
  • 2. Apache Spark 3: The (possible) future!
  • 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 4.
  • 5. What will be covered? ● Constraints on predicting the future in open source ● The current state of Spark ● Some exciting new things likely/possibly in Spark 3 ● A guide on how to build your own crystal ball to double check my crystal ball ● A look through JIRA ● A plea for you to help us with code reviews ● Q & A
  • 6. Predicting the future in OSS is hard ● This represents my views as someone who works on Spark ● What people end up deciding to work on / review may not match ● Conesus decision making can sometimes be unpredictable ● While we don't have a crystal ball, we do have JIRA Hisashi
  • 7. Some key themes for Spark 3 ● Deep Learning ○ VC dollars ● Kubernetes ○ If all the cool kids replaced their scheduler, would you? ○ Also see above ● Removing deprecated APIs (yay?) ● … Scala upgrade? Andy Blackledge
  • 8. Deep Learning: Does this slide have cat? ● New scheduler to support deep learning ● New data types to support deep learning ● Better interchange to support deep learning ● Actual deep learning algorithms…. Where are they? Quinn Dombrowski
  • 9. New "Gang" Scheduler ● Announced at Spark Summit back in 2.3 - https://www.datanami.com/2018/06/05/project-hydrogen-u nites-apache-spark-with-dl-frameworks/ hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./core/src hkarau@hkarau-glaptop:~/repos/spark$ grep -ri gang ./*/src hkarau@hkarau-glaptop:~/repos/spark$ Lisa Larsson
  • 10. New "Gang" Barrier Scheduler ● https://issues.apache.org/jira/browse/SPARK-24374 ● https://docs.google.com/document/d/1JR6lWcgAI53lCUxy 4qvQSv8w1jXZrbS7DC9AjYlwqJE/edit# hkarau@hkarau-glaptop:~/repos/spark$ grep -ri Barrier ./core/src ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala:class RDDBarrierSuite extends SparkFunSuite with SharedSparkContext { ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: test("create an RDDBarrier") { ./core/src/test/scala/org/apache/spark/rdd/RDDBarrierSuite.scala: Lisa Larsson
  • 11. What does this fix? ● Allows scheduling all of the DL job together ● Allows regular scheduling otherwise ● Handles failures (e.g. single executor failure == retry all) Lisa Larsson
  • 12. Spark Deep Learning Pipelines ● ML pipelines being extended to better support image data ● Some work external e.g. https://github.com/databricks/spark-deep-learning Alternatives: ● TensorflowOnSpark ● MMLSpark, etc. Smokey Combs
  • 13. Growth of use of Arrow (maybe)? Logos trademarks of their respective projects Juha Kettunen
  • 14. Kubernetes ● A new cluster manager, used for more than "just" big data ● Spark "supports" but lacks difficult to use ● Active work by people at many companies (yay!) Lisa Zins
  • 15. What do we need to do next? ● Better* dynamic scaling ● Easier uploading for user code & dependencies ● Better auth integration ● Better documentation (ugh client mode) ● Better job resource requirement tagging ● Better shell scripts for packaging dependencies ○ It's easier than YARN but that's not saying a lot ○ Asking a junior Data Scientist to build a docker container doesn't always go so well Hisashi
  • 17. Building your own crystal ball: ● Join us on th dev@ list - http://spark.apache.org/community.html ● Triage Issues: ○ https://issues.apache.org/jira/projects/SPARK/issues/? filter=allopenissues ● Review code! ○ http://spark-prs.appspot.com ○ https://www.youtube.com/user/holdenkarau Lee Jordan
  • 18. Spark JIRA ● Let's go look at issues on the Spark JIRA together! ● Don't all rush at once…. Jean Georges Perrin #SparkInAction
  • 19. Who likes doing code reviews? ● We have over 400* of them! ● Don't all rush at once….
  • 20. Want to get involved? ● Join us on th dev@ list - http://spark.apache.org/community.html ● Triage Issues: ○ https://issues.apache.org/jira/projects/SPARK/issues/? filter=allopenissues ● Please help review code! ○ http://spark-prs.appspot.com ○ https://www.youtube.com/user/holdenkarau Hisashi
  • 21. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 22. High Performance Spark! You can buy it today! On the internet! Nothing on Spark 3 because it doesn't exist yet Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 23. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 24. And some upcoming talks: ● March ○ Dataworks Barcelona ○ Strata San Francisco ● May ○ KiwiCoda Mania ● June ○ "Secret" (for another week or so) ● July ○ OSCON Portland ○ Skills Matter in London
  • 25. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark . Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF It’s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback