Big Data applications are increasingly being run on Kubernetes. Data scientists commonly use python-based workflows, with tools like PySpark and Jupyter for wrangling large amounts of data. The Kubernetes community over the past year has been actively investing in tools and support for frameworks such as Apache Spark, Jupyter and Apache Airflow. Attendees will learn how these tools can be used together to build a scalable self-service platform for data science on Kubernetes as well as the benefits that Kubernetes can provide over traditional options.
2. Some links (slides & recordings will be at):
http://bit.ly/2RU0XcG
CatLoversShow
3. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
4.
5. What is going to be covered:
● What is Kubernetes
● How it’s different from YARN and other similar systems we use Spark on
● How “simple” it is to switch cluster managers
○ Plus the not so simple (where’s my HDFS and auto-scaling?)
● WiFi co-operating a PySpark on K8s demo (everyone loves wordcount!)
● A brief detour in Kubeflow
● Future work and directions
Andrew
6. Kubernetes
“New” open-source cluster manager.
- github.com/kubernetes/kubernetes
Runs programs in Linux containers.
1600+ contributors and 60,000+ commits.
7. Kubernetes
“New” open-source cluster manager.
- github.com/kubernetes/kubernetes
libs
app
kernel
libs
app
libs
app
libs
app
Runs programs in Linux containers.
1600+ contributors and 60,000+ commits.
8. More isolation is good
Kubernetes provides each program with:
● a lightweight virtual file system -- Docker image
○ an independent set of S/W packages
● a virtual network interface
○ a unique virtual IP address
○ an entire range of ports
Aleksei I
9. Other isolation layers
● Separate process ID space
● Max memory limit
● CPU share throttling
● Mountable volumes
○ Config files -- ConfigMaps
○ Credentials -- Secrets
○ Local storages -- EmptyDir, HostPath
○ Network storages -- PersistentVolumes
Jarek Reiner
10. Dependencies
● Spark alone isn’t enough
● Think: spacy, sci-kit learn, tensorflow, etc.
● YARN: Shared conda env, but supporting different version is hard
Fuzzy Gerdes
11. Kubernetes architecture
node A node B
Pod 1 Pod 2 Pod 3
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Pod, a unit of scheduling and isolation.
● runs a user program in a primary container
● holds isolation layers like a virtual IP in an infra container
Robbt
12. Big Data on Kubernetes
Since Spark 2.3, the community has been working on a few
important new features that make Spark on Kubernetes more
usable and ready for a broader spectrum of use cases:
● non-JVM binding support and memory customization
● client-mode support for running interactive apps
● Kerberos support
● large framework refactors: rm init-container; scheduler
The Last Cookie
13. Spark on Kubernetes
Spark Core Kubernetes Scheduler Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
babbagecabbage
14. Spark on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
15. How to change to running on Kubernetes?
In theory “just”:
--master yarn to --master k8s://[...]
In practice:
● Build a container with your dependencies
● Possibly change your storage (HDFS to S3 or GCS)
● Change your cluster manager
● Re-do your tuning work
Hisashi
16. Demo: Everyone loves wordcount!
It’s big data which means we have to do WordCount
Recorded demo - https://youtu.be/jaIU2VCTv88
Hisashi
17. Demo #2: Wordcount in client mode on K8s
Recorded demo - https://youtu.be/s2aU81Zyq9E
Luxus M
18. Demo #3: Wordcount in a notebook on K8s
Everyone loves notebooks, except ops, qa and your very
stressed out data engineers.
Recorded demo - https://youtu.be/eMj0Pv1-Nfo
Tim (Timothy)
Pearce
19. What do we need to do next?
● Support dynamic scaling
● Storage?
● Better auth integration
● Better documentation (ugh client mode)
Hisashi
20. Dynamic Scaling:
● Need a seperate shuffle service
● We could do smart scale down maybe -
https://github.com/apache/spark/pull/19045
Jennifer C.
21. Related talks & blog posts
● Running custom Spark on GKE and Azure -
https://www.oreilly.com/ideas/how-to-run-a-custom-version-of-spark-on-hoste
d-kubernetes
● Deploying Spark on Kubernetes -
http://spark.apache.org/docs/latest/running-on-kubernetes.html
● Getting PySpark 2.4 working on GKE recorded livestream -
https://www.youtube.com/watch?v=3j9D7B6PE60
Interested in OSS (especially Spark)?
● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai
22. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
23. High Performance Spark!
Available today, not a lot on testing and almost nothing on
validation, but that should not stop you from buying several
copies (if you have an expense account).
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
24. Sign up for the mailing list @
http://www.distributedcomputing4kids.com
25. And some upcoming talks:
● November
○ Saturday - Scale By The Bay - San Francisco
● December
○ ScalaX - London
● January
○ Data Day Texas
● February
○ TBD
● March
○ Strata San Francisco
26. Cat wave photo by Quinn Dombrowski
k thnx bye! (or questions…)
If you want to fill out survey:
http://bit.ly/holdenTestingSpark
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I’ll be around - I have a light
up jacket but you can
message me on twitter too.