SlideShare a Scribd company logo
1 of 38
1
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Native Spark Executors
on Kubernetes
Diving into the Data Lake
Grace Chang
Mariano Gonzalez
Chicago Cloud Conference 2019
bit.ly/spark-k8s-code
2
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Presenters
Who are we?
Grace is an engineer with years of experience
ingesting, transforming, and storing data. Before
that, she spent her time building machine learning
models as a data scientist.
Senior Big Data Engineer at Glassdoor
Grace Chang
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He
is an avid and prolific open-source contributor.
Principal Data Engineer at Jellyvision
Mariano Gonzalez
Most importantly, we are just two people trying to learn about and
share big data technologies and approaches.
3
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Agenda
What are we going to talk about?
โ— Clarification of Assumptions
โ— Sharing of Motivations
โ— Discussion on Data Lakes
โ— Challenges Description
โ— Inspiration Explanation
โ— Description of Solution
โ— Demo of Solution
4
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Goal
Letโ€™s start from the beginning: What are we trying to achieve?
Data Storage
Different Types
and Formats of
Data
Data Pipelines
User
Ingest, process, and surface large amounts of data in an accessible way.
5
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Motivation
Why are we talking about this?
We have a complicated relationship with infrastructure.
We observed the strengths and weaknesses of each of
of the implementations.
We Have Tried Three Different Spark Infrastructure
Implementations
We tried out the popular solutions and observed the
pros and cons of the technologies used.
We Have Tried Both Managed and
Unmanaged Solutions
We searched for new, elegant ways to set up spark
infrastructure on a data lake.
6
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake Introduction
Where did the term come from?
The concept of a data lake has been
around for a while.
The term โ€œdata lakeโ€ was first
introduced in 2010 by James Dixon,
CTO at Pentaho.
โ€œIf you think of a data mart as a store of
bottled water โ€“ cleansed and packaged
and structured for easy consumption โ€“
the data lake is a large body of water in a
more natural state. The contents of the
data lake stream in from a source to fill
the lake, and various users of the lake
can come to examine, dive in, or take
samples.โ€
James Dixon, CTO of Pentaho
7
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Characteristics of a Data Lake
What makes up a data lake?
Centralization of data bring a number of benefits including being easier to govern and manage as
well as making it easier to discover non-disruptively heterogeneous data sets.
Consolidation
Extending the architecture to different workloads is not difficult.
Agility
Cloud object storage services provide virtually unlimited space at very low cost.
Collect and Store All Data at Any Scale
Having a centralized data lake makes it easier to keep track of what data you have, who has access
to it, what type of data you are storing, and what itโ€™s being used for.
Locate, Curate, and Secure Data
8
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse
What are the differences?
Data Stored in Its Native Format
Flexibility When Accessing Data
A data lake is not a direct replacement for a data warehouse, they are supplemental
technologies that serve different use cases with some overlap.
โ— Data can be loaded faster and accessed quicker since it does not need to go
through an initial transformation process.
โ— For traditional relational databases, data would need to be processed and
manipulated before being stored.
โ— Data scientists and engineers can access data faster than it would be possible in
a traditional BI architecture.
9
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse cont.
What are the differences?
Schema on Read
โ— Traditional data warehouses employ Schema-on-Write.
โ—‹ This requires an upfront data modeling exercise to define the schema for the
data.
โ— Schema-on-Read allows the schema to be developed and tailored on a case-by-
case basis.
โ—‹ The schema is developed and projected on the data sets required for a
particular use case.
โ—‹ This means that the data required is processed as needed.
10
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Avoiding the Data Swamp
How do we navigate challenges?
Although a data lake is a great
solution to manage data in a modern
data-driven environment, it is not
without its significant challenges.
โ€œWe see customers creating big data
graveyards, dumping everything into
the data lake and hoping to do
something with it down the road. But
then they just lose track of whatโ€™s there.
The main challenge is not creating
adata lake but taking advantage of the
opportunities it presents.โ€
Sean Martin, CTO of Cambridge Semantic
11
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lineage
Where did the data
come from and what
has happened to it?
Data Quality
Is the data accurate and
fit for its purposes?
Data Security
Is the data protected
from unauthorized
access?
Data Catalog
What data do you have
and where is it stored?
Data Lake Governance Challenges
What should we be aware of?
12
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
2 3
Operational
Complexity of
Maintaining a
Data Lake
Architecture
1
The Main Challenges
How do we balance and address these?
Static
Architecture
Resulting in
Unextendable
Pipelines
User Complexity
in Accessing Large
and Fragmented
Data
13
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Operational Complexity of Maintaining Data Lake Architecture
Some Cluster Management Solutions: YARN and Spark Standalone Clusters
Some Challenges:
โ— Cluster resizing difficulties
โ— Forced to scale compute and storage at the same time
โ— Often required vendor-specific bundle
โ— Lack of dynamic resource allocation
โ— Security setup labyrinth (AD, Kerberos, Centrify)
โ— (spark standalone) Difficulty maintaining a cluster with different runtimes
โ— (YARN) Exorbitant Amount of Different Configuration (massive XML files)
14
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Static Architecture Resulting in Unextendable Pipelines
Difficult to evolve an architecture
initially designed for one type of data
ingestion
โ—‹ e.g. Adding a streaming
architecture component to batch
architecture is involved
15
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
User Complexity in Accessing Large and Fragmented Data
โ— Friction between users (data scientists,
analysts, etc) and convenient data access
patterns
โ— Data from different sources can be hard
to merge together
โ—‹ Often one set data type (e.g. can only
ingest avro)
16
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
Medium Article on Data Infrastructure at AirBnB (2016)
17
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
DataBricks DeltaLake (2019)
18
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution
How did we get to our solution?
A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse
Workflows
โ— K8s to do cluster management/scaling
โ— Spark to do data transformations
โ— Managed Services
โ— Bronze/Silver/Gold pipeline organization
19
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Storage
How did we get to our solution?
Most data lake implementations use cloud object storage as the underlying storage
technology. It is recommended for the following reasons:
Durable: You can typically expect to see eleven 9โ€™s availability.
Scalable: Object storage is particularly well suited to storing vast amounts of
unstructured data, and storage capacity is virtually unlimited.
Affordable: You can store data for approximately $0.01 per GB/Month.
Secure: Granular access down to the object level.
Integrated: Most processing technologies support object storage as both a source
and a data sink.
Can it be a managed service? YES!
20
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Compute
How did we get to our solution?
K8s has positioned itself as the defacto container orchestrator, and spark
continues to be the tool of choice for running big data workloads. So why
donโ€™t we run spark on k8s?
Extendible: No extra infrastructure to run your data workloads
Upgradable: Running spark executors on K8โ€™s means I don't longer
need to upgrade the cluster; all runtimes are available right from
beginning via Docker images (scala, python, R)
Can it be a managed service? YES!
21
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Are you using data analytical pipelines which are already
containerized?
If a typical pipeline requires a broker, spark, database and some visualization
and everything runs on containers, it make sense to also run spark on
containers and use K8s to manage your entire pipeline.
Resource sharing is better optimized
Instead of running your pipeline on a dedicated hardware, it is very efficient
and optimal to run on Kubernetes cluster, so that there is better resource
sharing as all components in a pipeline are not running all the time.
22
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Leveraging Kubernetes Ecosystem
Spark workloads can make direct use of Kubernetes clusters for multi-tenancy
and sharing Namespaces and Quotas, as well as administrative features such as
Pluggable Authorization and Logging.
It requires no changes or new installations on the cluster; simply create a
container image and set up the right RBAC roles for your Spark Application and
youโ€™re all set.
23
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
So How Does This Actually Work?
spark-submit can be directly used to submit a Spark
application to a Kubernetes cluster. The submission
mechanism works as follows:
โ— Spark creates a spark driver running within a
Kubernetes pod.
โ— The driver creates executors which are also running
within Kubernetes pods and connects to them, and
executes application code.
โ— When the application completes, the executor pods
terminate and are cleaned up, but the driver pod
persists logs and remains in โ€œcompletedโ€ state in the
Kubernetes API until itโ€™s eventually garbage collected
or manually cleaned up.
24
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution
Decoupling of compute and storage, which
means they can scale independently!
25
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Compute
โ— Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP,
Azure)
โ— Multiple data science and engineering workloads:
โ—‹ Streaming / Real Time: CPU intensive
โ—‹ Batch / Analytic: Storage intensive
โ— Minimize operational burden with managed service while maximizing flexibility by taking
advantage of Kubernetes
โ—‹ Dynamic and lightweight scaling
โ— Since we are deploying Docker images there is no need to โ€œpatchโ€ the cluster
โ—‹ Developers are responsible for updating runtimes and libraries:
โ–  Scala/Java/Python/R dependencies
26
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Storage
โ— Storage Object as a managed service!
โ— Usage of S3 allows for stable and unified source of data
โ— Option to ingest data in different formats type because:
โ—‹ Structure
โ—‹ Semistructured
โ—‹ Un-structured
โ— Realtime and historical data available data without much changes
โ—‹ No rollup processes to another storage technology
27
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Example Implementation
Visualize Data
in Data Lake
Aggregate Data
in Data Lake
Transform
Saved Data in
Data Lake
Save Data To
Data Lake
Stream Data
From Twitter
28
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Architecture Diagram
29
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Architecting a Data Lake Storage
A common misconception and potential mistake is that data lakes are one giant,
centralized bucket where everything need to land. That is not true!
Combining AirBnb and Databricks ideas we obtain the following architecture:
Bronze Bucket: raw data lands (avro, json, csv etc)
Silver Bucket: columnar data (parquet, orc)
Gold Bucket: columnar data (aggregated)
30
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Finally, time for some code!
bit.ly/spark-k8s-code
31
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Bronze Pipeline
Producer
twitter_producer.py
Consumer
spark_consumer.py
TCP
Pod in Kubernetes
Spark Streaming Job in
Kubernetes
S3 Bucket (Bronze)
JSON file is
appended.
Tweets are streamed
into the producer
Producer sends data to the
spark job
32
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Silver Pipeline
S3 Bucket (Silver)
Parquet file is
written.
S3 Bucket
(Bronze)
JSON file.
Spark Streaming Job in
Kubernetes
Transformer pulls data
from bronze bucket,
converts it to parquet and
saves it to silver bucket.
33
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Gold Pipeline
S3 Bucket (Gold)
Parquet file
containing
aggregated data
written.
S3 Bucket
(Silver)
Parquet file.
Spark Streaming Job in
Kubernetes
Aggregator pulls data from
the silver bucket,
aggregates it, then saves it
to the gold bucket
34
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Visualization
S3 Bucket (Gold)
Parquet file of
aggregated data.
Zeppelin in Kubernetes
Allows user to query data.
User Queries
Data
35
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Future Work
What about all the management / monitoring / scheduling for spark job?
โ— State
โ— Metrics
โ— Retries
โ— Logs
Thankfully Google announced the spark kubernetes operator that will take core of all that (and
more)
Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338
36
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Questions?
37
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Links
https://kubernetes.io/docs/home/
https://spark.apache.org/docs/latest/
https://zeppelin.apache.org/docs/latest/
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
https://databricks.com/product/databricks-delta
https://docs.aws.amazon.com/s3/index.html
https://docs.aws.amazon.com/eks/index.html
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
38
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes

More Related Content

What's hot

No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
ย 
Life is a Stream of Events
Life is a Stream of Events Life is a Stream of Events
Life is a Stream of Events
confluent
ย 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
Databricks
ย 

What's hot (20)

Postgres Vision 2018: Data as the New Oil
Postgres Vision 2018: Data as the New OilPostgres Vision 2018: Data as the New Oil
Postgres Vision 2018: Data as the New Oil
ย 
PgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOpsPgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOps
ย 
No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architectures
ย 
Postgres Vision 2018: AI Needs IA
Postgres Vision 2018: AI Needs IAPostgres Vision 2018: AI Needs IA
Postgres Vision 2018: AI Needs IA
ย 
Airbyte - Seed deck
Airbyte  - Seed deckAirbyte  - Seed deck
Airbyte - Seed deck
ย 
Connect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday RisingConnect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday Rising
ย 
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
ย 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
ย 
Life is a Stream of Events
Life is a Stream of Events Life is a Stream of Events
Life is a Stream of Events
ย 
Postgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the CloudPostgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the Cloud
ย 
Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS
Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS  Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS
Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS
ย 
Worldwide Hybrid Cloud Computing Market โ€“ Drivers, Opportunities, Trends, and...
Worldwide Hybrid Cloud Computing Market โ€“ Drivers, Opportunities, Trends, and...Worldwide Hybrid Cloud Computing Market โ€“ Drivers, Opportunities, Trends, and...
Worldwide Hybrid Cloud Computing Market โ€“ Drivers, Opportunities, Trends, and...
ย 
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
ย 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
ย 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
ย 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In DenodoBuilding a Consistent Hybrid Cloud Semantic Model In Denodo
Building a Consistent Hybrid Cloud Semantic Model In Denodo
ย 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
ย 
Future of data
Future of dataFuture of data
Future of data
ย 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data Fabric
ย 
Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0
ย 

Similar to Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
ย 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
ย 

Similar to Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019 (20)

Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
ย 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
ย 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
ย 
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
ย 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
ย 
Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018
ย 
Cloud and Data Analytics Architecture: Data Everywhere for Everyone
Cloud and Data Analytics Architecture: Data Everywhere for EveryoneCloud and Data Analytics Architecture: Data Everywhere for Everyone
Cloud and Data Analytics Architecture: Data Everywhere for Everyone
ย 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
ย 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
ย 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
ย 
Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday
ย 
2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 School2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 School
ย 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
ย 
Cloud Customer Architecture for Big Data and Analytics
Cloud Customer Architecture for Big Data and AnalyticsCloud Customer Architecture for Big Data and Analytics
Cloud Customer Architecture for Big Data and Analytics
ย 
2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdf2022-Devnexus-StatefulMicroservices.pptx.pdf
2022-Devnexus-StatefulMicroservices.pptx.pdf
ย 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
ย 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
ย 
Containers and Kubernetes without limits
Containers and Kubernetes without limitsContainers and Kubernetes without limits
Containers and Kubernetes without limits
ย 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
ย 
TOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real WorldTOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real World
ย 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
ย 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
bodapatigopi8531
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
anilsa9823
ย 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
ย 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ย 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
ย 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
ย 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
ย 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
ย 

Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

  • 1. 1 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Native Spark Executors on Kubernetes Diving into the Data Lake Grace Chang Mariano Gonzalez Chicago Cloud Conference 2019 bit.ly/spark-k8s-code
  • 2. 2 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes The Presenters Who are we? Grace is an engineer with years of experience ingesting, transforming, and storing data. Before that, she spent her time building machine learning models as a data scientist. Senior Big Data Engineer at Glassdoor Grace Chang Mariano is an engineer with more than 15 years of experience with the JVM. He enjoys working with and exploring a variety of big data technologies. He is an avid and prolific open-source contributor. Principal Data Engineer at Jellyvision Mariano Gonzalez Most importantly, we are just two people trying to learn about and share big data technologies and approaches.
  • 3. 3 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Agenda What are we going to talk about? โ— Clarification of Assumptions โ— Sharing of Motivations โ— Discussion on Data Lakes โ— Challenges Description โ— Inspiration Explanation โ— Description of Solution โ— Demo of Solution
  • 4. 4 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes The Goal Letโ€™s start from the beginning: What are we trying to achieve? Data Storage Different Types and Formats of Data Data Pipelines User Ingest, process, and surface large amounts of data in an accessible way.
  • 5. 5 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Our Motivation Why are we talking about this? We have a complicated relationship with infrastructure. We observed the strengths and weaknesses of each of of the implementations. We Have Tried Three Different Spark Infrastructure Implementations We tried out the popular solutions and observed the pros and cons of the technologies used. We Have Tried Both Managed and Unmanaged Solutions We searched for new, elegant ways to set up spark infrastructure on a data lake.
  • 6. 6 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Data Lake Introduction Where did the term come from? The concept of a data lake has been around for a while. The term โ€œdata lakeโ€ was first introduced in 2010 by James Dixon, CTO at Pentaho. โ€œIf you think of a data mart as a store of bottled water โ€“ cleansed and packaged and structured for easy consumption โ€“ the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.โ€ James Dixon, CTO of Pentaho
  • 7. 7 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Characteristics of a Data Lake What makes up a data lake? Centralization of data bring a number of benefits including being easier to govern and manage as well as making it easier to discover non-disruptively heterogeneous data sets. Consolidation Extending the architecture to different workloads is not difficult. Agility Cloud object storage services provide virtually unlimited space at very low cost. Collect and Store All Data at Any Scale Having a centralized data lake makes it easier to keep track of what data you have, who has access to it, what type of data you are storing, and what itโ€™s being used for. Locate, Curate, and Secure Data
  • 8. 8 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Data Lake vs Data Warehouse What are the differences? Data Stored in Its Native Format Flexibility When Accessing Data A data lake is not a direct replacement for a data warehouse, they are supplemental technologies that serve different use cases with some overlap. โ— Data can be loaded faster and accessed quicker since it does not need to go through an initial transformation process. โ— For traditional relational databases, data would need to be processed and manipulated before being stored. โ— Data scientists and engineers can access data faster than it would be possible in a traditional BI architecture.
  • 9. 9 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Data Lake vs Data Warehouse cont. What are the differences? Schema on Read โ— Traditional data warehouses employ Schema-on-Write. โ—‹ This requires an upfront data modeling exercise to define the schema for the data. โ— Schema-on-Read allows the schema to be developed and tailored on a case-by- case basis. โ—‹ The schema is developed and projected on the data sets required for a particular use case. โ—‹ This means that the data required is processed as needed.
  • 10. 10 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Avoiding the Data Swamp How do we navigate challenges? Although a data lake is a great solution to manage data in a modern data-driven environment, it is not without its significant challenges. โ€œWe see customers creating big data graveyards, dumping everything into the data lake and hoping to do something with it down the road. But then they just lose track of whatโ€™s there. The main challenge is not creating adata lake but taking advantage of the opportunities it presents.โ€ Sean Martin, CTO of Cambridge Semantic
  • 11. 11 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Data Lineage Where did the data come from and what has happened to it? Data Quality Is the data accurate and fit for its purposes? Data Security Is the data protected from unauthorized access? Data Catalog What data do you have and where is it stored? Data Lake Governance Challenges What should we be aware of?
  • 12. 12 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes 2 3 Operational Complexity of Maintaining a Data Lake Architecture 1 The Main Challenges How do we balance and address these? Static Architecture Resulting in Unextendable Pipelines User Complexity in Accessing Large and Fragmented Data
  • 13. 13 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Operational Complexity of Maintaining Data Lake Architecture Some Cluster Management Solutions: YARN and Spark Standalone Clusters Some Challenges: โ— Cluster resizing difficulties โ— Forced to scale compute and storage at the same time โ— Often required vendor-specific bundle โ— Lack of dynamic resource allocation โ— Security setup labyrinth (AD, Kerberos, Centrify) โ— (spark standalone) Difficulty maintaining a cluster with different runtimes โ— (YARN) Exorbitant Amount of Different Configuration (massive XML files)
  • 14. 14 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Static Architecture Resulting in Unextendable Pipelines Difficult to evolve an architecture initially designed for one type of data ingestion โ—‹ e.g. Adding a streaming architecture component to batch architecture is involved
  • 15. 15 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes User Complexity in Accessing Large and Fragmented Data โ— Friction between users (data scientists, analysts, etc) and convenient data access patterns โ— Data from different sources can be hard to merge together โ—‹ Often one set data type (e.g. can only ingest avro)
  • 16. 16 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Inspiration How did we get to our solution? Medium Article on Data Infrastructure at AirBnB (2016)
  • 17. 17 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Inspiration How did we get to our solution? DataBricks DeltaLake (2019)
  • 18. 18 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Solution How did we get to our solution? A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse Workflows โ— K8s to do cluster management/scaling โ— Spark to do data transformations โ— Managed Services โ— Bronze/Silver/Gold pipeline organization
  • 19. 19 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Solution: Storage How did we get to our solution? Most data lake implementations use cloud object storage as the underlying storage technology. It is recommended for the following reasons: Durable: You can typically expect to see eleven 9โ€™s availability. Scalable: Object storage is particularly well suited to storing vast amounts of unstructured data, and storage capacity is virtually unlimited. Affordable: You can store data for approximately $0.01 per GB/Month. Secure: Granular access down to the object level. Integrated: Most processing technologies support object storage as both a source and a data sink. Can it be a managed service? YES!
  • 20. 20 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Solution: Compute How did we get to our solution? K8s has positioned itself as the defacto container orchestrator, and spark continues to be the tool of choice for running big data workloads. So why donโ€™t we run spark on k8s? Extendible: No extra infrastructure to run your data workloads Upgradable: Running spark executors on K8โ€™s means I don't longer need to upgrade the cluster; all runtimes are available right from beginning via Docker images (scala, python, R) Can it be a managed service? YES!
  • 21. 21 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Why Use Kubernetes as a Cluster Manager? Are you using data analytical pipelines which are already containerized? If a typical pipeline requires a broker, spark, database and some visualization and everything runs on containers, it make sense to also run spark on containers and use K8s to manage your entire pipeline. Resource sharing is better optimized Instead of running your pipeline on a dedicated hardware, it is very efficient and optimal to run on Kubernetes cluster, so that there is better resource sharing as all components in a pipeline are not running all the time.
  • 22. 22 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Why Use Kubernetes as a Cluster Manager? Leveraging Kubernetes Ecosystem Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. It requires no changes or new installations on the cluster; simply create a container image and set up the right RBAC roles for your Spark Application and youโ€™re all set.
  • 23. 23 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes So How Does This Actually Work? spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows: โ— Spark creates a spark driver running within a Kubernetes pod. โ— The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. โ— When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in โ€œcompletedโ€ state in the Kubernetes API until itโ€™s eventually garbage collected or manually cleaned up.
  • 24. 24 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Advantages of the Solution Decoupling of compute and storage, which means they can scale independently!
  • 25. 25 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Advantages of the Solution: Compute โ— Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP, Azure) โ— Multiple data science and engineering workloads: โ—‹ Streaming / Real Time: CPU intensive โ—‹ Batch / Analytic: Storage intensive โ— Minimize operational burden with managed service while maximizing flexibility by taking advantage of Kubernetes โ—‹ Dynamic and lightweight scaling โ— Since we are deploying Docker images there is no need to โ€œpatchโ€ the cluster โ—‹ Developers are responsible for updating runtimes and libraries: โ–  Scala/Java/Python/R dependencies
  • 26. 26 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Advantages of the Solution: Storage โ— Storage Object as a managed service! โ— Usage of S3 allows for stable and unified source of data โ— Option to ingest data in different formats type because: โ—‹ Structure โ—‹ Semistructured โ—‹ Un-structured โ— Realtime and historical data available data without much changes โ—‹ No rollup processes to another storage technology
  • 27. 27 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Our Example Implementation Visualize Data in Data Lake Aggregate Data in Data Lake Transform Saved Data in Data Lake Save Data To Data Lake Stream Data From Twitter
  • 28. 28 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Architecture Diagram
  • 29. 29 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Architecting a Data Lake Storage A common misconception and potential mistake is that data lakes are one giant, centralized bucket where everything need to land. That is not true! Combining AirBnb and Databricks ideas we obtain the following architecture: Bronze Bucket: raw data lands (avro, json, csv etc) Silver Bucket: columnar data (parquet, orc) Gold Bucket: columnar data (aggregated)
  • 30. 30 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Finally, time for some code! bit.ly/spark-k8s-code
  • 31. 31 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Bronze Pipeline Producer twitter_producer.py Consumer spark_consumer.py TCP Pod in Kubernetes Spark Streaming Job in Kubernetes S3 Bucket (Bronze) JSON file is appended. Tweets are streamed into the producer Producer sends data to the spark job
  • 32. 32 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Silver Pipeline S3 Bucket (Silver) Parquet file is written. S3 Bucket (Bronze) JSON file. Spark Streaming Job in Kubernetes Transformer pulls data from bronze bucket, converts it to parquet and saves it to silver bucket.
  • 33. 33 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Gold Pipeline S3 Bucket (Gold) Parquet file containing aggregated data written. S3 Bucket (Silver) Parquet file. Spark Streaming Job in Kubernetes Aggregator pulls data from the silver bucket, aggregates it, then saves it to the gold bucket
  • 34. 34 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Visualization S3 Bucket (Gold) Parquet file of aggregated data. Zeppelin in Kubernetes Allows user to query data. User Queries Data
  • 35. 35 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Future Work What about all the management / monitoring / scheduling for spark job? โ— State โ— Metrics โ— Retries โ— Logs Thankfully Google announced the spark kubernetes operator that will take core of all that (and more) Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338
  • 36. 36 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Questions?
  • 37. 37 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes Links https://kubernetes.io/docs/home/ https://spark.apache.org/docs/latest/ https://zeppelin.apache.org/docs/latest/ https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c https://databricks.com/product/databricks-delta https://docs.aws.amazon.com/s3/index.html https://docs.aws.amazon.com/eks/index.html https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 38. 38 Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes