Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. Weโve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
call girls in Vaishali (Ghaziabad) ๐ >เผ8448380779 ๐ genuine Escort Service ๐โ๏ธโ๏ธ
ย
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019
1. 1
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Native Spark Executors
on Kubernetes
Diving into the Data Lake
Grace Chang
Mariano Gonzalez
Chicago Cloud Conference 2019
bit.ly/spark-k8s-code
2. 2
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Presenters
Who are we?
Grace is an engineer with years of experience
ingesting, transforming, and storing data. Before
that, she spent her time building machine learning
models as a data scientist.
Senior Big Data Engineer at Glassdoor
Grace Chang
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He
is an avid and prolific open-source contributor.
Principal Data Engineer at Jellyvision
Mariano Gonzalez
Most importantly, we are just two people trying to learn about and
share big data technologies and approaches.
3. 3
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Agenda
What are we going to talk about?
โ Clarification of Assumptions
โ Sharing of Motivations
โ Discussion on Data Lakes
โ Challenges Description
โ Inspiration Explanation
โ Description of Solution
โ Demo of Solution
4. 4
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Goal
Letโs start from the beginning: What are we trying to achieve?
Data Storage
Different Types
and Formats of
Data
Data Pipelines
User
Ingest, process, and surface large amounts of data in an accessible way.
5. 5
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Motivation
Why are we talking about this?
We have a complicated relationship with infrastructure.
We observed the strengths and weaknesses of each of
of the implementations.
We Have Tried Three Different Spark Infrastructure
Implementations
We tried out the popular solutions and observed the
pros and cons of the technologies used.
We Have Tried Both Managed and
Unmanaged Solutions
We searched for new, elegant ways to set up spark
infrastructure on a data lake.
6. 6
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake Introduction
Where did the term come from?
The concept of a data lake has been
around for a while.
The term โdata lakeโ was first
introduced in 2010 by James Dixon,
CTO at Pentaho.
โIf you think of a data mart as a store of
bottled water โ cleansed and packaged
and structured for easy consumption โ
the data lake is a large body of water in a
more natural state. The contents of the
data lake stream in from a source to fill
the lake, and various users of the lake
can come to examine, dive in, or take
samples.โ
James Dixon, CTO of Pentaho
7. 7
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Characteristics of a Data Lake
What makes up a data lake?
Centralization of data bring a number of benefits including being easier to govern and manage as
well as making it easier to discover non-disruptively heterogeneous data sets.
Consolidation
Extending the architecture to different workloads is not difficult.
Agility
Cloud object storage services provide virtually unlimited space at very low cost.
Collect and Store All Data at Any Scale
Having a centralized data lake makes it easier to keep track of what data you have, who has access
to it, what type of data you are storing, and what itโs being used for.
Locate, Curate, and Secure Data
8. 8
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse
What are the differences?
Data Stored in Its Native Format
Flexibility When Accessing Data
A data lake is not a direct replacement for a data warehouse, they are supplemental
technologies that serve different use cases with some overlap.
โ Data can be loaded faster and accessed quicker since it does not need to go
through an initial transformation process.
โ For traditional relational databases, data would need to be processed and
manipulated before being stored.
โ Data scientists and engineers can access data faster than it would be possible in
a traditional BI architecture.
9. 9
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse cont.
What are the differences?
Schema on Read
โ Traditional data warehouses employ Schema-on-Write.
โ This requires an upfront data modeling exercise to define the schema for the
data.
โ Schema-on-Read allows the schema to be developed and tailored on a case-by-
case basis.
โ The schema is developed and projected on the data sets required for a
particular use case.
โ This means that the data required is processed as needed.
10. 10
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Avoiding the Data Swamp
How do we navigate challenges?
Although a data lake is a great
solution to manage data in a modern
data-driven environment, it is not
without its significant challenges.
โWe see customers creating big data
graveyards, dumping everything into
the data lake and hoping to do
something with it down the road. But
then they just lose track of whatโs there.
The main challenge is not creating
adata lake but taking advantage of the
opportunities it presents.โ
Sean Martin, CTO of Cambridge Semantic
11. 11
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lineage
Where did the data
come from and what
has happened to it?
Data Quality
Is the data accurate and
fit for its purposes?
Data Security
Is the data protected
from unauthorized
access?
Data Catalog
What data do you have
and where is it stored?
Data Lake Governance Challenges
What should we be aware of?
12. 12
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
2 3
Operational
Complexity of
Maintaining a
Data Lake
Architecture
1
The Main Challenges
How do we balance and address these?
Static
Architecture
Resulting in
Unextendable
Pipelines
User Complexity
in Accessing Large
and Fragmented
Data
13. 13
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Operational Complexity of Maintaining Data Lake Architecture
Some Cluster Management Solutions: YARN and Spark Standalone Clusters
Some Challenges:
โ Cluster resizing difficulties
โ Forced to scale compute and storage at the same time
โ Often required vendor-specific bundle
โ Lack of dynamic resource allocation
โ Security setup labyrinth (AD, Kerberos, Centrify)
โ (spark standalone) Difficulty maintaining a cluster with different runtimes
โ (YARN) Exorbitant Amount of Different Configuration (massive XML files)
14. 14
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Static Architecture Resulting in Unextendable Pipelines
Difficult to evolve an architecture
initially designed for one type of data
ingestion
โ e.g. Adding a streaming
architecture component to batch
architecture is involved
15. 15
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
User Complexity in Accessing Large and Fragmented Data
โ Friction between users (data scientists,
analysts, etc) and convenient data access
patterns
โ Data from different sources can be hard
to merge together
โ Often one set data type (e.g. can only
ingest avro)
16. 16
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
Medium Article on Data Infrastructure at AirBnB (2016)
17. 17
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
DataBricks DeltaLake (2019)
18. 18
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution
How did we get to our solution?
A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse
Workflows
โ K8s to do cluster management/scaling
โ Spark to do data transformations
โ Managed Services
โ Bronze/Silver/Gold pipeline organization
19. 19
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Storage
How did we get to our solution?
Most data lake implementations use cloud object storage as the underlying storage
technology. It is recommended for the following reasons:
Durable: You can typically expect to see eleven 9โs availability.
Scalable: Object storage is particularly well suited to storing vast amounts of
unstructured data, and storage capacity is virtually unlimited.
Affordable: You can store data for approximately $0.01 per GB/Month.
Secure: Granular access down to the object level.
Integrated: Most processing technologies support object storage as both a source
and a data sink.
Can it be a managed service? YES!
20. 20
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Compute
How did we get to our solution?
K8s has positioned itself as the defacto container orchestrator, and spark
continues to be the tool of choice for running big data workloads. So why
donโt we run spark on k8s?
Extendible: No extra infrastructure to run your data workloads
Upgradable: Running spark executors on K8โs means I don't longer
need to upgrade the cluster; all runtimes are available right from
beginning via Docker images (scala, python, R)
Can it be a managed service? YES!
21. 21
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Are you using data analytical pipelines which are already
containerized?
If a typical pipeline requires a broker, spark, database and some visualization
and everything runs on containers, it make sense to also run spark on
containers and use K8s to manage your entire pipeline.
Resource sharing is better optimized
Instead of running your pipeline on a dedicated hardware, it is very efficient
and optimal to run on Kubernetes cluster, so that there is better resource
sharing as all components in a pipeline are not running all the time.
22. 22
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Leveraging Kubernetes Ecosystem
Spark workloads can make direct use of Kubernetes clusters for multi-tenancy
and sharing Namespaces and Quotas, as well as administrative features such as
Pluggable Authorization and Logging.
It requires no changes or new installations on the cluster; simply create a
container image and set up the right RBAC roles for your Spark Application and
youโre all set.
23. 23
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
So How Does This Actually Work?
spark-submit can be directly used to submit a Spark
application to a Kubernetes cluster. The submission
mechanism works as follows:
โ Spark creates a spark driver running within a
Kubernetes pod.
โ The driver creates executors which are also running
within Kubernetes pods and connects to them, and
executes application code.
โ When the application completes, the executor pods
terminate and are cleaned up, but the driver pod
persists logs and remains in โcompletedโ state in the
Kubernetes API until itโs eventually garbage collected
or manually cleaned up.
24. 24
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution
Decoupling of compute and storage, which
means they can scale independently!
25. 25
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Compute
โ Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP,
Azure)
โ Multiple data science and engineering workloads:
โ Streaming / Real Time: CPU intensive
โ Batch / Analytic: Storage intensive
โ Minimize operational burden with managed service while maximizing flexibility by taking
advantage of Kubernetes
โ Dynamic and lightweight scaling
โ Since we are deploying Docker images there is no need to โpatchโ the cluster
โ Developers are responsible for updating runtimes and libraries:
โ Scala/Java/Python/R dependencies
26. 26
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Storage
โ Storage Object as a managed service!
โ Usage of S3 allows for stable and unified source of data
โ Option to ingest data in different formats type because:
โ Structure
โ Semistructured
โ Un-structured
โ Realtime and historical data available data without much changes
โ No rollup processes to another storage technology
27. 27
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Example Implementation
Visualize Data
in Data Lake
Aggregate Data
in Data Lake
Transform
Saved Data in
Data Lake
Save Data To
Data Lake
Stream Data
From Twitter
29. 29
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Architecting a Data Lake Storage
A common misconception and potential mistake is that data lakes are one giant,
centralized bucket where everything need to land. That is not true!
Combining AirBnb and Databricks ideas we obtain the following architecture:
Bronze Bucket: raw data lands (avro, json, csv etc)
Silver Bucket: columnar data (parquet, orc)
Gold Bucket: columnar data (aggregated)
30. 30
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Finally, time for some code!
bit.ly/spark-k8s-code
31. 31
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Bronze Pipeline
Producer
twitter_producer.py
Consumer
spark_consumer.py
TCP
Pod in Kubernetes
Spark Streaming Job in
Kubernetes
S3 Bucket (Bronze)
JSON file is
appended.
Tweets are streamed
into the producer
Producer sends data to the
spark job
32. 32
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Silver Pipeline
S3 Bucket (Silver)
Parquet file is
written.
S3 Bucket
(Bronze)
JSON file.
Spark Streaming Job in
Kubernetes
Transformer pulls data
from bronze bucket,
converts it to parquet and
saves it to silver bucket.
33. 33
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Gold Pipeline
S3 Bucket (Gold)
Parquet file
containing
aggregated data
written.
S3 Bucket
(Silver)
Parquet file.
Spark Streaming Job in
Kubernetes
Aggregator pulls data from
the silver bucket,
aggregates it, then saves it
to the gold bucket
34. 34
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Visualization
S3 Bucket (Gold)
Parquet file of
aggregated data.
Zeppelin in Kubernetes
Allows user to query data.
User Queries
Data
35. 35
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Future Work
What about all the management / monitoring / scheduling for spark job?
โ State
โ Metrics
โ Retries
โ Logs
Thankfully Google announced the spark kubernetes operator that will take core of all that (and
more)
Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338