SlideShare a Scribd company logo
1 of 30
Download to read offline
Achieving Data Locality
Using Spark + Alluxio
with and without K8s
Bin Fan, Jiacheng Liu
2019/12/17 Office Hour
Questions at https://alluxio.io/slack
Speakers
• Bin Fan
○ Founding engineer & Alluxio PMC @ Alluxio
○ Email: binfan@alluxio.com
• Jiacheng Liu
○ Core maintainer @ Alluxio
○ Email: jiacheng@alluxio.com
For Q&A, join https://alluxio.io/slack
Agenda
• Alluxio Overview
• Spark + Alluxio Data Locality without K8s
• K8s Overview
• Spark + Alluxio Data Locality with K8s
• Limitations and Future Work
Alluxio Overview
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
(FAQ for this office hour)
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
Why Alluxio in K8s?
Elastic Data for Elastic Compute with Tight Locality
● Improve Data locality
○ Big data analytics or ML
○ Cache data close to compute
● Enable high-speed data sharing across jobs
○ A closer staging storage layer for
compute jobs
● Provide Unification of persistent storage
○ Same data abstraction across different
storage
Host
Alluxio
Spark
AlluxioAlluxio
Spark
Alluxio
SparkSpark
or
Read more at https://www.alluxio.io/blog/kubernetes-alluxio-and-the-disaggregated-analytics-stack/
Data Challenges
● Data Copy
○ Multiple Copies of Data
● A Stateful Application on K8s can be hard
○ Data Migration and Rebalancing on elasticity
● Changing applications for a new storage system can be
hard
○ Lack of familiar API
○ Tuning can be challenging
Spark + Alluxio Data Locality
Without K8s
Spark Workflow (Without Alluxio)
Cluster Manager
(i.e. YARN/Mesos)
Application
Spark
Context s3://data/
Worker Node
1) run Spark job
Worker Node
Spark
Executor
3) launch executors
and launch tasks
4) access data and compute
2) allocate executors
Takeaway: Remote data, no locality
Step 2: Schedule compute to data cache location
Cluster Manager
(i.e. YARN/Mesos)
Application
Spark
Context
Alluxio
Client
Alluxio
Masters
HostA: 196.0.0.7
Alluxio
Worker
HostB: 196.0.0.8
Alluxio
Worker
2.2) allocate on [HostA]
block1
block2
2.1) where is
block1?
2.1) block1 is
served at
[HostA]
● Alluxio client implements HDFS compatible API with
block location info
● Alluxio masters keep track and serve a list of worker
nodes that currently have the cache.
Step 4: Find local Alluxio Worker and Efficient Data
Exchange
s3://data/
Spark Executor
Alluxio
WorkerAlluxio
Client
HostB: 196.0.0.8
Alluxio
Worker
HostA: 196.0.0.7
block1
Alluxio
Masters
block1?
[HostA]
Efficient I/O via local fs
(e.g., /mnt/ramdisk/) or
local domain socket
(/opt/domain)
● Spark Executor finds local Alluxio Worker by
hostname comparison
● Spark Executor talks to local Alluxio Worker
using either short-circuit access (via local
FS) or local domain socket
Recap: Spark Architecture with Alluxio
Cluster Manager
(i.e. YARN/Mesos)
Application
Spark
Context s3://data/
Alluxio
Client
Alluxio
Masters
Worker Node
Spark
Executor
Alluxio
Worker
Alluxio
Client
Worker Node
Alluxio
Worker
1) run spark job
2.2) allocate executors
4) access Alluxio for data
and compute
2.1) talk to Alluxio for
where the data cache is
Step 2: Help Spark schedule compute to data cache
Step 4: Enable Spark to access local Alluxio cache
3) launch executors
and launch tasks
Kubernetes Overview
Kubernetes (k8s) is...
“an open-source container-orchestration system for automating
application deployment, scaling, and management.”
Container Orchestration Platform
● Platform agnostic cluster management
● Service discovery and load balancing
● Storage orchestration
● Horizontal scaling
● Self-monitoring and self-healing
● Automated rollouts and rollbacks
K8s Terms
● Node
○ A VM or physical machine
● Container
○ Docker Image = Lightweight OS and application execution environment
○ Container = Image once running on Docker Engine
● Pod
○ Schedulable unit of one or more containers running together
○ Containers share storage and network etc
● Controller
○ Controls the desired state such as copies of a Pod
● DaemonSet
○ A Controller that ensures each Node has only one such Pod
● Volume
○ A directory accessible to the Containers in a Pod
● Persistent Volume
○ A storage resource with lifecycle independent of Pods
Spark + Alluxio Data Locality
In K8s environment
Spark on K8s architecture
● Spark 2.3 added native K8s support
● Spark application runs in K8s as a custom K8s Controller
● spark-submit talks to K8s API Server to create Driver
● Spark Driver creates Executor Pods
● When the application completes, the Executor Pods
terminate and are cleaned up
● The Driver Pod persists logs and remains in
“COMPLETED” state
Deployment model: Co-location
● Co-locate Spark Executor Pod with Alluxio Worker Pod
● Lifecycle
○ Spark Executors are ephemeral
○ Alluxio Worker serves cache across all Spark jobs
● Deployment order:
○ Deploy Alluxio cluster first (masters+workers)
○ 1 Alluxio Worker on each Node, by DaemonSet
○ Spark-submit to launch Spark Driver + Executors
Worker Node
Spark Executor
Alluxio
WorkerAlluxio
Client
Worker Node
Alluxio
Worker PodAlluxio
Client
Spark Executor
Pod
Co-location
No K8s
Co-location
K8s
Challenge 1: Spark Driver fails to allocate Executors correctly
Application
Spark
Context
Alluxio
Client
allocate on [AlluxioWorkerPodA]?
block1
block2
block1? [AlluxioWorkerPodA]
HostA: 196.0.0.7
AlluxioWorkerPodA:
10.0.1.1
HostB: 196.0.0.8
AlluxioWorkerPodB:
10.0.1.2
Alluxio
Master
Pods
Some Node
Spark
Driver
Pod
Some Node
Problem: Alluxio Worker Pods have different addresses from Nodes
Current Solution: Launch Alluxio Workers with hostNetwork
Application
Spark
Context
Alluxio
Client
allocate on [HostA]
block1
block2
block1? [HostA]
HostA: 196.0.0.7
HostA: 196.0.0.7
HostB: 196.0.0.8
HostB: 196.0.0.8
Alluxio
Master
Pods
Some Node
Spark
Driver
Pod
Some Node
hostNetwork: true
hostNetwork: true
Network
adapter
Network
adapter
Challenge 2: Spark Executor failed to identify if Alluxio
Pod is host-local
Problem:
● Spark Executor Pod has different hostName
○ Hostname HostA != SparkExecPod1
○ Local Alluxio Worker not identified!
HostA: 196.0.0.7
Network
adapter
block1
Alluxio
Client
SparkExecPod1:
10.0.1,1
Alluxio
Master
block1? [HostA]
Challenge 3: Spark Executor failed to find domain socket
Problem:
● Pods don’t share the File System
● Domain socket /opt/domain is in Alluxio Worker Pod
HostA: 196.0.0.7
Network
adapter
block1
Alluxio
Client
SparkExecPod1:
10.0.1,1
Alluxio
Master
block1? [HostA]
/opt/domain/
Current Solution: Use UUID to identify Alluxio Worker
and share a hostPath Volume between Pods
HostA: 196.0.0.7
block1
Alluxio
Client
SparkExecPod1:
10.0.1,1
Alluxio
Master
block1? [HostA:
UUIDABC]
/opt/domain/UUIDABC
UUID: UUIDABC
● Each Alluxio Worker has a UUID
● Share domain socket by a hostPath Volume
● Alluxio Client finds local worker’s domain socket by
finding file matching worker UUID.
● Worker domain socket path
/opt/domain/d -> /opt/domain/UUIDABC
Enabled by
alluxio.worker.data.server.domain.socket.address
● Mount hostPath Volume to Spark Executor
Enabled by Spark 2.4 (PR)
Recap: Spark + Alluxio Architecture on K8s
Application
Spark
Context
Alluxio
Client
Spark
Executor
Pod
Alluxio
Worker
Pod
Alluxio
Client
Alluxio
Worker
Pod
Worker Node
Worker Node
Spark
Driver
Pod
Alluxio
Master
Pods
Some Node
Some Node
1) run spark job
2.1) talk to Alluxio for
where the data is
s3://data/
2.2) create Driver
4) access Alluxio for data
and compute
Recap:
Step 2: Help Spark schedule compute to data location
Step 4: Enable Spark to access data locally
3) launch executors
and launch tasks
Limitations and future work
● Many enterprise environments have strict control on hostNetwork and
hostPath.
● Alluxio workers need hostNetwork
○ Plan: workarounds like supporting container network translation
● The domain socket file requires a hostPath volume
○ Plan: workarounds like using Local Persistent Volumes
● Feedback/collaboration are welcome!
Additional Resources
Spark Native K8s Support
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://kubernetes.io/blog/2018/03/apache-spark-23-with-native-kubernetes/
Spark + Alluxio Advanced Data Locality
https://www.alluxio.io/blog/top-10-tips-for-making-the-spark-alluxio-stack-blazing-fast/
Running Alluxio in K8s
https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-Kubernetes.html
Running Spark with Alluxio in K8s
https://docs.alluxio.io/os/user/stable/en/compute/Spark-On-Kubernetes.html
How Alluxio Helps Analytics Stack in Kubernetes
https://www.alluxio.io/blog/kubernetes-alluxio-and-the-disaggregated-analytics-stack/
zhuanlan.zhihu.com/alluxio
www.alluxio.com
info@alluxio.com
twitter.com/alluxio
linkedIn.com/alluxio
Thank you
binfan@alluxio.com
jiacheng@alluxio.com
https://alluxio.io/slack

More Related Content

What's hot

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACMarkus Michalewicz
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Oracle RAC - New Generation
Oracle RAC - New GenerationOracle RAC - New Generation
Oracle RAC - New GenerationAnil Nair
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder
 
EDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB
 
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewHA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewMarkus Michalewicz
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsSandesh Rao
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 

What's hot (20)

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Oracle RAC - New Generation
Oracle RAC - New GenerationOracle RAC - New Generation
Oracle RAC - New Generation
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 
EDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB Postgres DBA Best Practices
EDB Postgres DBA Best Practices
 
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - OverviewHA, Scalability, DR & MAA in Oracle Database 21c - Overview
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata Environments
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 

Similar to Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Running Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesRunning Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesAlluxio, Inc.
 
Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with KubernetesAlluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup Walid Shaari
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudAlluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioAlluxio, Inc.
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Laure Vergeron
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Summit
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Alluxio, Inc.
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3Alluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetMark Voelker
 

Similar to Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio (20)

CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Running Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesRunning Spark & Alluxio in Kubernetes
Running Spark & Alluxio in Kubernetes
 
Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with Kubernetes
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with Alluxio
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with Puppet
 

More from Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Recently uploaded (20)

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

  • 1. Achieving Data Locality Using Spark + Alluxio with and without K8s Bin Fan, Jiacheng Liu 2019/12/17 Office Hour Questions at https://alluxio.io/slack
  • 2. Speakers • Bin Fan ○ Founding engineer & Alluxio PMC @ Alluxio ○ Email: binfan@alluxio.com • Jiacheng Liu ○ Core maintainer @ Alluxio ○ Email: jiacheng@alluxio.com For Q&A, join https://alluxio.io/slack
  • 3. Agenda • Alluxio Overview • Spark + Alluxio Data Locality without K8s • K8s Overview • Spark + Alluxio Data Locality with K8s • Limitations and Future Work
  • 5. The Alluxio Story Originated as Tachyon project, at UC Berkley AMPLab by Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 6. Fast-growing Open Source Community 4000+ Github Stars1000+ Contributors Join the community on Slack (FAQ for this office hour) alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 7. Alluxio is Open-Source Data Orchestration Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver GCS Driver S3 Driver Azure Driver
  • 8. Why Alluxio in K8s? Elastic Data for Elastic Compute with Tight Locality ● Improve Data locality ○ Big data analytics or ML ○ Cache data close to compute ● Enable high-speed data sharing across jobs ○ A closer staging storage layer for compute jobs ● Provide Unification of persistent storage ○ Same data abstraction across different storage Host Alluxio Spark AlluxioAlluxio Spark Alluxio SparkSpark or Read more at https://www.alluxio.io/blog/kubernetes-alluxio-and-the-disaggregated-analytics-stack/
  • 9. Data Challenges ● Data Copy ○ Multiple Copies of Data ● A Stateful Application on K8s can be hard ○ Data Migration and Rebalancing on elasticity ● Changing applications for a new storage system can be hard ○ Lack of familiar API ○ Tuning can be challenging
  • 10. Spark + Alluxio Data Locality Without K8s
  • 11. Spark Workflow (Without Alluxio) Cluster Manager (i.e. YARN/Mesos) Application Spark Context s3://data/ Worker Node 1) run Spark job Worker Node Spark Executor 3) launch executors and launch tasks 4) access data and compute 2) allocate executors Takeaway: Remote data, no locality
  • 12. Step 2: Schedule compute to data cache location Cluster Manager (i.e. YARN/Mesos) Application Spark Context Alluxio Client Alluxio Masters HostA: 196.0.0.7 Alluxio Worker HostB: 196.0.0.8 Alluxio Worker 2.2) allocate on [HostA] block1 block2 2.1) where is block1? 2.1) block1 is served at [HostA] ● Alluxio client implements HDFS compatible API with block location info ● Alluxio masters keep track and serve a list of worker nodes that currently have the cache.
  • 13. Step 4: Find local Alluxio Worker and Efficient Data Exchange s3://data/ Spark Executor Alluxio WorkerAlluxio Client HostB: 196.0.0.8 Alluxio Worker HostA: 196.0.0.7 block1 Alluxio Masters block1? [HostA] Efficient I/O via local fs (e.g., /mnt/ramdisk/) or local domain socket (/opt/domain) ● Spark Executor finds local Alluxio Worker by hostname comparison ● Spark Executor talks to local Alluxio Worker using either short-circuit access (via local FS) or local domain socket
  • 14. Recap: Spark Architecture with Alluxio Cluster Manager (i.e. YARN/Mesos) Application Spark Context s3://data/ Alluxio Client Alluxio Masters Worker Node Spark Executor Alluxio Worker Alluxio Client Worker Node Alluxio Worker 1) run spark job 2.2) allocate executors 4) access Alluxio for data and compute 2.1) talk to Alluxio for where the data cache is Step 2: Help Spark schedule compute to data cache Step 4: Enable Spark to access local Alluxio cache 3) launch executors and launch tasks
  • 16. Kubernetes (k8s) is... “an open-source container-orchestration system for automating application deployment, scaling, and management.”
  • 17. Container Orchestration Platform ● Platform agnostic cluster management ● Service discovery and load balancing ● Storage orchestration ● Horizontal scaling ● Self-monitoring and self-healing ● Automated rollouts and rollbacks
  • 18. K8s Terms ● Node ○ A VM or physical machine ● Container ○ Docker Image = Lightweight OS and application execution environment ○ Container = Image once running on Docker Engine ● Pod ○ Schedulable unit of one or more containers running together ○ Containers share storage and network etc ● Controller ○ Controls the desired state such as copies of a Pod ● DaemonSet ○ A Controller that ensures each Node has only one such Pod ● Volume ○ A directory accessible to the Containers in a Pod ● Persistent Volume ○ A storage resource with lifecycle independent of Pods
  • 19. Spark + Alluxio Data Locality In K8s environment
  • 20. Spark on K8s architecture ● Spark 2.3 added native K8s support ● Spark application runs in K8s as a custom K8s Controller ● spark-submit talks to K8s API Server to create Driver ● Spark Driver creates Executor Pods ● When the application completes, the Executor Pods terminate and are cleaned up ● The Driver Pod persists logs and remains in “COMPLETED” state
  • 21. Deployment model: Co-location ● Co-locate Spark Executor Pod with Alluxio Worker Pod ● Lifecycle ○ Spark Executors are ephemeral ○ Alluxio Worker serves cache across all Spark jobs ● Deployment order: ○ Deploy Alluxio cluster first (masters+workers) ○ 1 Alluxio Worker on each Node, by DaemonSet ○ Spark-submit to launch Spark Driver + Executors Worker Node Spark Executor Alluxio WorkerAlluxio Client Worker Node Alluxio Worker PodAlluxio Client Spark Executor Pod Co-location No K8s Co-location K8s
  • 22. Challenge 1: Spark Driver fails to allocate Executors correctly Application Spark Context Alluxio Client allocate on [AlluxioWorkerPodA]? block1 block2 block1? [AlluxioWorkerPodA] HostA: 196.0.0.7 AlluxioWorkerPodA: 10.0.1.1 HostB: 196.0.0.8 AlluxioWorkerPodB: 10.0.1.2 Alluxio Master Pods Some Node Spark Driver Pod Some Node Problem: Alluxio Worker Pods have different addresses from Nodes
  • 23. Current Solution: Launch Alluxio Workers with hostNetwork Application Spark Context Alluxio Client allocate on [HostA] block1 block2 block1? [HostA] HostA: 196.0.0.7 HostA: 196.0.0.7 HostB: 196.0.0.8 HostB: 196.0.0.8 Alluxio Master Pods Some Node Spark Driver Pod Some Node hostNetwork: true hostNetwork: true Network adapter Network adapter
  • 24. Challenge 2: Spark Executor failed to identify if Alluxio Pod is host-local Problem: ● Spark Executor Pod has different hostName ○ Hostname HostA != SparkExecPod1 ○ Local Alluxio Worker not identified! HostA: 196.0.0.7 Network adapter block1 Alluxio Client SparkExecPod1: 10.0.1,1 Alluxio Master block1? [HostA]
  • 25. Challenge 3: Spark Executor failed to find domain socket Problem: ● Pods don’t share the File System ● Domain socket /opt/domain is in Alluxio Worker Pod HostA: 196.0.0.7 Network adapter block1 Alluxio Client SparkExecPod1: 10.0.1,1 Alluxio Master block1? [HostA] /opt/domain/
  • 26. Current Solution: Use UUID to identify Alluxio Worker and share a hostPath Volume between Pods HostA: 196.0.0.7 block1 Alluxio Client SparkExecPod1: 10.0.1,1 Alluxio Master block1? [HostA: UUIDABC] /opt/domain/UUIDABC UUID: UUIDABC ● Each Alluxio Worker has a UUID ● Share domain socket by a hostPath Volume ● Alluxio Client finds local worker’s domain socket by finding file matching worker UUID. ● Worker domain socket path /opt/domain/d -> /opt/domain/UUIDABC Enabled by alluxio.worker.data.server.domain.socket.address ● Mount hostPath Volume to Spark Executor Enabled by Spark 2.4 (PR)
  • 27. Recap: Spark + Alluxio Architecture on K8s Application Spark Context Alluxio Client Spark Executor Pod Alluxio Worker Pod Alluxio Client Alluxio Worker Pod Worker Node Worker Node Spark Driver Pod Alluxio Master Pods Some Node Some Node 1) run spark job 2.1) talk to Alluxio for where the data is s3://data/ 2.2) create Driver 4) access Alluxio for data and compute Recap: Step 2: Help Spark schedule compute to data location Step 4: Enable Spark to access data locally 3) launch executors and launch tasks
  • 28. Limitations and future work ● Many enterprise environments have strict control on hostNetwork and hostPath. ● Alluxio workers need hostNetwork ○ Plan: workarounds like supporting container network translation ● The domain socket file requires a hostPath volume ○ Plan: workarounds like using Local Persistent Volumes ● Feedback/collaboration are welcome!
  • 29. Additional Resources Spark Native K8s Support https://spark.apache.org/docs/latest/running-on-kubernetes.html https://kubernetes.io/blog/2018/03/apache-spark-23-with-native-kubernetes/ Spark + Alluxio Advanced Data Locality https://www.alluxio.io/blog/top-10-tips-for-making-the-spark-alluxio-stack-blazing-fast/ Running Alluxio in K8s https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-Kubernetes.html Running Spark with Alluxio in K8s https://docs.alluxio.io/os/user/stable/en/compute/Spark-On-Kubernetes.html How Alluxio Helps Analytics Stack in Kubernetes https://www.alluxio.io/blog/kubernetes-alluxio-and-the-disaggregated-analytics-stack/