Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Accelerate Distributed
PyTorch/Ray Workloads in
the Cloud
Chunxu Tang, Siyuan Sheng @ Alluxio
1

2
Agenda
ML Workloads in the Cloud
Accelerating PyTorch Workloads
Accelerating Ray Workloads
01
02
03

ALLUXIO 3
ML Workloads in the
Cloud
3

Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
5

Challenges
● I/O Bottlenecks
● Performance
○ Significant latency for remote data retrieval
○ Repeated data retrieval
● Cost
○ High expenses Incurred from remote storage access
○ Underutilization of GPU resources
6

Benefits of Data Locality
● Performance Gain
○ Faster access to your data compared to remote storage
○ Less time spent on data-intensive applications
● Cost Saving
○ Fewer API calls to cloud storage (data & metadata)
○ Higher utilization of GPU
7

Solutions
8
Pros Cons
Read from remote storage
(No locality)
● Easy to set up ● Performance and cost
issues due to I/O
bottlenecks
Copy data to local before
training
● Data is local
● Easy to set up
● Hard to manage
● Limited cache space
Local cache layer
(S3FS-FUSE, Alluxio-FUSE)
● Data is local
● Convenient interface
● Hard to manage
● Limited cache space
Distributed data access
layer
● Data is local or adjacent
● Central data management
● Scalable cache space
● Hard to build

Unified Data Access for ML Platforms
Online ML platform
Alluxio
Serving cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
DC/Cloud A DC/Cloud B
9

ALLUXIO 10
Accelerating PyTorch
Workloads
10

Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
11

Data Loading Performance
ImageNet (subset)
12
Yelp review

13
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement

Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement

ALLUXIO 15
Accelerating Ray
Workloads
15

Ray is Designed for Distributed Cloud Training
● Ray uses a distributed scheduler to dispatch training jobs to available
workers (CPUs/GPUs)
● Enables seamless horizontal scaling of training jobs across multiple nodes
● Provides streaming data abstraction for ML training for parallel and
distributed preprocessing.
16

Performance & Cost Issues from Ray community
● You might load the entire dataset again and again for each epoch
● You cannot cache the hottest data among multiple training jobs automatically
● You might be suﬀering from a cold start every time.

Alluxio - Ray Integration
18
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses

Alluxiofs - fsspec with Ray Usage
# Import fsspec & alluxio fsspec implementation
import fsspec
from alluxiofs import AlluxioFileSystem
# Create Alluxio filesystem
alluxio = fsspec.filesystem("s3", etcd_host=args.etcd_host)
# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train",
filesystem=alluxio)
See more in: https://github.com/fsspec/alluxiofs
Using Alluxiofs instead of S3fs
Original S3 URL

Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB

Alluxio+Ray Benchmark – Large Parquet ﬁles
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB

Cost Saving – Egress/Data Transfer Fees

Cost Saving – API Calls/S3 Operations (List, Get)

Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
https://alluxio.io/slack
@
Social Media
Github
https://github.com/Alluxio
Chunxu Tang
www.linkedin.com/in/chunxu-tang
Siyuan Sheng
www.linkedin.com/in/siyuan-sheng

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Recommended

Recommended

More Related Content

Similar to Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Similar to Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud