Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
- Chunxu Tang (Research Scientist, @Alluxio)
In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.
5. Hybrid/Multi-Cloud ML Platforms
Online ML platform
Serving cluster
Models
Training Data
Models
1
2
3
Offline training platform
Training cluster
DC/Cloud A DC/Cloud B
5
6. Challenges
● I/O Bottlenecks
● Performance
○ Significant latency for remote data retrieval
○ Repeated data retrieval
● Cost
○ High expenses Incurred from remote storage access
○ Underutilization of GPU resources
6
7. Benefits of Data Locality
● Performance Gain
○ Faster access to your data compared to remote storage
○ Less time spent on data-intensive applications
● Cost Saving
○ Fewer API calls to cloud storage (data & metadata)
○ Higher utilization of GPU
7
8. Solutions
8
Pros Cons
Read from remote storage
(No locality)
● Easy to set up ● Performance and cost
issues due to I/O
bottlenecks
Copy data to local before
training
● Data is local
● Easy to set up
● Hard to manage
● Limited cache space
Local cache layer
(S3FS-FUSE, Alluxio-FUSE)
● Data is local
● Convenient interface
● Hard to manage
● Limited cache space
Distributed data access
layer
● Data is local or adjacent
● Central data management
● Scalable cache space
● Hard to build
9. Unified Data Access for ML Platforms
Online ML platform
Alluxio
Serving cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
DC/Cloud A DC/Cloud B
9
11. Under Storage
Integration with PyTorch Training (Alluxio)
Training Node
Get Task Info
Alluxio Client
PyTorch
Get Cluster Info
Send Result
Cache Cluster
Service Registry
Cache Worker
Cache Worker
Execute Task
Cache Worker
Cache Client
Find Worker(s)
Affinity Block
Location
Policy Client-side load
balance
1
2
3
4
5
Cache miss -
Under storage task
11
13. 13
Training Directly from Storage (S3-FUSE)
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
GPU Utilization Improvement
14. Training with Alluxio-FUSE
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
GPU Utilization Improvement
16. Ray is Designed for Distributed Cloud Training
● Ray uses a distributed scheduler to dispatch training jobs to available
workers (CPUs/GPUs)
● Enables seamless horizontal scaling of training jobs across multiple nodes
● Provides streaming data abstraction for ML training for parallel and
distributed preprocessing.
16
17. Performance & Cost Issues from Ray community
● You might load the entire dataset again and again for each epoch
● You cannot cache the hottest data among multiple training jobs automatically
● You might be suffering from a cold start every time.
18. Alluxio - Ray Integration
18
Ray Dataloader
fsspec - Alluxio
impl
Alluxio Python
client
Ray
etcd
Alluxio Worker
REST API server
Alluxio Worker
REST API server
PyArrow Dataset
loading
Registration
Get worker
addresses
19. Alluxiofs - fsspec with Ray Usage
# Import fsspec & alluxio fsspec implementation
import fsspec
from alluxiofs import AlluxioFileSystem
# Create Alluxio filesystem
alluxio = fsspec.filesystem("s3", etcd_host=args.etcd_host)
# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://ai-ref-arch/imagenet-full/train",
filesystem=alluxio)
See more in: https://github.com/fsspec/alluxiofs
Using Alluxiofs instead of S3fs
Original S3 URL
20. Alluxio+Ray Benchmark – Small Files
● Dataset
○ 130GB imagenet dataset
● Process Settings
○ 4 train workers
○ 9 process reading
● Active Object Store Memory
○ 400-500 MiB
21. Alluxio+Ray Benchmark – Large Parquet files
● Dataset
○ 200MiB files, adds up to
60GiB
● Process Settings
○ 28 train workers
○ 28 process reading
● Active Object Store Memory
○ 20-30 GiB