3. Table of
content
Characteristics for offline inference job
The challenge for running large scale
inference job
Architecture & Optimization
The performance comparison w/o adopt
Alluxio
Further work
4. Characteristics for offline inference job
• Scale
• Each job about more than 400 tasks
• Each task read different dataset and generate its’ own output. (No interactive)
• Each task read about 2~3 GB data and output 7~8 GB data (total input is ~1Tb, total output
is ~3.5 TB)
• Each task may need 2~4 hours to finish
• Data access pattern
• Read input data only once, sequential read.
• Write output while job running.
• Infra:
• Storage: Azure Blob
• AI platform: OpenPAI microsoft/pai: Resource scheduling and cluster management for AI
(github.com)
• Scheduler: Hived microsoft/hivedscheduler: Kubernetes Scheduler for Deep Learning
(github.com)
5. Challenges
• Large number for total ingress & egress data, easy to cause IO failure
• Tools like blob-fuse will download data before tasks run and upload data after job
finished, easy to cause high IOPS and reach Azure Storage Limitation
• IO stall takes much time, GPU idle while upload/download data. (Waste time and
Money!)
6. Prod bed environment
•About 200 Azure Low Priority VMs, each has 4 GPUs. (The worker
node can be preempted at anytime!)
•Using Alluxio 2.3.0
•Using Kubernates 1.15.x
•Running more than 6 months
7. Architecture with Alluxio
Training/inference Job
read write
Policy Management
Data Caching/Prefetching System
load, cache, move, replica, evict data
Data Storage
(Azure Blob Store, Cosmos Stream, HDFS)
read write
Job
Scheduler(OpenPAI)
load
Alluxio
9. Optimization-CSI based deployment
•Sperate read/write mount option
• Enable metadata cache for input data folder (Model is share across tasks).
• Disable metadata cache and set write option to Though for output folder.
•Each pod using difference mount point. (Each job has its own mount
point)
•Each job can mount different path. (Can achieve access control)
10. Optimization-Fuse client improvement
•Flush enhance – Avoid data loss after job finished (Important for
inference job!)
• PR link:Implement fuse flush function by Binyang2014 · Pull Request #13103 ·
Alluxio/alluxio (github.com)
•Release enhance – Release function is aync, so file may not be closed
even we call “close” function. That will leave file in uncompleleted
status.
• PR link: Wait file closed before unmount fuse by Binyang2014 · Pull Request
#13114 · Alluxio/alluxio (github.com)
11. Prefetch (on-going)
Master
Node-1
GPU GPU
Worker
Node-2
GPU GPU
Worker
Node-3
GPU GPU
Worker
Node-4
GPU GPU
Worker
Load
block
block block
block block
block
Training Job Training Job
Training Job
Prefetch Prefetch
Prefetch
Submit
another
job
OpenPAI
Data paths
Schedule nodes
block
12. Benefits
•Stream input/output, smooth the IO request
•Handle read retry automatedly, decrease the failure rate
•Speed-up inference job. Decrease IO stall ,the performance improve
around 18%
Inference job without Alluxio, 1h 57min Inference job with Alluxio, 1h 34min
Low GPU usage
13. Future work
•Add write retry, decrease failure due to worker down
•Adopt Alluxio for training job. Training job has special data access
patten (each epoch read same data exactly once) and more
performance sensitive.
14. Reference
•OpenPAI: microsoft/pai: Resource scheduling and cluster
management for AI (github.com)
•Hived: microsoft/hivedscheduler: Kubernetes Scheduler for Deep
Learning (github.com)
•Alluxio-CSI: Alluxio/alluxio-csi (github.com)