Speed up large-scale ML/DL offline inference job with Alluxio

•

0 likes•270 views

Alluxio, Inc.

Alluxio Day III April 27, 2021 Speakers: Binyang Li & Qianxi Zhang, Microsoft

Software

Speed up large scale
ML/DL offline inference job
with Alluxio at Microsoft

Table of
content
Characteristics for offline inference job
The challenge for running large scale
inference job
Architecture & Optimization
The performance comparison w/o adopt
Alluxio
Further work

Characteristics for offline inference job
• Scale
• Each job about more than 400 tasks
• Each task read different dataset and generate its’ own output. (No interactive)
• Each task read about 2~3 GB data and output 7~8 GB data (total input is ~1Tb, total output
is ~3.5 TB)
• Each task may need 2~4 hours to finish
• Data access pattern
• Read input data only once, sequential read.
• Write output while job running.
• Infra:
• Storage: Azure Blob
• AI platform: OpenPAI microsoft/pai: Resource scheduling and cluster management for AI
(github.com)
• Scheduler: Hived microsoft/hivedscheduler: Kubernetes Scheduler for Deep Learning
(github.com)

Challenges
• Large number for total ingress & egress data, easy to cause IO failure
• Tools like blob-fuse will download data before tasks run and upload data after job
finished, easy to cause high IOPS and reach Azure Storage Limitation
• IO stall takes much time, GPU idle while upload/download data. (Waste time and
Money!)

Prod bed environment
•About 200 Azure Low Priority VMs, each has 4 GPUs. (The worker
node can be preempted at anytime!)
•Using Alluxio 2.3.0
•Using Kubernates 1.15.x
•Running more than 6 months

Architecture with Alluxio
Training/inference Job
read write
Policy Management
Data Caching/Prefetching System
load, cache, move, replica, evict data
Data Storage
(Azure Blob Store, Cosmos Stream, HDFS)
read write
Job
Scheduler(OpenPAI)
load
Alluxio

Optimization-CSI based deployment
Customized mount option
Path in Alluxio
Alluxio/alluxio-csi (github.com)

Optimization-CSI based deployment
•Sperate read/write mount option
• Enable metadata cache for input data folder (Model is share across tasks).
• Disable metadata cache and set write option to Though for output folder.
•Each pod using difference mount point. (Each job has its own mount
point)
•Each job can mount different path. (Can achieve access control)

Optimization-Fuse client improvement
•Flush enhance – Avoid data loss after job finished (Important for
inference job!)
• PR link:Implement fuse flush function by Binyang2014 · Pull Request #13103 ·
Alluxio/alluxio (github.com)
•Release enhance – Release function is aync, so file may not be closed
even we call “close” function. That will leave file in uncompleleted
status.
• PR link: Wait file closed before unmount fuse by Binyang2014 · Pull Request
#13114 · Alluxio/alluxio (github.com)

Prefetch (on-going)
Master
Node-1
GPU GPU
Worker
Node-2
GPU GPU
Worker
Node-3
GPU GPU
Worker
Node-4
GPU GPU
Worker
Load
block
block block
block block
block
Training Job Training Job
Training Job
Prefetch Prefetch
Prefetch
Submit
another
job
OpenPAI
Data paths
Schedule nodes
block

Benefits
•Stream input/output, smooth the IO request
•Handle read retry automatedly, decrease the failure rate
•Speed-up inference job. Decrease IO stall ,the performance improve
around 18%
Inference job without Alluxio, 1h 57min Inference job with Alluxio, 1h 34min
Low GPU usage

Future work
•Add write retry, decrease failure due to worker down
•Adopt Alluxio for training job. Training job has special data access
patten (each epoch read same data exactly once) and more
performance sensitive.

Reference
•OpenPAI: microsoft/pai: Resource scheduling and cluster
management for AI (github.com)
•Hived: microsoft/hivedscheduler: Kubernetes Scheduler for Deep
Learning (github.com)
•Alluxio-CSI: Alluxio/alluxio-csi (github.com)

What's hot

Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.

Spark Summit EU talk by Jiri SimsaSpark Summit

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.

Atom: A cloud native deep learning platform at SupremindAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.

Getting Started with Alluxio + Spark + S3Alluxio, Inc.

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Alluxio, Inc.

Running Spark & Alluxio in KubernetesAlluxio, Inc.

How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.

Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio, Inc.

Improving Presto performance with Alluxio at TikTokAlluxio, Inc.

Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio, Inc.

Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.

Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.

Fluid: When Alluxio Meets KubernetesAlluxio, Inc.

Alluxio Presentation at Strata San Jose 2016Jiří Šimša

What's hot (20)

Open Source Memory Speed Virtual Distributed Storage

Spark Summit EU talk by Jiri Simsa

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Atom: A cloud native deep learning platform at Supremind

Best Practices for Using Alluxio with Spark

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Getting Started with Alluxio + Spark + S3

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...

Running Spark & Alluxio in Kubernetes

How to Develop and Operate Cloud First Data Platforms

Building a high-performance data lake analytics engine at Alibaba Cloud with ...

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...

Alluxio Keynote at Strata+Hadoop World Beijing 2016

Improving Presto performance with Alluxio at TikTok

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead

Hybrid data lake on google cloud with alluxio and dataproc

Securely Enhancing Data Access in Hybrid Cloud with Alluxio

Fluid: When Alluxio Meets Kubernetes

Alluxio Presentation at Strata San Jose 2016

Similar to Speed up large-scale ML/DL offline inference job with Alluxio

Hadoop Vectored IOSteve Loughran

StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.

Spark and S3 with Ryan BlueDatabricks

Ruby and Distributed Storage SystemsSATOSHI TAGOMORI

Workflow Engines for HadoopJoe Crobak

Spring Roo Add-On Development & DistributionStefan Schmidt

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez

Apache GobblinMike Frampton

Transactional writes to cloud storage with Eric LiangDatabricks

Ingestion file copy using apexApache Apex

From a student to an apache committer practice of apache io tdbjixuan1989

mogpresHiroshi Ono

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.

tdtechtalk20160330johanJohan Gustavsson

Scalable Hadoop in the cloudTreasure Data, Inc.

Road to sbt 1.0 paved with serverEugene Yokota

Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)Issac Buenrostro

SAP Open Source meetup/Speedment - Palo Alto 2015Speedment, Inc.

Utilizing the OpenNTF Domino APIOliver Busse

Similar to Speed up large-scale ML/DL offline inference job with Alluxio (20)

Hadoop Vectored IO

StorageQuery: federated querying on object stores, powered by Alluxio and Presto

Spark and S3 with Ryan Blue

Ruby and Distributed Storage Systems

Workflow Engines for Hadoop

Spring Roo Add-On Development & Distribution

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...

Apache Gobblin

Transactional writes to cloud storage with Eric Liang

Ingestion file copy using apex

From a student to an apache committer practice of apache io tdb

mogpres

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Scalable and High available Distributed File System Metadata Service Using gR...

tdtechtalk20160330johan

Scalable Hadoop in the cloud

Road to sbt 1.0 paved with server

Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)

SAP Open Source meetup/Speedment - Palo Alto 2015

Utilizing the OpenNTF Domino API

Recently uploaded

Odoo Development Company in India | Devintelle Consulting ServiceDevintelle Consulting Service Pvt Ltd Odoo OpenERP

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Introduction Computer Science - Software Design.pdfFerryKemperman

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

Cyber security and its impact on E commercemanigoyal112

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Software Coding for software engineeringssuserb3a23b

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Recently uploaded (20)

Odoo Development Company in India | Devintelle Consulting Service

Implementing Zero Trust strategy with Azure

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Introduction Computer Science - Software Design.pdf

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

cpct NetworkING BASICS AND NETWORK TOOL.ppt

Comparing Linux OS Image Update Models - EOSS 2024.pdf

Cyber security and its impact on E commerce

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Sending Calendar Invites on SES and Calendarsnack.pdf

Unveiling the Future: Sylius 2.0 New Features

Cloud Data Center Network Construction - IEEE

Folding Cheat Sheet #4 - fourth in a series

Software Coding for software engineering

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Ahmed Motair CV April 2024 (Senior SW Developer)

Speed up large-scale ML/DL offline inference job with Alluxio

1. Speed up large scale ML/DL offline inference job with Alluxio at Microsoft

2. Speaker introduction Binyang Li - Software Engineer from Bing Qianxi Zhang - Research Software Engineer from MSRA

3. Table of content Characteristics for offline inference job The challenge for running large scale inference job Architecture & Optimization The performance comparison w/o adopt Alluxio Further work

4. Characteristics for offline inference job • Scale • Each job about more than 400 tasks • Each task read different dataset and generate its’ own output. (No interactive) • Each task read about 2~3 GB data and output 7~8 GB data (total input is ~1Tb, total output is ~3.5 TB) • Each task may need 2~4 hours to finish • Data access pattern • Read input data only once, sequential read. • Write output while job running. • Infra: • Storage: Azure Blob • AI platform: OpenPAI microsoft/pai: Resource scheduling and cluster management for AI (github.com) • Scheduler: Hived microsoft/hivedscheduler: Kubernetes Scheduler for Deep Learning (github.com)

5. Challenges • Large number for total ingress & egress data, easy to cause IO failure • Tools like blob-fuse will download data before tasks run and upload data after job finished, easy to cause high IOPS and reach Azure Storage Limitation • IO stall takes much time, GPU idle while upload/download data. (Waste time and Money!)

6. Prod bed environment •About 200 Azure Low Priority VMs, each has 4 GPUs. (The worker node can be preempted at anytime!) •Using Alluxio 2.3.0 •Using Kubernates 1.15.x •Running more than 6 months

7. Architecture with Alluxio Training/inference Job read write Policy Management Data Caching/Prefetching System load, cache, move, replica, evict data Data Storage (Azure Blob Store, Cosmos Stream, HDFS) read write Job Scheduler(OpenPAI) load Alluxio

8. Optimization-CSI based deployment Customized mount option Path in Alluxio Alluxio/alluxio-csi (github.com)

9. Optimization-CSI based deployment •Sperate read/write mount option • Enable metadata cache for input data folder (Model is share across tasks). • Disable metadata cache and set write option to Though for output folder. •Each pod using difference mount point. (Each job has its own mount point) •Each job can mount different path. (Can achieve access control)

10. Optimization-Fuse client improvement •Flush enhance – Avoid data loss after job finished (Important for inference job!) • PR link:Implement fuse flush function by Binyang2014 · Pull Request #13103 · Alluxio/alluxio (github.com) •Release enhance – Release function is aync, so file may not be closed even we call “close” function. That will leave file in uncompleleted status. • PR link: Wait file closed before unmount fuse by Binyang2014 · Pull Request #13114 · Alluxio/alluxio (github.com)

11. Prefetch (on-going) Master Node-1 GPU GPU Worker Node-2 GPU GPU Worker Node-3 GPU GPU Worker Node-4 GPU GPU Worker Load block block block block block block Training Job Training Job Training Job Prefetch Prefetch Prefetch Submit another job OpenPAI Data paths Schedule nodes block

12. Benefits •Stream input/output, smooth the IO request •Handle read retry automatedly, decrease the failure rate •Speed-up inference job. Decrease IO stall ,the performance improve around 18% Inference job without Alluxio, 1h 57min Inference job with Alluxio, 1h 34min Low GPU usage

13. Future work •Add write retry, decrease failure due to worker down •Adopt Alluxio for training job. Training job has special data access patten (each epoch read same data exactly once) and more performance sensitive.

14. Reference •OpenPAI: microsoft/pai: Resource scheduling and cluster management for AI (github.com) •Hived: microsoft/hivedscheduler: Kubernetes Scheduler for Deep Learning (github.com) •Alluxio-CSI: Alluxio/alluxio-csi (github.com)

Speed up large-scale ML/DL offline inference job with Alluxio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speed up large-scale ML/DL offline inference job with Alluxio

Similar to Speed up large-scale ML/DL offline inference job with Alluxio (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Speed up large-scale ML/DL offline inference job with Alluxio