SlideShare a Scribd company logo
1 of 42
Download to read offline
0
대규모 GPU 기반 K8S Cluster를 활용한
ML Training Troubleshooting
Open Infrastructure & Cloud Native Days Korea 2019
19.July.2019(Fri)
조규남
mystous@{naver, gmail}.com
1
mystous@kyunam.com:~$ who am i
• Principal Software Engineer / Software Architect @ Samsung Electronics
• C언어 pointer 이해 한지 22년째…
• Working
Private/Public Cloud Solution and Application – VM & Container
Possibility of HPC application on Cloud infrastructure by container cluster
(The 22nd IEEE International Conference on Computational Science and Engineering, 2019)
Time-efficient simulations of tight-binding electronic structures with Intel
Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016)
인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화
(한국정보처리학회 2015년 추계학술발표대회)
한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
2
Previous Presentation
https://developer.ibm.com/kr/devday2018/
https://www.slideshare.net/ssuser3e70ba/deep-learning-100-high-performance-computing-for-ai
3
Today
+ +
4
Introduction
What is Machine Learning Platform and Why
5
Machine Learning Platform Era
• Rising of Machine Learning Platform
1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform
Photo by frank mckenna on Unsplash
Personal PC HPC Platform
Mark by Vladyslav Severyn from the Noun Project
+Performance +Convenience
6
Why Platform is needed ?
• Too many pain points
End to End Management
: Various version of data set, unmanaged hyper
Parameters and uncontrolled trained Models
Configuration
: Too many ML Framework, version dependency
and Huge versions of ML Architecture
Utilization
: Dedicated Resource, Silo Management
Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
*2
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
7
Machine Learning Platforms
• Machine Learning Platform 춘추전국시대
8
How to build
• On-Premise? On Public Cloud?
versus by Hea Poh Lin from the Noun Project
9
Some Platforms
Machine Learning Platforms
10
Best Practice
• Uber {michelangelo}
Images from https://eng.uber.com/michelangelo/ copyright to Uber
11
Best Practice
• Airbnb {Bighead}
Slide clip from https://www.slideshare.net/databricks/bighead-airbnbs-endtoend-machine-learning-platform-with-krishna-puttaswamy-and-andrew-hoh copyright to Airbnb
12
Best Practice
• {Singularity}
Slide and Image from http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43 copyright to Gregory M. Kurtzer <gmk@lbl.gov>
13
Machine Learning Platform
Basic components of Machine Learning Platform
14
Basic Sequence
• Machine Learning Basic Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Collect
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Prediction
Machine Learning Platform Coverage
AI Engineer
15
Overall Software Stack
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
Cluster
Hardware
System Software
HPC&AITechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
Training Algorithm
MLFramework
Hadoop
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
Platform
Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management
*2
16
Based on HPC Technology
• Low Latency and High Throughput not Traffic
Enterprise Solution – Mass Traffic Handling HPC – Large Scale Problem Solving
17
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
18
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
19
Performance overhead on Kubernetes
[CPU Intensive Application] [Infiniband Comparison] [GPGPU Intensive Application]
K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019.
20
Business Logic Layer
• Deliverable Management Layer
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, ProSymbols
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Machine Learning Platform Coverage
21
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
22
Basic Architecture
• 사전 고려 사항
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
.yaml Template化
RDMA-SRIOV
plug-in
NVIDIA-peer-
memory package
Training Task
실행 전처리
Docker insecure
registry
Docker unlock
memory limit
Persistent Volume
Mount
Multi Tenant 관리
Timezone 통일
23
Troubleshooting
Problems that you can meet
24
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족
Multi GPU 처리Data Locality
CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작
Direct call kubectl
command
vGPU 부재
Server간 Communication overhead
Resource
Management
Container Root
Privilege
25
Troubleshooting
• Storage Issue
- 1) 다양한 환경을 Docker Image로 구성하여 저장
→ML Framework의 조합에 따라 저장 용량 증가
→사용자별 개인화 환경 제공시 저장 용량 급증
- 2) Docker내 대규모/대용량 파일 저장 가능성 존재
→ BERT등과 같이 대용량의 데이터를 가공하여 사용할 경우 k8s resource evict 발생
Solution
1) a. Docker Image On-Demand로 제공 Dirty flag 활용 Cache 관리
b. User Custom Image Garbage Collecting 및 정책 수립
2) a. Only Notice
b. Will be – persistent volume mount to user directory
26
Troubleshooting
• Data Feeding Bottleneck Issue
- Training에 사용되는 GPGPU 개수증가에 따른 Data Feeding 속도 문제
Graph from https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
[ GPGPU 개수에 따른 데이터 필요량 ]
Solution
a. CPU Intensive Pipeline 제공
→ Resource Issue 및 Multi node 사용
b. Hardware vender solution enabling
ex) NVIDIA DALI1), Intel DAAL2)/MKL3) 등
1) https://github.com/NVIDIA/DALI
2) https://software.intel.com/en-us/intel-daal
3) https://software.intel.com/en-us/mkl
27
Troubleshooting
• Resource Management & Pod Scheduler 부적합 Issue
- 1) GPGPU Machine Resource 파편화
→ Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움
- 2) Abusing User
→ Resource 선점 및 Low Utilization
Solution
1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정
b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
2) a. Fair share scheduling and Quota Consuming
b. Will be – Preemption scheduler for GPGPU
28
Troubleshooting
• Data Locality and Copy Issue
- Storage  GPGPU Server  GPGPU 간 Data Copy Overhead
Solution
1) Storage  GPGPU Server간 Cache enable
→ Hardware vender별 Solution 상이
2) In Memory DB, SR-IOV, GPUDirectRDMA1) 등
3) GPUDirect1), GPGPU Memory Align
1) https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Storage
Server with GPGPU
GPGPUs
CPUs
1)
2)
3)
29
Troubleshooting
• Multi GPGPU & Server Communication & 동적 pod Issue
- 1) On-Demand Multi GPGPU providing
→Single Cluster내 Multi GPGPU가 아닌 개별 Cluster 제공
→다양한 Multi GPGPU 지원 ML Framework 지원 – Horovod, CNTK, mxnet, Caffe-MPI 등
- 2) 성능 이슈
→ Network overhead에 따른 Scalability 저하
Solution
1) a. Cluster별 별도 Subnet 구성
b. ML Framework별 Cluster 구성 방법 및 ML Framework Plug-in 구조 수립
2) a. Hardware optimization & GPGPU Locality aware Topology 제공
b. 다양한 Peer-to-Peer Communication API 제공
30
Troubleshooting
• Multi GPGPU & Server Communication
- GPUDirectRDMA
Revisits from https://developer.nvidia.com/gpudirect
Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013
Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
31
Troubleshooting
• Multi GPGPU & Server Communication
- Single Root Complex
Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/
https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
32
Troubleshooting
• Direct call kubectl command & Root Privilege Issue
- k8s API 와 kubectl command 상이, Command line 권한
- Container Root Privilege 제거 필요
Solution
a. k8s API 및 kubectl Wrapper Layer
b. Docker container user privilege 부여
- Docker insecure registry 등록
- Docker memory 제한 해제
- User Secret container 저장 및 관리• https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/
• https://0x0d.im/archives/docker-security.html
• https://www.slideshare.net/BorgHan/hacking-docker-the-easy-way
• https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails
• https://dirtycow.ninja/
33
Troubleshooting
• CNI Overhead Issue
- Hardware에 따라 지원 가능한 CNI가 다름, Network Layer 상이
- 가상화에 따른 성능 이슈 및 동일 Server내 Container간 RPC 통신 사용
Table from https://chrislovecnm.com/kubernetes/cni/choosing-a-cni-provider/
Graph from ZENG, Hao, et al. Measurement and evaluation for docker container networking. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 2017. p. 105-108.
[ CNI 비교 표 ] [ 일부 CNI 성능 비교 표 ]
34
Troubleshooting
• CNI Overhead Issue
- 객관적 비교 수치가 거의 없음 – Flannel, Calico 위주
- Hardware configuration, 가상화 Layer 제 각각
Graph from https://community.cisco.com/t5/jive-developer-archive-blogs/docker-overlay-network-performance-comparison-intel-driver/ba-p/3664582
35
Troubleshooting
• CNI Overhead Issue
Graph from K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering
(IEEE CSE 2019), 2019.
Solution
a. 각자의 Hardware 요구사항 및 Hardware
Architecture 고려
b. 성능 측정은 직접 진행, Network는
제약 사항으로 두고 Workaround 고려
c. 다양한 Resource Packing 제공
ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
36
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
37
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html
[ vGPU Overall Architecture ]
Solution
Servers with
GPGPU
Kubernetes Cluster
Servers with
GPGPU
Servers with
GPGPU
OpenStack Cluster
VM with vGPGPU VM with vGPGPU VM with vGPGPU
Training Job Training Job Training Job Training Job
38
Suggestion
So what can we do?
39
Suggestion
• Watch your stage
All icon from the noun project (http://thenounprojecct.com) - Daouna Jeong, ruliani, Vectors Market
Beginner or Individual
Local Environment
cf. DIGITS, Anaconda
Professional
Cloud Environment
cf. ML Studio, Sagemaker
Expert and Product
Customized
cf. mlflow, kubeflow
40
Question?
Do you remain Curious?
41

More Related Content

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting

  • 1. 0 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting Open Infrastructure & Cloud Native Days Korea 2019 19.July.2019(Fri) 조규남 mystous@{naver, gmail}.com
  • 2. 1 mystous@kyunam.com:~$ who am i • Principal Software Engineer / Software Architect @ Samsung Electronics • C언어 pointer 이해 한지 22년째… • Working Private/Public Cloud Solution and Application – VM & Container Possibility of HPC application on Cloud infrastructure by container cluster (The 22nd IEEE International Conference on Computational Science and Engineering, 2019) Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016) 인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화 (한국정보처리학회 2015년 추계학술발표대회) 한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
  • 5. 4 Introduction What is Machine Learning Platform and Why
  • 6. 5 Machine Learning Platform Era • Rising of Machine Learning Platform 1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform Photo by frank mckenna on Unsplash Personal PC HPC Platform Mark by Vladyslav Severyn from the Noun Project +Performance +Convenience
  • 7. 6 Why Platform is needed ? • Too many pain points End to End Management : Various version of data set, unmanaged hyper Parameters and uncontrolled trained Models Configuration : Too many ML Framework, version dependency and Huge versions of ML Architecture Utilization : Dedicated Resource, Silo Management Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699 *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb *2 Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
  • 8. 7 Machine Learning Platforms • Machine Learning Platform 춘추전국시대
  • 9. 8 How to build • On-Premise? On Public Cloud? versus by Hea Poh Lin from the Noun Project
  • 11. 10 Best Practice • Uber {michelangelo} Images from https://eng.uber.com/michelangelo/ copyright to Uber
  • 12. 11 Best Practice • Airbnb {Bighead} Slide clip from https://www.slideshare.net/databricks/bighead-airbnbs-endtoend-machine-learning-platform-with-krishna-puttaswamy-and-andrew-hoh copyright to Airbnb
  • 13. 12 Best Practice • {Singularity} Slide and Image from http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43 copyright to Gregory M. Kurtzer <gmk@lbl.gov>
  • 14. 13 Machine Learning Platform Basic components of Machine Learning Platform
  • 15. 14 Basic Sequence • Machine Learning Basic Flow All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon Collect Cleansing & Labeling Model Selection Training Evaluation Parameter Tuning Prediction Machine Learning Platform Coverage AI Engineer
  • 16. 15 Overall Software Stack Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing, Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414. Cluster Hardware System Software HPC&AITechnology Middleware & Management Infiniband + Ethernet SAN + Local Node Storage Linux OS variant GPGPU or Accelerators ParallelFramework NumericalLibraries SystemTool Development Language Training Algorithm MLFramework Hadoop *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb Platform Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management *2
  • 17. 16 Based on HPC Technology • Low Latency and High Throughput not Traffic Enterprise Solution – Mass Traffic Handling HPC – Large Scale Problem Solving
  • 18. 17 With Container and WAS • Easy deployment + Isolated environment + Convenient HPC – Large Scale Problem Solving +
  • 19. 18 With Container and WAS • Easy deployment + Isolated environment + Convenient HPC – Large Scale Problem Solving +
  • 20. 19 Performance overhead on Kubernetes [CPU Intensive Application] [Infiniband Comparison] [GPGPU Intensive Application] K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019.
  • 21. 20 Business Logic Layer • Deliverable Management Layer All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, ProSymbols Cleansing & Labeling Model Selection Training Evaluation Parameter Tuning Machine Learning Platform Coverage
  • 22. 21 Basic Architecture • Kubernetes 기반 Machine Learning Platform Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
  • 23. 22 Basic Architecture • 사전 고려 사항 Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing .yaml Template化 RDMA-SRIOV plug-in NVIDIA-peer- memory package Training Task 실행 전처리 Docker insecure registry Docker unlock memory limit Persistent Volume Mount Multi Tenant 관리 Timezone 통일
  • 25. 24 Basic Architecture • Kubernetes 기반 Machine Learning Platform Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족 Multi GPU 처리Data Locality CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작 Direct call kubectl command vGPU 부재 Server간 Communication overhead Resource Management Container Root Privilege
  • 26. 25 Troubleshooting • Storage Issue - 1) 다양한 환경을 Docker Image로 구성하여 저장 →ML Framework의 조합에 따라 저장 용량 증가 →사용자별 개인화 환경 제공시 저장 용량 급증 - 2) Docker내 대규모/대용량 파일 저장 가능성 존재 → BERT등과 같이 대용량의 데이터를 가공하여 사용할 경우 k8s resource evict 발생 Solution 1) a. Docker Image On-Demand로 제공 Dirty flag 활용 Cache 관리 b. User Custom Image Garbage Collecting 및 정책 수립 2) a. Only Notice b. Will be – persistent volume mount to user directory
  • 27. 26 Troubleshooting • Data Feeding Bottleneck Issue - Training에 사용되는 GPGPU 개수증가에 따른 Data Feeding 속도 문제 Graph from https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html [ GPGPU 개수에 따른 데이터 필요량 ] Solution a. CPU Intensive Pipeline 제공 → Resource Issue 및 Multi node 사용 b. Hardware vender solution enabling ex) NVIDIA DALI1), Intel DAAL2)/MKL3) 등 1) https://github.com/NVIDIA/DALI 2) https://software.intel.com/en-us/intel-daal 3) https://software.intel.com/en-us/mkl
  • 28. 27 Troubleshooting • Resource Management & Pod Scheduler 부적합 Issue - 1) GPGPU Machine Resource 파편화 → Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움 - 2) Abusing User → Resource 선점 및 Low Utilization Solution 1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정 b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2 2) a. Fair share scheduling and Quota Consuming b. Will be – Preemption scheduler for GPGPU
  • 29. 28 Troubleshooting • Data Locality and Copy Issue - Storage  GPGPU Server  GPGPU 간 Data Copy Overhead Solution 1) Storage  GPGPU Server간 Cache enable → Hardware vender별 Solution 상이 2) In Memory DB, SR-IOV, GPUDirectRDMA1) 등 3) GPUDirect1), GPGPU Memory Align 1) https://docs.nvidia.com/cuda/gpudirect-rdma/index.html Storage Server with GPGPU GPGPUs CPUs 1) 2) 3)
  • 30. 29 Troubleshooting • Multi GPGPU & Server Communication & 동적 pod Issue - 1) On-Demand Multi GPGPU providing →Single Cluster내 Multi GPGPU가 아닌 개별 Cluster 제공 →다양한 Multi GPGPU 지원 ML Framework 지원 – Horovod, CNTK, mxnet, Caffe-MPI 등 - 2) 성능 이슈 → Network overhead에 따른 Scalability 저하 Solution 1) a. Cluster별 별도 Subnet 구성 b. ML Framework별 Cluster 구성 방법 및 ML Framework Plug-in 구조 수립 2) a. Hardware optimization & GPGPU Locality aware Topology 제공 b. 다양한 Peer-to-Peer Communication API 제공
  • 31. 30 Troubleshooting • Multi GPGPU & Server Communication - GPUDirectRDMA Revisits from https://developer.nvidia.com/gpudirect Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013 Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
  • 32. 31 Troubleshooting • Multi GPGPU & Server Communication - Single Root Complex Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/ https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
  • 33. 32 Troubleshooting • Direct call kubectl command & Root Privilege Issue - k8s API 와 kubectl command 상이, Command line 권한 - Container Root Privilege 제거 필요 Solution a. k8s API 및 kubectl Wrapper Layer b. Docker container user privilege 부여 - Docker insecure registry 등록 - Docker memory 제한 해제 - User Secret container 저장 및 관리• https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/ • https://0x0d.im/archives/docker-security.html • https://www.slideshare.net/BorgHan/hacking-docker-the-easy-way • https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails • https://dirtycow.ninja/
  • 34. 33 Troubleshooting • CNI Overhead Issue - Hardware에 따라 지원 가능한 CNI가 다름, Network Layer 상이 - 가상화에 따른 성능 이슈 및 동일 Server내 Container간 RPC 통신 사용 Table from https://chrislovecnm.com/kubernetes/cni/choosing-a-cni-provider/ Graph from ZENG, Hao, et al. Measurement and evaluation for docker container networking. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 2017. p. 105-108. [ CNI 비교 표 ] [ 일부 CNI 성능 비교 표 ]
  • 35. 34 Troubleshooting • CNI Overhead Issue - 객관적 비교 수치가 거의 없음 – Flannel, Calico 위주 - Hardware configuration, 가상화 Layer 제 각각 Graph from https://community.cisco.com/t5/jive-developer-archive-blogs/docker-overlay-network-performance-comparison-intel-driver/ba-p/3664582
  • 36. 35 Troubleshooting • CNI Overhead Issue Graph from K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019. Solution a. 각자의 Hardware 요구사항 및 Hardware Architecture 고려 b. 성능 측정은 직접 진행, Network는 제약 사항으로 두고 Workaround 고려 c. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
  • 37. 36 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
  • 38. 37 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html [ vGPU Overall Architecture ] Solution Servers with GPGPU Kubernetes Cluster Servers with GPGPU Servers with GPGPU OpenStack Cluster VM with vGPGPU VM with vGPGPU VM with vGPGPU Training Job Training Job Training Job Training Job
  • 40. 39 Suggestion • Watch your stage All icon from the noun project (http://thenounprojecct.com) - Daouna Jeong, ruliani, Vectors Market Beginner or Individual Local Environment cf. DIGITS, Anaconda Professional Cloud Environment cf. ML Studio, Sagemaker Expert and Product Customized cf. mlflow, kubeflow
  • 42. 41