Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting
1. 0
대규모 GPU 기반 K8S Cluster를 활용한
ML Training Troubleshooting
Open Infrastructure & Cloud Native Days Korea 2019
19.July.2019(Fri)
조규남
mystous@{naver, gmail}.com
2. 1
mystous@kyunam.com:~$ who am i
• Principal Software Engineer / Software Architect @ Samsung Electronics
• C언어 pointer 이해 한지 22년째…
• Working
Private/Public Cloud Solution and Application – VM & Container
Possibility of HPC application on Cloud infrastructure by container cluster
(The 22nd IEEE International Conference on Computational Science and Engineering, 2019)
Time-efficient simulations of tight-binding electronic structures with Intel
Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016)
인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화
(한국정보처리학회 2015년 추계학술발표대회)
한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
6. 5
Machine Learning Platform Era
• Rising of Machine Learning Platform
1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform
Photo by frank mckenna on Unsplash
Personal PC HPC Platform
Mark by Vladyslav Severyn from the Noun Project
+Performance +Convenience
7. 6
Why Platform is needed ?
• Too many pain points
End to End Management
: Various version of data set, unmanaged hyper
Parameters and uncontrolled trained Models
Configuration
: Too many ML Framework, version dependency
and Huge versions of ML Architecture
Utilization
: Dedicated Resource, Silo Management
Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
*2
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
11. 10
Best Practice
• Uber {michelangelo}
Images from https://eng.uber.com/michelangelo/ copyright to Uber
12. 11
Best Practice
• Airbnb {Bighead}
Slide clip from https://www.slideshare.net/databricks/bighead-airbnbs-endtoend-machine-learning-platform-with-krishna-puttaswamy-and-andrew-hoh copyright to Airbnb
13. 12
Best Practice
• {Singularity}
Slide and Image from http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43 copyright to Gregory M. Kurtzer <gmk@lbl.gov>
15. 14
Basic Sequence
• Machine Learning Basic Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Collect
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Prediction
Machine Learning Platform Coverage
AI Engineer
16. 15
Overall Software Stack
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
Cluster
Hardware
System Software
HPC&AITechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
Training Algorithm
MLFramework
Hadoop
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
Platform
Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management
*2
17. 16
Based on HPC Technology
• Low Latency and High Throughput not Traffic
Enterprise Solution – Mass Traffic Handling HPC – Large Scale Problem Solving
18. 17
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
19. 18
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
20. 19
Performance overhead on Kubernetes
[CPU Intensive Application] [Infiniband Comparison] [GPGPU Intensive Application]
K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019.
21. 20
Business Logic Layer
• Deliverable Management Layer
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, ProSymbols
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Machine Learning Platform Coverage
22. 21
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
23. 22
Basic Architecture
• 사전 고려 사항
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
.yaml Template化
RDMA-SRIOV
plug-in
NVIDIA-peer-
memory package
Training Task
실행 전처리
Docker insecure
registry
Docker unlock
memory limit
Persistent Volume
Mount
Multi Tenant 관리
Timezone 통일
25. 24
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족
Multi GPU 처리Data Locality
CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작
Direct call kubectl
command
vGPU 부재
Server간 Communication overhead
Resource
Management
Container Root
Privilege
26. 25
Troubleshooting
• Storage Issue
- 1) 다양한 환경을 Docker Image로 구성하여 저장
→ML Framework의 조합에 따라 저장 용량 증가
→사용자별 개인화 환경 제공시 저장 용량 급증
- 2) Docker내 대규모/대용량 파일 저장 가능성 존재
→ BERT등과 같이 대용량의 데이터를 가공하여 사용할 경우 k8s resource evict 발생
Solution
1) a. Docker Image On-Demand로 제공 Dirty flag 활용 Cache 관리
b. User Custom Image Garbage Collecting 및 정책 수립
2) a. Only Notice
b. Will be – persistent volume mount to user directory
27. 26
Troubleshooting
• Data Feeding Bottleneck Issue
- Training에 사용되는 GPGPU 개수증가에 따른 Data Feeding 속도 문제
Graph from https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
[ GPGPU 개수에 따른 데이터 필요량 ]
Solution
a. CPU Intensive Pipeline 제공
→ Resource Issue 및 Multi node 사용
b. Hardware vender solution enabling
ex) NVIDIA DALI1), Intel DAAL2)/MKL3) 등
1) https://github.com/NVIDIA/DALI
2) https://software.intel.com/en-us/intel-daal
3) https://software.intel.com/en-us/mkl
28. 27
Troubleshooting
• Resource Management & Pod Scheduler 부적합 Issue
- 1) GPGPU Machine Resource 파편화
→ Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움
- 2) Abusing User
→ Resource 선점 및 Low Utilization
Solution
1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정
b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
2) a. Fair share scheduling and Quota Consuming
b. Will be – Preemption scheduler for GPGPU
29. 28
Troubleshooting
• Data Locality and Copy Issue
- Storage GPGPU Server GPGPU 간 Data Copy Overhead
Solution
1) Storage GPGPU Server간 Cache enable
→ Hardware vender별 Solution 상이
2) In Memory DB, SR-IOV, GPUDirectRDMA1) 등
3) GPUDirect1), GPGPU Memory Align
1) https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Storage
Server with GPGPU
GPGPUs
CPUs
1)
2)
3)
30. 29
Troubleshooting
• Multi GPGPU & Server Communication & 동적 pod Issue
- 1) On-Demand Multi GPGPU providing
→Single Cluster내 Multi GPGPU가 아닌 개별 Cluster 제공
→다양한 Multi GPGPU 지원 ML Framework 지원 – Horovod, CNTK, mxnet, Caffe-MPI 등
- 2) 성능 이슈
→ Network overhead에 따른 Scalability 저하
Solution
1) a. Cluster별 별도 Subnet 구성
b. ML Framework별 Cluster 구성 방법 및 ML Framework Plug-in 구조 수립
2) a. Hardware optimization & GPGPU Locality aware Topology 제공
b. 다양한 Peer-to-Peer Communication API 제공
31. 30
Troubleshooting
• Multi GPGPU & Server Communication
- GPUDirectRDMA
Revisits from https://developer.nvidia.com/gpudirect
Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013
Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
32. 31
Troubleshooting
• Multi GPGPU & Server Communication
- Single Root Complex
Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/
https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
33. 32
Troubleshooting
• Direct call kubectl command & Root Privilege Issue
- k8s API 와 kubectl command 상이, Command line 권한
- Container Root Privilege 제거 필요
Solution
a. k8s API 및 kubectl Wrapper Layer
b. Docker container user privilege 부여
- Docker insecure registry 등록
- Docker memory 제한 해제
- User Secret container 저장 및 관리• https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/
• https://0x0d.im/archives/docker-security.html
• https://www.slideshare.net/BorgHan/hacking-docker-the-easy-way
• https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails
• https://dirtycow.ninja/
34. 33
Troubleshooting
• CNI Overhead Issue
- Hardware에 따라 지원 가능한 CNI가 다름, Network Layer 상이
- 가상화에 따른 성능 이슈 및 동일 Server내 Container간 RPC 통신 사용
Table from https://chrislovecnm.com/kubernetes/cni/choosing-a-cni-provider/
Graph from ZENG, Hao, et al. Measurement and evaluation for docker container networking. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 2017. p. 105-108.
[ CNI 비교 표 ] [ 일부 CNI 성능 비교 표 ]
35. 34
Troubleshooting
• CNI Overhead Issue
- 객관적 비교 수치가 거의 없음 – Flannel, Calico 위주
- Hardware configuration, 가상화 Layer 제 각각
Graph from https://community.cisco.com/t5/jive-developer-archive-blogs/docker-overlay-network-performance-comparison-intel-driver/ba-p/3664582
36. 35
Troubleshooting
• CNI Overhead Issue
Graph from K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering
(IEEE CSE 2019), 2019.
Solution
a. 각자의 Hardware 요구사항 및 Hardware
Architecture 고려
b. 성능 측정은 직접 진행, Network는
제약 사항으로 두고 Workaround 고려
c. 다양한 Resource Packing 제공
ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
37. 36
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
38. 37
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html
[ vGPU Overall Architecture ]
Solution
Servers with
GPGPU
Kubernetes Cluster
Servers with
GPGPU
Servers with
GPGPU
OpenStack Cluster
VM with vGPGPU VM with vGPGPU VM with vGPGPU
Training Job Training Job Training Job Training Job
40. 39
Suggestion
• Watch your stage
All icon from the noun project (http://thenounprojecct.com) - Daouna Jeong, ruliani, Vectors Market
Beginner or Individual
Local Environment
cf. DIGITS, Anaconda
Professional
Cloud Environment
cf. ML Studio, Sagemaker
Expert and Product
Customized
cf. mlflow, kubeflow