Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AllReduce for distributed learning I/O Extended Seoul

791 views

Published on

Google I/O 2019에서
Distributed Learning을 주제로 어떤 새로운 발표가 있었는지
우리가 보다 빠른 Traning을 위해 둘러볼 수 있는 기술이 어떤게 있는지를
살펴보고 자세한 구조를 공유하는 자료입니다.

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

AllReduce for distributed learning I/O Extended Seoul

  1. 1. AllReduce for distributed learning SungMin Han Gopher
  2. 2. Agenda ● A short impression of I/O 2019 ● Distributed learning ● AllReduce ● Cloud TPU Pods
  3. 3. Speaker SungMin Han Clova Research Engineer Gopher @pignose
  4. 4. A short impression I/O 2019
  5. 5. Google I/O 2019 schedule attended sessions happy hours sand boxes join community meet up snacks Uber drives 05.07 - 09 26 1 8 4 6 8
  6. 6. Key Announcements TensorFlow 2.0 Fairness Learning ML Kit AI Hub Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX) TensorFlow.js Google Coral Firebase Prediction Edge TPU
  7. 7. Key Announcements TensorFlow 2.0 Fairness Learning ML Kit AI Hub Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX) TensorFlow.js Google Coral TensorFlow Firebase Prediction Edge TPU
  8. 8. Key Announcements TensorFlow 2.0 Fairness Learning ML Kit AI Hub Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX) TensorFlow.js Google Coral TPU / Device Firebase Prediction Edge TPU
  9. 9. Key Announcements TensorFlow 2.0 Fairness Learning ML Kit AI Hub Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX) TensorFlow.js Google Coral ML Kit Firebase Prediction Edge TPU
  10. 10. This session, We will talk TensorFlow 2.0 Fairness Learning ML Kit AI Hub Federated Learning TPU v3 Cloud TPU Pods TensorFlow on Swift TensorFlow Lite for IoT Devices TensorFlow Agent TensorFlow Extended (TFX) TensorFlow.js Google Coral Distributed Learning Firebase Prediction Edge TPU
  11. 11. Distributed Learning
  12. 12. SGD with single GPU Model FP BP GPU 1CPU AVG WP ∆𝒘 loss Previous learning environment
  13. 13. Simple version SGD Model loss Gradient GPU 1CPU AVG Update ∆𝒘 Previous learning environment
  14. 14. The problem ● Learning time has dependency on the and GPU model ● The model update process works on only ● High spec GPU machine is too ● Single GPU has a practical ● There is no way to support Previous learning environment batch-size single GPU expensive limitations scalability
  15. 15. SGD with multiple GPU Model loss Gradient GPU 1CPU Aggregate (AVG) Update ∆𝒘 Model loss Gradient GPU 3 Model loss Gradient GPU 2 Gather ∆𝒘𝟏 ∆𝒘𝟐 ∆𝒘𝟑 Previous learning environment
  16. 16. The issue which we can find ● Data transmission time is slow between GPU memory and CPU ● There is GPU stickiness issue (*GPU balancing issue) ● This solution is for only single bare metal server (node) Previous learning environment TW gradient CPU model
  17. 17. To avoid the problem We need to find a better way imbalance
  18. 18. imbalance bottle neck=
  19. 19. The definition of Distribution Increase efficiency by dividing the problem into smaller parts Problem Worker Worker Worker Worker Answer
  20. 20. Three way of distributions Parallel Concurrent Parallel + Concurrent To build a distributed environment, We should understand the difference of three categories for distributed solutions
  21. 21. Well known distributed solutions ● DistBelief (Google brain 1st distributed environment for Deep Learning) ● Horovod (Uber’s Distributed Tensorflow Environment) ● AllReduce (Today’s topic!) ● Federated Learning (Google announced on 2018) ● CollectiveAllReduce (Google Tensorflow tf.contrib.distribute.CollectiveAllReduce)
  22. 22. The basic theory PS1 ∆𝒘 GPU 1 GPU 2 GPU 3 GPU 4 Broadcast ∆𝒘𝟏 ∆𝒘𝟐 ∆𝒘𝟑 ∆𝒘𝟒 Downpour SGD
  23. 23. Use case of Uber (horovod) https://eng.uber.com/horovod/
  24. 24. Parameter Server Scenario https://eng.uber.com/horovod/ bottle necksimple over headcomplex
  25. 25. Use case of Uber (horovod) https://eng.uber.com/horovod/ http://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf Ring AllReduce Use case of Uber (horovod) https://eng.uber.com/horovod/ http://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf Ring AllReduce
  26. 26. horovod Architecture https://eng.uber.com/horovod/ http://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf Ring AllReduce TensorFlow Baidu Ring-AllReduce NVIDIA NCCL2 Open MPI
  27. 27. Use case of Uber (horovod) https://eng.uber.com/horovod/
  28. 28. Federated Learning https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
  29. 29. Federated Learning https://arxiv.org/abs/1902.01046 - Towards Federated Learning at Scale: System Design
  30. 30. Federated Learning https://arxiv.org/abs/1902.01046 - Towards Federated Learning at Scale: System Design
  31. 31. Secure Aggregation https://eprint.iacr.org/2017/281.pdf
  32. 32. Federated Learning Architecture https://eng.uber.com/horovod/ http://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf TensorFlow Actor Programming (Message Passing) FL Server Secure Aggregation
  33. 33. AllReduce
  34. 34. What is AllReduce PS ∆𝒘 ∆𝒘𝟏 ∆𝒘 ∆𝒘𝟒 Downpour ∆𝒘 ∆𝒘𝟑 ∆𝒘 ∆𝒘𝟐
  35. 35. What is AllReduce AllReduce 𝜹1 𝜹𝟐 𝜹𝟐 𝜹𝟑 𝜹𝟏 𝜹𝟒 𝜹𝟑 𝜹𝟏 𝜹𝟒 𝜹𝟑 𝜹𝟒𝜹𝟐
  36. 36. What is AllReduce AllReduce 𝜹1 𝜹𝟐 𝜹𝟐 𝜹𝟑 𝜹𝟏 𝜹𝟒 𝜹𝟑 𝜹𝟏 𝜹𝟒 𝜹𝟑 𝜹𝟒𝜹𝟐
  37. 37. With Hamiltonian circuit
  38. 38. AllReduce Strategy https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/
  39. 39. AllReduce Strategy https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/
  40. 40. AllReduce Strategy https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/
  41. 41. AllReduce Strategy https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/
  42. 42. AllReduce Strategy https://preferredresearch.jp/2018/07/10/technologies-behind-distributed-deep-learning-allreduce/
  43. 43. Cloud TPU Pods
  44. 44. The world scale 180 TFLOPS TPU v2
  45. 45. The world scale 100 Peta FLOPS TPU v3
  46. 46. TPU v3 architecture (H/W) https://cloud.google.com/tpu/docs/system-architecture
  47. 47. TPU v3 architecture (S/W) https://cloud.google.com/tpu/docs/system-architecture
  48. 48. TPU v3 architecture https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning?hl=ko
  49. 49. TPU v3 architecture https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning?hl=ko
  50. 50. TPU v3 architecture https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning?hl=ko
  51. 51. TPU Pods Overview https://arxiv.org/pdf/1811.06992.pdf 2-D AllReduce
  52. 52. Summary ● TPU’s inter-connect design gives high-speed for communication with units ● TPU v3 and Pods basically follows AllReduce (1-D ring AllReduce, 2-D AllReduce) ● TPU Pods is not available yet (Alpha ‘19 06 30)
  53. 53. tan 𝑞!

×