SlideShare a Scribd company logo
1 of 38
Download to read offline
ML at Facebook:
An Infrastructure View
Yangqing Jia
Director, Facebook AI Infra
(* This is not
Facebook AI Infra)
The Machine Learning Moore’s Law?
0
500
1000
1500
2000
2500
2001 2003 2005 2007 2009 2011 2013 2015 2017
NumberofCitations
Machine Learning Execution Flow
Data Features Training Evaluation Inference
Offline Online
Machine Learning Execution Flow
Data Features Training Eval Inference PredictionsModel
It’s an infrastructure challenge
Data
Offline Training Online Inference
Storage
Challenges!
Network
Challenges!
Compute
Challenges!
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Does Facebook Use Machine Learning?
News Feed Ads Search
Language Translation
Sigma Facer Lumos
Speech Recognition
Content
Understanding
Classification
& Ranking
Services
What ML Models Do We Leverage?
GBDTSVM MLP CNN RNN
Support
Vector
Machines
Gradient-
Boosted
Decision Trees
Multi-Layer
Perceptron
Convolutional
Neural Nets
Recurrent
Neural Nets
Facer Sigma News Feed
Ads
Search
Sigma
Facer
Lumos
Language
Translation
Content
Understanding
Speech Rec
How Often Do We Train Models?
minutes hours days months
How Long Does Training Take?
seconds minutes hours days
How Much Compute Does Inference Consume?
100X 10x 1x
Does Facebook Design Hardware?
• Yes! Since 2010! All designs released through open compute!
• Facebook Server Design Philosophy
• Identify a small number of major services with unique resource requirements
• Design servers for those major services
One major
server design
versus
A B
D
C
Customized, dedicated hardwareGlobal shared pool
Does Facebook Design Hardware?
Yosemite/Twin Lakes:
For the web tier and
other “stateless services”
Open Compute “Sleds”
are 2U x 3 Across in an
Open Compute Rack
Server Card
Chassis
4-way
Shared NIC
SSD Boot
Rack
Does Facebook Design Hardware?
Tioga Pass:
For compute or memory-
intensive workloads:
Bryce Canyon:
For storage-heavy workloads:
Does Facebook Design Hardware for AI/ML?
• HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt.
• Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100)
Big Sur
Integrated Compute
8 Nvidia M40 GPUs
Big Basin
JBOG Design (CPU headnode)
8 Nvidia P100 / V100 GPUs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
OpenGL ESNNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN
Deep Learning Frameworks
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
O(n2) pairs
Shared model and operator representation
Open Neural Network Exchange
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
From O(n2) to O(n) pairs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
Facebook AI Ecosystem
Frameworks: Core ML Software
Caffe2 / PyTorch / ONNX
Platforms: Workflow Management, Deployment
FB Learner
Infrastructure: Servers, Storage, Network Strategy
Open Compute Project
FB Learner Platform
FB Learner
Feature Store
FB Learner
Flow
FB Learner
Predictor
• AI Workflow
• Model Management and Deployment
FBLearner in ML
Data Features Training Eval Inference PredictionsModel
FBLearner
Flow
FBLearner
Predictor
FBLearner
Feature
Store
CPU+GPUCPU CPU
Putting it All Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Feature Store FBLearner Flow FBL Predictor
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
2 Billion People
What changes when you scale to over
Scaling Challenges / Opportunities
Lots of Data Lots of Compute
Scaling Opportunity: Free Compute!
Santa Clara, California
Ashburn, Virginia
Prineville, Oregon
Forest City, North Carolina
Lulea, Sweden
Altoona, Iowa
Fort Worth, Texas
Clonee, Ireland
Los Lunas, New Mexico
Odense, Denmark
New Albany, Ohio
Papillion, Nebraska
Key Takeaways
Facebook AI
Lots of
Data
Wide variety
of models
Full stack
challenges Global scale
Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov
Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law
Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang
Facebook ML Infrastructure - 2018 slides

More Related Content

What's hot

Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
QuantUniversity
 

What's hot (20)

Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for Graphs
 
ヒストリア HelixCore(Perforce) 運用レギュレーションドキュメント
ヒストリア HelixCore(Perforce) 運用レギュレーションドキュメントヒストリア HelixCore(Perforce) 運用レギュレーションドキュメント
ヒストリア HelixCore(Perforce) 運用レギュレーションドキュメント
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
 
ETL Microsoft Material
ETL Microsoft MaterialETL Microsoft Material
ETL Microsoft Material
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
 
Azure PlayFab トレーニング資料
Azure PlayFab トレーニング資料Azure PlayFab トレーニング資料
Azure PlayFab トレーニング資料
 
How to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringHow to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based Filtering
 
サーバー知識不要!のゲームサーバー "Azure PlayFab" で長期運営タイトルを作ろう
サーバー知識不要!のゲームサーバー "Azure PlayFab" で長期運営タイトルを作ろうサーバー知識不要!のゲームサーバー "Azure PlayFab" で長期運営タイトルを作ろう
サーバー知識不要!のゲームサーバー "Azure PlayFab" で長期運営タイトルを作ろう
 
Design-driven vs. Data-driven
Design-driven vs. Data-driven Design-driven vs. Data-driven
Design-driven vs. Data-driven
 
次の世代のインタラクティブレンダリング5つの挑戦と10の滅ぶべき技術
次の世代のインタラクティブレンダリング5つの挑戦と10の滅ぶべき技術 次の世代のインタラクティブレンダリング5つの挑戦と10の滅ぶべき技術
次の世代のインタラクティブレンダリング5つの挑戦と10の滅ぶべき技術
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
 
R-FCN : object detection via region-based fully convolutional networks
R-FCN :  object detection via region-based fully convolutional networksR-FCN :  object detection via region-based fully convolutional networks
R-FCN : object detection via region-based fully convolutional networks
 
IR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased PrecisionIR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased Precision
 
Synthetic data generation for machine learning
Synthetic data generation for machine learningSynthetic data generation for machine learning
Synthetic data generation for machine learning
 
Python / BlueprintによるUnreal Engineの自動化 / GTMF2019
Python / BlueprintによるUnreal Engineの自動化 / GTMF2019Python / BlueprintによるUnreal Engineの自動化 / GTMF2019
Python / BlueprintによるUnreal Engineの自動化 / GTMF2019
 

Similar to Facebook ML Infrastructure - 2018 slides

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
Alan Tsai
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 

Similar to Facebook ML Infrastructure - 2018 slides (20)

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
Technology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and BeyondTechnology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and Beyond
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Microsoft power platform
Microsoft power platformMicrosoft power platform
Microsoft power platform
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
 
Cookbook for Building An App
Cookbook for Building An AppCookbook for Building An App
Cookbook for Building An App
 

More from Karthik Murugesan

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 

More from Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Facebook ML Infrastructure - 2018 slides

  • 1. ML at Facebook: An Infrastructure View Yangqing Jia Director, Facebook AI Infra
  • 2. (* This is not Facebook AI Infra)
  • 3.
  • 4. The Machine Learning Moore’s Law? 0 500 1000 1500 2000 2500 2001 2003 2005 2007 2009 2011 2013 2015 2017 NumberofCitations
  • 5. Machine Learning Execution Flow Data Features Training Evaluation Inference Offline Online
  • 6. Machine Learning Execution Flow Data Features Training Eval Inference PredictionsModel
  • 7. It’s an infrastructure challenge Data Offline Training Online Inference Storage Challenges! Network Challenges! Compute Challenges!
  • 8. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 9. Does Facebook Use Machine Learning? News Feed Ads Search Language Translation Sigma Facer Lumos Speech Recognition Content Understanding Classification & Ranking Services
  • 10. What ML Models Do We Leverage? GBDTSVM MLP CNN RNN Support Vector Machines Gradient- Boosted Decision Trees Multi-Layer Perceptron Convolutional Neural Nets Recurrent Neural Nets Facer Sigma News Feed Ads Search Sigma Facer Lumos Language Translation Content Understanding Speech Rec
  • 11. How Often Do We Train Models? minutes hours days months
  • 12. How Long Does Training Take? seconds minutes hours days
  • 13. How Much Compute Does Inference Consume? 100X 10x 1x
  • 14. Does Facebook Design Hardware? • Yes! Since 2010! All designs released through open compute! • Facebook Server Design Philosophy • Identify a small number of major services with unique resource requirements • Design servers for those major services One major server design versus A B D C Customized, dedicated hardwareGlobal shared pool
  • 15. Does Facebook Design Hardware? Yosemite/Twin Lakes: For the web tier and other “stateless services” Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack Server Card Chassis 4-way Shared NIC SSD Boot Rack
  • 16. Does Facebook Design Hardware? Tioga Pass: For compute or memory- intensive workloads: Bryce Canyon: For storage-heavy workloads:
  • 17. Does Facebook Design Hardware for AI/ML? • HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt. • Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100) Big Sur Integrated Compute 8 Nvidia M40 GPUs Big Basin JBOG Design (CPU headnode) 8 Nvidia P100 / V100 GPUs
  • 18. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes
  • 19. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 20. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust
  • 21. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust OpenGL ESNNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  • 22.
  • 23.
  • 24.
  • 25. Deep Learning Frameworks Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … O(n2) pairs
  • 26. Shared model and operator representation Open Neural Network Exchange Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … From O(n2) to O(n) pairs
  • 27. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 28. Facebook AI Ecosystem Frameworks: Core ML Software Caffe2 / PyTorch / ONNX Platforms: Workflow Management, Deployment FB Learner Infrastructure: Servers, Storage, Network Strategy Open Compute Project
  • 29. FB Learner Platform FB Learner Feature Store FB Learner Flow FB Learner Predictor • AI Workflow • Model Management and Deployment
  • 30. FBLearner in ML Data Features Training Eval Inference PredictionsModel FBLearner Flow FBLearner Predictor FBLearner Feature Store CPU+GPUCPU CPU
  • 31. Putting it All Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Feature Store FBLearner Flow FBL Predictor Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 32. 2 Billion People What changes when you scale to over
  • 33. Scaling Challenges / Opportunities Lots of Data Lots of Compute
  • 35. Santa Clara, California Ashburn, Virginia Prineville, Oregon Forest City, North Carolina Lulea, Sweden Altoona, Iowa Fort Worth, Texas Clonee, Ireland Los Lunas, New Mexico Odense, Denmark New Albany, Ohio Papillion, Nebraska
  • 36. Key Takeaways Facebook AI Lots of Data Wide variety of models Full stack challenges Global scale
  • 37. Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang