SlideShare a Scribd company logo
1 of 16
Download to read offline
AI Platform at Scale
Designing scalable platform for AI
Henry Saputra
Motivation for an AI Platform
● AI == ML for context of this presentation
● Developing AI Applications can easily incur technical debts
● Traditional software development assumes predictability during the lifetime
● Bring your own software and hardware
● Explainability and correctness are hard to quantify
● Data access and management is different from traditional software
● Compute and scale of workloads is different from traditional software
AI and ML code only small fraction ...
Reference: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
AI Platform in the wild
● eBay Krylov
● Facebook FBLearner Flow
● Uber Michelangelo
● Google TFX
● Salesforce Einstein Platform
● Amazon Sagemaker
Problems to Solve?
● Reduce plumbing work by data scientists
● Dependency on data pipeline, compute infrastructure, and networking
● Large variance of quality, metrics, and measurement of success
● Research vs Applied
● Online vs Offline
● Collaboration requires different approach - Eg: Machine Learning models not
directly re-usable
● Undeclared consumers
Goals of an AI Platform
● Provides a system where data scientists could build reliable, secured, easy, reproducible, and
automated AI model training, and scoring/ inference at scale.
● Address the problem of platform approach to unified infrastructure to run AI and ML jobs - no
longer running inside data scientists computer
● Standardizing on tools and pipeline to simplify AI and ML jobs from training to deploy models
● AI and ML algorithms should be implemented once and shareable
● Enable parallelism and distributed jobs to accelerate and scale
● Support exploration of metrics about past experiments
● Secure and Easy to use
Common Architecture and Components
● Access to Data - Data analysis, Feature store, Data Lake, Data Format
● ML Workflow or Pipeline - DAG, Orchestration vs Choreography
● Domain Specific Language (DSL)
● Computing Platform and Infrastructure - Cloud vs In-house
○ “Tall” instances, GPU accelerated
○ Distributed computing framework
○ Fast network for data ingest
○ Data locality to compute resources
○ Containers and Microservices
● Models and Experiments lifecycle and management
● Models deployment and serving flow - Batch and Realtime
● Metrics and monitoring - dashboards, reports, logs
● APIs - UI, CLI, Program bindings/ SDK, RESTful, RPC
● Supported ML libraries
AI Platform at eBay - Krylov Project
Challenges of Deploying AI Platform at Scale
● Defining the “right” architecture
● Open source - build vs buy? Early stage for AI Platform
● Extendible and Scale - horizontal vs vertical
● Secure environment for data access and compute
● Standards and common tooling for ML development - reduce complexity
● Sharing and re-use of algorithms and models
● Reduce tech debts - fast moving
● Tech refresh of hardware - Cloud vs In-house
Future Looking ...
● AutoML
● Online training/ learning and edge devices update
● Distributed Deep Learning for training - model vs data parallelism
● Graph as machine learning
● Improve of computing infrastructure hardware - GPU, TPU
● Faster network
● Next generation of storage for ML use cases
● Better support for AI applications - update and retrain models from devices
● Support for newer AI computing paradigm at scale - generative models,
reinforcement learning
Who do we need in AI Platform Team?
● Engineers and scientists
● Product Management
● Runtime support and infrastructure
eBay Krylov High Level Architecture
eBay Krylov Cluster Deployment
eBay Krylov Cluster Infrastructure
eBay Krylov Dashboard
Thank You

More Related Content

What's hot

Towards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approachTowards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approachFIWARE
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfPhilipBasford
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAniket Kanitkar
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...Amazon Web Services
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...HostedbyConfluent
 
Azure Migration Program Pitch Deck
Azure Migration Program Pitch DeckAzure Migration Program Pitch Deck
Azure Migration Program Pitch DeckNicholas Vossburg
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&MDatabricks
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationEric Kavanagh
 
Azure vs AWS
Azure vs AWSAzure vs AWS
Azure vs AWSJosh Lane
 

What's hot (20)

Towards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approachTowards Digital Twin standards following an open source approach
Towards Digital Twin standards following an open source approach
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services Comparison
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Azure Migration Program Pitch Deck
Azure Migration Program Pitch DeckAzure Migration Program Pitch Deck
Azure Migration Program Pitch Deck
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
 
Azure vs AWS
Azure vs AWSAzure vs AWS
Azure vs AWS
 

Similar to Designing Scalable AI Platforms

Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowGoDataDriven
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRBWilliam Poos
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to GreenJohn Archer
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprisedoppenhe
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil TechnologiesBlack Basil Technologies
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi
 
Making machine learning model deployment boring - Big Data Expo 2019
Making machine learning model deployment boring - Big Data Expo 2019Making machine learning model deployment boring - Big Data Expo 2019
Making machine learning model deployment boring - Big Data Expo 2019webwinkelvakdag
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkDatabricks
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Emergence of cloud computing and internet of things an overview
Emergence of cloud computing and internet of things   an overviewEmergence of cloud computing and internet of things   an overview
Emergence of cloud computing and internet of things an overviewSelvaraj Kesavan
 
Dell AI Telecom Webinar
Dell AI Telecom WebinarDell AI Telecom Webinar
Dell AI Telecom WebinarBill Wong
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta
 

Similar to Designing Scalable AI Platforms (20)

Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil Technologies
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
Making machine learning model deployment boring - Big Data Expo 2019
Making machine learning model deployment boring - Big Data Expo 2019Making machine learning model deployment boring - Big Data Expo 2019
Making machine learning model deployment boring - Big Data Expo 2019
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Emergence of cloud computing and internet of things an overview
Emergence of cloud computing and internet of things   an overviewEmergence of cloud computing and internet of things   an overview
Emergence of cloud computing and internet of things an overview
 
Dell AI Telecom Webinar
Dell AI Telecom WebinarDell AI Telecom Webinar
Dell AI Telecom Webinar
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 

Recently uploaded

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 

Recently uploaded (20)

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 

Designing Scalable AI Platforms

  • 1. AI Platform at Scale Designing scalable platform for AI Henry Saputra
  • 2. Motivation for an AI Platform ● AI == ML for context of this presentation ● Developing AI Applications can easily incur technical debts ● Traditional software development assumes predictability during the lifetime ● Bring your own software and hardware ● Explainability and correctness are hard to quantify ● Data access and management is different from traditional software ● Compute and scale of workloads is different from traditional software
  • 3. AI and ML code only small fraction ... Reference: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 4. AI Platform in the wild ● eBay Krylov ● Facebook FBLearner Flow ● Uber Michelangelo ● Google TFX ● Salesforce Einstein Platform ● Amazon Sagemaker
  • 5. Problems to Solve? ● Reduce plumbing work by data scientists ● Dependency on data pipeline, compute infrastructure, and networking ● Large variance of quality, metrics, and measurement of success ● Research vs Applied ● Online vs Offline ● Collaboration requires different approach - Eg: Machine Learning models not directly re-usable ● Undeclared consumers
  • 6. Goals of an AI Platform ● Provides a system where data scientists could build reliable, secured, easy, reproducible, and automated AI model training, and scoring/ inference at scale. ● Address the problem of platform approach to unified infrastructure to run AI and ML jobs - no longer running inside data scientists computer ● Standardizing on tools and pipeline to simplify AI and ML jobs from training to deploy models ● AI and ML algorithms should be implemented once and shareable ● Enable parallelism and distributed jobs to accelerate and scale ● Support exploration of metrics about past experiments ● Secure and Easy to use
  • 7. Common Architecture and Components ● Access to Data - Data analysis, Feature store, Data Lake, Data Format ● ML Workflow or Pipeline - DAG, Orchestration vs Choreography ● Domain Specific Language (DSL) ● Computing Platform and Infrastructure - Cloud vs In-house ○ “Tall” instances, GPU accelerated ○ Distributed computing framework ○ Fast network for data ingest ○ Data locality to compute resources ○ Containers and Microservices ● Models and Experiments lifecycle and management ● Models deployment and serving flow - Batch and Realtime ● Metrics and monitoring - dashboards, reports, logs ● APIs - UI, CLI, Program bindings/ SDK, RESTful, RPC ● Supported ML libraries
  • 8. AI Platform at eBay - Krylov Project
  • 9. Challenges of Deploying AI Platform at Scale ● Defining the “right” architecture ● Open source - build vs buy? Early stage for AI Platform ● Extendible and Scale - horizontal vs vertical ● Secure environment for data access and compute ● Standards and common tooling for ML development - reduce complexity ● Sharing and re-use of algorithms and models ● Reduce tech debts - fast moving ● Tech refresh of hardware - Cloud vs In-house
  • 10. Future Looking ... ● AutoML ● Online training/ learning and edge devices update ● Distributed Deep Learning for training - model vs data parallelism ● Graph as machine learning ● Improve of computing infrastructure hardware - GPU, TPU ● Faster network ● Next generation of storage for ML use cases ● Better support for AI applications - update and retrain models from devices ● Support for newer AI computing paradigm at scale - generative models, reinforcement learning
  • 11. Who do we need in AI Platform Team? ● Engineers and scientists ● Product Management ● Runtime support and infrastructure
  • 12. eBay Krylov High Level Architecture
  • 13. eBay Krylov Cluster Deployment
  • 14. eBay Krylov Cluster Infrastructure