Designing Scalable AI Platforms

•

2 likes•855 views

The document discusses designing scalable platforms for artificial intelligence (AI) and machine learning (ML). It outlines several challenges in developing AI applications, including technical debts, unpredictability, different data and compute needs compared to traditional software. It then reviews existing commercial AI platforms and common components of AI platforms, including data access, ML workflows, computing infrastructure, model management, and APIs. The rest of the document focuses on eBay's Krylov project as an example AI platform, outlining its architecture, challenges of deploying platforms at scale, and needed skill sets on the platform team.

Engineering

AI Platform at Scale
Designing scalable platform for AI
Henry Saputra

Motivation for an AI Platform
● AI == ML for context of this presentation
● Developing AI Applications can easily incur technical debts
● Traditional software development assumes predictability during the lifetime
● Bring your own software and hardware
● Explainability and correctness are hard to quantify
● Data access and management is different from traditional software
● Compute and scale of workloads is different from traditional software

$AI and ML code only small fraction ... Reference: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf$

AI Platform in the wild
● eBay Krylov
● Facebook FBLearner Flow
● Uber Michelangelo
● Google TFX
● Salesforce Einstein Platform
● Amazon Sagemaker

Problems to Solve?
● Reduce plumbing work by data scientists
● Dependency on data pipeline, compute infrastructure, and networking
● Large variance of quality, metrics, and measurement of success
● Research vs Applied
● Online vs Oﬄine
● Collaboration requires different approach - Eg: Machine Learning models not
directly re-usable
● Undeclared consumers

Goals of an AI Platform
● Provides a system where data scientists could build reliable, secured, easy, reproducible, and
automated AI model training, and scoring/ inference at scale.
● Address the problem of platform approach to uniﬁed infrastructure to run AI and ML jobs - no
longer running inside data scientists computer
● Standardizing on tools and pipeline to simplify AI and ML jobs from training to deploy models
● AI and ML algorithms should be implemented once and shareable
● Enable parallelism and distributed jobs to accelerate and scale
● Support exploration of metrics about past experiments
● Secure and Easy to use

Common Architecture and Components
● Access to Data - Data analysis, Feature store, Data Lake, Data Format
● ML Workﬂow or Pipeline - DAG, Orchestration vs Choreography
● Domain Speciﬁc Language (DSL)
● Computing Platform and Infrastructure - Cloud vs In-house
○ “Tall” instances, GPU accelerated
○ Distributed computing framework
○ Fast network for data ingest
○ Data locality to compute resources
○ Containers and Microservices
● Models and Experiments lifecycle and management
● Models deployment and serving ﬂow - Batch and Realtime
● Metrics and monitoring - dashboards, reports, logs
● APIs - UI, CLI, Program bindings/ SDK, RESTful, RPC
● Supported ML libraries

Challenges of Deploying AI Platform at Scale
● Deﬁning the “right” architecture
● Open source - build vs buy? Early stage for AI Platform
● Extendible and Scale - horizontal vs vertical
● Secure environment for data access and compute
● Standards and common tooling for ML development - reduce complexity
● Sharing and re-use of algorithms and models
● Reduce tech debts - fast moving
● Tech refresh of hardware - Cloud vs In-house

Future Looking ...
● AutoML
● Online training/ learning and edge devices update
● Distributed Deep Learning for training - model vs data parallelism
● Graph as machine learning
● Improve of computing infrastructure hardware - GPU, TPU
● Faster network
● Next generation of storage for ML use cases
● Better support for AI applications - update and retrain models from devices
● Support for newer AI computing paradigm at scale - generative models,
reinforcement learning

Who do we need in AI Platform Team?
● Engineers and scientists
● Product Management
● Runtime support and infrastructure

What's hot

Towards Digital Twin standards following an open source approachFIWARE

Databricks on AWS.pptxWasm1953

Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh

Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez

Databricks: A Tool That Empowers You To Do More With DataDatabricks

Gen AI Cognizant & AWS event presentation_12 Oct.pdfPhilipBasford

AWS vs Azure - Cloud Services ComparisonAniket Kanitkar

Machine Learning & Amazon SageMakerAmazon Web Services

Introdution to Dataops and AIOps (or MLOps)Adrien Blind

data warehouse vs data lakePolestarsolutions

[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1

Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...Amazon Web Services

Introduction to Azure Data FactorySlava Kokaev

Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...HostedbyConfluent

Amazon QuickSightAmazon Web Services

Azure Migration Program Pitch DeckNicholas Vossburg

Data In Motion Paris 2023confluent

Apply MLOps at Scale by H&MDatabricks

The Future of Data Warehousing and Data IntegrationEric Kavanagh

Azure vs AWSJosh Lane

What's hot (20)

Towards Digital Twin standards following an open source approach

Databricks on AWS.pptx

Best Practices in DataOps: How to Create Agile, Automated Data Pipelines

Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...

Databricks: A Tool That Empowers You To Do More With Data

Gen AI Cognizant & AWS event presentation_12 Oct.pdf

AWS vs Azure - Cloud Services Comparison

Machine Learning & Amazon SageMaker

Introdution to Dataops and AIOps (or MLOps)

data warehouse vs data lake

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...

Introduction to Azure Data Factory

Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...

Amazon QuickSight

Azure Migration Program Pitch Deck

Data In Motion Paris 2023

Apply MLOps at Scale by H&M

The Future of Data Warehousing and Data Integration

Azure vs AWS

Similar to Designing Scalable AI Platforms

Building a Scalable and reliable open source ML Platform with MLFlowGoDataDriven

Digital Reinvention by NRBWilliam Poos

ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva

DDDP 2019 - Brown to GreenJohn Archer

Deploying ML models in the enterprisedoppenhe

Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati

World Artificial Intelligence Conference Shanghai 2018Adam Gibson

It Consulting & Services - Black Basil TechnologiesBlack Basil Technologies

Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul

ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi

Making machine learning model deployment boring - Big Data Expo 2019webwinkelvakdag

Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation

Machine Learning at Scale with MLflow and Apache SparkDatabricks

Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Databricks

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen

Cloud computingjun28Aravindharamanan S

Cloud computingjun28Dennis Ebenezer

Emergence of cloud computing and internet of things an overviewSelvaraj Kesavan

Dell AI Telecom WebinarBill Wong

Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta

Similar to Designing Scalable AI Platforms (20)

Building a Scalable and reliable open source ML Platform with MLFlow

Digital Reinvention by NRB

ICP for Data- Enterprise platform for AI, ML and Data Science

DDDP 2019 - Brown to Green

Deploying ML models in the enterprise

Building a Real-Time Security Application Using Log Data and Machine Learning...

World Artificial Intelligence Conference Shanghai 2018

It Consulting & Services - Black Basil Technologies

Bitkom Cray presentation - on HPC affecting big data analytics in FS

ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...

Making machine learning model deployment boring - Big Data Expo 2019

Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness

Machine Learning at Scale with MLflow and Apache Spark

Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

Cloud computingjun28

Emergence of cloud computing and internet of things an overview

Dell AI Telecom Webinar

Developing and deploying AI solutions on the cloud using Team Data Science Pr...

Recently uploaded

Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr

Earthing details of Electrical Substationstephanwindworld

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER

National Level Hackathon Participation Certificate.pdfRajuKanojiya4

An experimental study in using natural admixture as an alternative for chemic...Chandu841456

Input Output Management in Operating SystemRashmi Bhat

complete construction, environmental and economics information of biomass com...asadnawaz62

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303

Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Main Memory Management in Operating SystemRashmi Bhat

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Recently uploaded (20)

Concrete Mix Design - IS 10262-2019 - .pptx

Introduction-To-Agricultural-Surveillance-Rover.pptx

young call girls in Green Park🔝 9953056974 🔝 escort Service

Class 1 | NFPA 72 | Overview Fire Alarm System

Earthing details of Electrical Substation

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

Risk Assessment For Installation of Drainage Pipes.pdf

National Level Hackathon Participation Certificate.pdf

An experimental study in using natural admixture as an alternative for chemic...

Input Output Management in Operating System

complete construction, environmental and economics information of biomass com...

Software and Systems Engineering Standards: Verification and Validation of Sy...

Work Experience-Dalton Park.pptxfvvvvvvv

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Energy Awareness training ppt for manufacturing process.pptx

Solving The Right Triangles PowerPoint 2.ppt

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

Main Memory Management in Operating System

POWER SYSTEMS-1 Complete notes examples

Designing Scalable AI Platforms

1. AI Platform at Scale Designing scalable platform for AI Henry Saputra

2. Motivation for an AI Platform ● AI == ML for context of this presentation ● Developing AI Applications can easily incur technical debts ● Traditional software development assumes predictability during the lifetime ● Bring your own software and hardware ● Explainability and correctness are hard to quantify ● Data access and management is different from traditional software ● Compute and scale of workloads is different from traditional software

3. AI and ML code only small fraction ... Reference: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

4. AI Platform in the wild ● eBay Krylov ● Facebook FBLearner Flow ● Uber Michelangelo ● Google TFX ● Salesforce Einstein Platform ● Amazon Sagemaker

5. Problems to Solve? ● Reduce plumbing work by data scientists ● Dependency on data pipeline, compute infrastructure, and networking ● Large variance of quality, metrics, and measurement of success ● Research vs Applied ● Online vs Oﬄine ● Collaboration requires different approach - Eg: Machine Learning models not directly re-usable ● Undeclared consumers

6. Goals of an AI Platform ● Provides a system where data scientists could build reliable, secured, easy, reproducible, and automated AI model training, and scoring/ inference at scale. ● Address the problem of platform approach to uniﬁed infrastructure to run AI and ML jobs - no longer running inside data scientists computer ● Standardizing on tools and pipeline to simplify AI and ML jobs from training to deploy models ● AI and ML algorithms should be implemented once and shareable ● Enable parallelism and distributed jobs to accelerate and scale ● Support exploration of metrics about past experiments ● Secure and Easy to use

7. Common Architecture and Components ● Access to Data - Data analysis, Feature store, Data Lake, Data Format ● ML Workflow or Pipeline - DAG, Orchestration vs Choreography ● Domain Specific Language (DSL) ● Computing Platform and Infrastructure - Cloud vs In-house ○ “Tall” instances, GPU accelerated ○ Distributed computing framework ○ Fast network for data ingest ○ Data locality to compute resources ○ Containers and Microservices ● Models and Experiments lifecycle and management ● Models deployment and serving flow - Batch and Realtime ● Metrics and monitoring - dashboards, reports, logs ● APIs - UI, CLI, Program bindings/ SDK, RESTful, RPC ● Supported ML libraries

8. AI Platform at eBay - Krylov Project

9. Challenges of Deploying AI Platform at Scale ● Deﬁning the “right” architecture ● Open source - build vs buy? Early stage for AI Platform ● Extendible and Scale - horizontal vs vertical ● Secure environment for data access and compute ● Standards and common tooling for ML development - reduce complexity ● Sharing and re-use of algorithms and models ● Reduce tech debts - fast moving ● Tech refresh of hardware - Cloud vs In-house

10. Future Looking ... ● AutoML ● Online training/ learning and edge devices update ● Distributed Deep Learning for training - model vs data parallelism ● Graph as machine learning ● Improve of computing infrastructure hardware - GPU, TPU ● Faster network ● Next generation of storage for ML use cases ● Better support for AI applications - update and retrain models from devices ● Support for newer AI computing paradigm at scale - generative models, reinforcement learning

11. Who do we need in AI Platform Team? ● Engineers and scientists ● Product Management ● Runtime support and infrastructure

12. eBay Krylov High Level Architecture

13. eBay Krylov Cluster Deployment

14. eBay Krylov Cluster Infrastructure

15. eBay Krylov Dashboard

16. Thank You