The document discusses designing scalable platforms for artificial intelligence (AI) and machine learning (ML). It outlines several challenges in developing AI applications, including technical debts, unpredictability, different data and compute needs compared to traditional software. It then reviews existing commercial AI platforms and common components of AI platforms, including data access, ML workflows, computing infrastructure, model management, and APIs. The rest of the document focuses on eBay's Krylov project as an example AI platform, outlining its architecture, challenges of deploying platforms at scale, and needed skill sets on the platform team.
1. AI Platform at Scale
Designing scalable platform for AI
Henry Saputra
2. Motivation for an AI Platform
● AI == ML for context of this presentation
● Developing AI Applications can easily incur technical debts
● Traditional software development assumes predictability during the lifetime
● Bring your own software and hardware
● Explainability and correctness are hard to quantify
● Data access and management is different from traditional software
● Compute and scale of workloads is different from traditional software
3. AI and ML code only small fraction ...
Reference: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
4. AI Platform in the wild
● eBay Krylov
● Facebook FBLearner Flow
● Uber Michelangelo
● Google TFX
● Salesforce Einstein Platform
● Amazon Sagemaker
5. Problems to Solve?
● Reduce plumbing work by data scientists
● Dependency on data pipeline, compute infrastructure, and networking
● Large variance of quality, metrics, and measurement of success
● Research vs Applied
● Online vs Offline
● Collaboration requires different approach - Eg: Machine Learning models not
directly re-usable
● Undeclared consumers
6. Goals of an AI Platform
● Provides a system where data scientists could build reliable, secured, easy, reproducible, and
automated AI model training, and scoring/ inference at scale.
● Address the problem of platform approach to unified infrastructure to run AI and ML jobs - no
longer running inside data scientists computer
● Standardizing on tools and pipeline to simplify AI and ML jobs from training to deploy models
● AI and ML algorithms should be implemented once and shareable
● Enable parallelism and distributed jobs to accelerate and scale
● Support exploration of metrics about past experiments
● Secure and Easy to use
7. Common Architecture and Components
● Access to Data - Data analysis, Feature store, Data Lake, Data Format
● ML Workflow or Pipeline - DAG, Orchestration vs Choreography
● Domain Specific Language (DSL)
● Computing Platform and Infrastructure - Cloud vs In-house
○ “Tall” instances, GPU accelerated
○ Distributed computing framework
○ Fast network for data ingest
○ Data locality to compute resources
○ Containers and Microservices
● Models and Experiments lifecycle and management
● Models deployment and serving flow - Batch and Realtime
● Metrics and monitoring - dashboards, reports, logs
● APIs - UI, CLI, Program bindings/ SDK, RESTful, RPC
● Supported ML libraries
9. Challenges of Deploying AI Platform at Scale
● Defining the “right” architecture
● Open source - build vs buy? Early stage for AI Platform
● Extendible and Scale - horizontal vs vertical
● Secure environment for data access and compute
● Standards and common tooling for ML development - reduce complexity
● Sharing and re-use of algorithms and models
● Reduce tech debts - fast moving
● Tech refresh of hardware - Cloud vs In-house
10. Future Looking ...
● AutoML
● Online training/ learning and edge devices update
● Distributed Deep Learning for training - model vs data parallelism
● Graph as machine learning
● Improve of computing infrastructure hardware - GPU, TPU
● Faster network
● Next generation of storage for ML use cases
● Better support for AI applications - update and retrain models from devices
● Support for newer AI computing paradigm at scale - generative models,
reinforcement learning
11. Who do we need in AI Platform Team?
● Engineers and scientists
● Product Management
● Runtime support and infrastructure