SlideShare a Scribd company logo
1 of 49
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning
using Kubeflow
Arun Gupta, @arungupta
Principal Open Source Technologist
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://dilbert.com/strip/2013-02-02
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning 101
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
Broadest and deepest set of capabilities
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Broadest and deepest set of capabilities
T H E AW S M L S TAC K
ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AI Services Broadest and deepest set of capabilities
T H E AW S M L S TAC K
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AI Services Broadest and deepest set of capabilities
T H E AW S M L S TAC K
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon
Rekognition
+Custom Labels
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
Amazon
Textract
Amazon
Kendra
Amazon
Connect
with Contact Lens
FRAMEWORKS INTERFACES INFRASTRUCTURE
ML Frameworks + Infrastructure
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
ML Services
Amazon SageMaker
Ground Truth
data labelling
ML
Marketplace
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks
SageMaker
Experiments
Model
tuning
SageMaker
Autopilot
Model
hosting
SageMaker
Model Monitor
SageMakerStudioIDE
Amazon EKS
Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin
Containerized ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
M A C H I N E L E A R N I N G
S T O R A G E
Amazon Redshift
+ Redshift Spectrum
Amazon
QuickSight
Amazon EMR
Hadoop, Spark, Presto,
Pig, Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch
Service
AWS Glue
A N A L Y T I C S
Amazon S3
Standard-IA
Amazon S3
Standard
Amazon S3
One Zone-IA
Amazon
Glacier
Amazon S3
Intelligent-
Tiering
N E W
Amazon
EBS
Amazon S3
Glacier Deep
Archive
N E W
Storage and Analytics for Machine Learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Machine Learning on Kubernetes?
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: Without KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: With KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Kubeflow
Containerized machine learning platform
Makes it easy to develop, deploy, and manage portable,
scalable end-to-end ML workflows on k8s
“Toolkit” – loosely coupled tools and blueprints for ML
End to End ML workflow – ML code is only a small component
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What’s in
KubeFlow?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS: run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started with Amazon EKS
eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS-Optimized GPU AMI
Built on top of the standard Amazon EKS-Optimized AMI
Includes packages to support Amazon P2/P3/G3/G4
instances
• NVIDIA drivers
• nvidia-docker2 package
• nvidia-container-runtime (as default runtime)
GPU Clock Optimization
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cluster Autoscaler Improvements
Add GPU support
autoscaler#1584 GPU autoscaling supported for AWS
autoscaler#1589 GPU scale down performance optimization
Prevent CA from removing a node with ML training job
running
Annotate job ”cluster-autoscaler.kubernetes.io/safe-to-evict”: “false”
Recommended to create GPU node group per AZ
Improve network communication performance
Prevent ASG rebalancing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Deep Learning Containers
KEY FEATURES
Customizable
container images
Support for TensorFlow,
Apache MXNet
Single and multi-node
training and inference
Pre-packaged Docker
container images
fully configured
and validated
Best performance
and scalability
without tuning
Works with Amazon EKS,
Amazon ECS,
and Amazon EC2
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Desktop
MiniKF: Local Kubeflow deployment using VirtualBox and
Vagrant
• Minikube -> Kubernetes
• MiniKF -> Kubeflow (includes minikube)
Runs on macOS, Linux, and Windows
Does not require k8s-specific knowledge
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Cloud
Major cloud providers supported
Choices on Amazon Web Services
• Self-managed k8s on EC2: Kops, CloudFormation, Terraform
• Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started with Kubeflow
on Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jupyter Notebook
Create and share documents that contain live code,
equations, visualizations, and narrative text
• UI to manage notebooks
• Integrate with RBAC/IAM
• Ingress / Service Mesh
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jupyter Notebook
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fairing
Python SDK to build, train and deploy ML models
• Easily package ML
training jobs
• Train ML models from
notebook to k8s
• Streamline the model
development process
Setup KubeflowFairing for training and prediction
https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb
Train an XGBoost model remotely on Kubeflow
Deploy the trained model to Kubeflowfor prediction
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib – Hyperparameter Tuning
Hyperparameter are parameters external to the model to control the
training, e.g. learning rate, batch size, epochs
Tuning finds a set of hyperparameters that optimizes an objective
function, e.g. Find the optimal batch size and learning rate to maximize
prediction accuracy
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hyperparameter Tuning is Hard
More hyperparameters -> exponential space growth
Tuning by hands is inefficient and error-prone
Need to track metrics across multiple jobs
Managing resources and infrastructure for lot of jobs is hard
Variety of frameworks and algorithms to support
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib – Hyperparameter Tuning
trialName Validation-accuracy accuracy --lr --num-layers --optimizer
random-experiment-rfwwbnsd 0.974920 0.984844 0.013831565266960293 4 sgd
random-experiment-vxgwlgqq 0.113854 0.116646 0.024225789898529138 4 ftrl
random-experiment-wclrwlcq 0.979697 0.998437 0.021916171239020756 4 sgd
random-experiment-7lsc4pwb 0.113854 0.115312 0.024163810384272653 5 ftrl
random-experiment-86vv9vgv 0.963475 0.971562 0.02943228249244735 3 adam
random-experiment-jh884cxz 0.981091 0.999219 0.022372025623908262 2 sgd
random-experiment-sgtwhrgz 0.980693 0.997969 0.016641686851083654 4 sgd
random-experiment-c6vvz6dv 0.980792 0.998906 0.0264125850165842 3 sgd
random-experiment-vqs2xmfj 0.113854 0.105313 0.026629394628228185 4 ftrl
random-experiment-bv8lsh2m 0.980195 0.999375 0.021769570793012488 2 sgd
random-experiment-7vbnqc7z 0.113854 0.102188 0.025079750575740783 4 ftrl
random-experiment-kwj9drmg 0.979498 0.995469 0.014985919312945063 4 sgd
Hyperparameters
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Trial template
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing: Model serving and management
Provides a Kubernetes CRD for serving ML models on arbitrary frameworks.
Encapsulates the complexity of autoscaling, networking and server configuration to bring features
like scale to zero, transformations, and canary rollouts to your deployments
Enables a simple, pluggable, and complete story for your production ML inference server by providing
prediction, pre-processing, post-processing and explainability.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Custom Resource
S3 secret
attached to
Service
Account
Trained
model
https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pluggable Interface
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
default:
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "pytorch-cifar10"
spec:
default:
pytorch:
storageUri: "gs://kfserving-samples/models/pytorch/cifar10"
modelClassName: "Net"
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Interface – Scikit Learn
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
canaryTrafficPercent: 25
canary:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris-v2"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Distributed Training
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Optimizing Distributed Deep
Learning Performance on Amazon EKS
https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pipelines – Machine Learning Job Orchestrator
Compose, deploy, and manage
end-to-end ML workflows
End-to-end orchestration
Easy, rapid, and reliable
experimentation
Easy re-use
Built using Pipelines SDK
kfp.compiler,
kfp.components, kfp.Client
Uses Argo under the hood to
orchestrate resources
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Kubeflow Pipeline Components
@dsl.pipeline(
name='Sample Trainer',
description=’’
)
def sample_train_pipeline(... ):
create_cluster_op = CreateClusterOp('create-cluster', ...)
analyze_op = AnalyzeOp('analyze', ...)
transform_op = TransformOp('transform', ...)
train_op = TrainerOp('train', ...)
predict_op = PredictOp('predict', ...)
confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', ...)
roc_op = RocOp('roc', ...)
kfp.compiler.Compiler().compile(sample_train_pipeline , 'my-pipeline.zip’)
Pipeline component
Pipeline decorator
Pipeline function
Compile pipeline
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Kubeflow Pipeline Components
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata – Model Tracking
• Metadata schema to track artifacts related to
execution contexts
• Metadata API for storing and retrieving
metadata
• Client libraries for end-users to interact with
the Metadata service from their Notebooks or
Pipelines code.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Making Kubeflow a first class citizen on AWS
• Centralized and unified Kubernetes cluster logs in Amazon CloudWatch
• External traffic and authentication management with ALB Ingress Controller
• TLS and authentication with AWS Certificate Manager and AWS Cognito
• In-built FSx CSI driver w/S3 data repository integration to optimize training performance
• Elastic File System integration for common data sharing in JupyterHub
• Easier and customizable Kubeflow installation with kfctl and Kustomize support
• Kubeflow Pipeline integration with AWS Services – Amazon EMR, Athena, SageMaker
• Add ECR integration to Kubeflow Fairing
• Jupyter Notebook images with AWS CLI installed and ECR support
• Auto detect GPU worker nodes and install NVIDIA device plugin
https://www.kubeflow.org/docs/aws/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Kubeflow Roadmap
Kubeflow v1.0 - Theme: Enterprise Readiness
E2E examples and increased docs on Kubeflow site
Upstream testing for Kubeflow on AWS
Support DIY K8S on AWS
IAM Roles for Service Accounts integration with Jupyter
notebooks
Support for managed contributors
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Feature store - Feast
• Discoverability and reuse of features
• Standardization of features
• Access to features for training and serving
• Consistency between training and serving
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fully managed
infrastructure in Amazon SageMake
Introducing Amazon SageMaker Operators for Kubernetes
Kubernetes customers can now train, tune, & deploy models in
Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Under the hood – Amazon SageMaker and
Kubernetes
Kubectl apply
YAML
Key Features
• Amazon SageMaker
Operators for training,
tuning, inference
• Natively interact with
Amazon SageMaker jobs
using K8s tools (e.g., get
pods, describe)
• Stream and view logs from
Amazon SageMaker in K8s
• Helm Charts to assist with
setup and spec creation
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://github.com/aws/amazon-
sagemaker-operator-for-k8s
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
Workshop: eksworkshop.com/advanced/420_kubeflow/
Jupyter notebooks: github.com/aws-samples/eks-kubeflow-
workshop/
Optimizing Machine Learning performance:
aws.amazon.com/blogs/opensource/optimizing-distributed-
deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Please fill up the session rating in the feedback form
and collect your goodie at the end of the day
THANK YOU

More Related Content

What's hot

눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018Amazon Web Services Korea
 
建構全球跨區域 x Active-Active架構的無伺服器化後台服務
建構全球跨區域  x Active-Active架構的無伺服器化後台服務建構全球跨區域  x Active-Active架構的無伺服器化後台服務
建構全球跨區域 x Active-Active架構的無伺服器化後台服務Amazon Web Services
 
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019AWSKRUG - AWS한국사용자모임
 
善用 GraphQL 與 AWS AppSync 讓您的 Progressive Web App (PWA) 加速進化 (Level 200)
善用  GraphQL 與 AWS AppSync 讓您的  Progressive Web App (PWA) 加速進化 (Level 200)善用  GraphQL 與 AWS AppSync 讓您的  Progressive Web App (PWA) 加速進化 (Level 200)
善用 GraphQL 與 AWS AppSync 讓您的 Progressive Web App (PWA) 加速進化 (Level 200)Amazon Web Services
 
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS SummitWork with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS SummitAmazon Web Services
 
Continuous Delivery Best Practices
Continuous Delivery Best PracticesContinuous Delivery Best Practices
Continuous Delivery Best PracticesAmazon Web Services
 
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...Amazon Web Services
 
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerAmazon Web Services
 
Amazon SageMaker Build, Train and Deploy Your ML Models
Amazon SageMaker Build, Train and Deploy Your ML ModelsAmazon SageMaker Build, Train and Deploy Your ML Models
Amazon SageMaker Build, Train and Deploy Your ML ModelsAWS Riyadh User Group
 
Introduction to AI services for Developers - Builders Day Israel
Introduction to AI services for Developers - Builders Day IsraelIntroduction to AI services for Developers - Builders Day Israel
Introduction to AI services for Developers - Builders Day IsraelAmazon Web Services
 
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...Amazon Web Services
 
自己紹介とStoryblok紹介(5分 ver)
自己紹介とStoryblok紹介(5分 ver)自己紹介とStoryblok紹介(5分 ver)
自己紹介とStoryblok紹介(5分 ver)Arisa Fukuzaki
 
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)Amazon Web Services
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML OverviewBESPIN GLOBAL
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
 
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Amazon Web Services
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...Amazon Web Services
 
Deep Dive Amazon SageMaker
Deep Dive Amazon SageMakerDeep Dive Amazon SageMaker
Deep Dive Amazon SageMakerCobus Bernard
 
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...Amazon Web Services Korea
 

What's hot (20)

눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
눈으로 보는 AWS 기반 인공지능 서비스 아키텍처 활용 데모::OliverKlein::AWS Summit Seoul 2018
 
建構全球跨區域 x Active-Active架構的無伺服器化後台服務
建構全球跨區域  x Active-Active架構的無伺服器化後台服務建構全球跨區域  x Active-Active架構的無伺服器化後台服務
建構全球跨區域 x Active-Active架構的無伺服器化後台服務
 
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
Amplify로 Neptune 그래프 DB 기반 모바일 앱 만들기 :: 김현민 - AWS Community Day 2019
 
善用 GraphQL 與 AWS AppSync 讓您的 Progressive Web App (PWA) 加速進化 (Level 200)
善用  GraphQL 與 AWS AppSync 讓您的  Progressive Web App (PWA) 加速進化 (Level 200)善用  GraphQL 與 AWS AppSync 讓您的  Progressive Web App (PWA) 加速進化 (Level 200)
善用 GraphQL 與 AWS AppSync 讓您的 Progressive Web App (PWA) 加速進化 (Level 200)
 
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS SummitWork with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
Work with Machine Learning in Amazon SageMaker - BDA203 - Toronto AWS Summit
 
Continuous Delivery Best Practices
Continuous Delivery Best PracticesContinuous Delivery Best Practices
Continuous Delivery Best Practices
 
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...
Making Hybrid Work for You: Getting into the Cloud Fast (GPSTEC308) - AWS re:...
 
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
 
Amazon SageMaker Build, Train and Deploy Your ML Models
Amazon SageMaker Build, Train and Deploy Your ML ModelsAmazon SageMaker Build, Train and Deploy Your ML Models
Amazon SageMaker Build, Train and Deploy Your ML Models
 
Introduction to AI services for Developers - Builders Day Israel
Introduction to AI services for Developers - Builders Day IsraelIntroduction to AI services for Developers - Builders Day Israel
Introduction to AI services for Developers - Builders Day Israel
 
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot Instances...
 
自己紹介とStoryblok紹介(5分 ver)
自己紹介とStoryblok紹介(5分 ver)自己紹介とStoryblok紹介(5分 ver)
自己紹介とStoryblok紹介(5分 ver)
 
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)
成本節約之道:加速設計週期 x 大規模運行高效能運算 (HPC) 工作負載 (Level: 300)
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML Overview
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
Deep Dive Amazon SageMaker
Deep Dive Amazon SageMakerDeep Dive Amazon SageMaker
Deep Dive Amazon SageMaker
 
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
 
Using Graph Databases
Using Graph DatabasesUsing Graph Databases
Using Graph Databases
 

Similar to Machine Learning using Kubernetes - AI Conclave 2019

MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetApache MXNet
 
Machine Learning with Kubernetes- AWS Container Day 2019 Barcelona
Machine Learning with Kubernetes- AWS Container Day 2019 BarcelonaMachine Learning with Kubernetes- AWS Container Day 2019 Barcelona
Machine Learning with Kubernetes- AWS Container Day 2019 BarcelonaAmazon Web Services
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Julien SIMON
 
Machine learning using Kubernetes
Machine learning using KubernetesMachine learning using Kubernetes
Machine learning using KubernetesArun Gupta
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019Amazon Web Services
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019AWS Summits
 
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...Amazon Web Services
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Julien SIMON
 
[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning ContainersAmazon Web Services
 
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...Amazon Web Services
 
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS SummitDeploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS SummitAmazon Web Services
 
Modern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSModern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSAmazon Web Services
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統Amazon Web Services
 
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...AWS Korea 금융산업팀
 
機器學習技術在工業應用上的最佳實務
機器學習技術在工業應用上的最佳實務機器學習技術在工業應用上的最佳實務
機器學習技術在工業應用上的最佳實務Amazon Web Services
 
AWS Summit London 2019 - Containers on AWS
AWS Summit London 2019 - Containers on AWSAWS Summit London 2019 - Containers on AWS
AWS Summit London 2019 - Containers on AWSMassimo Ferre'
 
Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...Amazon Web Services
 
Amazon EC2 Strategie per l'ottimizzazione dei costi
Amazon EC2 Strategie per l'ottimizzazione dei costiAmazon EC2 Strategie per l'ottimizzazione dei costi
Amazon EC2 Strategie per l'ottimizzazione dei costiAmazon Web Services
 
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit Sydney
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit SydneySupercharge Your Organisation With Machine Learning on AWS - AWS Summit Sydney
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit SydneyAmazon Web Services
 

Similar to Machine Learning using Kubernetes - AI Conclave 2019 (20)

MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
Machine Learning with Kubernetes- AWS Container Day 2019 Barcelona
Machine Learning with Kubernetes- AWS Container Day 2019 BarcelonaMachine Learning with Kubernetes- AWS Container Day 2019 Barcelona
Machine Learning with Kubernetes- AWS Container Day 2019 Barcelona
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
Machine learning using Kubernetes
Machine learning using KubernetesMachine learning using Kubernetes
Machine learning using Kubernetes
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 
Core services
Core servicesCore services
Core services
 
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
Building a Recommender System Using Amazon SageMaker's Factorization Machine ...
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)
 
[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers
 
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...
Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sage...
 
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS SummitDeploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
 
Modern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSModern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECS
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
 
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...
[금융사를 위한 AWS Generative AI Day 2023] 7_다양한 AI 워크로드를 위한 최적의 ...
 
機器學習技術在工業應用上的最佳實務
機器學習技術在工業應用上的最佳實務機器學習技術在工業應用上的最佳實務
機器學習技術在工業應用上的最佳實務
 
AWS Summit London 2019 - Containers on AWS
AWS Summit London 2019 - Containers on AWSAWS Summit London 2019 - Containers on AWS
AWS Summit London 2019 - Containers on AWS
 
Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...Setting up custom machine learning environments on AWS - AIM309 - New York AW...
Setting up custom machine learning environments on AWS - AIM309 - New York AW...
 
Amazon EC2 Strategie per l'ottimizzazione dei costi
Amazon EC2 Strategie per l'ottimizzazione dei costiAmazon EC2 Strategie per l'ottimizzazione dei costi
Amazon EC2 Strategie per l'ottimizzazione dei costi
 
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit Sydney
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit SydneySupercharge Your Organisation With Machine Learning on AWS - AWS Summit Sydney
Supercharge Your Organisation With Machine Learning on AWS - AWS Summit Sydney
 

More from Arun Gupta

5 Skills To Force Multiply Technical Talents.pdf
5 Skills To Force Multiply Technical Talents.pdf5 Skills To Force Multiply Technical Talents.pdf
5 Skills To Force Multiply Technical Talents.pdfArun Gupta
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesArun Gupta
 
Secure and Fast microVM for Serverless Computing using Firecracker
Secure and Fast microVM for Serverless Computing using FirecrackerSecure and Fast microVM for Serverless Computing using Firecracker
Secure and Fast microVM for Serverless Computing using FirecrackerArun Gupta
 
Building Java in the Open - j.Day at OSCON 2019
Building Java in the Open - j.Day at OSCON 2019Building Java in the Open - j.Day at OSCON 2019
Building Java in the Open - j.Day at OSCON 2019Arun Gupta
 
Why Amazon Cares about Open Source
Why Amazon Cares about Open SourceWhy Amazon Cares about Open Source
Why Amazon Cares about Open SourceArun Gupta
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
How to be a mentor to bring more girls to STEAM
How to be a mentor to bring more girls to STEAMHow to be a mentor to bring more girls to STEAM
How to be a mentor to bring more girls to STEAMArun Gupta
 
Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018Arun Gupta
 
Introduction to Amazon EKS - KubeCon 2018
Introduction to Amazon EKS - KubeCon 2018Introduction to Amazon EKS - KubeCon 2018
Introduction to Amazon EKS - KubeCon 2018Arun Gupta
 
Mastering Kubernetes on AWS - Tel Aviv Summit
Mastering Kubernetes on AWS - Tel Aviv SummitMastering Kubernetes on AWS - Tel Aviv Summit
Mastering Kubernetes on AWS - Tel Aviv SummitArun Gupta
 
Top 10 Technology Trends Changing Developer's Landscape
Top 10 Technology Trends Changing Developer's LandscapeTop 10 Technology Trends Changing Developer's Landscape
Top 10 Technology Trends Changing Developer's LandscapeArun Gupta
 
Container Landscape in 2017
Container Landscape in 2017Container Landscape in 2017
Container Landscape in 2017Arun Gupta
 
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Java EE and NoSQL using JBoss EAP 7 and OpenShiftJava EE and NoSQL using JBoss EAP 7 and OpenShift
Java EE and NoSQL using JBoss EAP 7 and OpenShiftArun Gupta
 
Docker, Kubernetes, and Mesos recipes for Java developers
Docker, Kubernetes, and Mesos recipes for Java developersDocker, Kubernetes, and Mesos recipes for Java developers
Docker, Kubernetes, and Mesos recipes for Java developersArun Gupta
 
Thanks Managers!
Thanks Managers!Thanks Managers!
Thanks Managers!Arun Gupta
 
Migrate your traditional VM-based Clusters to Containers
Migrate your traditional VM-based Clusters to ContainersMigrate your traditional VM-based Clusters to Containers
Migrate your traditional VM-based Clusters to ContainersArun Gupta
 
NoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern SuccessNoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern SuccessArun Gupta
 
Package your Java EE Application using Docker and Kubernetes
Package your Java EE Application using Docker and KubernetesPackage your Java EE Application using Docker and Kubernetes
Package your Java EE Application using Docker and KubernetesArun Gupta
 
Nuts and Bolts of WebSocket Devoxx 2014
Nuts and Bolts of WebSocket Devoxx 2014Nuts and Bolts of WebSocket Devoxx 2014
Nuts and Bolts of WebSocket Devoxx 2014Arun Gupta
 
How to run your first marathon ? JavaOne 2014 Ignite
How to run your first marathon ? JavaOne 2014 IgniteHow to run your first marathon ? JavaOne 2014 Ignite
How to run your first marathon ? JavaOne 2014 IgniteArun Gupta
 

More from Arun Gupta (20)

5 Skills To Force Multiply Technical Talents.pdf
5 Skills To Force Multiply Technical Talents.pdf5 Skills To Force Multiply Technical Talents.pdf
5 Skills To Force Multiply Technical Talents.pdf
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and Kubernetes
 
Secure and Fast microVM for Serverless Computing using Firecracker
Secure and Fast microVM for Serverless Computing using FirecrackerSecure and Fast microVM for Serverless Computing using Firecracker
Secure and Fast microVM for Serverless Computing using Firecracker
 
Building Java in the Open - j.Day at OSCON 2019
Building Java in the Open - j.Day at OSCON 2019Building Java in the Open - j.Day at OSCON 2019
Building Java in the Open - j.Day at OSCON 2019
 
Why Amazon Cares about Open Source
Why Amazon Cares about Open SourceWhy Amazon Cares about Open Source
Why Amazon Cares about Open Source
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
How to be a mentor to bring more girls to STEAM
How to be a mentor to bring more girls to STEAMHow to be a mentor to bring more girls to STEAM
How to be a mentor to bring more girls to STEAM
 
Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018
 
Introduction to Amazon EKS - KubeCon 2018
Introduction to Amazon EKS - KubeCon 2018Introduction to Amazon EKS - KubeCon 2018
Introduction to Amazon EKS - KubeCon 2018
 
Mastering Kubernetes on AWS - Tel Aviv Summit
Mastering Kubernetes on AWS - Tel Aviv SummitMastering Kubernetes on AWS - Tel Aviv Summit
Mastering Kubernetes on AWS - Tel Aviv Summit
 
Top 10 Technology Trends Changing Developer's Landscape
Top 10 Technology Trends Changing Developer's LandscapeTop 10 Technology Trends Changing Developer's Landscape
Top 10 Technology Trends Changing Developer's Landscape
 
Container Landscape in 2017
Container Landscape in 2017Container Landscape in 2017
Container Landscape in 2017
 
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Java EE and NoSQL using JBoss EAP 7 and OpenShiftJava EE and NoSQL using JBoss EAP 7 and OpenShift
Java EE and NoSQL using JBoss EAP 7 and OpenShift
 
Docker, Kubernetes, and Mesos recipes for Java developers
Docker, Kubernetes, and Mesos recipes for Java developersDocker, Kubernetes, and Mesos recipes for Java developers
Docker, Kubernetes, and Mesos recipes for Java developers
 
Thanks Managers!
Thanks Managers!Thanks Managers!
Thanks Managers!
 
Migrate your traditional VM-based Clusters to Containers
Migrate your traditional VM-based Clusters to ContainersMigrate your traditional VM-based Clusters to Containers
Migrate your traditional VM-based Clusters to Containers
 
NoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern SuccessNoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern Success
 
Package your Java EE Application using Docker and Kubernetes
Package your Java EE Application using Docker and KubernetesPackage your Java EE Application using Docker and Kubernetes
Package your Java EE Application using Docker and Kubernetes
 
Nuts and Bolts of WebSocket Devoxx 2014
Nuts and Bolts of WebSocket Devoxx 2014Nuts and Bolts of WebSocket Devoxx 2014
Nuts and Bolts of WebSocket Devoxx 2014
 
How to run your first marathon ? JavaOne 2014 Ignite
How to run your first marathon ? JavaOne 2014 IgniteHow to run your first marathon ? JavaOne 2014 Ignite
How to run your first marathon ? JavaOne 2014 Ignite
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Machine Learning using Kubernetes - AI Conclave 2019

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning using Kubeflow Arun Gupta, @arungupta Principal Open Source Technologist
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://dilbert.com/strip/2013-02-02
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K Broadest and deepest set of capabilities FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K Amazon EKS Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin Containerized ML
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Broadest and deepest set of capabilities T H E AW S M L S TAC K ML Services Amazon SageMaker Ground Truth data labelling ML Marketplace SageMaker Neo Built-in algorithms SageMaker Notebooks SageMaker Experiments Model tuning SageMaker Autopilot Model hosting SageMaker Model Monitor SageMakerStudioIDE
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI Services Broadest and deepest set of capabilities T H E AW S M L S TAC K VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS Amazon Rekognition +Custom Labels Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru Amazon Textract Amazon Kendra Amazon Connect with Contact Lens
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI Services Broadest and deepest set of capabilities T H E AW S M L S TAC K VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS Amazon Rekognition +Custom Labels Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru Amazon Textract Amazon Kendra Amazon Connect with Contact Lens FRAMEWORKS INTERFACES INFRASTRUCTURE ML Frameworks + Infrastructure Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA ML Services Amazon SageMaker Ground Truth data labelling ML Marketplace SageMaker Neo Built-in algorithms SageMaker Notebooks SageMaker Experiments Model tuning SageMaker Autopilot Model hosting SageMaker Model Monitor SageMakerStudioIDE Amazon EKS Auto ScalingOptimized GPU AMI Deep Learning Container FSx CSI Plugin Containerized ML
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. M A C H I N E L E A R N I N G S T O R A G E Amazon Redshift + Redshift Spectrum Amazon QuickSight Amazon EMR Hadoop, Spark, Presto, Pig, Hive…19 total Amazon Athena Amazon Kinesis Amazon Elasticsearch Service AWS Glue A N A L Y T I C S Amazon S3 Standard-IA Amazon S3 Standard Amazon S3 One Zone-IA Amazon Glacier Amazon S3 Intelligent- Tiering N E W Amazon EBS Amazon S3 Glacier Deep Archive N E W Storage and Analytics for Machine Learning
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes? Composability Portability Scalability O N - P R E M I S E S C L O U D http://www.shutterstock.com/gallery-635827p1.html
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: Without KubeFlow @aronchik
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: With KubeFlow @aronchik
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Kubeflow Containerized machine learning platform Makes it easy to develop, deploy, and manage portable, scalable end-to-end ML workflows on k8s “Toolkit” – loosely coupled tools and blueprints for ML End to End ML workflow – ML code is only a small component https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud Managed Kubernetes control plane, attach data plane Native upstream Kubernetes experience Platform for enterprises to run production-grade workloads Integrates with additional AWS services
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io) Creates all resources needed for the cluster
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS-Optimized GPU AMI Built on top of the standard Amazon EKS-Optimized AMI Includes packages to support Amazon P2/P3/G3/G4 instances • NVIDIA drivers • nvidia-docker2 package • nvidia-container-runtime (as default runtime) GPU Clock Optimization
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cluster Autoscaler Improvements Add GPU support autoscaler#1584 GPU autoscaling supported for AWS autoscaler#1589 GPU scale down performance optimization Prevent CA from removing a node with ML training job running Annotate job ”cluster-autoscaler.kubernetes.io/safe-to-evict”: “false” Recommended to create GPU node group per AZ Improve network communication performance Prevent ASG rebalancing
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Deep Learning Containers KEY FEATURES Customizable container images Support for TensorFlow, Apache MXNet Single and multi-node training and inference Pre-packaged Docker container images fully configured and validated Best performance and scalability without tuning Works with Amazon EKS, Amazon ECS, and Amazon EC2
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Desktop MiniKF: Local Kubeflow deployment using VirtualBox and Vagrant • Minikube -> Kubernetes • MiniKF -> Kubeflow (includes minikube) Runs on macOS, Linux, and Windows Does not require k8s-specific knowledge
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Cloud Major cloud providers supported Choices on Amazon Web Services • Self-managed k8s on EC2: Kops, CloudFormation, Terraform • Amazon EKS
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started with Kubeflow on Amazon EKS
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook Create and share documents that contain live code, equations, visualizations, and narrative text • UI to manage notebooks • Integrate with RBAC/IAM • Ingress / Service Mesh
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fairing Python SDK to build, train and deploy ML models • Easily package ML training jobs • Train ML models from notebook to k8s • Streamline the model development process Setup KubeflowFairing for training and prediction https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb Train an XGBoost model remotely on Kubeflow Deploy the trained model to Kubeflowfor prediction
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib – Hyperparameter Tuning Hyperparameter are parameters external to the model to control the training, e.g. learning rate, batch size, epochs Tuning finds a set of hyperparameters that optimizes an objective function, e.g. Find the optimal batch size and learning rate to maximize prediction accuracy
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter Tuning is Hard More hyperparameters -> exponential space growth Tuning by hands is inefficient and error-prone Need to track metrics across multiple jobs Managing resources and infrastructure for lot of jobs is hard Variety of frameworks and algorithms to support
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib – Hyperparameter Tuning trialName Validation-accuracy accuracy --lr --num-layers --optimizer random-experiment-rfwwbnsd 0.974920 0.984844 0.013831565266960293 4 sgd random-experiment-vxgwlgqq 0.113854 0.116646 0.024225789898529138 4 ftrl random-experiment-wclrwlcq 0.979697 0.998437 0.021916171239020756 4 sgd random-experiment-7lsc4pwb 0.113854 0.115312 0.024163810384272653 5 ftrl random-experiment-86vv9vgv 0.963475 0.971562 0.02943228249244735 3 adam random-experiment-jh884cxz 0.981091 0.999219 0.022372025623908262 2 sgd random-experiment-sgtwhrgz 0.980693 0.997969 0.016641686851083654 4 sgd random-experiment-c6vvz6dv 0.980792 0.998906 0.0264125850165842 3 sgd random-experiment-vqs2xmfj 0.113854 0.105313 0.026629394628228185 4 ftrl random-experiment-bv8lsh2m 0.980195 0.999375 0.021769570793012488 2 sgd random-experiment-7vbnqc7z 0.113854 0.102188 0.025079750575740783 4 ftrl random-experiment-kwj9drmg 0.979498 0.995469 0.014985919312945063 4 sgd Hyperparameters
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Trial template
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing: Model serving and management Provides a Kubernetes CRD for serving ML models on arbitrary frameworks. Encapsulates the complexity of autoscaling, networking and server configuration to bring features like scale to zero, transformations, and canary rollouts to your deployments Enables a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability.
  • 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Custom Resource S3 secret attached to Service Account Trained model https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
  • 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pluggable Interface apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "flowers-sample" spec: default: tensorflow: storageUri: "gs://kfserving-samples/models/tensorflow/flowers" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "pytorch-cifar10" spec: default: pytorch: storageUri: "gs://kfserving-samples/models/pytorch/cifar10" modelClassName: "Net"
  • 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Interface – Scikit Learn apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi canaryTrafficPercent: 25 canary: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris-v2" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi
  • 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Training
  • 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
  • 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pipelines – Machine Learning Job Orchestrator Compose, deploy, and manage end-to-end ML workflows End-to-end orchestration Easy, rapid, and reliable experimentation Easy re-use Built using Pipelines SDK kfp.compiler, kfp.components, kfp.Client Uses Argo under the hood to orchestrate resources
  • 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components @dsl.pipeline( name='Sample Trainer', description=’’ ) def sample_train_pipeline(... ): create_cluster_op = CreateClusterOp('create-cluster', ...) analyze_op = AnalyzeOp('analyze', ...) transform_op = TransformOp('transform', ...) train_op = TrainerOp('train', ...) predict_op = PredictOp('predict', ...) confusion_matrix_op = ConfusionMatrixOp('confusion-matrix', ...) roc_op = RocOp('roc', ...) kfp.compiler.Compiler().compile(sample_train_pipeline , 'my-pipeline.zip’) Pipeline component Pipeline decorator Pipeline function Compile pipeline
  • 39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components
  • 40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata – Model Tracking • Metadata schema to track artifacts related to execution contexts • Metadata API for storing and retrieving metadata • Client libraries for end-users to interact with the Metadata service from their Notebooks or Pipelines code.
  • 41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Making Kubeflow a first class citizen on AWS • Centralized and unified Kubernetes cluster logs in Amazon CloudWatch • External traffic and authentication management with ALB Ingress Controller • TLS and authentication with AWS Certificate Manager and AWS Cognito • In-built FSx CSI driver w/S3 data repository integration to optimize training performance • Elastic File System integration for common data sharing in JupyterHub • Easier and customizable Kubeflow installation with kfctl and Kustomize support • Kubeflow Pipeline integration with AWS Services – Amazon EMR, Athena, SageMaker • Add ECR integration to Kubeflow Fairing • Jupyter Notebook images with AWS CLI installed and ECR support • Auto detect GPU worker nodes and install NVIDIA device plugin https://www.kubeflow.org/docs/aws/
  • 42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Kubeflow Roadmap Kubeflow v1.0 - Theme: Enterprise Readiness E2E examples and increased docs on Kubeflow site Upstream testing for Kubeflow on AWS Support DIY K8S on AWS IAM Roles for Service Accounts integration with Jupyter notebooks Support for managed contributors
  • 44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Feature store - Feast • Discoverability and reuse of features • Standardization of features • Access to features for training and serving • Consistency between training and serving
  • 45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fully managed infrastructure in Amazon SageMake Introducing Amazon SageMaker Operators for Kubernetes Kubernetes customers can now train, tune, & deploy models in Amazon SageMaker
  • 46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Under the hood – Amazon SageMaker and Kubernetes Kubectl apply YAML Key Features • Amazon SageMaker Operators for training, tuning, inference • Natively interact with Amazon SageMaker jobs using K8s tools (e.g., get pods, describe) • Stream and view logs from Amazon SageMaker in K8s • Helm Charts to assist with setup and spec creation
  • 47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/aws/amazon- sagemaker-operator-for-k8s
  • 48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References Workshop: eksworkshop.com/advanced/420_kubeflow/ Jupyter notebooks: github.com/aws-samples/eks-kubeflow- workshop/ Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing-distributed- deep-learning-performance-amazon-eks/
  • 49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Please fill up the session rating in the feedback form and collect your goodie at the end of the day THANK YOU

Editor's Notes

  1. Kubernetes provides isolation, auto-scaling, load balancing, flexibility and GPU support. These features are critical to run computationally, data intensive and hard to parallelize machine learning models. Declarative syntax of Kubernetes deployment descriptors make it easy for non-operationally focused engineers to easily train machine learning models on Kubernetes. This talk will explain why and how Amazon EKS, our managed service is the right Kubernetes platform for building your Machine Learning solutions.
  2. When we learn how to ride a bike, we’re using the data provided by a friend, or a sibling, or a parent to train our mind to ride the bike. Machine Learning is applying those concepts, but machine-to-machine.
  3. Machine Learning can be realized in a variety of ways. Let’s see how it looks in a DIY fashion. There is a training phase. You start with a training data that will be used to create a model. The blue cloud in the middle is your code that reads the training data and creates a model. Once the model is generated, then test data is fed to the model to find out accuracy of the model. Algorithm chosen in your application also defines how long it takes to generate the model. You keep repeating this cycle until a model with reasonable accuracy is obtained. After training is done, there is an inference phase. In this phase, input data, typically real-world data, is fed to the the generated model and predictions are made. The ultimate goal of ML is to ensure how good are the predictions? So if your ML model is to identify a hand-written number and if it is presented a hand-written number, can it be accurately identified? If not, then you go back to training and then infer again.
  4. Within AWS we see the stack as having three layers:   The bottom layer of the stack is for expert machine learning practitioners who work at the framework level and are comfortable building, training, tuning, and deploying machine learning models. This is the foundation for all of the innovation we drive at every other layer of the stack. There are GPU instances like P3 and P3dn where the vast majority of deep learning and machine learning is done in the cloud. All the common frameworks such as TensorFlow, PyTorch, Caffe2, and Apache MXNet are supported. We will always make sure that all the frameworks you care about are supported equally well, so you have the right tool for the right job.
  5. While we’re seeing a lot of activity at that bottom layer (infrastructure and frameworks), the reality is that there just aren't that many expert machine learning practitioners in the world. That’s why we built and launched Amazon SageMaker, a managed ML service in the middle tier, which makes it much easier for every day developers and data scientists to get up and running with machine learning…
  6. Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training. Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight. We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
  7. Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training. Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight. We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
  8. To summarize, the AWS AI and ML stack has three layers. Each layer addressing different audiences: ML Frameworks & Infrastructure: For expert machine learning practitioners who work at the framework level. ML Services: For every day developers and data scientists we built and launched Amazon SageMaker. AI Services: Developers with no prior machine learning experience can easily build sophisticated AI driven applications
  9. Deep storage and analytics capabilities are needed for a comprehensive ML solution. Storage systems should be able to support high throughput and low latency, with best security around that data. You need a deep collection of real-time analytics. AWS offers all of that. We’ll cover one part of this later in this talk as well.
  10. Why is Kubernetes well suited for Machine Learning? There are three reasons: Composability, Portability, Scalability ML is about data ingestion, data analysis, data transformation, data validation, building a model, model validation, training at scale, inference and much more. Each of these phase ends up being a microservice. And turns out Kubernetes provides a great platform for composing these microservices together. It allows multiple Data Scientists to chose a solution that works for them. It also enables separation of duties between Ops and Data scientists. Using containers and Kubernetes as a base layer allows you to use open source frameworks (ex: kubeflow), and use it to train & develop models on k8s. This allows you to easily migrate your solution from from laptop, to on-prem to the cloud. We’ll talk about kubeflow later in this preso. Kubernetes allows you to scale the applications. It not only provides support for more nodes, but more GPUs and we’ll talk about performance optimizations on Amazon EKS for near linear scalability later in this talk. More disk/network, low latency filesystems which again we’ll talk later. Also, you need to run the experiments multiple times by tweaking parameters a little bit every time. So you need to be able to scale your infrastructure to support these needs. AWS cloud meet your needs well for that.
  11. Lets see how we can leverage these containers on K8s. As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more. Ever Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML.
  12. KubeFlow makes it easy for everyone to develop, deploy and manage portable, distributed Machine Learning on k8s. Anywhere you are running K8s, you should be able to run kubeflow. Once Amazon EKS cluster is up and running,
  13. Kubeflow was introduced about 2 years ago at KubeCon. It provides a containerized machine learning platform. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes: simple, portable and scalable. Kubeflow has evolved to become a toolkit of loosely couple toolkits for machine learning. Their goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.  Lets see how we can leverage these containers on K8s. As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more. Every Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML. Kubeflow provides a unified layer on top of k8s to run ML workloads. --- Unify competing interests of data scientists and devops Data scientists want to write algorithms, experiment fast, get access to data DevOps care about security, reliability, and cost
  14. Containerized Machine Learning platform JupyterHub for collaborative & interactive training A TensorFlow Training Controller A TensorFlow Serving Deployment, SeldonCore for complex inference and non TF models Pipelines, powered by Argo, for workflows Experiments allow to run different configuration of pipelines Metadata is the information about executions (runs), models, datasets, and other artifacts Wiring to make it work on any k8s anywhere
  15. We released an EKS-optimized GPU AMI. This is the basic building block that allows you to create GPU-powered Amazon EKS cluster. It is built on top of standard Amazon EKS-optimized AMI. Includes the usual NVIDIA drivers, package, and runtime to provide support for GPU instances in AWS cloud. NVIDIA driver uses an autoboost feature, which varies the GPU clock speeds. By disabling the autoboost feature and setting the GPU clock speeds to their maximum frequency, you can consistently achieve the maximum performance with your GPU instances. 
  16. Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster for burstable workloads. This means that there will be times when pods fail to run due to insufficient resources. At other times, nodes are underutilized for an extended period of time and so the pods there can be placed on other nodes and that node can be reclaimed. Most of the work for GPUs is now being put into cluster autoscaler. Lets talk about some of that. A couple of pull requests that we made are highlighted here. The first PR was enabling support for GPU to multiple cloud providers which allowed it to be solved in an upstream compliant way. It basically added label and type to the GPU nodes. GPU nodes are not sensitive to CPU and memory but use a different set of metrics for CA to scale down the nodes. This second PR adds support for that. safe-to-evict is a standard k8s annotation. It can be specified on a pod and then CA would not remove that node during scale down. This is particularly relevant for ML workloads as you’d not like the pod to be terminated that has completed hours of training, but still got work to do. The recommendation is to create GPU node groups per AZ. This has a couple of benefits – first it helps with network communication and low data transfer costs in a distributed training job. Secondly, it avoids ASG rebalancing across multiple AZs. --- Escalator is designed for large batch or job based workloads that cannot be force-drained and moved when the cluster needs to scale down - Escalator will ensure pods have been completed on nodes before terminating them. It is also optimized for scaling up the cluster as fast as possible to ensure pods are not left in a pending state.
  17. 1/ AWS Deep Learning Containers provides pre-packaged Docker container images that are fully configured and validated, so customers no longer have to spend time building and testing the images. 2/ Since we’ve already optimized these docker images for AWS, customers can get the best performance and scalability right away – no tuning required. 3/ AWS Deep Learning Containers are built to work with Amazon EKS, Amazon ECS, and Amazon EC2 to give developers the flexibility and choice. 4/ Customers can also customize these container images to include their own tools and packages for a high degree of control over features of their environment such as monitoring, compliance, and scaling.
  18. Jupyter notebook provides an easy on-ramp to build, deploy and train ML models. You can create notebooks that contain live code and provide interactive output in a wide variety of formats such as HTML, images, video, and custom MIME types. Jupyter supports over 40 programming languages, including Python, R, Julia, and Scala. Jupyter notebook in Kubeflow doesn't have that many language kernels yet. However, users can definitely customize on their own. Kubeflow is 0.7 today. One of the new features introduced in 0.6 was multi-user isolation of user-created resources. This feature allow multiple users to operate on a shared Kubeflow deployment without stepping on each others’ jobs and resources. The isolation mechanisms also prevent accidental deletion/modification of resources of other users in the deployment. An administrator needs to deploy Kubeflow and configure the authentication. service for the deployment. A user can log into the system and will by default be accessing their primary profile. A profile is a collection of Kubernetes resources along with a Kubernetes namespace of the same name.
  19. By using Kubeflow Fairing and adding a few lines of code, you can run your ML training job locally or in the cloud, directly from Python code or a Jupyter notebook. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint. Kubeflow Fairing packages your Jupyter notebook, Python function, or Python file as a Docker image, then deploys and runs the training job on Kubeflow. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint. - Easily package ML training jobs: Enable ML practitioners to easily package their ML model training code, and their code’s dependencies, as a Docker image. - Easily train ML models in a hybrid cloud environment: Provide a high-level API for training ML models to make it easy to run training jobs in the cloud, without needing to understand the underlying infrastructure. - Streamline the process of deploying a trained model: Make it easy for ML practitioners to deploy trained ML models to a hybrid cloud environment.
  20. Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?. Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters. Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  21. HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is not efficient. May be you want to anchor on one particular value, and then vary others. For example, batch size and then vary learning rate to find optimal values. You want to be able to track metrics across different jobs. You can choose different frameworks like TensorFlow, PyTorch or MXNet. Algorithm could be random search, grid search, or bayesian optimization.
  22. Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?. Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters. HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is error Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  23. The lines go from right to left indicating the three parameters that are used for each run of the experiment. The left side shows validation-accuracy and accuracy for the training output, they both use different datasets. The left most columns match the required objective specified in the Experiment. Now, we can choose the best/optimized value of validation accuracy and accuracy, identify the corresponding parameters and use them for training.
  24. One of the most important aspects of Building Cloud Native is knowing your responsibilities and where ever possible leveraging the awesome landscape of Cloud Native technologies. KFServing has chosen Istio and Knative to solve core serverless, networking, and revision management problems, and KFServing decorate those layers with ML specific opinions. It builds on top of Kubernetes, so you have full control up and down the stack for whatever you need. Because the whole stack is open, there is a privilege and responsibility to upstream functionality. If ML customers have networking or serverless requirements, the KFServing community will deliver them, but not in our code. We will move down the stack as far as we can, and contribute the code where it serves its widest purpose. This makes the entire ecosystem stronger, and avoids reinventing the wheel over and over. KFServing’s mission to build an ML serving platform that is simple yet powerful. It focuses on data scientists, which means it build concepts that make sense in their domain. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX. Low Bar High Ceiling More recent version more extensible with concept of transformers – essentially a wrapper function you can call before inferencing to get more information, for example take a user id and get location, and past purchase history
  25. We started with our interface. This was one of the most important thing to get right. We needed to find the lowest floor possible for Data Scientists. Unlike other serving platforms, one of our key decisions was to choose a non-container interface. Most ML frameworks support model serialization to a file, so we picked the most common frameworks and built or found a set of out of the box servers to support them. These servers would simply load the serialized model and start an http endpoint. That meant no custom servers, no containers, no readiness checks, no bespoke code, all the complexity around server management was simple handled -- and it just worked. In about 8 lines, you could describe all of the infrastructure you needed to get your model up and running. 
  26. Keep in mind that principle of Low floor with a High Ceiling. We’d found a pretty low floor, but where can our users go from there. They might want to manage Replica Limits, Service Accounts, or even specify resource requests like GPUs. Everything extends cleanly and consistently using Cloud Native Terminology that users might be familiar with or might encounter in the future.  In the first half of this spec, our users workload scales between 3 and 10 replicas, with 2 cores and a gpu per replica. They’ve also specified a service account that grants permission. All of these features should feel very familiar if you’ve ever deployed with Kubernetes, and in fact, that’s what we’re passing through to kubernetes under the hood. One of our principles was a single resource semantic. We wanted to see if we take all of the infrastructure related to a single model, and contain it within a single resource. One good example of this is the second chunk of this spec which is our canary specification. This allows users to specify a second serving configuration, and experiment on it with a percentage of their traffic. In other systems, this would mean a ton of extra resources to wire up routing configurations and two full stacks, but in KFServing you can simply copy your default spec, tweak it, and rename it canary. In our example, the only difference between default and canary is the pointer to the storageUri.  Without a single resource semantic, a data scientist would need to know whether that all of their configurations were applied together and at the same time. If one of the resources failed to apply, they might get stuck in limbo. Because we’ve simplified this structure into a single resource, those risks don’t even need to enter their mind.
  27. Training large models on vast amounts of data can drastically improve model performance. But consider a deep network with millions of parameters, how do we achieve this without waiting for days, or even multiple weeks? That’s where distributed training comes. It allows us train and serve a model on multiple physical machines. This can be achieved using model parallelism and data parallelism. When a big model can not fit into a single node's memory, model parallel training can be employed to handle the big model. Data parallelism allows to distribute the data between different tasks. "Data Parallelism" is the most common training configuration, it involves multiple tasks in a worker job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a ps (parameter server) job. All tasks typically run on different machines or containers.  Distributed training in Kubeflow is provided using TFJob Use a different view -> Ps parameter. If you have fast link with NCCL, distributed training performance is better with sync parameter server instead of async method
  28. We wrote a blog post that discussed best practices to optimize machine learning training performance on Amazon EKS to improve the throughput and minimize training times. We used Kubeflow and FSx for CSI driver. ResNet-50 (kind of network) with ImageNet database. We trained using mixed precision on 20 P3.16xlarge instances (160 V100 GPUs) with a batch size of 256 per GPU (aggregate batch size of ~41k). To achieve better scaling efficiency, we used Horovod and TensorFlow, We observed near-linear scaling, between 90-100% scaling efficiency up to 160 GPUs and 98k images per second.
  29. Enable and simplify the orchestration of end to end ML pipelines. ML workflow include all of the components that make up the steps in the workflow and how the components interact with each other. It makes it easy to try numerous ideas and techniques, and manage your various trials/experiments. Also enable to re-use components and pipelines to quickly cobble together end to end solutions, without having to re-build each time. kfp.compiler includes classes and methods for building Docker container images for your pipeline components. kfp.components includes classes and methods for interacting with pipeline components. kfp.Client contains the Python client libraries and allows to run a pipeline and create an experiment
  30. You can create pipeline directly using YAML file, or create pipeline using SDK Several different ways to run a pipeline Directly from the UI Invoke it from the SDK Setting up a schedule Pipeline run metadata is stored in Kubeflow DB
  31. Write your application code, my-app-code.py. For example, write code to transform data or train a model. Create a Docker container image that packages your program (my-app-code.py) and upload the container image to a registry. To build a container image based on a given Dockerfile, you can use the Docker command-line interface or the kfp.compiler.build_docker_image method from the Kubeflow Pipelines SDK. Write a component function using the Kubeflow Pipelines DSL to define your pipeline’s interactions with the component’s Docker container. Your component function must return a kfp.dsl.ContainerOp. Write a pipeline function using the Kubeflow Pipelines DSL to define the pipeline and include all the pipeline components. Use the kfp.dsl.pipeline decorator to build a pipeline from your pipeline function. Compile the pipeline to generate a compressed YAML definition of the pipeline. The Kubeflow Pipelines service converts the static configuration into a set of Kubernetes resources for execution. Use the Kubeflow Pipelines SDK to run the pipeline.
  32. Your ML workflows generate a lot of metadata. This is information such as executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow. As you start trying out multiple combinations, you need a way to manage all of this metadata together. That’s exactly the purpose of Metadata project. It tracks and manages the metadata that the workflows produce. Metadata components comes pre-installed in 0.7. This is currently an Alpha version and development team is interested in your feedback.
  33. Manage EKS cluster provision with eksctl and provide flexibility to start different flavor of GPU nodes. Manage external traffic with AWS ALB Ingress Controller. Traffic will go through ALB Ingress controller to Istio-Gateway and then forward to ambassador inside cluster. Leverage Amazon FSx CSI driver to manage Lustre file system which is optimized for compute-intensive workloads, such as high-performance computing and machine learning. It can scale to hundreds of GBps of throughput and millions of IOPS. Centralized and unified Kubernetes cluster logs in CloudWatch which helps debugging and troubleshooting. Enable TLS and Authentication with AWS Certificate Manager and AWS Cognito Enable Private Access for your Kubernetes cluster's API server endpoint Automatically detect GPU instance and install Nvidia Device Plugin
  34. As Kubeflow continues to evolve, you can be assured it will continue to work well on AWS
  35. If you are modeling a taxi service then Driver might be an entity and daily count trips might be a feature. Other interesting features might be the distance between the driver and a destination, or the time of day. A combination of multiple features are used as inputs for a machine learning model. As your ML workloads scale, features play an important role for both training and serving. Typical challenges: Features not being reused: Features representing the same business concepts are being redeveloped many times, when existing work from other teams could have been reused. Feature definitions vary: Teams define features differently and there is no easy access to the documentation of a feature. Hard to serve up to date features: Combining streaming and batch derived features, and making them available for serving, requires expertise that not all teams have. Ingesting and serving features derived from streaming data often requires specialised infrastrastructure. As such, teams are deterred from making use of real time data. Inconsistency between training and serving: Training requires access to historical data, whereas models that serve predictions need the latest values. Inconsistencies arise when data is siloed into many independent systems requiring separate tooling. Feast is an open source feature store that can be integrated with Kubeflow to address the feature storage needs. Feast solutions: Discoverability and reuse of features: A centralized feature store allows organizations to build up a foundation of features that can be reused across projects. Teams are then able to utilize features developed by other teams, and as more features are added to the store it becomes easier and cheaper to build models. Access to features for training: Feast allows users to easily access historical feature data. This allows users to produce datasets of features for use in training models. ML practitioners can then focus more on modelling and less on feature engineering. Access to features in serving: Feature data is also available to models in production through a feature serving API. The serving API has been designed to provide low latency access to the latest feature values. Consistency between training and serving: Feast provides consistency by managing and unifying the ingestion of data from batch and streaming sources, using Apache Beam, into both the feature warehouse and feature serving stores. Users can query features in the warehouse and the serving API using the same set of feature identifiers. Standardization of features: Teams are able to capture documentation, metadata and metrics about features. This allows teams to communicate clearly about features, test features data, and determine if a feature is useful for a particular model.
  36. Amazon SageMaker is investing in delivering the best experience for Machine Learning (ML) on Kubernetes by creating Kubernetes operators and Kubeflow Pipeline Components for SageMaker services. Kubernetes users will be able to use managed ML services for training, model tuning, and inference without leaving their Kubernetes environments and pipelines and without learning SageMaker APIs. Miran makes DevOps easier giving customers the benefits of a managed service for ML in Kubernetes, without migrating their workload from Kubernetes and without learning SageMaker APIs. Customers lower training costs by not paying for idle GPU instances. With SageMaker operators and pipeline components for training and model tuning, GPU resources are fully managed by SageMaker and utilized only for the duration of a job. Customers with 30% or higher idle GPU resources on their local self-managed resources for training would see a reduction in total cost by using SageMaker operators and pipeline components. Customers can create hybrid pipelines with SageMaker pipeline components for Kubeflow that can seamlessly execute jobs on AWS, on-premise resources, and other cloud providers.