SlideShare a Scribd company logo
1 of 25
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Train and deploy your Machine Learning workloads with
AWS container services
Julien Simon
Global Evangelist, AI & Machine Learning, AWS
@julsimon
F l o o r 2 8 , T e l A v i v , J u l y 8 t h , 2 0 1 9
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Your firstMachineLearning project
• You’ve trained a model on a local machine, using a popular open source library.
• You’ve measured the model’s accuracy, and things look good. Now you’d like to
deploy it to check its actual behaviour, to run A/B tests, etc.
• You’ve embedded the model in your business application.
• You’ve deployed everything to a single Ubuntu virtual machine in the cloud.
• Everything works, you’re serving predictions, life is good!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
Single EC2 instance
Infrastructure effort C’mon, it’s just one instance
ML setup effort pip install tensorflow
CI/CD integration Not needed
Build models DIY
Train models python train.py
Deploy models (at scale) python predict.py
Scale/HA inference Not needed
Optimize costs Not needed
Security Not needed
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling alert!
• More customers, more team members, more models, woohoo!
• Scalability, high availability & security are now a thing
• Scaling up is a losing proposition.You need to scale out
• Only automation can save you: IaC, CI/CD and all that good DevOps stuff
• What are your options?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Option 1:virtualmachines
• Definitely possible, but:
• Why? Seriously, I want to know.
• Operational and financial issues await if you don’t automate extensively
• Training
• Build on-demand clusters with CloudFormation,Terraform, etc.
• Distributed training is a pain to set up
• Prediction
• Automate deployement with CI/CD
• Scale with Auto Scaling, Load Balancers, etc.
• Spot, spot, spot
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Deep LearningAMIs
Optimized environments on Amazon Linux or Ubuntu
Conda AMI
For developers who want pre-
installed pip packages of DL
frameworks in separate virtual
environments.
Base AMI
For developers who want a clean
slate to set up private DL engine
repositories or custom builds of DL
engines.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
More EC2 instances
Infrastructure effort Lots
ML setup effort Some (DLAMI)
CI/CD integration No change
Build models DIY
Train models DIY
Deploy models DIY (model servers)
Scale/HA inference DIY (Auto Scaling, LB)
Optimize costs DIY (Spot, automation)
Security DIY (IAM,VPC, KMS)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Option 2:Docker clusters
• This makes a lot of sense if you’re already deploying apps to Docker
• No change to the dev experience: same workflows, same CI/CD, etc.
• Deploy prediction services on the same infrastructure as business apps.
• Amazon ECS and Amazon EKS
• Lots of flexibility: mixed instance types (including GPUs), placement constraints, etc.
• Both come with AWS-maintainedAMIs that will save you time
• One cluster or many clusters ?
• Build on-demand development and test clusters with CloudFormation,Terraform, etc.
• Many customers find that running a large single production cluster works better
• Still instance-based and not fully-managed
• Not a hands-off operation: services / deployments, service discovery, etc. are nice but you still have work to do
• And yes, this matters even if « someone else is taking care of clusters »
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://gitlab.com/juliensimon/dlcontainers/tree/master/ecs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://gitlab.com/juliensimon/dlcontainers/tree/master/eks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
EC2 ECS / EKS
Infrastructure effort Lots Some (Docker tools)
ML setup effort Some (DLAMI) Some (DL containers)
CI/CD integration No change No change
Build models DIY DIY
Train models (at scale) DIY DIY (Docker tools)
Deploy models (at scale) DIY (model servers) DIY (Docker tools)
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.)
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation)
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WhataboutAWS Fargate?
• AWS Fargate lets you run containers without having to manage servers
• Per-container configuration
• 0.25 vCPU 0.5GB, 1GB, and 2GB
• 0.5 vCPU Min. 1GB and Max. 4GB, in 1GB increments
• 1 vCPU Min. 2GB and Max. 8GB, in 1GB increments
• 2 vCPU Min. 4GB and Max. 16GB, in 1GB increments
• 4 vCPU Min. 8GB and Max. 30GB, in 1GB increments
• $0.04048 / vCPU / hour, $0.004445 / GB / hour
• No GPU available
• AWS Deep Learning containers not officially supported
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://gitlab.com/juliensimon/dlnotebooks/blob/master/keras/04-fashion-mnist-
sagemaker-advanced/Fashion%20MNIST-SageMaker-Fargate.ipynb
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
EC2 ECS / EKS Fargate for ECS
Infrastructure effort Lots Some (Docker tools) None
ML setup effort Some (DLAMI) Some (DL containers) Some (DIY containers)
CI/CD integration No change No change No change
Build models DIY DIY DIY
Train models (at scale) DIY DIY (Docker tools) Not compelling
Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools)
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) DIY (Auto Scaling, LB)
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) DIY (automation)
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Option 3:go fullymanaged withAmazonSageMaker
1
2
3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Model options on Amazon SageMaker
Training code
Factorization Machines
Linear Learner
Principal Component Analysis
K-Means Clustering
XGBoost
And more
Built-in Algorithms (17)
No ML coding required
No infrastructure work required
Distributed training
Pipe mode
BringYour Own Container
Full control, run anything!
R, C++, etc.
No infrastructure work required
Built-in Frameworks
Bring your own code: script mode
Open source containers
No infrastructure work required
Distributed training
Pipe mode
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training and deploying
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
train_instance_count=1,
train_instance_type='ml.c5.2xlarge’,
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 10,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by a single instance
tf_endpoint = tf_estimator.deploy(initial_instance_count=1,instance_type=ml.t3.xlarge)
tf_endpoint.predict(…)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Training and deploying, atany scale
tf_estimator = TensorFlow(entry_point=’my_crazy_cnn.py',
role=role,
train_instance_count=8,
train_instance_type='ml.p3.16xlarge', # Total of 64 GPUs
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 200,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by 16 multi-AZ load-balanced instances
tf_endpoint = tf_estimator.deploy(initial_instance_count=16, instance_type=ml.p3.2xlarge)
tf_endpoint.predict(…)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
EC2 ECS / EKS Fargate for ECS SageMaker
Infrastructure effort Lots Some (Docker tools) None None
ML setup effort Some (DLAMI) Some (DL containers) Some (DIY containers)
containers)
Minimal
CI/CD integration No change No change No change Some (SDK, Step
Functions)
Build models DIY DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) Not compelling 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools) 1 LOCs
Scale/HA inference DIY (Auto Scaling,
LB)
DIY (Services, pods, etc.) DIY (Auto Scaling, LB)
LB)
Built-in
Optimize costs DIY (Spot, RIs,
automation)
DIY (Spot, RIs,
automation)
DIY (automation) On-demand training,
Auto Scaling for
inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Score card
EC2 ECS / EKS Fargate for ECS SageMaker
Infrastructure effort Lots Some (Docker tools) None None
ML setup effort Some (DL AMI) Some (DL containers) Some (DIY containers) Minimal
CI/CD integration No change No change No change Some (SDK, Step Functions)
Build models DIY DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) Not compelling 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools) 1 LOCs
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) DIY (Auto Scaling, LB) Built-in
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) DIY (automation) On-demand training,
Auto Scaling for inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
Personal opinion Small scale only, unless Reasonable choice if you’re Same as ECS/EKS, just
less infra work.
Inference only IMHO.
Learn it in a few hours,
forget about servers,
focus 100% on ML, enjoy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Conclusion
• Whatever works for you at this time is fine
• Don’t over-engineer, and don’t « plan for the future »
• Fight « we’ve always done like this », NIH, and Hype Driven Development
• Optimize for current business conditions, pay attention toTCO
• Models and data matter, not infrastructure
• When conditions change, move fast: smash and rebuild
• ... which is what cloud is all about!
• « 100% of our time spent on ML » shall be the whole of the Law
• Mix and match
• Train on SageMaker, deploy on ECS/EKS… or vice versa
• Write your own story!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Getting started
http://aws.amazon.com/free
https://aws.ai
https://aws.amazon.com/machine-learning/amis/
https://aws.amazon.com/machine-learning/containers/
https://aws.amazon.com/sagemaker
https://github.com/aws/sagemaker-python-sdk
https://github.com/awslabs/amazon-sagemaker-examples
https://medium.com/@julsimon
https://gitlab.com/juliensimon/dlcontainers DL AMI / container demos
https://gitlab.com/juliensimon/dlnotebooks SageMaker notebooks
Merci!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Julien Simon
Global Evangelist, AI & Machine Learning, AWS
@julsimon

More Related Content

What's hot

Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Amazon Web Services
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 

What's hot (20)

A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdf
 
Machine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon SagemakerMachine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon Sagemaker
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
Using Amazon SageMaker to build, train, & deploy your ML Models
Using Amazon SageMaker to build, train, & deploy your ML ModelsUsing Amazon SageMaker to build, train, & deploy your ML Models
Using Amazon SageMaker to build, train, & deploy your ML Models
 
Train & Deploy ML Models with Amazon Sagemaker: Collision 2018
Train & Deploy ML Models with Amazon Sagemaker: Collision 2018Train & Deploy ML Models with Amazon Sagemaker: Collision 2018
Train & Deploy ML Models with Amazon Sagemaker: Collision 2018
 
Machine Learning with Amazon SageMaker
Machine Learning with Amazon SageMakerMachine Learning with Amazon SageMaker
Machine Learning with Amazon SageMaker
 
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
Demystifying Machine Learning on AWS
Demystifying Machine Learning on AWSDemystifying Machine Learning on AWS
Demystifying Machine Learning on AWS
 
Building serverless applications (April 2018)
Building serverless applications (April 2018)Building serverless applications (April 2018)
Building serverless applications (April 2018)
 
Optimize your ML workloads_converted.pdf
Optimize your ML workloads_converted.pdfOptimize your ML workloads_converted.pdf
Optimize your ML workloads_converted.pdf
 
An Overview of Machine Learning on AWS
An Overview of Machine Learning on AWSAn Overview of Machine Learning on AWS
An Overview of Machine Learning on AWS
 
Using Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML ModelsUsing Amazon SageMaker to build, train, and deploy your ML Models
Using Amazon SageMaker to build, train, and deploy your ML Models
 
Amazon Sagemaker Studio를 통한 ML개발하기 - 소성운(크로키닷컴) :: AWS Community D...
Amazon Sagemaker Studio를 통한 ML개발하기 - 소성운(크로키닷컴) :: AWS Community D...Amazon Sagemaker Studio를 통한 ML개발하기 - 소성운(크로키닷컴) :: AWS Community D...
Amazon Sagemaker Studio를 통한 ML개발하기 - 소성운(크로키닷컴) :: AWS Community D...
 
Intro To Serverless Application Architecture: Collision 2018
Intro To Serverless Application Architecture: Collision 2018Intro To Serverless Application Architecture: Collision 2018
Intro To Serverless Application Architecture: Collision 2018
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Learn to run on the cloud with AWS Training and Certification - ANZ Webinar S...
Learn to run on the cloud with AWS Training and Certification - ANZ Webinar S...Learn to run on the cloud with AWS Training and Certification - ANZ Webinar S...
Learn to run on the cloud with AWS Training and Certification - ANZ Webinar S...
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
re:Invent 2018: AI/ML Services
re:Invent 2018: AI/ML Servicesre:Invent 2018: AI/ML Services
re:Invent 2018: AI/ML Services
 

Similar to Train and Deploy Machine Learning Workloads with AWS Container Services (July 2019))

Similar to Train and Deploy Machine Learning Workloads with AWS Container Services (July 2019)) (20)

Remove Undifferentiated Heavy Lifting from Jenkins (DEV201-R1) - AWS re:Inven...
Remove Undifferentiated Heavy Lifting from Jenkins (DEV201-R1) - AWS re:Inven...Remove Undifferentiated Heavy Lifting from Jenkins (DEV201-R1) - AWS re:Inven...
Remove Undifferentiated Heavy Lifting from Jenkins (DEV201-R1) - AWS re:Inven...
 
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
 
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Anahe...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Anahe...AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Anahe...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Anahe...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
 
Migrating Microsoft Applications to AWS like an Expert - AWS Summit Sydney 2018
Migrating Microsoft Applications to AWS like an Expert - AWS Summit Sydney 2018Migrating Microsoft Applications to AWS like an Expert - AWS Summit Sydney 2018
Migrating Microsoft Applications to AWS like an Expert - AWS Summit Sydney 2018
 
Machine Learning at the Edge
Machine Learning at the EdgeMachine Learning at the Edge
Machine Learning at the Edge
 
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
Rightsizing Your Silicon Design Environment: Elastic Clusters for EDA Workloa...
 
Infrastructure Is Code with the AWS Cloud Development Kit (DEV372) - AWS re:I...
Infrastructure Is Code with the AWS Cloud Development Kit (DEV372) - AWS re:I...Infrastructure Is Code with the AWS Cloud Development Kit (DEV372) - AWS re:I...
Infrastructure Is Code with the AWS Cloud Development Kit (DEV372) - AWS re:I...
 
Intelligence of Things: IoT, AWS DeepLens and Amazon SageMaker - AWS Summit S...
Intelligence of Things: IoT, AWS DeepLens and Amazon SageMaker - AWS Summit S...Intelligence of Things: IoT, AWS DeepLens and Amazon SageMaker - AWS Summit S...
Intelligence of Things: IoT, AWS DeepLens and Amazon SageMaker - AWS Summit S...
 
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
 
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitBuilding well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
 
Orchestrating containers on AWS | AWS Summit Tel Aviv 2019
Orchestrating containers on AWS  | AWS Summit Tel Aviv 2019Orchestrating containers on AWS  | AWS Summit Tel Aviv 2019
Orchestrating containers on AWS | AWS Summit Tel Aviv 2019
 
Orchestrating containers on AWS | AWS Summit Tel Aviv 2019
Orchestrating containers on AWS  | AWS Summit Tel Aviv 2019Orchestrating containers on AWS  | AWS Summit Tel Aviv 2019
Orchestrating containers on AWS | AWS Summit Tel Aviv 2019
 
CI/CD pipelines on AWS - Builders Day Israel
CI/CD pipelines on AWS - Builders Day IsraelCI/CD pipelines on AWS - Builders Day Israel
CI/CD pipelines on AWS - Builders Day Israel
 
Well Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfWell Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdf
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
 
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
 
Serverless at Lifestage
Serverless at LifestageServerless at Lifestage
Serverless at Lifestage
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep Dive
 

More from Julien SIMON

More from Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
 
Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)
 
Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)
 
Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...
 
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
 
Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)
 
Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)
 

Recently uploaded

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Train and Deploy Machine Learning Workloads with AWS Container Services (July 2019))

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Train and deploy your Machine Learning workloads with AWS container services Julien Simon Global Evangelist, AI & Machine Learning, AWS @julsimon F l o o r 2 8 , T e l A v i v , J u l y 8 t h , 2 0 1 9
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Your firstMachineLearning project • You’ve trained a model on a local machine, using a popular open source library. • You’ve measured the model’s accuracy, and things look good. Now you’d like to deploy it to check its actual behaviour, to run A/B tests, etc. • You’ve embedded the model in your business application. • You’ve deployed everything to a single Ubuntu virtual machine in the cloud. • Everything works, you’re serving predictions, life is good!
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card Single EC2 instance Infrastructure effort C’mon, it’s just one instance ML setup effort pip install tensorflow CI/CD integration Not needed Build models DIY Train models python train.py Deploy models (at scale) python predict.py Scale/HA inference Not needed Optimize costs Not needed Security Not needed
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling alert! • More customers, more team members, more models, woohoo! • Scalability, high availability & security are now a thing • Scaling up is a losing proposition.You need to scale out • Only automation can save you: IaC, CI/CD and all that good DevOps stuff • What are your options?
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Option 1:virtualmachines • Definitely possible, but: • Why? Seriously, I want to know. • Operational and financial issues await if you don’t automate extensively • Training • Build on-demand clusters with CloudFormation,Terraform, etc. • Distributed training is a pain to set up • Prediction • Automate deployement with CI/CD • Scale with Auto Scaling, Load Balancers, etc. • Spot, spot, spot
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep LearningAMIs Optimized environments on Amazon Linux or Ubuntu Conda AMI For developers who want pre- installed pip packages of DL frameworks in separate virtual environments. Base AMI For developers who want a clean slate to set up private DL engine repositories or custom builds of DL engines.
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card More EC2 instances Infrastructure effort Lots ML setup effort Some (DLAMI) CI/CD integration No change Build models DIY Train models DIY Deploy models DIY (model servers) Scale/HA inference DIY (Auto Scaling, LB) Optimize costs DIY (Spot, automation) Security DIY (IAM,VPC, KMS)
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Option 2:Docker clusters • This makes a lot of sense if you’re already deploying apps to Docker • No change to the dev experience: same workflows, same CI/CD, etc. • Deploy prediction services on the same infrastructure as business apps. • Amazon ECS and Amazon EKS • Lots of flexibility: mixed instance types (including GPUs), placement constraints, etc. • Both come with AWS-maintainedAMIs that will save you time • One cluster or many clusters ? • Build on-demand development and test clusters with CloudFormation,Terraform, etc. • Many customers find that running a large single production cluster works better • Still instance-based and not fully-managed • Not a hands-off operation: services / deployments, service discovery, etc. are nice but you still have work to do • And yes, this matters even if « someone else is taking care of clusters »
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://gitlab.com/juliensimon/dlcontainers/tree/master/ecs
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://gitlab.com/juliensimon/dlcontainers/tree/master/eks
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card EC2 ECS / EKS Infrastructure effort Lots Some (Docker tools) ML setup effort Some (DLAMI) Some (DL containers) CI/CD integration No change No change Build models DIY DIY Train models (at scale) DIY DIY (Docker tools) Deploy models (at scale) DIY (model servers) DIY (Docker tools) Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. WhataboutAWS Fargate? • AWS Fargate lets you run containers without having to manage servers • Per-container configuration • 0.25 vCPU 0.5GB, 1GB, and 2GB • 0.5 vCPU Min. 1GB and Max. 4GB, in 1GB increments • 1 vCPU Min. 2GB and Max. 8GB, in 1GB increments • 2 vCPU Min. 4GB and Max. 16GB, in 1GB increments • 4 vCPU Min. 8GB and Max. 30GB, in 1GB increments • $0.04048 / vCPU / hour, $0.004445 / GB / hour • No GPU available • AWS Deep Learning containers not officially supported
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://gitlab.com/juliensimon/dlnotebooks/blob/master/keras/04-fashion-mnist- sagemaker-advanced/Fashion%20MNIST-SageMaker-Fargate.ipynb
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card EC2 ECS / EKS Fargate for ECS Infrastructure effort Lots Some (Docker tools) None ML setup effort Some (DLAMI) Some (DL containers) Some (DIY containers) CI/CD integration No change No change No change Build models DIY DIY DIY Train models (at scale) DIY DIY (Docker tools) Not compelling Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools) Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) DIY (Auto Scaling, LB) Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) DIY (automation) Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Option 3:go fullymanaged withAmazonSageMaker 1 2 3
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Model options on Amazon SageMaker Training code Factorization Machines Linear Learner Principal Component Analysis K-Means Clustering XGBoost And more Built-in Algorithms (17) No ML coding required No infrastructure work required Distributed training Pipe mode BringYour Own Container Full control, run anything! R, C++, etc. No infrastructure work required Built-in Frameworks Bring your own code: script mode Open source containers No infrastructure work required Distributed training Pipe mode
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Training and deploying tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', role=role, train_instance_count=1, train_instance_type='ml.c5.2xlarge’, framework_version='1.12', py_version='py3', script_mode=True, hyperparameters={ 'epochs': 10, 'learning-rate': 0.01}) tf_estimator.fit(data) # HTTPS endpoint backed by a single instance tf_endpoint = tf_estimator.deploy(initial_instance_count=1,instance_type=ml.t3.xlarge) tf_endpoint.predict(…)
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Training and deploying, atany scale tf_estimator = TensorFlow(entry_point=’my_crazy_cnn.py', role=role, train_instance_count=8, train_instance_type='ml.p3.16xlarge', # Total of 64 GPUs framework_version='1.12', py_version='py3', script_mode=True, hyperparameters={ 'epochs': 200, 'learning-rate': 0.01}) tf_estimator.fit(data) # HTTPS endpoint backed by 16 multi-AZ load-balanced instances tf_endpoint = tf_estimator.deploy(initial_instance_count=16, instance_type=ml.p3.2xlarge) tf_endpoint.predict(…)
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card EC2 ECS / EKS Fargate for ECS SageMaker Infrastructure effort Lots Some (Docker tools) None None ML setup effort Some (DLAMI) Some (DL containers) Some (DIY containers) containers) Minimal CI/CD integration No change No change No change Some (SDK, Step Functions) Build models DIY DIY DIY 17 built-in algorithms Train models (at scale) DIY DIY (Docker tools) Not compelling 2 LOCs Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools) 1 LOCs Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) DIY (Auto Scaling, LB) LB) Built-in Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) DIY (automation) On-demand training, Auto Scaling for inference Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Score card EC2 ECS / EKS Fargate for ECS SageMaker Infrastructure effort Lots Some (Docker tools) None None ML setup effort Some (DL AMI) Some (DL containers) Some (DIY containers) Minimal CI/CD integration No change No change No change Some (SDK, Step Functions) Build models DIY DIY DIY 17 built-in algorithms Train models (at scale) DIY DIY (Docker tools) Not compelling 2 LOCs Deploy models (at scale) DIY (model servers) DIY (Docker tools) DIY (Docker tools) 1 LOCs Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) DIY (Auto Scaling, LB) Built-in Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) DIY (automation) On-demand training, Auto Scaling for inference Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters Personal opinion Small scale only, unless Reasonable choice if you’re Same as ECS/EKS, just less infra work. Inference only IMHO. Learn it in a few hours, forget about servers, focus 100% on ML, enjoy
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conclusion • Whatever works for you at this time is fine • Don’t over-engineer, and don’t « plan for the future » • Fight « we’ve always done like this », NIH, and Hype Driven Development • Optimize for current business conditions, pay attention toTCO • Models and data matter, not infrastructure • When conditions change, move fast: smash and rebuild • ... which is what cloud is all about! • « 100% of our time spent on ML » shall be the whole of the Law • Mix and match • Train on SageMaker, deploy on ECS/EKS… or vice versa • Write your own story!
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Getting started http://aws.amazon.com/free https://aws.ai https://aws.amazon.com/machine-learning/amis/ https://aws.amazon.com/machine-learning/containers/ https://aws.amazon.com/sagemaker https://github.com/aws/sagemaker-python-sdk https://github.com/awslabs/amazon-sagemaker-examples https://medium.com/@julsimon https://gitlab.com/juliensimon/dlcontainers DL AMI / container demos https://gitlab.com/juliensimon/dlnotebooks SageMaker notebooks
  • 25. Merci! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Julien Simon Global Evangelist, AI & Machine Learning, AWS @julsimon

Editor's Notes

  1. TODO: Tensorflow 2.0