SlideShare a Scribd company logo
1 of 26
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scale Machine Learning from zero to
millions of users
Julien Simon
GlobalTechnical Evangelist, AI & Machine Learning, AWS
@julsimon
Rationale
How to train ML models and deploy them in production, from
humble beginnings to world domination
Try to take reasonable and justified steps
Longer, more opinionated version: https://medium.com/@julsimon/scaling-machine-
learning-from-0-to-millions-of-users-part-1-a2d36a5e849
And so itbegins
• You’ve trained a model on a local machine, using a popular open source library.
• You’ve measured the model’s accuracy, and things look good.
• Now you’d like to deploy it to check its actual behaviour, to run A/B tests, etc.
• You’ve embedded the model in your business application.
• You’ve deployed everything to a single virtual machine in the cloud.
• Everything works, you’re serving predictions, life is good!
Score card
Single EC2 instance
Infrastructure effort C’mon, it’s just one instance
ML setup effort pip install tensorflow
CI/CD integration Not needed
Build models DIY
Train models python train.py
Deploy models (at scale) python predict.py
Scale/HA inference Not needed
Optimize costs Not needed
Security Not needed
A fewinstancesand models later…
• Life is not that good
• Too much manual work
• Time-consuming and error-prone
• Dependency hell
• No cost optimization
• Monolithic architecture
• Deployment hell
• Multiple apps can’t share the same model
• Apps and models scale differently
Use AWS-maintained tools
• Deep Learning Amazon Machine Images
• Deep Learning containers
Dockerize
Create a prediction service
• Model servers
• Bespoke API (Flask?)
AWS Deep LearningAMIs andContainers
Optimized environments on Amazon Linux or Ubuntu
Conda AMI
For developers who want pre-
installed pip packages of DL
frameworks in separate virtual
environments.
Base AMI
For developers who want a clean
slate to set up private DL engine
repositories or custom builds of DL
engines.
Containers
For developers who want pre-
installed containers for DL
frameworks (TensorFlow, PyTorch,
Apache MXNet)
Scaling alert!
• More customers, more team members, more models, woohoo!
• Scalability, high availability & security are now a thing
• Scaling up is a losing proposition.You need to scale out
• Only automation can save you:
IaC, CI/CD and all that good DevOps stuff
• What are your options?
Option 1:virtualmachines
• Definitely possible, but:
• Why? Seriously, I want to know.
• Operational and financial issues await if you don’t automate extensively
• Training
• Build on-demand clusters with CloudFormation,Terraform, etc.
• Distributed training is a pain to set up
• Prediction
• Automate deployement with CI/CD
• Scale with Auto Scaling, Load Balancers, etc.
• Spot, spot, spot
Score card
More EC2 instances
Infrastructure effort Lots
ML setup effort Some (DLAMI)
CI/CD integration No change
Build models DIY
Train models DIY
Deploy models DIY (model servers)
Scale/HA inference DIY (Auto Scaling, LB)
Optimize costs DIY (Spot, automation)
Security DIY (IAM,VPC, KMS)
Option 2:Docker clusters
• This makes a lot of sense if you’re already deploying apps to Docker
• No change to the dev experience: same workflows, same CI/CD, etc.
• Deploy prediction services on the same infrastructure as business apps.
• Amazon ECS and Amazon EKS
• Lots of flexibility: mixed instance types (including GPUs), placement constraints, etc.
• Both come with AWS-maintainedAMIs that will save you time
• One cluster or many clusters ?
• Build on-demand development and test clusters with CloudFormation,Terraform, etc.
• Many customers find that running a large single production cluster works better
• Still instance-based and not fully-managed
• Not a hands-off operation: services / pods, service discovery, etc. are nice but you still have work to do
• And yes, this matters even if « someone else is taking care of clusters »
Score card
EC2 ECS / EKS
Infrastructure effort Lots Some (Docker tools)
ML setup effort Some (DLAMI) Some (DL containers)
CI/CD integration No change No change
Build models DIY DIY
Train models (at scale) DIY DIY (Docker tools)
Deploy models (at scale) DIY (model servers) DIY (Docker tools)
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.)
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation)
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
Option 3:go fullymanaged withAmazonSageMaker
1
2
3
Model options on Amazon SageMaker
Training code
Factorization Machines
Linear Learner
Principal Component Analysis
K-Means Clustering
Image classification
And more
Built-in Algorithms (17)
No ML coding required
No infrastructure work required
Distributed training
Pipe mode
BringYour Own Container
Full control, run anything!
R, C++, etc.
No infrastructure work required
Built-in Frameworks
Bring your own code: script mode
Open source containers
No infrastructure work required
Distributed training
Pipe mode
TheAmazonSageMakerAPI
• Python SDK orchestrating all Amazon SageMaker activity
• High-level objects for algorithm selection, training, deploying,
automatic model tuning, etc.
https://github.com/aws/sagemaker-python-sdk
• Spark SDK (Python & Scala)
https://github.com/aws/sagemaker-spark/tree/master/sagemaker-spark-sdk
• AWS SDK
• For scripting and automation
• CLI : ‘aws sagemaker’
• Language SDKs: boto3, etc.
Training and deploying
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
train_instance_count=1,
train_instance_type='ml.c5.2xlarge’,
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 10,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by a single instance
tf_endpoint = tf_estimator.deploy(initial_instance_count=1, instance_type=ml.t3.xlarge)
tf_endpoint.predict(…)
Training and deploying, atany scale
tf_estimator = TensorFlow(entry_point=’my_crazy_cnn.py',
role=role,
train_instance_count=8,
train_instance_type='ml.p3.16xlarge', # Total of 64 GPUs
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 200,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by 16 multi-AZ load-balanced instances
tf_endpoint = tf_estimator.deploy(initial_instance_count=16, instance_type=ml.p3.2xlarge)
tf_endpoint.predict(…)
Score card
EC2 ECS / EKS SageMaker
Infrastructure effort Maximal Some (Docker tools) None
ML setup effort Some (DLAMI) Some (DL containers) Minimal
CI/CD integration No change No change Some (SDK, Step Functions)
Build models DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) SDK: 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs
Kubernetes support
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) On-demand/Spot training,
Auto Scaling for inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
Score card
Flamewarin3,2,1…
EC2 ECS / EKS SageMaker
Infrastructure effort Maximal Some (Docker tools) None
ML setup effort Some (DLAMI) Some (DL containers) Minimal
CI/CD integration No change No change Some (SDK, Step Functions)
Build models DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs
Kubernetes support
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) Spot training,
Auto Scaling for inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
Personal opinion Small scale only, unless you have
strong DevOps skills and enjoy
exercising them.
Reasonable choice if you’re a Docker
Docker shop, and if you’re able and
and willing to integrate with the
Docker/OSS ecosystem. If not, I’d
think twice: Docker isn’t an ML
platform.
Learn it in a few hours, forget
about servers, focus 100% on
ML, enjoy goodies like pipe
mode, distributed training, HPO,
HPO, inference pipelines and
more.
Conclusion
• Whatever works for you at this time is fine
• Don’t over-engineer, and don’t « plan for the future »
• Optimize for current business conditions, pay attention toTCO
• Models and data matter, not infrastructure
• When conditions change, move fast: smash and rebuild
• ... which is what cloud is all about!
• « 100% of our time spent on ML » shall be the whole of the Law
• Mix and match if it makes sense
• Train on SageMaker, deploy on ECS/EKS… or vice versa
• Write your own story!
Getting started
https://aws.amazon.com/machine-learning/amis/
https://aws.amazon.com/machine-learning/containers/
https://aws.amazon.com/sagemaker
https://github.com/aws/sagemaker-python-sdk
https://github.com/awslabs/amazon-sagemaker-examples
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_operators_for_kubernetes.html
https://medium.com/@julsimon
https://youtube.com/juliensimonfr
https://gitlab.com/juliensimon/dlcontainers DL AMI / container demos
https://gitlab.com/juliensimon/dlnotebooks SageMaker notebooks
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Julien Simon
GlobalTechnical Evangelist, AI & Machine Learning, AWS
@julsimon

More Related Content

What's hot

Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)Julien SIMON
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Julien SIMON
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)Julien SIMON
 
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Julien SIMON
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfAmazon Web Services
 
Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)Julien SIMON
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)Julien SIMON
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketAmazon Web Services
 
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)Julien SIMON
 
Machine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon SagemakerMachine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon SagemakerAmazon Web Services
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAmazon Web Services
 
End-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerEnd-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerSungmin Kim
 
Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)AWS User Group Pune
 
Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)Julien SIMON
 
AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine Amazon Web Services
 
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...Amazon Web Services
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services
 
Build Text Analytics Solutions with Amazon Comprehend and Amazon Translate
Build Text Analytics Solutions with Amazon Comprehend and Amazon TranslateBuild Text Analytics Solutions with Amazon Comprehend and Amazon Translate
Build Text Analytics Solutions with Amazon Comprehend and Amazon TranslateAmazon Web Services
 

What's hot (20)

Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)Scaling Machine Learning from zero to millions of users (May 2019)
Scaling Machine Learning from zero to millions of users (May 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
Build, train and deploy Machine Learning models on Amazon SageMaker (May 2019)
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdf
 
Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)Become a Machine Learning developer with AWS services (May 2019)
Become a Machine Learning developer with AWS services (May 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
Machine Learning: From Notebook to Production with Amazon Sagemaker (April 2018)
 
Machine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon SagemakerMachine Learning: From Notebook to Production with Amazon Sagemaker
Machine Learning: From Notebook to Production with Amazon Sagemaker
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
 
Machine Learning on AWS
Machine Learning on AWSMachine Learning on AWS
Machine Learning on AWS
 
End-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerEnd-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMaker
 
Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)
 
Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)Optimize your Machine Learning workloads (April 2019)
Optimize your Machine Learning workloads (April 2019)
 
AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine
 
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...
AWS re:Invent 2016: Transforming Industrial Processes with Deep Learning (MAC...
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Build Text Analytics Solutions with Amazon Comprehend and Amazon Translate
Build Text Analytics Solutions with Amazon Comprehend and Amazon TranslateBuild Text Analytics Solutions with Amazon Comprehend and Amazon Translate
Build Text Analytics Solutions with Amazon Comprehend and Amazon Translate
 

Similar to Scale Machine Learning from zero to millions of users (April 2020)

"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019Provectus
 
Julien Simon "Scaling ML from 0 to millions of users"
Julien Simon "Scaling ML from 0 to millions of users"Julien Simon "Scaling ML from 0 to millions of users"
Julien Simon "Scaling ML from 0 to millions of users"Fwdays
 
Canada DevOps Summit 2020 Presentation Nov_03_2020
Canada DevOps Summit 2020 Presentation Nov_03_2020Canada DevOps Summit 2020 Presentation Nov_03_2020
Canada DevOps Summit 2020 Presentation Nov_03_2020Varun Manik
 
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Amazon Web Services
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWSDima Pasko
 
Serverless at Lifestage
Serverless at LifestageServerless at Lifestage
Serverless at LifestageBATbern
 
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...RightScale
 
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your Deployment
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your DeploymentAWS 201 Webinar Series - Rightsizing and Cost Optimizing your Deployment
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your DeploymentAmazon Web Services
 
OpenSourceIndia-Suman.pptx
OpenSourceIndia-Suman.pptxOpenSourceIndia-Suman.pptx
OpenSourceIndia-Suman.pptxSuman Debnath
 
Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Brennan Saeta
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microserviceAmazon Web Services
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...Julien SIMON
 
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...AWS Summit 2013 | India - Running High Churn Development & Test Environments,...
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...Amazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...Amazon Web Services
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep DiveRightScale
 

Similar to Scale Machine Learning from zero to millions of users (April 2020) (20)

"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
Julien Simon "Scaling ML from 0 to millions of users"
Julien Simon "Scaling ML from 0 to millions of users"Julien Simon "Scaling ML from 0 to millions of users"
Julien Simon "Scaling ML from 0 to millions of users"
 
Canada DevOps Summit 2020 Presentation Nov_03_2020
Canada DevOps Summit 2020 Presentation Nov_03_2020Canada DevOps Summit 2020 Presentation Nov_03_2020
Canada DevOps Summit 2020 Presentation Nov_03_2020
 
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWS
 
Serverless at Lifestage
Serverless at LifestageServerless at Lifestage
Serverless at Lifestage
 
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
 
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
 
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your Deployment
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your DeploymentAWS 201 Webinar Series - Rightsizing and Cost Optimizing your Deployment
AWS 201 Webinar Series - Rightsizing and Cost Optimizing your Deployment
 
OpenSourceIndia-Suman.pptx
OpenSourceIndia-Suman.pptxOpenSourceIndia-Suman.pptx
OpenSourceIndia-Suman.pptx
 
Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...AWS Summit 2013 | India - Running High Churn Development & Test Environments,...
AWS Summit 2013 | India - Running High Churn Development & Test Environments,...
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
AWS DeepLens Workshop: Building Computer Vision Applications - BDA201 - Atlan...
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep Dive
 

More from Julien SIMON

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceJulien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersJulien SIMON
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with TransformersJulien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Julien SIMON
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Julien SIMON
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)Julien SIMON
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Julien SIMON
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Julien SIMON
 
Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)Julien SIMON
 
Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...Julien SIMON
 
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)Julien SIMON
 
Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)Julien SIMON
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Julien SIMON
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Julien SIMON
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 

More from Julien SIMON (15)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 
Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)Become a Machine Learning developer with AWS (Avril 2019)
Become a Machine Learning developer with AWS (Avril 2019)
 
Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...Solve complex business problems with Amazon Personalize and Amazon Forecast (...
Solve complex business problems with Amazon Personalize and Amazon Forecast (...
 
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
 
Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)Build Machine Learning Models with Amazon SageMaker (April 2019)
Build Machine Learning Models with Amazon SageMaker (April 2019)
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
 
Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)Optimize your machine learning workloads on AWS (March 2019)
Optimize your machine learning workloads on AWS (March 2019)
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Scale Machine Learning from zero to millions of users (April 2020)

  • 1. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scale Machine Learning from zero to millions of users Julien Simon GlobalTechnical Evangelist, AI & Machine Learning, AWS @julsimon
  • 2. Rationale How to train ML models and deploy them in production, from humble beginnings to world domination Try to take reasonable and justified steps Longer, more opinionated version: https://medium.com/@julsimon/scaling-machine- learning-from-0-to-millions-of-users-part-1-a2d36a5e849
  • 3.
  • 4. And so itbegins • You’ve trained a model on a local machine, using a popular open source library. • You’ve measured the model’s accuracy, and things look good. • Now you’d like to deploy it to check its actual behaviour, to run A/B tests, etc. • You’ve embedded the model in your business application. • You’ve deployed everything to a single virtual machine in the cloud. • Everything works, you’re serving predictions, life is good!
  • 5. Score card Single EC2 instance Infrastructure effort C’mon, it’s just one instance ML setup effort pip install tensorflow CI/CD integration Not needed Build models DIY Train models python train.py Deploy models (at scale) python predict.py Scale/HA inference Not needed Optimize costs Not needed Security Not needed
  • 6.
  • 7. A fewinstancesand models later… • Life is not that good • Too much manual work • Time-consuming and error-prone • Dependency hell • No cost optimization • Monolithic architecture • Deployment hell • Multiple apps can’t share the same model • Apps and models scale differently Use AWS-maintained tools • Deep Learning Amazon Machine Images • Deep Learning containers Dockerize Create a prediction service • Model servers • Bespoke API (Flask?)
  • 8. AWS Deep LearningAMIs andContainers Optimized environments on Amazon Linux or Ubuntu Conda AMI For developers who want pre- installed pip packages of DL frameworks in separate virtual environments. Base AMI For developers who want a clean slate to set up private DL engine repositories or custom builds of DL engines. Containers For developers who want pre- installed containers for DL frameworks (TensorFlow, PyTorch, Apache MXNet)
  • 9.
  • 10.
  • 11. Scaling alert! • More customers, more team members, more models, woohoo! • Scalability, high availability & security are now a thing • Scaling up is a losing proposition.You need to scale out • Only automation can save you: IaC, CI/CD and all that good DevOps stuff • What are your options?
  • 12. Option 1:virtualmachines • Definitely possible, but: • Why? Seriously, I want to know. • Operational and financial issues await if you don’t automate extensively • Training • Build on-demand clusters with CloudFormation,Terraform, etc. • Distributed training is a pain to set up • Prediction • Automate deployement with CI/CD • Scale with Auto Scaling, Load Balancers, etc. • Spot, spot, spot
  • 13. Score card More EC2 instances Infrastructure effort Lots ML setup effort Some (DLAMI) CI/CD integration No change Build models DIY Train models DIY Deploy models DIY (model servers) Scale/HA inference DIY (Auto Scaling, LB) Optimize costs DIY (Spot, automation) Security DIY (IAM,VPC, KMS)
  • 14. Option 2:Docker clusters • This makes a lot of sense if you’re already deploying apps to Docker • No change to the dev experience: same workflows, same CI/CD, etc. • Deploy prediction services on the same infrastructure as business apps. • Amazon ECS and Amazon EKS • Lots of flexibility: mixed instance types (including GPUs), placement constraints, etc. • Both come with AWS-maintainedAMIs that will save you time • One cluster or many clusters ? • Build on-demand development and test clusters with CloudFormation,Terraform, etc. • Many customers find that running a large single production cluster works better • Still instance-based and not fully-managed • Not a hands-off operation: services / pods, service discovery, etc. are nice but you still have work to do • And yes, this matters even if « someone else is taking care of clusters »
  • 15.
  • 16. Score card EC2 ECS / EKS Infrastructure effort Lots Some (Docker tools) ML setup effort Some (DLAMI) Some (DL containers) CI/CD integration No change No change Build models DIY DIY Train models (at scale) DIY DIY (Docker tools) Deploy models (at scale) DIY (model servers) DIY (Docker tools) Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS)
  • 17. Option 3:go fullymanaged withAmazonSageMaker 1 2 3
  • 18. Model options on Amazon SageMaker Training code Factorization Machines Linear Learner Principal Component Analysis K-Means Clustering Image classification And more Built-in Algorithms (17) No ML coding required No infrastructure work required Distributed training Pipe mode BringYour Own Container Full control, run anything! R, C++, etc. No infrastructure work required Built-in Frameworks Bring your own code: script mode Open source containers No infrastructure work required Distributed training Pipe mode
  • 19. TheAmazonSageMakerAPI • Python SDK orchestrating all Amazon SageMaker activity • High-level objects for algorithm selection, training, deploying, automatic model tuning, etc. https://github.com/aws/sagemaker-python-sdk • Spark SDK (Python & Scala) https://github.com/aws/sagemaker-spark/tree/master/sagemaker-spark-sdk • AWS SDK • For scripting and automation • CLI : ‘aws sagemaker’ • Language SDKs: boto3, etc.
  • 20. Training and deploying tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', role=role, train_instance_count=1, train_instance_type='ml.c5.2xlarge’, framework_version='1.12', py_version='py3', script_mode=True, hyperparameters={ 'epochs': 10, 'learning-rate': 0.01}) tf_estimator.fit(data) # HTTPS endpoint backed by a single instance tf_endpoint = tf_estimator.deploy(initial_instance_count=1, instance_type=ml.t3.xlarge) tf_endpoint.predict(…)
  • 21. Training and deploying, atany scale tf_estimator = TensorFlow(entry_point=’my_crazy_cnn.py', role=role, train_instance_count=8, train_instance_type='ml.p3.16xlarge', # Total of 64 GPUs framework_version='1.12', py_version='py3', script_mode=True, hyperparameters={ 'epochs': 200, 'learning-rate': 0.01}) tf_estimator.fit(data) # HTTPS endpoint backed by 16 multi-AZ load-balanced instances tf_endpoint = tf_estimator.deploy(initial_instance_count=16, instance_type=ml.p3.2xlarge) tf_endpoint.predict(…)
  • 22. Score card EC2 ECS / EKS SageMaker Infrastructure effort Maximal Some (Docker tools) None ML setup effort Some (DLAMI) Some (DL containers) Minimal CI/CD integration No change No change Some (SDK, Step Functions) Build models DIY DIY 17 built-in algorithms Train models (at scale) DIY DIY (Docker tools) SDK: 2 LOCs Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs Kubernetes support Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) On-demand/Spot training, Auto Scaling for inference Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
  • 23. Score card Flamewarin3,2,1… EC2 ECS / EKS SageMaker Infrastructure effort Maximal Some (Docker tools) None ML setup effort Some (DLAMI) Some (DL containers) Minimal CI/CD integration No change No change Some (SDK, Step Functions) Build models DIY DIY 17 built-in algorithms Train models (at scale) DIY DIY (Docker tools) 2 LOCs Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs Kubernetes support Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) Spot training, Auto Scaling for inference Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters Personal opinion Small scale only, unless you have strong DevOps skills and enjoy exercising them. Reasonable choice if you’re a Docker Docker shop, and if you’re able and and willing to integrate with the Docker/OSS ecosystem. If not, I’d think twice: Docker isn’t an ML platform. Learn it in a few hours, forget about servers, focus 100% on ML, enjoy goodies like pipe mode, distributed training, HPO, HPO, inference pipelines and more.
  • 24. Conclusion • Whatever works for you at this time is fine • Don’t over-engineer, and don’t « plan for the future » • Optimize for current business conditions, pay attention toTCO • Models and data matter, not infrastructure • When conditions change, move fast: smash and rebuild • ... which is what cloud is all about! • « 100% of our time spent on ML » shall be the whole of the Law • Mix and match if it makes sense • Train on SageMaker, deploy on ECS/EKS… or vice versa • Write your own story!
  • 26. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Julien Simon GlobalTechnical Evangelist, AI & Machine Learning, AWS @julsimon

Editor's Notes

  1. AI Services: AI Services are intentionally easy to use. They can be accessed via a simple API call. We’ve pulled the best and most targeted capabilities into ready-made services--for example image recognition or transcription. The focus here is really on enabling any developer—no ML skills required—to be able to develop AI applications using one of our services. These API services, used in conjunction, create compelling solutions that really target business problems and use cases. Customers can build these capabilities into their new and existing applications to reduce costs, increase speed,  improve customer satisfaction and insight, and build ‘modern’ intelligent applications What is your use case? What are the capabilities you might need? There’s an AI Service, or a pairing of services that will address the need. AI Services descriptions for color: Amazon Rekognition: Rekognition makes it easy to add image and video analysis to your applications. You just provide an image or video to the Rekognition API, and the service can identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content. Amazon Rekognition also provides highly accurate facial analysis and facial recognition on images and video that you provide. You can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases. Rekognition is a simple and easy to use API that can quickly analyze any image or video file stored in Amazon S3. Amazon Rekognition is always learning from new data, and we are continually adding new labels and facial recognition features to the service. More info: https://aws.amazon.com/rekognition/ Amazon Polly: Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly is a text to speech service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. With dozens of lifelike voices across a variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries. More info: https://aws.amazon.com/polly/ Amazon Transcribe: Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech. Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so that you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language. More info: https://aws.amazon.com/transcribe/ Amazon Translate: Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Amazon Translate allows you to localize content - such as websites and applications - for international users, and to easily translate large volumes of text efficiently. More info: https://aws.amazon.com/translate/ Amazon Comprehend: Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. The service identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic. Using these APIs, you can analyze text and apply the results in a wide range of applications including voice of customer analysis, intelligent document search, and content personalization for web applications.  More info: https://aws.amazon.com/comprehend Amazon Lex: Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions. With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, enabling you to quickly and easily build sophisticated, natural language, conversational bots  More info: https://aws.amazon.com/lex    
  2. AI Services: AI Services are intentionally easy to use. They can be accessed via a simple API call. We’ve pulled the best and most targeted capabilities into ready-made services--for example image recognition or transcription. The focus here is really on enabling any developer—no ML skills required—to be able to develop AI applications using one of our services. These API services, used in conjunction, create compelling solutions that really target business problems and use cases. Customers can build these capabilities into their new and existing applications to reduce costs, increase speed,  improve customer satisfaction and insight, and build ‘modern’ intelligent applications What is your use case? What are the capabilities you might need? There’s an AI Service, or a pairing of services that will address the need. AI Services descriptions for color: Amazon Rekognition: Rekognition makes it easy to add image and video analysis to your applications. You just provide an image or video to the Rekognition API, and the service can identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content. Amazon Rekognition also provides highly accurate facial analysis and facial recognition on images and video that you provide. You can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases. Rekognition is a simple and easy to use API that can quickly analyze any image or video file stored in Amazon S3. Amazon Rekognition is always learning from new data, and we are continually adding new labels and facial recognition features to the service. More info: https://aws.amazon.com/rekognition/ Amazon Polly: Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly is a text to speech service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. With dozens of lifelike voices across a variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries. More info: https://aws.amazon.com/polly/ Amazon Transcribe: Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech. Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so that you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language. More info: https://aws.amazon.com/transcribe/ Amazon Translate: Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Amazon Translate allows you to localize content - such as websites and applications - for international users, and to easily translate large volumes of text efficiently. More info: https://aws.amazon.com/translate/ Amazon Comprehend: Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. The service identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic. Using these APIs, you can analyze text and apply the results in a wide range of applications including voice of customer analysis, intelligent document search, and content personalization for web applications.  More info: https://aws.amazon.com/comprehend Amazon Lex: Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions. With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, enabling you to quickly and easily build sophisticated, natural language, conversational bots  More info: https://aws.amazon.com/lex