2. Rationale
How to train ML models and deploy them in production, from
humble beginnings to world domination
Try to take reasonable and justified steps
Longer, more opinionated version: https://medium.com/@julsimon/scaling-machine-
learning-from-0-to-millions-of-users-part-1-a2d36a5e849
3.
4. And so itbegins
• You’ve trained a model on a local machine, using a popular open source library.
• You’ve measured the model’s accuracy, and things look good.
• Now you’d like to deploy it to check its actual behaviour, to run A/B tests, etc.
• You’ve embedded the model in your business application.
• You’ve deployed everything to a single virtual machine in the cloud.
• Everything works, you’re serving predictions, life is good!
5. Score card
Single EC2 instance
Infrastructure effort C’mon, it’s just one instance
ML setup effort pip install tensorflow
CI/CD integration Not needed
Build models DIY
Train models python train.py
Deploy models (at scale) python predict.py
Scale/HA inference Not needed
Optimize costs Not needed
Security Not needed
6.
7. A fewinstancesand models later…
• Life is not that good
• Too much manual work
• Time-consuming and error-prone
• Dependency hell
• No cost optimization
• Monolithic architecture
• Deployment hell
• Multiple apps can’t share the same model
• Apps and models scale differently
Use AWS-maintained tools
• Deep Learning Amazon Machine Images
• Deep Learning containers
Dockerize
Create a prediction service
• Model servers
• Bespoke API (Flask?)
8. AWS Deep LearningAMIs andContainers
Optimized environments on Amazon Linux or Ubuntu
Conda AMI
For developers who want pre-
installed pip packages of DL
frameworks in separate virtual
environments.
Base AMI
For developers who want a clean
slate to set up private DL engine
repositories or custom builds of DL
engines.
Containers
For developers who want pre-
installed containers for DL
frameworks (TensorFlow, PyTorch,
Apache MXNet)
9.
10.
11. Scaling alert!
• More customers, more team members, more models, woohoo!
• Scalability, high availability & security are now a thing
• Scaling up is a losing proposition.You need to scale out
• Only automation can save you:
IaC, CI/CD and all that good DevOps stuff
• What are your options?
12. Option 1:virtualmachines
• Definitely possible, but:
• Why? Seriously, I want to know.
• Operational and financial issues await if you don’t automate extensively
• Training
• Build on-demand clusters with CloudFormation,Terraform, etc.
• Distributed training is a pain to set up
• Prediction
• Automate deployement with CI/CD
• Scale with Auto Scaling, Load Balancers, etc.
• Spot, spot, spot
14. Option 2:Docker clusters
• This makes a lot of sense if you’re already deploying apps to Docker
• No change to the dev experience: same workflows, same CI/CD, etc.
• Deploy prediction services on the same infrastructure as business apps.
• Amazon ECS and Amazon EKS
• Lots of flexibility: mixed instance types (including GPUs), placement constraints, etc.
• Both come with AWS-maintainedAMIs that will save you time
• One cluster or many clusters ?
• Build on-demand development and test clusters with CloudFormation,Terraform, etc.
• Many customers find that running a large single production cluster works better
• Still instance-based and not fully-managed
• Not a hands-off operation: services / pods, service discovery, etc. are nice but you still have work to do
• And yes, this matters even if « someone else is taking care of clusters »
18. Model options on Amazon SageMaker
Training code
Factorization Machines
Linear Learner
Principal Component Analysis
K-Means Clustering
Image classification
And more
Built-in Algorithms (17)
No ML coding required
No infrastructure work required
Distributed training
Pipe mode
BringYour Own Container
Full control, run anything!
R, C++, etc.
No infrastructure work required
Built-in Frameworks
Bring your own code: script mode
Open source containers
No infrastructure work required
Distributed training
Pipe mode
19. TheAmazonSageMakerAPI
• Python SDK orchestrating all Amazon SageMaker activity
• High-level objects for algorithm selection, training, deploying,
automatic model tuning, etc.
https://github.com/aws/sagemaker-python-sdk
• Spark SDK (Python & Scala)
https://github.com/aws/sagemaker-spark/tree/master/sagemaker-spark-sdk
• AWS SDK
• For scripting and automation
• CLI : ‘aws sagemaker’
• Language SDKs: boto3, etc.
20. Training and deploying
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
train_instance_count=1,
train_instance_type='ml.c5.2xlarge’,
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 10,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by a single instance
tf_endpoint = tf_estimator.deploy(initial_instance_count=1, instance_type=ml.t3.xlarge)
tf_endpoint.predict(…)
21. Training and deploying, atany scale
tf_estimator = TensorFlow(entry_point=’my_crazy_cnn.py',
role=role,
train_instance_count=8,
train_instance_type='ml.p3.16xlarge', # Total of 64 GPUs
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={
'epochs': 200,
'learning-rate': 0.01})
tf_estimator.fit(data)
# HTTPS endpoint backed by 16 multi-AZ load-balanced instances
tf_endpoint = tf_estimator.deploy(initial_instance_count=16, instance_type=ml.p3.2xlarge)
tf_endpoint.predict(…)
22. Score card
EC2 ECS / EKS SageMaker
Infrastructure effort Maximal Some (Docker tools) None
ML setup effort Some (DLAMI) Some (DL containers) Minimal
CI/CD integration No change No change Some (SDK, Step Functions)
Build models DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) SDK: 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs
Kubernetes support
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) On-demand/Spot training,
Auto Scaling for inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
23. Score card
Flamewarin3,2,1…
EC2 ECS / EKS SageMaker
Infrastructure effort Maximal Some (Docker tools) None
ML setup effort Some (DLAMI) Some (DL containers) Minimal
CI/CD integration No change No change Some (SDK, Step Functions)
Build models DIY DIY 17 built-in algorithms
Train models (at scale) DIY DIY (Docker tools) 2 LOCs
Deploy models (at scale) DIY (model servers) DIY (Docker tools) SDK: 1 LOCs
Kubernetes support
Scale/HA inference DIY (Auto Scaling, LB) DIY (Services, pods, etc.) Built-in
Optimize costs DIY (Spot, RIs, automation) DIY (Spot, RIs, automation) Spot training,
Auto Scaling for inference
Security DIY (IAM,VPC, KMS) DIY (IAM,VPC, KMS) API parameters
Personal opinion Small scale only, unless you have
strong DevOps skills and enjoy
exercising them.
Reasonable choice if you’re a Docker
Docker shop, and if you’re able and
and willing to integrate with the
Docker/OSS ecosystem. If not, I’d
think twice: Docker isn’t an ML
platform.
Learn it in a few hours, forget
about servers, focus 100% on
ML, enjoy goodies like pipe
mode, distributed training, HPO,
HPO, inference pipelines and
more.
24. Conclusion
• Whatever works for you at this time is fine
• Don’t over-engineer, and don’t « plan for the future »
• Optimize for current business conditions, pay attention toTCO
• Models and data matter, not infrastructure
• When conditions change, move fast: smash and rebuild
• ... which is what cloud is all about!
• « 100% of our time spent on ML » shall be the whole of the Law
• Mix and match if it makes sense
• Train on SageMaker, deploy on ECS/EKS… or vice versa
• Write your own story!
AI Services:
AI Services are intentionally easy to use. They can be accessed via a simple API call.
We’ve pulled the best and most targeted capabilities into ready-made services--for example image recognition or transcription.
The focus here is really on enabling any developer—no ML skills required—to be able to develop AI applications using one of our services.
These API services, used in conjunction, create compelling solutions that really target business problems and use cases.
Customers can build these capabilities into their new and existing applications to reduce costs, increase speed, improve customer satisfaction and insight, and build ‘modern’ intelligent applications
What is your use case? What are the capabilities you might need? There’s an AI Service, or a pairing of services that will address the need.
AI Services descriptions for color:
Amazon Rekognition:
Rekognition makes it easy to add image and video analysis to your applications. You just provide an image or video to the Rekognition API, and the service can identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content.
Amazon Rekognition also provides highly accurate facial analysis and facial recognition on images and video that you provide. You can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.Rekognition is a simple and easy to use API that can quickly analyze any image or video file stored in Amazon S3. Amazon Rekognition is always learning from new data, and we are continually adding new labels and facial recognition features to the service.
More info: https://aws.amazon.com/rekognition/
Amazon Polly:
Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products.
Polly is a text to speech service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice.
With dozens of lifelike voices across a variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries.
More info: https://aws.amazon.com/polly/
Amazon Transcribe:
Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications.
Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech.
Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content.
The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so that you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language.
More info: https://aws.amazon.com/transcribe/
Amazon Translate:
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.
Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms.
Amazon Translate allows you to localize content - such as websites and applications - for international users, and to easily translate large volumes of text efficiently.
More info: https://aws.amazon.com/translate/
Amazon Comprehend:
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text.
The service identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
Using these APIs, you can analyze text and apply the results in a wide range of applications including voice of customer analysis, intelligent document search, and content personalization for web applications.
More info: https://aws.amazon.com/comprehend
Amazon Lex:
Amazon Lex is a service for building conversational interfaces into any application using voice and text.
Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions.
With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, enabling you to quickly and easily build sophisticated, natural language, conversational bots
More info: https://aws.amazon.com/lex
AI Services:
AI Services are intentionally easy to use. They can be accessed via a simple API call.
We’ve pulled the best and most targeted capabilities into ready-made services--for example image recognition or transcription.
The focus here is really on enabling any developer—no ML skills required—to be able to develop AI applications using one of our services.
These API services, used in conjunction, create compelling solutions that really target business problems and use cases.
Customers can build these capabilities into their new and existing applications to reduce costs, increase speed, improve customer satisfaction and insight, and build ‘modern’ intelligent applications
What is your use case? What are the capabilities you might need? There’s an AI Service, or a pairing of services that will address the need.
AI Services descriptions for color:
Amazon Rekognition:
Rekognition makes it easy to add image and video analysis to your applications. You just provide an image or video to the Rekognition API, and the service can identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content.
Amazon Rekognition also provides highly accurate facial analysis and facial recognition on images and video that you provide. You can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.Rekognition is a simple and easy to use API that can quickly analyze any image or video file stored in Amazon S3. Amazon Rekognition is always learning from new data, and we are continually adding new labels and facial recognition features to the service.
More info: https://aws.amazon.com/rekognition/
Amazon Polly:
Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products.
Polly is a text to speech service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice.
With dozens of lifelike voices across a variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries.
More info: https://aws.amazon.com/polly/
Amazon Transcribe:
Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications.
Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech.
Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content.
The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so that you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language.
More info: https://aws.amazon.com/transcribe/
Amazon Translate:
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.
Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms.
Amazon Translate allows you to localize content - such as websites and applications - for international users, and to easily translate large volumes of text efficiently.
More info: https://aws.amazon.com/translate/
Amazon Comprehend:
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text.
The service identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
Using these APIs, you can analyze text and apply the results in a wide range of applications including voice of customer analysis, intelligent document search, and content personalization for web applications.
More info: https://aws.amazon.com/comprehend
Amazon Lex:
Amazon Lex is a service for building conversational interfaces into any application using voice and text.
Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions.
With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, enabling you to quickly and easily build sophisticated, natural language, conversational bots
More info: https://aws.amazon.com/lex