2. Jayesh Bapu Ahire
➢ Organizer,
Twilio India Community, AWS UG Pune, Elasticsearch UG Pune,
Alexa UG Nashik
➢ Research Assistant, Stanford AI Lab
➢ Research Associate, Tsinghua AI Lab & ETH Research
➢ Author, Blogger, Speaker, Student, Poet
7. Select Algo & Framework
Integrate & Deploy
Data Preprocessing
Train & Tune Model
8. Machine Learning in Cloud
● The cloud’s pay-per-use model
● Easy for enterprises to experiment, scale and go in
production.
● Intelligent capabilities accessible without requiring
advanced skills in AI.
● Don’t require deep knowledge of AI, machine learning
theory, or a team of data scientists.
9. AI & ML capabilities of AWS
ML Frameworks +
Infrastructure
ML Services AI Services
Frameworks
Interfaces
+
Infrastructure
Amazon Sagemaker
Build
+
Train
+
Deploy
Personalize Forecast Rekognition
Comprehend Textract Polly
Lex Translate Transcribe
11. Reduce Complexity Fully managed
Quick Test
Pre-optimized
Algorithms
Bring Your Own
Algorithm
Distributed Training
12. Build Train Deploy
Collect & prepare
training data
Data labelling & pre-built
notebooks for common
problems
Set up & manage
environments for training
One-click training using
Amazon EC2 On-Demand
or Spot instances
Deploy model in
production
One-click deployment
Choose & optimize your
ML algorithm
Built-in, high-performance
algorithms and hundreds
of ready to use
algorithms in AWS
Marketplace
Train & tune model
Train once, run anywhere
& model optimization
Scale & manage the
production environment
Fully managed with auto-
scaling for 75% less
13. Machine Learning end to end pipeline using Amazon Sagemaker
Build
1. Pre-build algorithms
& notebooks
2. Data Labeling:
Ground Truth
3. AWS marketplace for
ML
Deploy
1. one-click deployment
and hosting
Train
1. One-click model
training and tuning
2. Sagemaker Neo
3. Sagemaker RL
03
01 02
14.
15.
16.
17.
18.
19. Amazon SageMaker: Open Source Containers
● Customize them
● Run them locally for development and testing
● Run them on SageMaker for training and prediction at scale
https://github.com/aws/sagemaker-tensorflow-containers
https://github.com/aws/sagemaker-mxnet-containers
20. Amazon SageMaker: Bring Your Own Container
● Prepare the training code in Docker container
● Upload container image to Amazon Elastic Container Registry (ECR)
● Upload training dataset to Amazon S3/FSx/EFS
● Invoke Create Training Job API to execute a SageMaker training job
SageMaker training job pulls the container image from Amazon ECR, reads
the training data from the data source, configures the training job with
hyperparameter inputs, trains a model, and saves the model to model_dir so
that it can be deployed for inference later.
https://github.com/aws/sagemaker-container-support
21. Distributed Training At Scale on Amazon SageMaker
● Training on Amazon SageMaker can automatically distribute processing
across a number of nodes - including P3 instances
● You can choose from two data distribution types for training ML models
○ Fully Replicated - This will pass every file in the input to every
machine
○ Sharded S3 Key - This will separate and distribute the files in the input
across the training nodes
Overall, sharding can run faster but it depends on the algorithm
22. Amazon SageMaker: Local Mode Training
Enabling experimentation speed
● Train with local notebooks
● Train on notebook instances
● Iterate faster a small sample of the dataset locally no waiting for a new
● training cluster to be built each time
● Emulate CPU (single and multi-instance) and GPU (single instance) in local
mode
● Go distributed with a single line of code
23. Automatic Model Tuning on Amazon SageMaker
Hyperparameter Optimizer
● Amazon SageMaker automatic model tuning predicts hyperparameter
values, which might be most effective at improving fit.
● Automatic model tuning can be used with the Amazon SageMaker
○ Built-in algorithms,
○ Pre-built deep learning frameworks, and
○ Bring-your-own-algorithm containers
http://github.com/awslabs/amazon-sagemakerexamples/tree/master/hyperparameter tuning
24. Amazon SageMaker: Accelerating ML Training
Faster start times and training job execution time
● Two modes: File Mode and Pipe Mode
○ input mode parameter in sagemaker.estimator.estimator
● File Mode: S3 data source or file system data source
○ When using S3 as data source, training data set is downloaded to EBS volumes
○ Use file system data source (Amazon EFS or Amazon FSx for Lustre) for faster
training
○ startup and execution time. Different data formats supported: CSV, protobuf, JSON,
libsvm (check algo docs!)
● Pipe Mode streams the data set to training instances
○ This allows you to process large data sets and training starts faster
○ Dataset must be in recordio-encoded protobuf or csv format
25. Amazon SageMaker: Fully-Managed Spot Training
Reduce training costs at scale
● Managed Spot training on SageMaker to reduce training costs by up to 90%
● Managed Spot Training is available in all training configurations:
○ All instance types supported by Amazon SageMaker
○ All models: built-in algorithms, built-in frameworks, and custom models
○ All configurations: single instance training, distributed training, and
automatic model tuning.
● Setting it up is extremely simple
○ If you're using the console, just switch the feature on.
○ If you're working with the Amazon SageMaker SDK just set
train_use_spot_instances to true in the Estimator constructor.
26. Amazon SageMaker: Secure Machine Learning
● No retention of customers data
● SageMaker provides encryption in transit
● Encryption at rest everywhere
● Compute isolation - instances allocated for computation are never shared with
others
● Network isolation: all compute instances run inside private service managed
VPCs
● Secure, fully managed infrastructure: Amazon Sagemaker take care of patching
and keeping instances up-to-date
● Notebook security - Jupyter notebooks can be operated without internet access
and bound to secure customer VPCs
27. Amazon SageMaker Training: Getting Started
To train a model in Amazon SageMaker, you will need the following:
● A dataset. Here we will use the MNIST (Modified National Institute of Standards and
Technology database) dataset. This dataset provides a training set of 50,000 example
images of handwritten single-digit numbers, a validation set of 10,000 images, and a test
dataset of 10,000 images.
● An algorithm. Here we will use the Linear Learner algorithm provided by Amazon
● An Amazon Simple Storage Service (Amazon S3) bucket to store the training data and the
model artifacts
● An Amazon SageMaker notebook instance to prepare and process data and to train and
deploy a machine learning model.
● A Jupyter notebook to use with the notebook instance
● For model training, deployment, and validation, I will use the high-level Amazon
SageMaker Python SDK
28. Amazon SageMaker Training: Getting Started
● Create the S3 bucket
● Create an Amazon SageMaker Notebook instance by going here:
https://console.aws.amazon.com/sagemaker/
● Choose Notebook instances, then choose Create notebook instance.
● On the Create notebook instance page, provide the Notebook instance name,
choose ml.t2.medium for instance type (least expensive instance) For IAM role,
choose Create a new role, then choose Create role.
● Choose Create notebook instance.
In a few minutes, Amazon SageMaker launches an ML compute instance
and attaches an ML storage volume to it. The notebook instance has a
preconfigured Jupyter notebook server and a set of Anaconda libraries.
29. How To Train a Model With Amazon SageMaker
To train a model in Amazon SageMaker, you create a training job. The training job
includes the following information:
● The URL of the Amazon Simple Storage Service (Amazon S3) bucket or the file
● system id of the file system where you've stored the training data.
● The compute resources that you want Amazon SageMaker to use for model
training. Compute resources are ML compute instances that are managed by
Amazon SageMaker.
● The URL of the S3 bucket where you want to store the output of the job.
● The Amazon Elastic Container Registry path where the training code is stored.
30. Linear Learner with MNIST dataset example
● Provide the S3 bucket and prefix that you want to use for training and model
artifacts. This should be within the same region as the Notebook instance,
training, and hosting
● The IAM role arn used to give training and hosting access to your data
● Download the MNIST dataset
● Amazon SageMaker implementation of Linear Learner takes recordio wrapped
protobuf, where as the data we have is a pickle-ized numpy array on disk.
● This data conversion will be handled by the Amazon SageMaker Python SDK,
imported as sagemaker
31. Train the model
Create and Run a Training Job with Amazon SageMaker Python SDK
● To train a model in Amazon Sagemaker, you can use
○ Amazon SageMaker Python SDK or
○ AWS SDK for Python (Boto 3) or
○ AWS console
● For this exercise, I will use the notebook instance and the Python SDK
● The Amazon SageMaker Python SDK includes the
sagemaker.estimator.Estimator estimator, which can be used with any
algorithm.
● To run a model training job import the Amazon SageMaker Python SDK and get
the Linear Learner container