This document discusses machine learning using Kubeflow. It provides an overview of Kubeflow, which is a containerized machine learning platform that makes it easy to develop, deploy, and manage portable, scalable end-to-end ML workflows on Kubernetes. It discusses various Kubeflow components like Jupyter notebooks, Fairing for packaging ML jobs, Katib for hyperparameter tuning, KFServing for model serving, Pipelines for orchestrating workflows, and Metadata for tracking artifacts. It also provides guidance on deploying Kubeflow on Amazon EKS and optimizing distributed deep learning performance on EKS.
Kubernetes provides isolation, auto-scaling, load balancing, flexibility and GPU support. These features are critical to run computationally, data intensive and hard to parallelize machine learning models. Declarative syntax of Kubernetes deployment descriptors make it easy for non-operationally focused engineers to easily train machine learning models on Kubernetes. This talk will explain why and how Amazon EKS, our managed service is the right Kubernetes platform for building your Machine Learning solutions.
When we learn how to ride a bike, we’re using the data provided by a friend, or a sibling, or a parent to train our mind to ride the bike. Machine Learning is applying those concepts, but machine-to-machine.
Machine Learning can be realized in a variety of ways. Let’s see how it looks in a DIY fashion.
There is a training phase.
You start with a training data that will be used to create a model. The blue cloud in the middle is your code that reads the training data and creates a model. Once the model is generated, then test data is fed to the model to find out accuracy of the model. Algorithm chosen in your application also defines how long it takes to generate the model. You keep repeating this cycle until a model with reasonable accuracy is obtained.
After training is done, there is an inference phase.
In this phase, input data, typically real-world data, is fed to the the generated model and predictions are made. The ultimate goal of ML is to ensure how good are the predictions? So if your ML model is to identify a hand-written number and if it is presented a hand-written number, can it be accurately identified?
If not, then you go back to training and then infer again.
Within AWS we see the stack as having three layers:
The bottom layer of the stack is for expert machine learning practitioners who work at the framework level and are comfortable building, training, tuning, and deploying machine learning models. This is the foundation for all of the innovation we drive at every other layer of the stack.
There are GPU instances like P3 and P3dn where the vast majority of deep learning and machine learning is done in the cloud.
All the common frameworks such as TensorFlow, PyTorch, Caffe2, and Apache MXNet are supported. We will always make sure that all the frameworks you care about are supported equally well, so you have the right tool for the right job.
While we’re seeing a lot of activity at that bottom layer (infrastructure and frameworks), the reality is that there just aren't that many expert machine learning practitioners in the world.
That’s why we built and launched Amazon SageMaker, a managed ML service in the middle tier, which makes it much easier for every day developers and data scientists to get up and running with machine learning…
Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training.
Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight.
We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
Moving on, the top level of the stack is what people often call artificial intelligence (AI), because it closely mimics human cognition. And our services here are for customers that don’t want to deal with models and training.
Customers can easily build these capabilities into new and existing applications to reduce costs, increase speed, and improve customer satisfaction and insight.
We offer multiple pre-trained AI services covering vision, speech, language, chatbots, forecasting and recommendations. The key here is that developers with no prior machine learning experience can easily build sophisticated AI driven applications, like an AI driven contact center or live media subtitling.
To summarize, the AWS AI and ML stack has three layers. Each layer addressing different audiences:
ML Frameworks & Infrastructure: For expert machine learning practitioners who work at the framework level.
ML Services: For every day developers and data scientists we built and launched Amazon SageMaker.
AI Services: Developers with no prior machine learning experience can easily build sophisticated AI driven applications
Deep storage and analytics capabilities are needed for a comprehensive ML solution. Storage systems should be able to support high throughput and low latency, with best security around that data. You need a deep collection of real-time analytics. AWS offers all of that. We’ll cover one part of this later in this talk as well.
Why is Kubernetes well suited for Machine Learning? There are three reasons: Composability, Portability, Scalability
ML is about data ingestion, data analysis, data transformation, data validation, building a model, model validation, training at scale, inference and much more. Each of these phase ends up being a microservice. And turns out Kubernetes provides a great platform for composing these microservices together. It allows multiple Data Scientists to chose a solution that works for them. It also enables separation of duties between Ops and Data scientists.
Using containers and Kubernetes as a base layer allows you to use open source frameworks (ex: kubeflow), and use it to train & develop models on k8s. This allows you to easily migrate your solution from from laptop, to on-prem to the cloud. We’ll talk about kubeflow later in this preso.
Kubernetes allows you to scale the applications. It not only provides support for more nodes, but more GPUs and we’ll talk about performance optimizations on Amazon EKS for near linear scalability later in this talk. More disk/network, low latency filesystems which again we’ll talk later. Also, you need to run the experiments multiple times by tweaking parameters a little bit every time. So you need to be able to scale your infrastructure to support these needs. AWS cloud meet your needs well for that.
Lets see how we can leverage these containers on K8s.
As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more.
Ever Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML.
KubeFlow makes it easy for everyone to develop, deploy and manage portable, distributed Machine Learning on k8s.
Anywhere you are running K8s, you should be able to run kubeflow. Once Amazon EKS cluster is up and running,
Kubeflow was introduced about 2 years ago at KubeCon. It provides a containerized machine learning platform.
The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes: simple, portable and scalable.
Kubeflow has evolved to become a toolkit of loosely couple toolkits for machine learning. Their goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
Lets see how we can leverage these containers on K8s.
As a Data Scientist, you just want to do Data Science, run your ML models on EKS. But you need to become an expert in containers, packaging, persistent volumes, scaling, GPUs, drivers, DevOps and much more.
Every Data Scientist has a slightly different view on what the different tools are for modeling, UX, framework, storage and and multiple other items that are needed for ML.
Kubeflow provides a unified layer on top of k8s to run ML workloads.
---
Unify competing interests of data scientists and devops
Data scientists want to write algorithms, experiment fast, get access to data
DevOps care about security, reliability, and cost
Containerized Machine Learning platform
JupyterHub for collaborative & interactive training
A TensorFlow Training Controller
A TensorFlow Serving Deployment, SeldonCore for complex inference and non TF models
Pipelines, powered by Argo, for workflows
Experiments allow to run different configuration of pipelines
Metadata is the information about executions (runs), models, datasets, and other artifacts
Wiring to make it work on any k8s anywhere
We released an EKS-optimized GPU AMI. This is the basic building block that allows you to create GPU-powered Amazon EKS cluster. It is built on top of standard Amazon EKS-optimized AMI.
Includes the usual NVIDIA drivers, package, and runtime to provide support for GPU instances in AWS cloud.
NVIDIA driver uses an autoboost feature, which varies the GPU clock speeds. By disabling the autoboost feature and setting the GPU clock speeds to their maximum frequency, you can consistently achieve the maximum performance with your GPU instances.
Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster for burstable workloads. This means that there will be times when pods fail to run due to insufficient resources. At other times, nodes are underutilized for an extended period of time and so the pods there can be placed on other nodes and that node can be reclaimed. Most of the work for GPUs is now being put into cluster autoscaler. Lets talk about some of that.
A couple of pull requests that we made are highlighted here. The first PR was enabling support for GPU to multiple cloud providers which allowed it to be solved in an upstream compliant way. It basically added label and type to the GPU nodes.
GPU nodes are not sensitive to CPU and memory but use a different set of metrics for CA to scale down the nodes. This second PR adds support for that.
safe-to-evict is a standard k8s annotation. It can be specified on a pod and then CA would not remove that node during scale down. This is particularly relevant for ML workloads as you’d not like the pod to be terminated that has completed hours of training, but still got work to do.
The recommendation is to create GPU node groups per AZ. This has a couple of benefits – first it helps with network communication and low data transfer costs in a distributed training job. Secondly, it avoids ASG rebalancing across multiple AZs.
---
Escalator is designed for large batch or job based workloads that cannot be force-drained and moved when the cluster needs to scale down - Escalator will ensure pods have been completed on nodes before terminating them. It is also optimized for scaling up the cluster as fast as possible to ensure pods are not left in a pending state.
1/ AWS Deep Learning Containers provides pre-packaged Docker container images that are fully configured and validated, so customers no longer have to spend time building and testing the images.
2/ Since we’ve already optimized these docker images for AWS, customers can get the best performance and scalability right away – no tuning required.
3/ AWS Deep Learning Containers are built to work with Amazon EKS, Amazon ECS, and Amazon EC2 to give developers the flexibility and choice.
4/ Customers can also customize these container images to include their own tools and packages for a high degree of control over features of their environment such as monitoring, compliance, and scaling.
Jupyter notebook provides an easy on-ramp to build, deploy and train ML models.
You can create notebooks that contain live code and provide interactive output in a wide variety of formats such as HTML, images, video, and custom MIME types.
Jupyter supports over 40 programming languages, including Python, R, Julia, and Scala. Jupyter notebook in Kubeflow doesn't have that many language kernels yet. However, users can definitely customize on their own.
Kubeflow is 0.7 today. One of the new features introduced in 0.6 was multi-user isolation of user-created resources. This feature allow multiple users to operate on a shared Kubeflow deployment without stepping on each others’ jobs and resources. The isolation mechanisms also prevent accidental deletion/modification of resources of other users in the deployment.
An administrator needs to deploy Kubeflow and configure the authentication. service for the deployment. A user can log into the system and will by default be accessing their primary profile. A profile is a collection of Kubernetes resources along with a Kubernetes namespace of the same name.
By using Kubeflow Fairing and adding a few lines of code, you can run your ML training job locally or in the cloud, directly from Python code or a Jupyter notebook. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint.
Kubeflow Fairing packages your Jupyter notebook, Python function, or Python file as a Docker image, then deploys and runs the training job on Kubeflow. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint.
- Easily package ML training jobs: Enable ML practitioners to easily package their ML model training code, and their code’s dependencies, as a Docker image.
- Easily train ML models in a hybrid cloud environment: Provide a high-level API for training ML models to make it easy to run training jobs in the cloud, without needing to understand the underlying infrastructure.
- Streamline the process of deploying a trained model: Make it easy for ML practitioners to deploy trained ML models to a hybrid cloud environment.
Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?.
Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters.
Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier.
Extensible
Framework agnostic: TensorFlow, PyTorch, MXNet, …
Customizable algorithm backend
Experiment: “optimization loop” for some specific problem
Suggestion: a proposed solution to the problem
Trial: one iteration of the loop
Job: evaluate a trial and calculate objective value
HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is not efficient.
May be you want to anchor on one particular value, and then vary others. For example, batch size and then vary learning rate to find optimal values. You want to be able to track metrics across different jobs.
You can choose different frameworks like TensorFlow, PyTorch or MXNet. Algorithm could be random search, grid search, or bayesian optimization.
Hyperparameter are the parameters that are specified for an algorithm before the learning/training begins. So, lets say a data scientist has chosen an algorithm, that will then define what kind of hyperparameters can be specified. For example, learning rate, batch size, number of epochs, maximum depth allowed for the decision tree, number of trees in random forest, number of neurons in neural network layer, how many layers in my neural network?.
Once the hyperparameters are chosen, then multiple training runs are conducted with different values of hyperparameters. A model is generated after each run and evaluated for optimality such as time taken to complete the training, error rate, and accuracy. There are methods like grid search, random search, and Bayesian optimization to define the spectrum of values of hyperparameters.
HPO is hard because more hyperparameters means exponential space growth. Also tuning by hand is error
Katib means secretary or scribe in Arabic. As Vizier stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier.
Extensible
Framework agnostic: TensorFlow, PyTorch, MXNet, …
Customizable algorithm backend
Experiment: “optimization loop” for some specific problem
Suggestion: a proposed solution to the problem
Trial: one iteration of the loop
Job: evaluate a trial and calculate objective value
The lines go from right to left indicating the three parameters that are used for each run of the experiment. The left side shows validation-accuracy and accuracy for the training output, they both use different datasets. The left most columns match the required objective specified in the Experiment.
Now, we can choose the best/optimized value of validation accuracy and accuracy, identify the corresponding parameters and use them for training.
One of the most important aspects of Building Cloud Native is knowing your responsibilities and where ever possible leveraging the awesome landscape of Cloud Native technologies. KFServing has chosen Istio and Knative to solve core serverless, networking, and revision management problems, and KFServing decorate those layers with ML specific opinions. It builds on top of Kubernetes, so you have full control up and down the stack for whatever you need.
Because the whole stack is open, there is a privilege and responsibility to upstream functionality. If ML customers have networking or serverless requirements, the KFServing community will deliver them, but not in our code. We will move down the stack as far as we can, and contribute the code where it serves its widest purpose. This makes the entire ecosystem stronger, and avoids reinventing the wheel over and over.
KFServing’s mission to build an ML serving platform that is simple yet powerful. It focuses on data scientists, which means it build concepts that make sense in their domain.
It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
Low Bar High Ceiling
More recent version more extensible with concept of transformers – essentially a wrapper function you can call before inferencing to get more information, for example take a user id and get location, and past purchase history
We started with our interface. This was one of the most important thing to get right. We needed to find the lowest floor possible for Data Scientists. Unlike other serving platforms, one of our key decisions was to choose a non-container interface. Most ML frameworks support model serialization to a file, so we picked the most common frameworks and built or found a set of out of the box servers to support them. These servers would simply load the serialized model and start an http endpoint. That meant no custom servers, no containers, no readiness checks, no bespoke code, all the complexity around server management was simple handled -- and it just worked. In about 8 lines, you could describe all of the infrastructure you needed to get your model up and running.
Keep in mind that principle of Low floor with a High Ceiling. We’d found a pretty low floor, but where can our users go from there. They might want to manage Replica Limits, Service Accounts, or even specify resource requests like GPUs. Everything extends cleanly and consistently using Cloud Native Terminology that users might be familiar with or might encounter in the future.
In the first half of this spec, our users workload scales between 3 and 10 replicas, with 2 cores and a gpu per replica. They’ve also specified a service account that grants permission. All of these features should feel very familiar if you’ve ever deployed with Kubernetes, and in fact, that’s what we’re passing through to kubernetes under the hood.
One of our principles was a single resource semantic. We wanted to see if we take all of the infrastructure related to a single model, and contain it within a single resource. One good example of this is the second chunk of this spec which is our canary specification. This allows users to specify a second serving configuration, and experiment on it with a percentage of their traffic. In other systems, this would mean a ton of extra resources to wire up routing configurations and two full stacks, but in KFServing you can simply copy your default spec, tweak it, and rename it canary. In our example, the only difference between default and canary is the pointer to the storageUri.
Without a single resource semantic, a data scientist would need to know whether that all of their configurations were applied together and at the same time. If one of the resources failed to apply, they might get stuck in limbo. Because we’ve simplified this structure into a single resource, those risks don’t even need to enter their mind.
Training large models on vast amounts of data can drastically improve model performance. But consider a deep network with millions of parameters, how do we achieve this without waiting for days, or even multiple weeks? That’s where distributed training comes. It allows us train and serve a model on multiple physical machines.
This can be achieved using model parallelism and data parallelism. When a big model can not fit into a single node's memory, model parallel training can be employed to handle the big model. Data parallelism allows to distribute the data between different tasks.
"Data Parallelism" is the most common training configuration, it involves multiple tasks in a worker job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a ps (parameter server) job. All tasks typically run on different machines or containers.
Distributed training in Kubeflow is provided using TFJob
Use a different view -> Ps parameter.
If you have fast link with NCCL, distributed training performance is better with sync parameter server instead of async method
We wrote a blog post that discussed best practices to optimize machine learning training performance on Amazon EKS to improve the throughput and minimize training times. We used Kubeflow and FSx for CSI driver. ResNet-50 (kind of network) with ImageNet database.
We trained using mixed precision on 20 P3.16xlarge instances (160 V100 GPUs) with a batch size of 256 per GPU (aggregate batch size of ~41k). To achieve better scaling efficiency, we used Horovod and TensorFlow, We observed near-linear scaling, between 90-100% scaling efficiency up to 160 GPUs and 98k images per second.
Enable and simplify the orchestration of end to end ML pipelines. ML workflow include all of the components that make up the steps in the workflow and how the components interact with each other.
It makes it easy to try numerous ideas and techniques, and manage your various trials/experiments.
Also enable to re-use components and pipelines to quickly cobble together end to end solutions, without having to re-build each time.
kfp.compiler includes classes and methods for building Docker container images for your pipeline components.
kfp.components includes classes and methods for interacting with pipeline components.
kfp.Client contains the Python client libraries and allows to run a pipeline and create an experiment
You can create pipeline directly using YAML file, or create pipeline using SDK
Several different ways to run a pipeline
Directly from the UI
Invoke it from the SDK
Setting up a schedule
Pipeline run metadata is stored in Kubeflow DB
Write your application code, my-app-code.py. For example, write code to transform data or train a model.
Create a Docker container image that packages your program (my-app-code.py) and upload the container image to a registry. To build a container image based on a given Dockerfile, you can use the Docker command-line interface or the kfp.compiler.build_docker_image method from the Kubeflow Pipelines SDK.
Write a component function using the Kubeflow Pipelines DSL to define your pipeline’s interactions with the component’s Docker container. Your component function must return a kfp.dsl.ContainerOp.
Write a pipeline function using the Kubeflow Pipelines DSL to define the pipeline and include all the pipeline components. Use the kfp.dsl.pipeline decorator to build a pipeline from your pipeline function.
Compile the pipeline to generate a compressed YAML definition of the pipeline. The Kubeflow Pipelines service converts the static configuration into a set of Kubernetes resources for execution.
Use the Kubeflow Pipelines SDK to run the pipeline.
Your ML workflows generate a lot of metadata. This is information such as executions (runs), models, datasets, and other artifacts. Artifacts are the files and objects that form the inputs and outputs of the components in your ML workflow.
As you start trying out multiple combinations, you need a way to manage all of this metadata together. That’s exactly the purpose of Metadata project. It tracks and manages the metadata that the workflows produce.
Metadata components comes pre-installed in 0.7. This is currently an Alpha version and development team is interested in your feedback.
Manage EKS cluster provision with eksctl and provide flexibility to start different flavor of GPU nodes.
Manage external traffic with AWS ALB Ingress Controller. Traffic will go through ALB Ingress controller to Istio-Gateway and then forward to ambassador inside cluster.
Leverage Amazon FSx CSI driver to manage Lustre file system which is optimized for compute-intensive workloads, such as high-performance computing and machine learning. It can scale to hundreds of GBps of throughput and millions of IOPS.
Centralized and unified Kubernetes cluster logs in CloudWatch which helps debugging and troubleshooting.
Enable TLS and Authentication with AWS Certificate Manager and AWS Cognito
Enable Private Access for your Kubernetes cluster's API server endpoint
Automatically detect GPU instance and install Nvidia Device Plugin
As Kubeflow continues to evolve, you can be assured it will continue to work well on AWS
If you are modeling a taxi service then Driver might be an entity and daily count trips might be a feature. Other interesting features might be the distance between the driver and a destination, or the time of day. A combination of multiple features are used as inputs for a machine learning model.
As your ML workloads scale, features play an important role for both training and serving.
Typical challenges:
Features not being reused: Features representing the same business concepts are being redeveloped many times, when existing work from other teams could have been reused.
Feature definitions vary: Teams define features differently and there is no easy access to the documentation of a feature.
Hard to serve up to date features: Combining streaming and batch derived features, and making them available for serving, requires expertise that not all teams have. Ingesting and serving features derived from streaming data often requires specialised infrastrastructure. As such, teams are deterred from making use of real time data.
Inconsistency between training and serving: Training requires access to historical data, whereas models that serve predictions need the latest values. Inconsistencies arise when data is siloed into many independent systems requiring separate tooling.
Feast is an open source feature store that can be integrated with Kubeflow to address the feature storage needs.
Feast solutions:
Discoverability and reuse of features: A centralized feature store allows organizations to build up a foundation of features that can be reused across projects. Teams are then able to utilize features developed by other teams, and as more features are added to the store it becomes easier and cheaper to build models.
Access to features for training: Feast allows users to easily access historical feature data. This allows users to produce datasets of features for use in training models. ML practitioners can then focus more on modelling and less on feature engineering.
Access to features in serving: Feature data is also available to models in production through a feature serving API. The serving API has been designed to provide low latency access to the latest feature values.
Consistency between training and serving: Feast provides consistency by managing and unifying the ingestion of data from batch and streaming sources, using Apache Beam, into both the feature warehouse and feature serving stores. Users can query features in the warehouse and the serving API using the same set of feature identifiers.
Standardization of features: Teams are able to capture documentation, metadata and metrics about features. This allows teams to communicate clearly about features, test features data, and determine if a feature is useful for a particular model.
Amazon SageMaker is investing in delivering the best experience for Machine Learning (ML) on Kubernetes by creating Kubernetes operators and Kubeflow Pipeline Components for SageMaker services. Kubernetes users will be able to use managed ML services for training, model tuning, and inference without leaving their Kubernetes environments and pipelines and without learning SageMaker APIs.
Miran makes DevOps easier giving customers the benefits of a managed service for ML in Kubernetes, without migrating their workload from Kubernetes and without learning SageMaker APIs.
Customers lower training costs by not paying for idle GPU instances. With SageMaker operators and pipeline components for training and model tuning, GPU resources are fully managed by SageMaker and utilized only for the duration of a job. Customers with 30% or higher idle GPU resources on their local self-managed resources for training would see a reduction in total cost by using SageMaker operators and pipeline components.
Customers can create hybrid pipelines with SageMaker pipeline components for Kubeflow that can seamlessly execute jobs on AWS, on-premise resources, and other cloud providers.