Slides from my talk at the Data Innovations Summit on MXNet Model Server.
Apache MXNet Model Server (MMS) is a flexible and easy to use tool for serving deep learning models exported from MXNet or the Open Neural Network Exchange (ONNX).
Hi everyone! My name is Adrian Hornsby, I’m an technical evangelist at AWS , and one of my focus area is AI and especially Deep Learning. Today I’m going to talk about model serving. It’s a super interesting domain within Deep Learning, and I hope you will enjoy learning more about it. If you want to chat more - I’ll be here after the talk so feel free to drop by!
With a show of hands – How many of you know what Deep Learning is? How many have ever implemented a neural network? How many have deployed one to production? OK – so we have fair knowledge of DL. In this talk we will not dive into the details of DNNs, since this Is not the topic of this talk, nor do we have the time… But I will briefly discuss it to set the right context.
So Deep Learning is a field within Machine Learning, which is by itself a field within AI. AI is the set of technique that enables computers to mimic, and surpass, human intelligence ML is a subset of AI, and is the set of mostly statistical techniques that enables computers to improve with experience – hence “learning” DL is a subset of ML, a technique inspired by the human brain – or neurons to be more exact – that uses interconnected artificial neurons to learn from samples.
So at the base of Deep Learning we have the Neural Network.Let’s briefly see what these networks look like. So a neural network at its most simplistic form is composed of layers, each consisting of a set of neurons, that are interconnected across layers with weighted edges. The term “deep learning” was coined due to these networks having many hidden layers, which makes them “deep”.The network takes the input vector, matrix, or more generally tensor, and feeds every element of the input into a unit in the input layer. From there the computation cascades across the units and layers, until we get an output in the output layer.Neural networks are non linear functions, and can learn non linear features, as the activation functions in each neuron is non linear.They enable learning features in a hierarchical way, with each layer learning a feature that is leveraging the features learned in the previous layer.And very importantly: it is a scalable architecture that can be made more complex with more learning capabilities by enlarging the network and/or modifying the operators in neurons.And it is typically very heavy computationally. Modern networks such as resnet-152, which has 152 layers, requires 11GFLOPS for a single forward pass.
Beyond the growing usage of DL in applications and devices around us, there is another interesting aspect to deep learning, and that is how well it does compared to the dominant species on this planet: us!
One of the first areas Deep Learning was able to demonstrate state of the art results, was in the domain of Machine Vision. A classical problem in that domain is Object Classification: given an image, identify the most prominent object in that image out of a set of pre-defined classes. A DNN presented in 2012 by Alex Krizhevsky, was able to leap-frog the best known algo to date by over 30%. That was really a major leap, and since then every year the best algorithm for Object Classification, and many other Vision tasks, are based on Deep Learning, with results that keep on getting better. Research paper by Geirhos from 2017 shows that DNNs already outperform humans in Object Classification – a task us humans have been programmed to specialize in by evolution. The paper also shows that human vision actually performs better when noise is introduced – it may make you feel better, it worked for me
AlexNet paper: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Humans vs DNNs paper: https://arxiv.org/pdf/1706.06969.pdf
The PredNet is a deep convolutional recurrent neural network inspired by the principles of predictive coding from the neuroscience literature
A bit about why Deep Learning is a big deal
You can see Deep Learning applied in more and more domains, with a growing impact on our lives. If you look at the breadth of AI applied within Amazon alone, you can see DL in the Retail Website within personalization and recs, you can see it optimizing Amazon’s logistics, you probably noticed the boom voice-enabled personal assistants, and you may have heard that Amazon drones also rely on deep learning, just as other autonomous vehicles tech is relying on it. And of course the list goes on.
OK, so hopefully by now you are convinced that Deep Learning is awesome, and the next thing you want to do is use it in your production system.
So, how do you actually use a deep learning model in your production environment? Let’s start with the outcome we’re trying to achieve. In fact, it is pretty straight-forward, and is not very different than deploying any other service. We have a trained model, that we want to use for inference, We have a bunch of clients: mobile, desktop, iot, cloud – or any combination of those We want to have a server of sorts, hosting a trained model, exposing an inference API, which when called runs a feed forward through the network doing the deep learning “magic” Naveen explained earlier.
That’s a very simple schema of model serving setup.
As we saw in the previous slide, in many ways, serving deep learning models is similar to other, more traditional, serving frameworks out there, such as Apache Tomcat. And indeed in many ways, Model Serving is undifferentiated heavy lifting. That is a term we use and focus on in AWS a lot. What it means is all of the aspects that are necessary to get the job done, but that are not differentiating the business and win against the competition. Setting up servers, networks, etc. is all UHL.
Let’s quickly go over the main concerns Model Serving system needs to address: - Performance – this concern is about providing a scalable architecture that is able to meet target TPS, making an efficient use of the available compute resources, strike the right balance between throughput and latency. It is especially important for Deep Learning, since the computational load of running a single inference is typically significant. As a reference, a model such as ResNet-152 requires billions of FLOPs for a single forward pass. Availability – to make your application working properly all the time, you want to minimize down time, and avoid offline status when load is high, or when you are busy deploying a new model. Networking – making your model consumable means you need to expose a network endpoint that clients can call to get predictions. This endpoint needs to support standard interfaces such as HTTP, error codes, security and more. Monitoring – having any service in production means you need the ability to look into your operational metrics in near-real time; things like resource utilization on host, inference latencies, requests and errors. Model Decoupling– when you are serving models you want to offer a way that enables to use trained models without knowing anything about their inner working details. The model may be identifying cats in images, or doing sentiment analysis. No change should be done to the server beyond deploying a different model. Cross Framework – there are many different Neural Network frameworks: MXNet, TensorFlow, PyTorch, Caffe, and more. “Same Same, But Different” - all similar, but different in style and implementation details. We want a model server that just works, regardless of the framework used to build and train the model. Cross Platform – similar to how there are many frameworks, there are also many platforms you can run your server on. From the OS (Linux, Windows) to the actual compute processor which can be a CPU, a GPU or a TPU.
And beyond all of that, one uber-concern that is an important meta concern is Ease of Use – all of the concerns just mentioned needs to be addressed in a way that is easy to use, quick to learn, and just work!
Are there systems that handle that for us?The answer is: yes! Deep Learning serving is pretty nascent, but there are a few systems - let’s go over a few: - TF Serving was open sourced Feb 2016, and went 1.0 Aug 2017. It is designed to serve TF models over gRPC, and is used extensively within Google. - Clipper is an ongoing project by RiseLabs at UC Berkely. Open sourced in 2017, currently in v0.2. It is a machine learning serving system with various backend engine support, including Caffe, TF and recently also MXNet - MXNet Model Server, or MMS for short, is actively developed by my team, open sourced Dec 2017, it is built on top of Apache MXNet, which is AWS’s DL framework of choice. MMS is almost at v0.2, in active development, and in this talk we will dive deeper into how it’s designed and some of the exciting engineering challenges we have in front of us as we keep developing the system.
Now that you have seen MMS in action on a simple use case, I’d like to dive into some technical details on how MMS is engineered and used. I'll start with the Model ArchiveNow let's talk about MMS' network interface.Let's see how MMS uses containers.Metrics.And lastly, I'd like to chat about how we're leveraging ONNX to achieve cross platform support.
To decouple the actual model from the serving framework, we designed the “Model Archive”. Model Archive is a file that encapsulates all of the model-specific logic. It is the one-and-only resource MMS needs in order to set up serving for the model. In many ways, it is similar to Java’s JAR file – and indeed we have took a similar implementation approach.
Let’s take a look at what is needed to generate a model archive: a trained neural network, a signature file defining input and output types and shapes, which tells MMS what endpoints to setup, and how to transform the inputs and outputs. Then there’s the option to include custom code, which allows users to add feature extraction logic, or any other init/pre/post processing logic they may want to build into the model. Additionally, users can package whatever other additional files their model will need at runtime. Class labels is an example use case for aux files. Users use the MMS export CLI to package up all of these assets into a Model Archive package, which is then used by MMS to initialize and serve requests as we’ve seen earlier. This decoupling enables a clean separation of responsibilities between model creation and model serving.1. The ML Engineer or Data Scientist build and trains the model, writes feature extraction code, and then packages it all up into the archive. 2. The Software Engineer or Dev Ops Engineer setup up MMS on a prod cluster, and configures MMS to point to the archive, either on the local FS or on a remote URL.
Let’s quickly jump to the console to see how this looks (DEMO) Show a pre-prepared folder with model, signature, code and aux files Open the signature and show Open the code and show Show how the export utility is used
One of our major design decisions when planning MMS was to focus on ease of use, while not introducing any one-way doors that will prevent improvements in the future. With that in mind, we decided to: Expose REST-like endpoints over HTTP - arguably the easiest endpoint to integrate with, which is quite different than TFS's approach which supports only gRPC for performance reasons. All of these endpoints are automatically generated based on the model archive's signature.json JSON is the default encoding format for endpoint - to make it easy for clients to integrate with MMS has an out-of-the-box support for handling binary inputs such as JPEG. With this support, clients can include a JPG image as part of the request payload, and MMS will automatically translate this into an input tensor and resize it for you so it fits the model’s expected input tensor. Support OpenAPI specification - this enables hooking up tooling to automate tasks, such as auto-generating client code across many popular programming languages.
Let’s see how this looks – Demo 3 - Curl the api-description endpoint and go over the response
Anyone who ever owned a service in production knows how critical it is to have a reliable and extensive set of operational metrics. You want them reported at a relatively low interval, say every 1 minutes or so, and report operational data that enables the service owner to know important stuff, like errors, traffic, latencies, etc.
We took care to design MMS with built in Ops Metrics reporting, so MMS supports out of the box:(1) Requests (2) Latencies (3) Resources We report all metrics across model and hostname dimensions, so users can setup their monitoring and alarming across an entire cluster, or across a specific model, etc. And MMS integrates directly with AWS CloudWatch, so users can use CW’s console and integrations to have full visibility and control over their prod setup.
As I demoed, you can easily run MMS on your Mac. While this will work well for prototyping or testing, it is not a scalable setup for high-load production traffic. For production deployments we recommend using containers: they are lightweight, provides isolation and have wide platform support. The MMS repo includes Docker images that are pre-configured with required software components and configuration for optimal execution. Users can use this image with their container orchestration tool of choice, and there’s plenty of good options out there such as ECS, Docker and Kubernetes. Users can build the pre-configured image MMS provides, push it to a registry, and then orchestrate it with a platform such as ECS.ECS manages the cluster, including scaling, load balancing, networking, instrumenting and more. The MMS image itself includes an NGINX network reverse proxy, integrated with MMS.
To learn more about MMS container setup, visit the GitHub repo, where we have details and instructions. We’re also planning to publish a blog post about this specific use case soon!
One of the Model Serving concerns we talked about earlier was Cross Framework, and indeed there are many awesome DL framework to choose from. In an ideal world, you will build your DL model with whatever framework you fancy, and then just deploy it – and it will just work. Think about how the JVM works – as long as the language compiles to ByteCode – it will run regardless of the language you used! A good model server will enable the same flexibility.
Another concern we talked about was cross platform support. Intel, Nvidia, Apple’s CoreML…– many different and important runtime platforms.
The problem here, as you may have observed, is that to support all of them in a naïve way we need order of N^2 translations/conversions which is pretty hard to build and maintain. This is where ONNX comes into play. ONNX is an initiative driven by AWS, Facebook and Microsoft, with the goal of defining an open neural network and operator definition. You can check it out on onnx.ai. Support includes quite a few frameworks and platforms– and the list is gradually expanding!
Model Server will introduce ONNX support in the coming release that is going out next week. With ONNX support, users will be able to package up models built with frameworks that support ONNX, such as PyTorch, Caffe2, CNTK. In the future, we may also leverage ONNX to help MMS run on more platforms, that will add support for ONNX.
OK, without further ado, let’s see how MMS looks in practice. We’ll start with the basic use case: installing MMS, loading a model, serving it, and doing prediction. Ready?
Demo: Install MMS Show the model zoo, copy a model link Download an image Use cURL to do inference Examine the results
Thank you for listening, I hope you learned about deep learning systems and serving, and had a good time. MXNet and Model Server are open source - feel free to try it out and file issues. We’re also hiring aggressively, so if you have talented friends that want to be part of the DL revolution - feel free to refer and talk to us! Thank you!