There is a major shift in web and mobile application architecture from the ‘old-school’ one to a modern ‘micro-services’ architecture based on containers. Kubernetes has been quite successful in managing those containers and running them in distributed computing environments.
Now enabling Big Data and Machine Learning on Kubernetes will allow IT organizations to standardize on the same Kubernetes infrastructure. This will propel adoption and reduce costs.
Kubeflow is an open source framework dedicated to making it easy to use the machine learning tool of your choice and deploy your ML applications at scale on Kubernetes. Kubeflow is becoming an industry standard as well!
Both Kubernetes and Kubeflow will enable IT organizations to focus more effort on applications rather than infrastructure.
3. 1. Key takeaways
• There is a major shift in web and mobile application
architecture from the ‘old-school’ one to a modern ‘micro-
services’ architecture based on containers. Kubernetes
has been quite successful in managing those containers and
running them in distributed computing environments.
• Now enabling Big Data and Machine Learning on Kubernetes
will allow IT organizations to standardize on the same
Kubernetes infrastructure. This will propel adoption and
reduce costs.
• Kubeflow is an open source framework dedicated to making
it easy to use the machine learning tool of your choice and
deploy your ML applications at scale on Kubernetes.
Kubeflow is becoming an industry standard as well!
• Both Kubernetes and Kubeflow will enable IT organizations
to focus more effort on applications rather than
infrastructure. 3
5. 2. What is Docker?
The Docker logo is sort of a whale / boat hybrid, filled with
shipping containers. The analogy is taken from freight
transport where goods are shipped in containers.
5
6. 2. What is Docker?
Docker is an open source technology, released back in 2013, for
development and deployment of applications in containers that
package together application’s code, libraries,
configurations and software dependencies into container
images.
6
7. 2. What is Docker?
A container is a runnable instance of an image. Container
images can be pulled from a registry ( such as Docker Hub
hub.docker.com, Azure Container Registry, …) and deployed
anywhere the container runtime is installed: your laptop, servers
on-premises, or in the cloud.
7
8. 2. What is Docker?
• Some of the advantages that Docker offers:
• Identical environments: Deploy and run the same way
whether in development, testing or production and the
application that you deploy to one environment is going to
work in another.
• Isolated environments for your individual applications
• Version control: Instead of “patching”, new functionality is
added to a micro-service by replacing existing containers with
ones that incorporate new functionality.
• Portability: Easy move workloads between different versions
of Linux for example
• Developer Productivity
• Application Agility: How quickly you can evolve an
application
• Operational Efficiencies: containerized applications are
easier to deploy.
• Scale out (not up): simply start more containers
8
10. 3. What is Kubernetes?
• The Kubernetes logo is literally a boat’s steering wheel.
• It should be an admiral’s hat because, as we will see,
Kubernetes helps you manage a fleet of Docker ‘boats’, not
just one! 10
11. 3. What is Kubernetes?
• Kubernetes (numeronym K8s) is an open source platform for
automating deployment, scaling and management of
containerized applications both in cloud and on premise.
• It was initially released by Google in 2014 and it is now
managed by the Cloud Native Computing Foundation (CNCF).
• Kubernetes has been already adopted by the largest public
cloud vendors and technology providers.
• Some of the companies providing Kubernetes Managed
Services: Google Cloud Platform (GCP) – GKE; Microsoft
Azure – AKS; Amazon AWS – EKS; Oracle – OKE; IBM
Cloud Container Service; RedHat – OpenStack; Pivotal –
PKS; Alibaba Cloud Container Service for Kubernetes, …
• Kubernetes is being embraced by even more software
vendors and enterprises.
11
12. 3. What is Kubernetes?
• A Kubernetes cluster is comprised of at least one master
node, which manages the cluster, and multiple worker
nodes, where containerized applications run using Pods.
• A Pod is a logical grouping of one or more containers. Pods
enable multiple containers to run on a host machine and
share resources such as: storage, networking, and container
runtime information.
12
13. 3. What is Kubernetes?
Some of the advantages that Kubernetes offers:
• Kubernetes makes containers manageable
• Portability between cloud and on-premises
• Kubernetes cloud agnostic design made containerized
applications to run on any platform without any changes to
the application code.
• Kubernetes provides two types of auto-scaling:
• pod auto-scaling where more pods are automatically
created in a cluster based on scaling rules, and
• cluster auto-scaling where more nodes are added to a
cluster based on flexible rules.
• Monitoring: Rather than having to rely on ad hoc
monitoring approaches, system monitoring is built into
Kubernetes and provides for a wide range of features:
replicas, rolling updates, auto-scaling, etc.
• Better cluster resource utilization 13
15. 4. Why Big Data on Kubernetes?
• Big Data on Kubernetes is now a reality thanks to:
• The Special Interest Group in Kubernetes Community on
Big Data and the many companies collaborating on the
related effort.
• Kubernetes newer features such as StatefulSets, custom
schedulers, custom resources, custom controllers,
container storage interface, …
• More persistent storage options to run stateful
applications on Kubernetes, depending on data type, such
as object storage, file systems, software defined storage, …
• More and more Big Data/Fast Data Tools running on
Kubernetes such as: Apache Spark, Apache Kafka,
Apache Flink, Apache Cassandra, Apache Zookeeper, …
15
16. 4. Why Big Data on Kubernetes?
Example: Apache Spark on Kubernetes
• Video: Submitting Spark jobs using Kubernetes scheduler on
AKS. March 16, 2018 https://www.youtube.com/watch?v=T7pAZplLiCk
• Article: Running Apache Spark jobs on AKS. March 15, 2018
https://docs.microsoft.com/en-us/azure/aks/spark-job
• Blog: Apache Spark 2.3 with Native Kubernetes Support.
March 15, 2018
• Docs: Running Spark on Kubernetes https://apache-spark-on-
k8s.github.io/userdocs/running-on-kubernetes.html
16
17. 4. Why Big Data on Kubernetes?
• There are many ways to run Big Data applications such as
Spark. For example:
• Standalone mode using dedicated resources
• YARN cluster co-resident with Hadoop
• Mesos cluster alongside other Mesos applications
• So, why would you run Big Data applications on Kubernetes?
• In addition to all the advantages that Kubernetes offer, the
following ones are particularly relevant to Big Data
applications:
• A single container orchestrator for all your applications
• Increased server utilization
• Isolation between workloads
• Reduction in operational overhead
• Language-agnostic distributed computing clusters
17
18. 4. Why Big Data on Kubernetes?
• A single container orchestrator for all your
applications: For example, Kubernetes can manage a
broad range of workloads; no need to deal with
YARN/HDFS for data processing and a separate container
orchestrator for your other applications. This solve the
problem of running Big Data applications in silos in their
own clusters.
• Increased server utilization: For example, share nodes
between Spark and other applications by having a
streaming application running to feed a streaming Spark
pipeline, or a nginx pod to serve web traffic without the
need to statically partition nodes.
18
19. 4. Why Big Data on Kubernetes?
• Isolation between workloads: For example, Kubernetes
allows you to safely co-schedule batch workloads like
Spark on the same nodes as latency-sensitive servers.
• Reduction in operational overhead. For example: Static
clusters require greater operational know-how to do
common tasks with Kafka, such as applying broker
configuration updates, upgrading to a new version, and
adding or decommissioning brokers. By using Kafka on
Kubernetes, you can reduce the overhead for a number of
common operational tasks with standard cluster resource
manager features.
• Containers and Kubernetes make great language-
agnostic distributed computing clusters.
19
21. 5. Why Machine Learning on Kubernetes?
Machine Learning on Kubernetes is now a reality thanks to:
• Development in Kubernetes such as Stateful applications,
extension points, …
• Hardware acceleration for Kubernetes from Nvidia (GPU)
, Google (TPU: Tensor Processing Unit), …
• Machine Learning tools running on Kubernetes such as:
Kubeflow, Paddle, Seldon, RiseML, Anaconda, H2O, …
• Emergence of Kubeflow, an open source framework
dedicated to making it easy to use the machine learning
tool of your choice and deploy your ML applications in
distributed mode on Kubernetes. Kubeflow is becoming the
industry standard as well!
• Services such as the one from Microsoft to train and
serve TensorFlow Models at scale with Kubernetes and
Kubeflow on Azure Kubernetes Service AKS
21
22. 5. Why Machine Learning on Kubernetes?
• You've created a machine learning model, using a tool of
choice such as TensorFlow, PyTorch, or scikit-learn… Now
what?
• How can you ensure that the model is deployed to
production and can scale as needed on incoming
data?
• How can you seamlessly migrate a model from your
local laptop / virtual machine to your cloud platform of
choice?
• Kubeflow includes:
• the JupyterHub platform for creating and managing
Jupyter notebook servers that are used by data science
and research groups
• a Tensorflow Customer Resource for managing
compute resources to a specific cluster size
• a Tensorflow Serving container to house the machine
learning application 22
23. 5 Why Machine Learning on Kubernetes?
• Distributed training instead of sequential: huge time saver
for large trainings
• Enabling Machine Learning at large scale
• Mix of GPU and CPU nodes to serve both as a training and
serving platform
• IT can better support data science and machine learning
applications with Kubernetes as the common
orchestration layer for all (containerized) applications
• Ability for IT to create self-service environments for data
scientists and other data users.
• Single scheduling solution for multiple environments, on
premise or in multiple clouds
• Better resource utilization through centralized scheduling
of data science and other containerized applications
23
25. 6. How to get started?
• Learn from some free tutorials in your browser !
• Docker & Containers https://www.katacoda.com/courses/docker
• Kubernetes https://www.katacoda.com/courses/kubernetes
• KubeFlow https://www.katacoda.com/kubeflow
• Watch some demos
• Sentiment Analysis using Kubernetes and Kubeflow, Google,
May 31st 2018 https://www.youtube.com/watch?v=-ZlIuQXyD1A
• OSS Unboxing – Kubeflow, Lachlan Evenson, Microsoft, May
11th 2018 https://www.youtube.com/watch?v=uL_pqP_HgcY
• Do some labs
• Labs for Training and Serving TensorFlow Models with
Kubernetes and Kubeflow on Azure Container Service (AKS)
https://github.com/Azure/kubeflow-labs
• Introduction to Kubeflow on Google Kubernetes Engine (GKE)
https://codelabs.developers.google.com/codelabs/kubeflow-
introduction/index.html?index=..%2F..%2Fio2018#0 25