SlideShare a Scribd company logo
1 of 65
Download to read offline
Deploying deep learning models
Platform agnostic approach for production with docker+Kubernetes
About this presentation
This was originally created to explain the basics of
code deployment both in academia and startup
environments.
ACADEMIA Especially in academia outside computer science
departments it is typical that the code developed is very unstructured
without much thought on reproducibility or legibility.
STARTUPS In smaller startups it is beneficial for all the team members
to understand at least something on all the various aspects involved in
building a tech product.
Based on personal experience, these sort of knowledge gaps between a
clinician/biologist and the technical person both in technology and
biology can make the teamwork painfully slow.
General “Data Science” Architecture
Typical data scientist or researcher may use the
“tech stack” on the left.
In academia, the code is typically not deployed anywhere, e.g.
installing a custom app on the smartphone of a subject in a clinical
trial. The researcher simply gather the data with some 3rd
party
software and then writes some research-grade code to analyze it.
In industry, the data science team can consist of multiple roles, and
it becomes essential for the organization to have a smooth operation
between different roles. In other words, the research and models
done by the data scientist can be put in production quickly without
major re-writing of the code.
http://101.datascience.community/2016/11/28/data-scientists-data-eng
ineers-software-engineers-the-difference-according-to-linkedin/
https://www.slideshare.net/continuumio/journey-to-open-data-science
General “BIG DATA/ML” Architecture
For example the model developed in TensorFlow might look like this when deployed as product (e.g.
as an app for your phone to tell whether your image contains a cat or a dog).
DOCKER
https://www.docker.com/what-docker
e.g.ASUSESC8000G3
local Server 'Lock-in'less
CloudService
Forinference,i.e.
processcustomer
queriesviaAPI
EXACTLYTHE SAMEMODEL
Both locally at the office and in the cloud
https://www.docker.com/survey-2016
DOCKER Deep learning?
Unfortunately that is wrong for deep learning applications. For any serious deep learning application, you need NVIDIA
graphics cards, otherwise it could take months to train your models. NVIDIA requires both the host driver and the docker
image's driver to be exactly the same. If the version is off by a minor number, you will not be able to use the NVIDIA card, it
will refuse to run. I don't know how much of the binary code changes between minor versions, but I would rather have the card
try to run instructions and get a segmentation fault then die because of a version mismatch.
We build our docker images based off the NVIDIA card and driver along with the software needed. We essentially have the
same docker image for each driver version. To help stay manage this, we have a test platform that makes sure all of
our code runs on all the different docker images.
This issue is mostly in NVIDIA's court, they can modify their drivers to be able to work across different versions. I'm not
sure if there is anything that Docker can do on their side. I think its something they should figure out though, the
combination of docker and deep learning could help a lot more people get started faster, but right now its
an empty promise.
http://www.somatic.io/blog/docker-and-deep-learning-a-bad-match
The biggest impact on data science right now is not coming from a new
algorithm or statistical method. It’s coming from Docker containers.
Containers solve a bunch of tough problems simultaneously: they make it easy to
use libraries with complicated setups; they make your output reproducible; they
make it easier to share your work; and they can take the pain out of the Python data
science stack.
The wonderful triad of Docker : “Isolation! Portability!
Repeatability!” There are numerous use cases where Docker might just be what
you need, be it Data Analytics, Machine Learning or AI
DOCKERize everything as microservices
.pwc.com/us/en/technology-forecast/2014
http://www.slideshare.net/RichardHarvey7/micro-services-and-containers
(ARC401) Cloud First: New Architecture for New Infrastructure
Amazon Web Services, slideshare.net/AmazonWebServices
Why Microservices?
Whyrunmicroservicesusing
DockerandKubernetes?
Posted by: Seth Lakowske
Published: 2016-04-25
http://sethlakowske.com/articles/why-run-docker-containers-and-kubernetes/
Benefitsof microservices
1) Codecanbebrokenoutintosmaller microservicesthatareeasiertolearn,release
andupdate.
2) Individualmicroservicescanbewrittenusingthebesttoolsfor thejob.
3) Releasing anewservicedoesn'trequiresynchronizationacrossawholecompany.
4) Newtechnologystackshavelowerrisksincetheserviceisrelativelysmall.
5) Developerscanruncontainerslocally,rebuildingandverifyingaftereachcommitona
systemthatmirrorsproduction.
6) BothDocker andKubernetesareopensourceandfreetouse.
7) AccesstoDockerhubleveragesthework oftheopensourcecommunity.
8) ServiceisolationwithouttheheavyweightVM.Addingaservicetoaserver doesnot
affectother servicesontheserver.
9) Servicescanbemoreeasilyrunonalargeclusterofnodesmakingitmorereliable.
10) Someclientswillonlyhostinprivateandnotonpublicclouds.
11) Lendsitselftoimmutableinfrastructure,soservicesarereloadablewithoutmissing
statewhenaserver goesdown.
12) Immutablecontainersimprovesecuritysincedatacanonlybemutatedinspecified
volumes,rootkitsoftencan'tbeinstalledevenifthesystemispenetrated.
13) Increasingsupportfornewhardware,liketheGPUinacontainer meansevengpgpu
taskslikedeeplearningcanbecontainerized.
14) Thereisacostforrunningmicroservices-thebuildandruntimebecomesmore
complex.Thisispartofthepricetopayandifyou'vemadetherightdecisioninyour
context,thenbenefitswillexceedthecosts.
Costsof microservices
• Managingmultipleservicestendstobemorecostly.
• Newwaysfor network andserverstofail.
Conclusion
Intherightcircumstances,thebenefitsofmicroservicesoutweightheextracostof
management.
events.linuxfoundation.org,Frank Zhao
https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
Dockervs AWS Lambda ”In General”
AWS Lambda will win - sort of.....  From a programming model
and a cost model, AWS Lambda is the future - despite so of the
tooling limitations. Docker in my opinion is an evolutionary step of
"virtualization" that we've been seeing for the last 10 years. AWS
Lambda is a step-function. In fact, I personally think it is innovations
like Amazon Elastic Beanstalk and CloudFormation that has pushed
the demand solutions like Docker.  In the near future, I predict that
open source will catch up and provide an AWS Lambda
experience on top of Docker containers. Iron.io is opensource
andappearstobe goingdownthispath.
FlorianWalker ProductManageratFujitsu
Thefutureisnow:)Funktion,partofFabric8,aimsto
provideaLamdaexperienceon-topofKubernetes->
https://github.com/fabric8io/funktion
JasonDaniels CTO-FujitsuHybridCloudEMEIA
projectKratos..
https://www.iron.io/introducing-aws-lambda-support/
https://www.quora.com/Are-there-any-alternatives-to-Amazon-Lambda
Funktion is an open source event driven lambda style programming
model on top of Kubernetes. A funktion is a regular function in any
programming language bound to a trigger deployed into
Kubernetes. Then Kubernetes takes care of the rest (scaling, high
availability, loadbalancing, loggingandmetricsetc).
Funktion supports hundreds of different triggerendpoint URLs including
most network protocols, transports, databases, messaging systems,
social networks, cloud services and SaaS offerings. In a sense funktion is
a serverless approach to event driven microservices as you focus on
just writing funktions and Kubernetes takes care of the rest. Its not that
there's no servers; its more that you as the funktion developer don't have
toworryabout managingthem.
Announcing ProjectKratos I’m happy to
announce that Project Kratos is now available
in beta. Iron.io is rolling out a set of tools that
allow you to convert AWS Lambda functions
into Docker images. Now, you can import
existing Lambda functions and run them via
any container orchestration system. You can
also create new Lambda functions and
quickly package them up in a container to run
on other platforms. All three of the AWS
runtimes are supported – Node.js, Python and
Java.
Docker Issues Size
Dockercontainers quickly grow in size as theyneedto contain everythingrequired fordeployment
http://blog.xebia.com/create-the-smallest-possible-docker-container/
https://www.ctl.io/developers/blog/post/optimizing-docker-images/
“Docker imagescan get reallybig. Manyare over 1Gin size. How dotheyget sobig?Dotheyreally
need tobe thisbig? Canwemake them smaller without sacrificingfunctionality?
Here atCenturyLinkwe've spent alot oftime recentlybuildingdifferent docker images. As we began
experimentingwith image creationoneof the thingswe discovered wasthat our custom images
were ballooningin size prettyquickly(it wasn't uncommontoend up with imagesthat weighed-in at
1GB or more). Now, it'snot toobigadeal tohaveacouple gigsworth ofimages sittingon your local
system, but it becomesabit ofpain assoon asyoustart pushing/pullingthese imagesacrossthe
networkon aregular basis. “
https://blog.replicated.com/2016/02/05/refactoring-a-dockerfile-for-image-size/
“There’s been a welcomefocus in the Docker community recently around image
size. Smaller image sizes are being championed by Docker and by the community.
When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s
greatlyneeded.”
https://ypereirareis.github.io/blog/2016/02/15/docker-image-size-optimization/
https://github.com/microscaling/imagelayers-graph
ImageLayers.io is a project maintained by Microscaling Systems since September 2016. The project
was developed by the team at CenturyLinkLabs. This utility provides a browser-based visualization
of user-specified Docker Images and their layers. This visualization provides key information on the
composition of a Docker Image and any commonalitiesbetween them. ImageLayers.io allows
Docker users to easily discover best practices for image construction, and aid in determining which
imagesare most appropriate for their specificuse cases.
Deploying inKubernetes Please seedeployment/README.md
What is lambda architecture anyway?
https://www.oreilly.com/ideas/questioning-the-lambda-architecture
The Lambda Architecture is an approach to building stream processing applications
on top of MapReduce andStorm or similar systems. This has proven to be a
surprisinglypopular idea,withadedicated website andan upcomingbook. 
Theway thisworksisthatan immutablesequenceofrecordsiscaptured and fedintoa batch
system and a stream processing system in parallel. You implement your transformation
logic twice, once in the batch system and once in the stream processing system. You stitch
together the results from both systems at query time to produce a complete answer. There
arealot ofvariationson this.
The Lambda Architecture is aimed at applications built around complex asynchronous
transformations that need to run with low latency (say, a few seconds to a few hours). A
good example would be a news recommendation system that needs to crawl various
news sources, process and normalize all the input, and then index, rank, and store it for
serving.
I like that the Lambda Architecture emphasizes retaining the input data unchanged. I
think the discipline of modeling data transformation as a series of materialized stages from an
original input has a lot of merit. I also like that this architecture highlights the problem of
reprocessingdata (processinginput dataoveragain tore-deriveoutput).
The problem with the Lambda Architecture is that maintaining code that needs to produce
the same result in two complex distributed systems is exactly as painful as it seems like it
would be. I don’t think this problem is fixable. Ultimately, even if you can avoid coding your
application twice, the operational burden of running and debugging two systems is
going to be very high. And any new abstraction can only provide the features supported by
the intersection of the two systems. Worse, committing to this new uber-framework walls off
the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig,
Crunch,Cascading,Oozie,etc).
Kappa Architecture is a simplification of Lambda Architecture. A
Kappa Architecture system is like a Lambda Architecture system with the batch
processing system removed. To replace batch processing, data is simply fed
throughthestreamingsystemquickly.
Kappa Architecture revolutionizes database migrations and reorganizations: just
delete your serving layer database and populate a new copy from the canonical
store! Since there is no batch processing layer, only one set of code needs to
bemaintained.
kappa-architecture.com
CHALLENGING THE LAMBDA
ARCHITECTURE: BUILDING APPS
FOR FAST DATA WITH VOLTDB V5.0
dataconomy.com
VoltDB is an ideal alternativeto the LambdaArchitecture’s speed layer. It offers horizontal scaling
and high per-machine throughput. It can easily ingest and process millions of tuples per second
with redundancy, while using fewer resources than alternative solutions. VoltDB requires an
orderofmagnitude fewer nodesto achievethescale andspeed of the Lambdaspeed layer. Asa
benefit,substantiallysmallerclustersarecheapertobuildandrun,and easiertomanage.
Managing containers
Enter Kubernetes (from Google)
DOCKER Management enter→ Kubernetes
https://www.youtube.com/watch?v=PivpCKEiQOQ
www.computerweekly.com/feature/Demystifying-Kubernete
Once every five years, the IT industry witnesses a major technology
shift. In the past two decades, we have seen server paradigm evolve
into web-based architecture that matured to service orientation before
finally moving to the cloud. Today it is containers.
Docker is much more than just the tools and API. It created a vibrant
ecosystem that started to contribute to a variety of tools to manage the
lifecycle of containers. 
One of the first tools that Google decided to make open source is
called Kubernetes, which means “pilot” or “helmsman” in Greek.
Kubernetes works in conjunction with Docker. While Docker provides the
lifecycle management of containers, Kubernetes takes it to the next
level by providing orchestration and managing clusters of
containers.
Traditionally, platform as a service (PaaS) offerings such as Azure,
App Engine, Cloud Foundry, OpenShift, Heroku and Engine Yard
exposed the capability of running the code by abstracting the
infrastructure.
Kubernetes and Docker deliver the promise of PaaS through a
simplified mechanism. Once the system administrators 
configure and deploy Kubernetes on a specific infrastructure,
developers can start pushing the code into the clusters. This hides the
complexity of dealing with the command line tools, APIs and
dashboards of specific IaaS providers. 
Containers at scale
As has been demonstrated, it is relatively easy to launchtensofthousandsofcontainers
 on a single host. But how do you deploy thousands of containers? How do you
manage and keep track of them? How do you manage and recover from failure. While
these things sometimes might look easy, there are some hard problems to tackle. Let us
walkthroughwhatitmakesitsodifficult.
With a single command the Docker environment is set up and you can docker run until
you drop. But what if you have to run Docker containers across two hosts? How about
50 hosts? Or how about 10,000 hosts? Now, you may ask why one would wanttodo
 this. Therearesomegoodreasons why:
nextplatform.com/2016/03/22
https://www.nextplatform.com/2015/09/29/why-containers-at-scale-is-hard/
nextplatform.com/2016/03/03
Two founders of the Kubernetes project at Google, Craig McLuckie and Joe Beda, today
announced their new company, Heptio. The company has raised $8.5 million in a series A
investmentroundledbyAccel,withparticipation fromMadronaVentureGroup.
Open source Kubernetes is a widely deployed technology for container orchestration. Now,
Heptiowillbringacommercialversion ofthesoftware to enterprises.
www.sdxcentral.com
Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling, and management of containerized
applications.
It groups containers that make up an application into logical units for easy
management and discovery. Kubernetes builds upon 
15 years of experience of running production workloads at Google, combined with
best-of-breed ideas and practices from the community.
KubeWeekly — aggregating all interesting weekly news about Kubernetes
in the form of a newsletter. Manage a cluster of Linux containers as a single
system to accelerate Dev and simplify Ops.
https://kubeweekly.com/
http://nshani.blogspot.co.uk/2016/02/getting-started-with-kubernetes.html
https://www.youtube.com/watch?v=21hXNReWsUU
http://cloud9.nebula.fi/app.html
Kubernetes concepts
http://www.slideshare.net/arungupta1/package-your-ja
va-ee-application-using-docker-and-kubernetes
linkedin.com/pulse
http://www.slideshare.net/jawnsy/kubernetes-my-bff
Inference can be very resource intensive. Our server executes the following
TensorFlow graph to process every classification request it receives. The
Inception-v3 model has over 27 million parameters and runs 5.7 billion floating
pointoperationsper inference.
Fortunately, this is where Kubernetes can help us. Kubernetes distributes
inference request processing across a cluster using its ExternalLoadBalancer.
Each pod in the cluster contains a TensorFlowServingDockerimage with the
TensorFlow Serving-based gRPC server and a trained Inception-v3 model.
The model is represented as a setoffiles describing the shape of the
TensorFlowgraph,modelweights,assets,andso on.
Since everything is neatly packaged together, we can dynamically scale the
numberof replicatedpodsusing the KubernetesReplicationController to keep
up with the service demands.
blog.kubernetes.io/2016/03
Alternatives
medium.com/@mustwin
BareMetal
Most schedulers with the notable exception of Cloud Foundry can be
installed on “bare metal” or physical machines inside your datacenter. This
cansave you big onhypervisorlicensing fees.
VolumeMounts
Volume mounts allow you to persist data across container
deployments. This is a key differentiator depending on your applications’
needs.Mesosistheleaderhere,andKubernetesisslowlycatching up.
https://news.ycombinator.com/item?id=10438273
https://www.oreilly.com/ideas/swarm-v-fleet-v-kubernetes-v-mesos
Conclusion
There are clearly a lot of choices for orchestrating, clustering, and managing
containers. That being said, the choices are generally well differentiated. In terms of
orchestration, we can say the following:
Swarm has the advantage (and disadvantage) of using the standard Docker interface.
Whilst this makes it very simple to use Swarm and to integrate it into existing
workflows, it may also make it more difficult to support the more complex scheduling
that may be defined in custom interfaces.
Fleet is a low-level and fairly simple orchestration layer that can be used as a base for
running higher level orchestration tools, such as Kubernetes or custom systems.
Kubernetes is an opinionated orchestration tool that comes with service discovery and
replication baked-in. It may require some re-designing of existing applications, but used
correctly will result in a fault-tolerant and scalable system.
Mesos is a low-level, battle-hardened scheduler that supports several frameworks for
container orchestration including Marathon, Kubernetes, and Swarm. At the time of
writing, Kubernetes and Mesos are more developed and stable than Swarm. In terms
of scale, only Mesos has been proven to support large-scale systems of hundreds or
thousands of nodes. However, when looking at small clusters of, say, less than a dozen
nodes, Mesos may be an overly complex solution.
Kubernetes Still on top?
https://news.ycombinator.com/item?id=12462261
After all, Kubernetes is a mere two years old (as a public open source
project), whereas Apache Mesos has clocked seven years in market.
Docker Swarm is younger than Kubernetes, and it comes with the
backing of the center of the container universe, Docker Inc Yet the
orchestration rivals pale in comparison to Kubernetes' community,
which -- now under management by the
Cloud Native Computing Foundation -- is exceptionally large and
diverse.
•
Kubernetes is one of the top projects on GitHub: in the top 0.01
percent in stars and No. 1 in terms of activity.
•
While documentation is subpar, Kubernetes has a significant Slack
and Stack Overflow community that steps in to answer questions
and foster collaboration, with growth that dwarfs that of its rivals.
•
More professionals list Kubernetes in their LinkedIn profile than
any other comparable offering by a wide margin.
•
Perhaps most glaring, data from OpenHub shows Apache Mesos
dwindling since its initial release and Docker Swarm starting to
slow. In terms of raw community contributions, Kubernetes is
exploding, with 1,000-plus contributors and 34,000 commits --
more than four times those of nearest rival Mesos.
http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html
https://github.com/kubernetes/kubernetes
I would argue that general-purpose clusters like those managed by Google
Kubernetes are better for hosting Internet businesses depending on artificial
intelligence technologies than special-purpose clusters like NVIDIA DGX-1.
Consider the case that an experiment model training job is using all the 100 GPUs in the cluster. A
production job gets started and asks for 50 GPUs. If we use MPI, we'd have to kill the experiment job
so to release enough resource to run the production job. This tends to make the owner of the
experiment job get the impression that he is doing a "second-class" work.
Kubernetes is smarter than MPI as it can kill, or preempt, only 50 workers of the experiment job,
so to allow both jobs run at the same time. With Kubernetes, people have to build their programs
into Docker images that run as Docker containers. Each container has its own filesystem and
network port space. When A runs as a container, it removes only files in its own directory. This is to
some extent like that we define C++ classes in namespaces, which helps us removing class name
conflicts.
An Example A typical Kubernetes cluster runs an automatic speech recognition (ASR) business
might be running the following jobs:
1) The speech service, with as many instances so to serve many simultaneous user requests.
2) The Kafka system, whose each channel collects a certain log stream of the speech service.
3) Kafka channels are followed by Storm jobs for online data processing. For example, a Storm job joins the
utterance log stream and transcription stream.
4) The joined result, namely session log stream, is fed to an ASR model trainer that updates the model.
5) This trainer notifies ASR server when it writes updated models into Ceph.
6) Researchers might change the training algorithm, and run some experiment training jobs, which serve testing
ASR service jobs.
The famous 'classical big data' on Spark
Apache Spark has emerged as the de facto framework for big data
analytics with its advanced in-memory programming model and upper-level
libraries for scalable machine learning, graph analysis, streaming and
structured data processing. It is a general-purpose cluster computing
framework with language-integrated APIs in Scala, Java, Python and R. As
a rapidly evolving open source project, with an increasing number of
contributors from both academia and industry, it is difficult for researchers
to comprehend the full body of development and research behind Apache
Spark, especially those who are beginners in this area.
In this paper, we present a technical review on big data analytics using
Apache Spark. This review focuses on the key components, abstractions
and features of Apache Spark. More specifically, it shows what Apache
Spark has for designing and implementing big data algorithms and
pipelines for machine learning, graph analysis and stream processing. In
addition, we highlight some research and development directions on
Apache Spark for big data analytics.http://dx.doi.org/10.1007/s41060-016-0027-9
In addition to the research highlights we presented in the
previous sections, there are other research works which have
been done using Apache Spark as a core engine for solving
data problems in machine learning and data mining [5,36],
graph processing [16], genomic analysis [60,65], time series
data [71], smart grid data [73], spatial data processing [87],
scientific computations of satellite data [67], large-scale
biological sequence alignment [97] and data discretization
[68]. There are also some recent works on using Apache
Spark for deep learning [46,64]. CaffeOnSpark is an open
source project [60] from Yahoo [61] for distributed deep
learning on big data withApache Spark.
“BIG Data” Frameworks
Apache spark for example
Tensorflow + Apache Spark
https://www.youtube.c
om/watch?v=PFK6gsnlV5
E
https://www.meetup.com/Advanced-Spark-and-
TensorFlow-Meetup/
You might be wondering: what’s Apache Spark’s use here when most
high-performance deep learning implementations are single-node
only? To answer this question, we walk through two use cases and
explain how you can use Spark and a cluster of machines to improve
deep learning pipelines with TensorFlow:
Hyperparameter Tuning: use Spark to find the best
set of hyperparameters for neural network training,
leading to 10X reduction in training time and 34%
lower error rate.
Deploying models at scale: use Spark to apply a
trained neural network model on a large amount of
data.
How does using Spark improve the
accuracy? The accuracy with the
default set of hyperparameters is
99.2%. Our best result with
hyperparameter tuning has a
99.47% accuracy on the test set,
which is a 34% reduction of the
test error. Distributing the
computations scaled linearly with
the number of nodes added to the
cluster: using a 13-node cluster, we
were able to train 13 models in
parallel, which translates into a 7x
speedup compared to training the
models one at a time on one
machine.
The goal of this workshop is to build an end-to-end, streaming data analytics and recommendations pipeline on your local machine using Docker and the latest streaming analytics
tools. First, we create a data pipeline to interactively analyze, approximate, and visualize streaming data using modern tools such as Apache Spark, Kafka, Zeppelin, iPython, and
ElasticSearch.
http://advancedspark.com/:
Dask as an alternative to apache spark #1
https://youtu.be/1kkFZ4P-XHg
continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster
MatthewRocklin'sBlog
dask,theoriginalproject
dask.distributed,thedistributedmemoryschedulerpowering theclustercomputing
dask.bag,theuserAPIwe’veusedin thispost.
AmazonEC2withDaskconfiguredwith JupyterNotebooks, and
Anaconda.https://github.com/dask/dask-ec2
Dask as an alternative to apache spark #2
http://dask.pydata.org/en/latest/spark.html
Spark is mature and all-inclusive. If you want a single project that does everything and you’re
already on Big Datahardware then Sparkis a safe bet, especially if your use cases are typical
ETL+ SQLandyou’realreadyusingScala.
Dask is lighter weight and is easier to integrate into existing code and hardware. If
your problems vary beyond typical ETL + SQL and you want to add flexible parallelism to
existing solutions then dask may be a good fit, especially if you are already using Python
andassociated libraries likeNumPyand Pandas.
If you are looking to manage a terabyte or less of tabular CSV or JSON data then you
shouldforgetbothSparkandDask andusePostgresorMongoDB.
https://news.ycombinator.com/item?id=10062076
Dask seemsto beaimed atparallelismofonlycertain operations(someparts of NumPyand
Pandas) on larger than memory data on a single machine. Spark is a general purpose
computing engine that can work across a cluster of machines and has many libraries
optimizedfordistributedcomputing(machinelearning,graph, etc.).
The advantages of Dask seem to be that it is a drop in replacement for NumPy and Pandas.
Granted,given theprevalenceofthosetwolibrariesthatisn'tasmalladvantage.
https://www.quora.com/Is-https-github-com-blaze-dask-an-alternative-to-Spark
GPU ComputingwithApache SparkandPython
by ContinuumAnalytics,.slideshare.net
TensorFlowBasics
WeightsPersistence.SaveandRestoreamodel.
Fine-Tuning.Fine-Tuneapre-trainedmodelonanewtask.
UsingHDF5.UseHDF5tohandlelargedatasets.
UsingDASK.UseDASKtohandlelargedatasets.
Kubernetes + Dask
Runningonkuberneteson google
containerengine
Thissmall repo gives an example Kubernetes
configuration for running dask.distributed on Google
Container Engine.
DaskClusterDeployments
http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments
This work is supported by Continuum Analytics and the 
XDATAProgram aspartofthe BlazeProject
All code in this post is experimental. It should not be relied
upon. For people looking to deploy dask.distributed on a
clusterpleasereferinsteadtothe documentation instead.
Daskisdeployedtoday onthe followingsystemsinthewild:
• SGE
• SLURM,
• Torque
• Condor
• LSF
• Mesos
• Marathon
• Kubernetes
• SSHandcustomscripts
…theremay bemore.ThisiswhatIknowoffirst-hand.
These systems provide users access to cluster resources
and ensure that many distributed services / users play nicely
together. They’re essential for any modern cluster
deployment.
For example, both OlivierGriesl (INRIA, scikit-learn) and 
TimO’Donnell (Mount Sinai, Hammer lab) publish
instructions on how to deploy Dask.distributed on 
Kubernetes.
• Olivier’srepository
• Tim’srepository
SciPyTutorialSetupOnKubernetes
writtenbyBenjaminZaitlenon2016-09-30
http://quasiben.github.io/blog/2016/9/30/scipy-setup/
Ourgoalwasto givestudentsaccessto apreconfigured
cluster with zero entryrequirements: push abuttongeta
cluster with all toolsinstalled.Toaccomplish thisweneed
ahandful ofdockerimages:
• Web application:button and info
• Jupyternotebook
• proxyapp (more on thislater)
• clustertechnologies:Spark, Dask, IPython Parallel
Anda handful of Kubernetesconcepts:
• Pods:collection of containers(similar to docker-compose)
• namespaces:named andisolated clusters
• replication controller:a scalable Pod.
Managing Data
Storage layer
That is code What about data then?
Using the different software above, an application can be deployed, scaled
easily and accessed from the outside world in few seconds. But, what about
the data? Structured content would probably be stored in a distributed
database, like MongoDB, for example Unstructured content is traditionally
stored in either a local file system, a NAS share or in Object Storage. A local
file system doesn’t work as a container can be deployed on any node in the
cluster.
On the other side, Object Storage can be used by any application from any
container, is highly available due to the use of load balancers, doesn’t require
any provisioning and accelerate the development cycle of the applications.
Why ? Because a developer doesn’t have to think about the way data should
be stored, to manage a directory structure, and so on.
The Amazon S3 endpoint used to upload and download pictures is displayed
on the bottom left corner and shows that ViPR is used to store the data.
The fact that the picture is uploaded directly to the Object Storage platform
means that the web application is not in the data path. This allows the
application to scale without deploying hundreds of instances. This web
application can also be used to display all the pictures stored in the
corresponding Amazon S3 bucket.
The url displayed below each picture shows that the picture is downloaded
directly from the Object Storage platform, which again means that the web
application is not in the data path. This is another reason why Object
Storage is the de facto standard for web scale applications.
recorditblog.com
http://www.slideshare.net/kubecon/kubecon-eu-2016-kubernetes-storage-101
Persistent Volumes Walkthrough
The purpose of this guide is to help you become familiar with Kubernetes Persistent Volumes.
By the end of the guide, we’ll have nginx serving content from your persistent volume.
You can view all the files for this example in the docs repo here.
This guide assumes knowledge of Kubernetes fundamentals and that you have a cluster up and
running.
See Persistent Storage design document for more information.
http://kubernetes.io/docs/user-guide/persistent-volumes/walkthrough/
Data Lakes vs data warehouses #1
“A data lake is a storage repository that holds a vast amount of raw data in its
native format, including structured, semi-structured, and unstructured data. The
data structure and requirements are not defined until the data is needed.”
The table below helps flesh out this definition. It also highlights a few of the key
differences between a data warehouse and a data lake. This is, by no means, an
exhaustive list, but it does get us past this “been there, done that” mentality:
Data. A data warehouse only stores data that has been modeled/structured, while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured. [See my
big data is not new graphic. The data warehouse can only store the orange data, while the data lake can store all the orange and blue data.]
Processing. Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to model it. That’s called schema-on-write. With a data lake, you just load
in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches.
Storage. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this: First,
Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware.
Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied
to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and
apps on-the-fly.
Security. Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data
warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a
question of if, but when.
Users. For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it
the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. Trust me, adata lake,at thispoint in itsmaturity,isbest suited for the datascientists.
www.kdnuggets.com/2015/09
http://www.smartdatacollective.com/all/13556
Data Lakes Medical Examples
SettingUp the Data Lake
http://www.slideshare.net/CasertaConcepts/setting-up-the-data-lake-55319460
searchhealthit.techtarget.com
Unlike most relational databases' linear representation and analysis of
data, Franz's semantic graph database technology employs with which
users can graphically see data elements and their relationships.
Montefiore also recently started another program using the data lake to do cardio-
genetic predictive analytics to determine the degrees of possibility of patients having
suddencardiacdeathbasedontheir geneticbackground.
Deploying Deep learning models
Small scale interference
Data science pipeline development and deployment
https://www.continuum.io/blog/developer-blog/productionizing-and-deploying-data-science-projects
Note! Academia has been notoriously slow
to adopt new technologies in academic
research, and you should not be afraid of the
use of “business analyst”
academia.
Dockerizing analysis research
pipelines for end-to-end
reproducibility would be an
excellent way reducing “sloppy”
science.
Peer reviewers (or even the
journals automatically) could
potentially feed “standardized”
datasets via REST APIs at
various points of the pipeline to
ensure that the methodologies
used are of “good science”
Minimal r&D ready for production
Deep Learning with Keras on Google Compute Engine
by Cole Murray. Software Engineer. Mobile & Full-Stack Engineering. Machine Learning.
https://medium.com/google-cloud/keras-inception-v3-on-google-compute-engine-a54918b0058
“Keras Inception V3 image classification model Prediction with deployment on Compute Engine
w/ Docker & Google Container Registry (like Docker Hub), using Flask Front-end Webserver.”
When you have developed and trained your model, the
prediction part in development becomes rather easy
Flask web server allows fast creation of front-end
which you can use for example for quick demo of your
minimum viable product (MVP), progress report for
your boss, interactive front-end for showing scientific
results for your PI (principal investigator).
Keras, Tensorflow, Docker, Flask, Gunicorn, Nginx,
Supervisor tech stack in deployment
PREDICT (Inference)
e.g. “Whether the image is of a cat or a dog?”
Front-End
e.g. “How to upload a photo from your phone,
and predict what it contains?”
Note! If you are using TensorFlow (TF) backend, you can
use the TF computation graph (i.e. trained model)
trained “natively” on TensorFlow with the Keras
deployment for example when your data scientist(s)
keep on re-training the model, and you can then just
replace the computation graph (.ckpt, checkpoint) with
the new weights
TensorFlow Serving Tensorflow for production
How to deploy Machine Learning models with TensorFlow. Part 1—
make your model ready for serving.
https://medium.com/towards-data-science/how-to-deploy-machine-learning-models-with-tensorflow-part-1-make-your-model-ready-for-serving-776a14ec3198
By Vitaly Bezgachev
Streaming Machine learning Google
How to use Google Cloud Dataflow with TensorFlow for batch
predictive analysis
Justin Kestelyn, Google Cloud Platform, April 18, 2017
https://cloud.google.com/blog/big-data/2017/04/how-to-use-google-cloud-dataflow-with-tensorflow
-for-batch-predictive-analysis
This article shows you how to use Cloud Dataflow to run
batch processing for machine learning predictions. The article
uses the machine learning model trained with TensorFlow.
The trained model is exported into a Google Cloud Storage
bucket before batch processing starts. The model is
dynamically restored on the worker nodes of prediction jobs.
This approach enables you to make predictions against a
large dataset, stored in a Cloud Storage bucket or Google
BigQuery tables, in a scalable manner, because Cloud
Dataflow automatically distributes the prediction tasks to
multiple worker nodes.
Cloud Dataflow is a unified programming model and a
managed service for developing and executing a wide range
of data processing patterns including ETL (Extract, Transform,
Load), batch computation, and continuous computation.
Cloud Dataflow frees you from operational tasks such as
resource management and performance optimization.
Using Cloud Storage as a data source.
Using BigQuery as a data source.
Automated Kubernetes GPU Deployment #1 with tensorflow
How to automate deep learning training with Kubernetes GPU-cluster
https://github.com/Langhalsdino/Kubernetes-GPU-Guide
By Frederic Tausch
Why did I write this guide?
I have worked as in intern for the startup understand.ai and
noticed the hassle of firstly designing a machine learning
algorithm locally and then bringing it to the cloud for
training with different parameters and datasets.
The second part, bringing it to the cloud for extensive
training, takes always longer than thought, is frustrating and
involves usually a lot of pitfalls.
For this reason I decided to work on this problem and make
the second part effortless, simple and quick. The result of
this work is this handy guide, that describes how everyone
Automated Kubernetes GPU Deployment #2 with tensorflow
GPUs & Kubernetes for Deep Learning
By Samuel Cozannet
Deploy Kubernetes with GPUs
● Deploy k8s on AWS in a development mode (no HA,
colocating etcd, the control plane and PKI)
● Deploy 2x nodes with GPUs (p2.xlarge and p2.8xlarge
instances)
● Deploy 3x nodes with CPU only (m4.xlarge)
● Validate GPU availability
Add EFS storage to the cluster
● Programmatically add an EFS File System and Mount
Points into your nodes
● Verify that it works by adding a PV / PVC in k8s.
Automating Tensorflow deployment
● Data ingest code & package
● Training code & package
● Evaluation code & package
● Serving code & package
● Deployment process
So what is a Deep Learning pipeline exactly? Well, my definition is a 4 step pipeline, with a
potential retro-action loop, that consists of :
Data Ingest: This step is the accumulation and pre processing of the data that will be used to
train our model(s). This step is maybe the less fun, but it is one of the most important. Your model
will be as good as the data you train it on
Training: this is the part where you play God and create the Intelligence. From your data, you will
create a model that you hope will be representative of the reality and capable of describing it (and
even why not generate it)
Evaluation + Monitoring: Unless you can prove your model is good, it is worth just about
nothing. The evaluation phase aims at measuring the distance between your model and reality.
This can then be fed back to a human being to adjust parameters. In advanced setups, this can
essentially be your test in CI/CD of the model, and help auto tune the model even without human
intervention
Serving: there is a good chance you will want to update your model from time to time. If you want
to recognize threats on the network for example, it is clear that you will have to learn new malware
signatures and patterns of behavior, or you can close your business immediately. Serving is this
phase where you expose the model for consumption, and make sure your customers always enjoy
the latest model.
Jupyter notebooks vs IDE development
What is the practical difference?
Jupyter notebooks What are they all about
If you are a beginner in data science (and with Python) you might have noticed that there
are a lot of tutorial implemented as Jupyter notebooks (used to be referred as ipython).
The notebooks allow easy embedding of formatted text and images to executable code, which
make then very useful to provided with scientific papers and walkthrough example code.
Idea of a notebook
Both in academia (e.g. non-tech savvy old-school PI vs. new generation
data wizard) and in industry, one may want to communicate
concisely the rationale for the computing with the results.
Compare this to “Excel data science”, where a person manually
manipulates the data destroying end-to-end processing pipeline and
making reproducible research impossible.
In Jupyter notebook, one could include both data preparation and
data analysis parts, and the grad student could for example do the
dirty work and ask for insight from more senior researchers.
In theory, the collaborators would quickly realize that the notebook is
about, and allowing quick playing around that would easily usable by
the poor grad student as well eliminating the Excel →
Matlab/R/Python Excel circus→ probably experienced in some
research labs.
Downside is that the notebooks are not ‘plug’n’play’ executables, and
can depend on a lot of standard libraries, are even worse for a non-
tech savvy PI, custom libraries written by the grad student
Jupyter notebook Features
Sep 10, 2016 • Alex Rogozhnikov
http://arogozhnikov.github.io/2016/09/10/jupyter-features.html
Jupyter notebooks Examples of capabilities
A gallery of interesting Jupyter Notebooks
David Haberthür
Colour science computations with colour, a Python
package implementing a comprehensive number of colour
theory transformations and algorithms supported by a
dedicated collection of IPython Notebooks. More colour
science related IPython Notebooks are available on
colour-science.org.
Data-driven journalism
The Need for Openness in Data Journalism, by
Brian Keegan.
St. Louis County Segregation Analysis ,
analysis for the article
The Ferguson Area Is Even More Segregated Than
You Probably Guessed
by Jeremy Singer-Vine.
Reproducible academic publications
This section contains academic papers that have been published in the peer-reviewed literature or
pre-print sites such as the ArXiv that include one or more notebooks that enable (even if only
partially) readers to reproduce the results of the publication.
4) The Paper of the Future by Alyssa Goodman et al. (Authorea Preprint, 2017). This article
explains and shows with demonstrations how scholarly "papers" can morph into long-lasting
rich records of scientific discourse, enriched with deep data and code linkages, interactive
figures, audio, video, and commenting. It includes an interactive d3.js visualization and has an
astronomical data figure with an IPYthon Notebook "behind" it.
A 5-minute video demonstration of this paper is available at this YouTube link.
“Traditional” notebook example
5 Better “Science Storytelling”
As we stated at the outset, communicating results by way of what cognitive scientists refer to
as "storytelling" has the deepest, most long-lasting, impact on a reader, viewer, or listener.
Until recently, journal articles only contained words, numbers, and pictures, but today we can
enhance journal articles' storytelling potential with audio, video, and enhanced figures that
offer interactivity and context. We consider each of these opportunities in turn, below.
Screen shot of the first 3D PDF published in Nature in
2009 (Goodman 2009). A video demonstration of how
users can interact with the figure is on YouTube, here.
Open the PDF here in Adobe Acrobat to interact.
Jupyter notebooks how do they compare to IDEs
“Real-life” development typically a lot easier with a
proper IDE (Integrated Development Environment) then
https://www.slant.co/versus/1240/15716/~pycharm_vs_jupyter
You can use Jupyter notebooks also with
PyCharm
https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html
Compare to the Rstudio IDE and its notebooks
IDE PyCharm
PyCharm the most commonly used IDE, and could be a safe bet to start with.
datacamp.com
By Paulo Vasconcellos
Top 5 Python IDEs For Data Science
https://www.kaggle.com/general/4308
Jun 28, 2017
http://noeticforce.com/best-python-ide-for-programmers-windows-and-mac
Rodeo IDE has the feel of RStudio if you are coming from R
https://www.r-bloggers.com/
rstudio-clone-for-python-ro
deo/
By Erik Marsja
http://www.marsja.se/rstudio-like-python
-ides-rodeo-spyder/
Deploying Deep learning models
Large-scale interference (+Training) with streaming
Streaming Machine learning ”Heavier” tech stacks
End to End Streaming ML Recommendation Pipeline Spark 2 0, Kafka, TensorFlow Workshop
by Chris Fregly on 10/12/2016.
https://www.youtube.com/watch?v=UmCB9ycz55Q | https://github.com/fluxcapacitor/pipeline/wiki
“STREAMING”
i.e. continuous stream of data
for example from credit card
transactions, Uber cars, Internet
of Things devices such as
electricity meters or next-
generation vital monitoring at
hospitals.
Streaming Machine learning Confluent #1
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner, Technology Evangelist at Confluent, Published on Mar 24, 2017
https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Big Data vs. Fast Data SQL Server, Oracle, MySQL, Teradata, Netezza, JDBC/ODBC, Hadoop, SFDC, PostgreSQL, RapidMiner,
TIBCO Spotfire
Streaming Machine learning Confluent #2a
Machine Learning and Deep Learning Applied to Real Time with
Apache Kafka Streams
Kai Wähner, Technology Evangelist at Confluent, Published on May 23, 2017
https://www.slideshare.net/KaiWaehner/apache-kafka-streams-machine-learning-deep-learning
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Kafka Cluster
Deployed anywhere:
Docker, Kubernetes,
Mesos, Java App, ...
Kafka Cluster
Streaming Machine learning Confluent #2b
Continuously train and improve the model
with every new event
How to improve models?
1) Manual Update
2) Automated Batch
3)Real Time
Streaming Machine learning Databricks #1
https://www.slideshare.net/databricks/integrating-deep-learning-libraries-with-a
pache-spark
https://github.com/databricks/spark-deep-learning
One way to productionize a model is to deploy it as a
Spark SQL User Defined Function, which allows anyone who
knows SQL to use it. Deep Learning Pipelines provides
mechanisms to take a deep learning model and register a Spark
SQL User Defined Function (UDF).
The resulting UDF takes a column (formatted as a image struct
"SpImage") and produces the output of the given Keras model
(e.g. for Inception V3, it produces a real valued score vector over
the ImageNet object categories).
Streaming Machine learning Databricks #2
Getting the best results in deep learning requires
experimenting with different values for training
parameters, an important step called hyperparameter
tuning. Since Deep Learning Pipelines enables
exposing deep learning training as a step in Spark’s
machine learning pipelines, users can rely on the
hyperparameter tuning infrastructure already built into
Spark.
A Vision for Making Deep Learning Simple
From Machine Learning Practitioners to Business Analysts
by Sue Ann Hong, Tim Hunter and Reynold Xin Posted in ENGINEERING BLOGJune 6, 2017
SIMPLIFY MACHINE LEARNING WITH APACHE SPARK
Read the White Paper
Download our Machine Learning Starter Kit
Deploying Models in SQL
Once a data scientist builds the desired model, Deep Learning Pipelines makes
it simple to expose it as a function in SQL, so anyone in their organization can
use it – data engineers, data scientists, business analysts, anybody.
sparkdl.registerKerasUDF("img_classify",
"/mymodels/dogmodel.h5")
Next, any user in the organization can apply prediction in SQL:
SELECT image, img_classify(image) label FROM images
WHERE contains(label, “Chihuahua”)
Similar functionality is also available in the DataFrame programmatic API
across all supported languages (Python, Scala, Java, R). Similar to scalable
prediction, this feature works in both batch and structured streaming.
Conclusion
In this blog post, we introduced Deep Learning Pipelines, a new
library that makes deep learning drastically easier to use and scale.
While this is just the beginning, we believe Deep Learning Pipelines
has the potential to accomplish what Spark did to big data: make the
deep learning “superpower” approachable for everybody.
Future posts in the series will cover the various tools in the library in
more detail: image manipulation at scale, transfer learning,
prediction at scale, and making deep learning available in SQL.
To learn more about the library, check out the Databricks notebook
as well as the github repository. We encourage you to give us
feedback. Or even better, be a contributor and help bring the power
of scalable deep learning to everyone.
Streaming Machine learning StreamAnalytix
Impetus Technologies Unveils New, TensorFlow-Based Deep
Learning Feature on Apache Spark for StreamAnalytix
LOS GATOS, Calif., June 15, 2017 /PRNewswire/
https://streamanalytix.com/industryBuzz/PR_15_06_2017
https://streamanalytix.com/architecture
The combination of streaming analytics and deep learning enables a new
breed of applications and machine capabilities in industrial IoT, voice
analytics and anomaly detection.
Streaming Machine learning linagora
Making Image Classification Simple With Spark Deep Learning
Zied Sellami, Jun 28 2017
https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-l
earning-f654a8b876b8
Apache Spark is an open-source cluster-computing
framework. Apache Spark is a fast, in-memory data
processing engine with expressive development APIs to
allow data workers to efficiently execute streaming,
machine learning or SQL workloads that require fast
iterative access to datasets.
Conclusion
Apache Spark is a very powerful platform with elegant and
expressive APIs to allow Big Data processing.
We tried with success Spark Deep Learning, an API that
combine Apache Spark and Tensorflow to train and deploy
an image classifier. It is extremely easier (less than 30 lines
of code). Our next objective is to test if we can deploy facial
recognition model with this API
While this support is only available on Python, we hope that
integration will be done very soon on other programming
languages especially with Scala.
Run pyspark with spark-deep-learning library
spark-deep-learning library comes from Databricks and leverages Spark for its two
strongest facets:
In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep
learning in very few lines of code.
It uses Spark’s powerful distributed engine to scale out deep learning on massive
datasets.
The library run on pyspark. Let’s start pyspark:
export SPARK_HOME=PATH/TO/spark-2.1.1-bin-hadoop2.7
export set JAVA_OPTS="-Xmx9G -XX:MaxPermSize=2G -XX:+UseCompressedOops
-XX:MaxMetaspaceSize=512m"
$SPARK_HOME/bin/pyspark --packages databricks:spark-deep-learning:0.1.0-
spark2.1-s_2.11 --driver-memory 5g
Below is a console output example.
Deep Learning Pipelines on Apache Spark enables fast transfer learning with a
Featurizer (that transform images to numeric features).
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features",
modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3,
labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
view rawmodelCreator.py hosted with by❤ GitHub
Industry
Use cases
Kubernetes in deep learning #1
https://openai.com/blog/infrastructure-for-deep-learning/
Deep learning is an empirical
science, and the quality of a
group's infrastructure is a
multiplier on progress.
Fortunately, today's open-source
ecosystem makes it possible for
anyone to build great deep learning
infrastructure.
In this post, we'll share how deep
learning research usually
proceeds, describe the
infrastructure choices we've made to
support it, and open-source
kubernetes-ec2-autoscaler, a batch-
optimized scaling manager for
Kubernetes. We hope you find this
post useful in building your own
deep learning infrastructure.
Once the model shows sufficient promise, you'll scale it up to larger datasets and more GPUs. This requires
long jobs that consume many cycles and last for multiple days. You'll need careful experiment management,
and to be extremely thoughtful about your chosen range of hyperparameters.
Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has
convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for
some scientific libraries. We also run our own physical servers, primarily running Titan X GPUs. We expect to
have a hybrid cloud for the long haul: it's valuable to experiment with different GPUs, interconnects, and other
techniques which may become important for the future of deep learning.
Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our
infrastructure for small- and large-scale jobs, and we're actively solidifying our toolkit for making distributed
use-cases as accessible as local ones.
Kubernetes requires each job to be a Docker container, which gives us dependency isolation and
code snapshotting. However, building a new Docker container can add precious extra seconds to a
researcher's iteration cycle, so we also provide tooling to transparently ship code from a researcher's laptop
into a standard image. We expose Kubernetes's flannel network directly to researchers' laptops, allowing
users seamless network access to their running jobs. This is especially useful for accessing monitoring
services such as TensorBoard. (Our initial approach — which is cleaner from a strict isolation perspective —
required people to create a Kubernetes Service for each port they wanted to expose, but we found that it
added too much friction.)
We're releasing kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. It runs as a
normal Pod on Kubernetes and requires only that your worker nodes are in Auto Scaling groups.
Our infrastructure aims to maximize the productivity of deep learning
researchers, allowing them to focus on the science. We're building tools to
further improve our infrastructure and workflow, and will share these in upcoming
weeks and months. We welcome help to make this go even faster!
Kubernetes in deep learning #2
https://news.ycombinator.com/item?id=12391505
May 9 INFRA · DATA · RESEARCH · NEWS FEED · PYTHON
IntroducingFBLearnerFlow:Facebook'sAIbackbone
Jeffrey Dunn, https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
Automated Kubernetes CPU Deployment Baidu Case
Baidu's deep learning framework adopts Kubernetes
Google's orchestration framework will provide smart resource allocation and cluster management to PaddlePaddle
http://www.infoworld.com/article/3167608/artificial-intelligence/baidus-deep-learning-framework-adopts-kubernetes.html
http://blog.kubernetes.io/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes.html
Why Run PaddlePaddle on Kubernetes
PaddlePaddle is designed to be slim and independent of computing
infrastructure. Users can run it on top of Hadoop, Spark, Mesos,
Kubernetes and others.. We have a strong interest with Kubernetes
because of its flexibility, efficiency and rich features.
A successful deep learning project includes both the research and
the data processing pipeline. There are many parameters to be tuned.
A lot of engineers work on the different parts of the project
simultaneously.
To ensure the project is easy to manage and utilize hardware resource
efficiently, we want to run all parts of the project on the same
infrastructure platform.
The platform should provide:
● fault-tolerance. It should abstract each stage of the pipeline as a
service, which consists of many processes that provide high
throughput and robustness through redundancy.
● auto-scaling. In the daytime, there are usually many active users,
the platform should scale out online services. While during nights, the
platform should free some resources for deep learning experiments.
● job packing and isolation. It should be able to assign a
PaddlePaddle trainer process requiring the GPU, a web backend
service requiring large memory, and a CephFS process requiring disk
IOs to the same node to fully utilize its hardware.
What we want is a platform which runs the deep learning system, the
Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed
queue service (e.g., Kafka), the log joiner and other data processors
written using Storm, Spark, and Hadoop MapReduce on the same
cluster.
We want to run all jobs -- online and offline, production and experiments
-- on the same cluster, so we could make full utilization of the cluster, as
different kinds of jobs require different hardware resource.
Docker customers
Published on Aug 11, 2016
In this video Ajay Dankar, Senior Director Product Management at
PayPal discusses why they selected Docker and Docker Trusted
Registry to help them containerize their legacy apps to more
efficiently utilize their infrastructure and secure workloads.
--
Docker is an open platform for developers and system administrators to build, ship
and run distributed applications. With Docker, IT organizations shrink application
delivery from months to minutes, frictionlessly move workloads between data
centers and the cloud and can achieve up to 20X greater efficiency in their use of
computing resources. Inspired by an active community and by transparent, open
source innovation, Docker containers have been downloaded more than 700
million times and Docker is used by millions of developers across thousands of the
world’s most innovative organizations, including eBay, Baidu, the BBC, Goldman
Sachs, Groupon, ING, Yelp, and Spotify. Docker’s rapid adoption has catalyzed an
active ecosystem, resulting in more than 180,000 “Dockerized” applications, over
40 Docker-related startups and integration partnerships with AWS, Cloud Foundry,
Google, IBM, Microsoft, OpenStack, Rackspace, Red Hat and VMware.
https://www.youtube.com/watch?v=wf4Jg-9gv9Q
Business data into value
8 ways to turn data into value with Apache Spark machine learning
OCTOBER 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM
http://www.ibmbigdatahub.com/blog/8-ways-turn-data-value-apache-spark-machine-learning
1. Obtain a holistic view of business
In today's competitive world, many corporations work hard to gain a holistic view or a 360
degree view of customers, for many of the key benefits
as outlined by data analytics expert Mr. Abhishek Joshi. In many cases, a holistic view was not
obtained, partially due to the lack of capabilities to organize huge amount of data and then to
analyze them. But Apache Spark’s ability to compute quickly while using data frames to
organize huge amounts of data can help researchers quickly develop analytical models that
provide a holistic view of the business, adding value to related business operations. To realize
this value, however, an analytical process, from data cleaning to modeling, must
still be completed.
4. Avoid customer churn by rethinking churn modeling
Losing customers means losing revenue. Not surprisingly, then, companies strive to detect
potential customer churn through predictive modeling, allowing them to implement
interventions aimed at retaining customers. This might sound easy, but it can actually be very
complicated: Customers leave for reasons that are as divergent as the customers themselves
are, and products and services can play an important, but hidden, role in all this. What’s more,
merely building models to predict churn for different customer segments—and with regard to
different products and services—isn’t enough; we must also design interventions, then
select the intervention judged most likely to prevent a particular customer from departing. Yet
even doing this requires the use of analytics to evaluate the results achieved—and,
eventually, to select interventions from an analytical standpoint. Amid this morass of choices,
Apache Spark’s distributed computing capabilities can help solve previously baffling problems.
5. Develop meaningful purchase recommendations
Recommendations for purchases of products and services can be very powerful when made
appropriately, and they have become expected features of e-commerce platforms, with
many customers relying on recommendations to guide their purchases. Yet developing
recommendations at all means developing recommendations for each customer—or, at the very
least, for small segments of customers. Apache Spark can make this possible by offering the
distributed computing and streaming analytics capabilities that have become invaluable tools
for this purpose.
ebaytechblog: Spark is helping eBay create value from its data, and so the future is bright
for Spark at eBay. In the meantime, we will continue to see adoption of Spark increase at
eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product
announcements, industry chatter, and Spark’s own strengths and capabilities.
http://dx.doi.org/10.1186/s40165-015-0014-6
https://thinkbiganalytics.com/big_data_solutions/data-science/
Open data and Apache Spark
-----JumptoTopic-----
00:00:06 -WorkshopIntro&Environment Setup
00:13:06 -BriefIntrotoSpark
00:17:32 -AnalysisOverview: SFFireDepartment CallsforService
00:23:22 -AnalysiswithPySparkDataFramesAPI
00:29:32 -DoingDate/TimeAnalysis
00:47:53 -Memory, CachingandWritingtoParquet
01:00:40 -SQLQueries
01:21:11 -ConvertaSparkDataFrametoaPandasDataFrame
-----Q&A-----
01:24:43 -SparkDataFramesvs.SQL:ProsandCons?
01:26:57 -WorkflowforChainingDatabricksnotebooksintoPipeline?
01:30:27 -IsSpark2.0readytousein production?
https://www.youtube.com/watch?v=iiJq8fvSMPg
Internet of things (IoT)
To proof that our IoT platform is really independent on application environment, we took one IoT
gateway (RaspberryPi 2) from the city project and put into Austin Convention Center during
OpenStack Summit together with IQRF based mesh network connecting sensors that measure
humidity, temperature and CO2 levels. This demonstrates ability that IoT gateway can manage or
collect data from any technology like IQRF, Bluetooth, GPIO, and any other communication
standard supported on Linux based platforms.
We deployed 20 sensors and 20 routers on 3 conference floors with a single active IoT gateway
receiving data from entire IQRF mesh network and relaying it to dedicated time-series database,
in this case Graphite. Collector is MQQT-Java bridge running inside docker container managed by
Kubernetes.
The following screenshot shows real time CO2 values from different rooms on 2
floors. Historical graph shows values from Monday. You can easily recognize
when the main keynote session started and when was the lunch period.
Healthcare
https://www.youtube.com/watch?v=ePp54ofRqRs
https://www.healthdirect.gov.au/
How open source container tech can impact healthcare
At Red Hat, we believe that creating open source platforms allows the tech community
to develop the best software possible. We recently launched a series of films
highlighting the open source movement’s impact on healthcare, including
initiatives that promote open patient data and provide 3D-printed prosthetics.
Health is a great context to start exploring OpenShift’s open source capabilities. We
designed OpenShift to allow developers to take full advantage of containers (Docker)
and orchestration (Kubernetes), without having to learn the internals of how to build
containers from scratch or understand sys admin enough to deploy production-quality
apps that can scale on demand.
OpenShift makes using containers and orchestration accessible by letting you focus
on code instead of writing Dockerfiles and running Docker builds all day. With the
integrated Source-to-Image open source project, the platform automatically creates
containers while requiring only the URL for your source code repository.
openshift.devpost.com
Improving Container Security: Docker and More After 6 months and 15
successful beta deployments, Twistlock is announcing the general availability of our
container security suite. Twistlock came out of stealth in May 2015. Since then,
we have been working diligently with a select group of beta customers to validate the
value of our offerings. This diverse group of 15 beta testers, including Wix,AppsFlyer,
and HolidayCheck, spans financial services, hospitality, healthcare, Internet services,
and government. These customers confirmed that we are hitting the sweet spot of
their most pressing container security needs -- a majority of them already deployed
our product into their production environments, protecting live services and customer
data.
The logical resource boundaries established in Docker containers are almost as secure as
those established by the Linux operating system or by a virtual machine, according to a
report by Gartner analyst Joerg Fritsch. However, Docker and Linux containers in general fall
short when it comes to container management and administration, Fritsch said in his report, "
Security properties of containers managed by Docker."
Academia
Use cases
Neuroscience & bioinformatics
http://dx.doi.org/10.1016/j.conb.2015.04.002
Most large-scale analytics, whether in industry or neuroscience, involve common patterns. Raw data
are massive in size. Often, they are processed so as to extract signals of interest, which are then used for
statistical analysis, exploration, and visualization. But raw data can be analyzed or visualized directly (top
arrow). And the results of each successive step informs how to perform the earlier ones (feedback loops).
Icons below highlight some of the technologies, discussed in this essay, that are core to the modern large-
scale analysis workflow.
“Cloud deployment also makes it easier to build tools that run identically for all users,
especially with virtual machine platforms like Docker. However, cloud deployment for
neuroscience does require transferring data to cloud storage, which may become a bottleneck.
Deploying on academic clusters requires at least some support from cluster administrators but
keeps the data closer to the computation. … There is also rapidly growing interest in the ‘‘data
analysis notebook’’. These notebooks – the Jupyter notebook being a particularly popular
example – combine executable code blocks, notes, and graphics in an interactive document that
runs in a web browser, and provides a seamless front-end to a computer, or a large
cluster of computers if running against a framework like Spark. Notebooks are a particularly
appealing way to disseminate information; a recent neuroimaging paper, for example, provided all
of its analyses in a version-controlled repository hosted on GitHub with Jupyter notebooks that
generate all the figures in the paper [45]—a clear model for the future of reproducible
science
https://www.docker.com/customers/docker-helps-varian-medical-systems-battle-cancer
https://dx.doi.org/10.12688/f1000research.7536.1
http://dx.doi.org/10.1371/journal.pone.0152686
http://dx.doi.org/10.1186/s13742-015-0087-0
http://homolog.us/blogs/blog/2015/09/22/is-docker-for-suckers/
Neuroscience streaming data with Spark
http://dx.doi.org/10.1016/j.conb.2015.04.002
Streaming analysis of two-photon calcium
imaging*
* With these levels of analysis in mind, we probe the cortical
circuits underlying active tactile decision making. Whisker-
based haptic tasks for head-fixed mice developed in our lab
provide outstanding stimulus control and facilitate
applications of powerful biophysical methods, such as
whole cell recordings and two-photon microscopy.
in collaboration with Karel Svodoba and Nicholas Sofroniew.
https://www.janelia.org/lab/svoboda-lab
https://youtu.be/uUQTSPvD1mc?t=17m
Real-time visualization during the experiment rather than doing
the experiment with no good idea of what is going on before post-
experiment offline analysis
Real-time feedback on the experimental parameters, or for the
brain itself via optogenetic stimulation for example.
Reproducible SCIENCE
http://dx.doi.org/10.1038/nj7622-703a
http://dx.doi.org/10.1038/533452a
http://t-redactyl.io/blog/2016/10/a-crash-course-in-reproducible-research-in-python.html
http://conference.scipy.org/proceedings/scipy2016/pdfs/christian_oxvig.pdf
The use of 'custom MATLAB scripts'
Reproducible SCIENCE with docker
ANACONDA AND DOCKER
BETTER TOGETHER FOR REPRODUCIBLE DATA SCIENCE
Monday, June 20, 2016, continuum.io/blog
Anaconda integrates with many different providers and platforms to give
you access to the data science libraries you love on the services you use,
including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today
we’re excited to announce our new partnership with Docker.
As part of the announcements at DockerCon this week, Anaconda images
will be featured in the new Docker Store, including Anaconda and
Miniconda images based on Python 2 and Python 3. These freely available
Anaconda images for Docker are now verified, will be featured in the Docker
Store when it launches, are being regularly scanned for security
vulnerabilities and are available from the
ContinuumIO organization on Docker Hub.
Anaconda and Docker are a great combination to empower
yourdevelopment, testing and deployment workflows with
Open Data Science tools, including Python and R. Our users often ask
whether they should be using Anaconda or Docker for data science
development and deployment workflows. We suggest using both - they’re
better together!
Reproducible SCIENCE Between Jupyter and Docker
Jupyter/JupyterLabdoesnotcomereally as
'plug'n'play'andyoustill havetohave allthe
dependenciesresolved
Build own condopackages,and deploy
continuum.io/blog/developer-blog/whats-old-and-new-conda-build
Anaconda
Enterprise
Notebooks
continuum.io

More Related Content

What's hot

MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
Modern Data Platforms
Modern Data Platforms Modern Data Platforms
Modern Data Platforms Arne Roßmann
 
Scaling DevSecOps Culture for Enterprise
Scaling DevSecOps Culture for EnterpriseScaling DevSecOps Culture for Enterprise
Scaling DevSecOps Culture for EnterpriseOpsta
 
What Is DevOps?
What Is DevOps?What Is DevOps?
What Is DevOps?Soumya De
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability Abigail Bangser
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineDevOps.com
 
Understanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsUnderstanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsMy Gen Tec
 
Introduction to Azure DevOps
Introduction to Azure DevOpsIntroduction to Azure DevOps
Introduction to Azure DevOpsLorenzo Barbieri
 
Dos and Don'ts of DevSecOps
Dos and Don'ts of DevSecOpsDos and Don'ts of DevSecOps
Dos and Don'ts of DevSecOpsPriyanka Aash
 
91APP: 從 "零" 開始的 DevOps
91APP: 從 "零" 開始的 DevOps91APP: 從 "零" 開始的 DevOps
91APP: 從 "零" 開始的 DevOpsAndrew Wu
 

What's hot (20)

MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Modern Data Platforms
Modern Data Platforms Modern Data Platforms
Modern Data Platforms
 
Scaling DevSecOps Culture for Enterprise
Scaling DevSecOps Culture for EnterpriseScaling DevSecOps Culture for Enterprise
Scaling DevSecOps Culture for Enterprise
 
Azure DevOps
Azure DevOpsAzure DevOps
Azure DevOps
 
What Is DevOps?
What Is DevOps?What Is DevOps?
What Is DevOps?
 
Real-Time Streaming Data on AWS
Real-Time Streaming Data on AWSReal-Time Streaming Data on AWS
Real-Time Streaming Data on AWS
 
YAML Magic
YAML MagicYAML Magic
YAML Magic
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD Pipeline
 
Understanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsUnderstanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model Operations
 
Introduction to Azure DevOps
Introduction to Azure DevOpsIntroduction to Azure DevOps
Introduction to Azure DevOps
 
Python Cryptography & Security
Python Cryptography & SecurityPython Cryptography & Security
Python Cryptography & Security
 
Dos and Don'ts of DevSecOps
Dos and Don'ts of DevSecOpsDos and Don'ts of DevSecOps
Dos and Don'ts of DevSecOps
 
KEDA Overview
KEDA OverviewKEDA Overview
KEDA Overview
 
91APP: 從 "零" 開始的 DevOps
91APP: 從 "零" 開始的 DevOps91APP: 從 "零" 開始的 DevOps
91APP: 從 "零" 開始的 DevOps
 

Similar to Deploying deep learning models with Docker and Kubernetes

HPC Cloud Burst Using Docker
HPC Cloud Burst Using DockerHPC Cloud Burst Using Docker
HPC Cloud Burst Using DockerIRJET Journal
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes vty
 
Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday Walid Shaari
 
Docker Bday #5, SF Edition: Introduction to Docker
Docker Bday #5, SF Edition: Introduction to DockerDocker Bday #5, SF Edition: Introduction to Docker
Docker Bday #5, SF Edition: Introduction to DockerDocker, Inc.
 
Build and automate your machine learning application with docker and jenkins
Build and automate your machine learning application with docker and jenkinsBuild and automate your machine learning application with docker and jenkins
Build and automate your machine learning application with docker and jenkinsKnoldus Inc.
 
What is Docker & Why is it Getting Popular?
What is Docker & Why is it Getting Popular?What is Docker & Why is it Getting Popular?
What is Docker & Why is it Getting Popular?Mars Devs
 
DevAssistant, Docker and You
DevAssistant, Docker and YouDevAssistant, Docker and You
DevAssistant, Docker and YouBalaBit
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Tampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerTampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerSakari Hoisko
 
Weave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapWeave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapPatrick Chanezon
 
Docker and containerization
Docker and containerizationDocker and containerization
Docker and containerizationAmulya Saxena
 
Docker Application to Scientific Computing
Docker Application to Scientific ComputingDocker Application to Scientific Computing
Docker Application to Scientific ComputingPeter Bryzgalov
 
Devops interview questions 1 www.bigclasses.com
Devops interview questions  1  www.bigclasses.comDevops interview questions  1  www.bigclasses.com
Devops interview questions 1 www.bigclasses.combigclasses.com
 
Docker Use Cases.pdf
Docker Use Cases.pdfDocker Use Cases.pdf
Docker Use Cases.pdfSimform
 

Similar to Deploying deep learning models with Docker and Kubernetes (20)

HPC Cloud Burst Using Docker
HPC Cloud Burst Using DockerHPC Cloud Burst Using Docker
HPC Cloud Burst Using Docker
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes
 
Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday Bahrain ch9 introduction to docker 5th birthday
Bahrain ch9 introduction to docker 5th birthday
 
Docker Bday #5, SF Edition: Introduction to Docker
Docker Bday #5, SF Edition: Introduction to DockerDocker Bday #5, SF Edition: Introduction to Docker
Docker Bday #5, SF Edition: Introduction to Docker
 
Build and automate your machine learning application with docker and jenkins
Build and automate your machine learning application with docker and jenkinsBuild and automate your machine learning application with docker and jenkins
Build and automate your machine learning application with docker and jenkins
 
Docker In Brief
Docker In BriefDocker In Brief
Docker In Brief
 
What is Docker & Why is it Getting Popular?
What is Docker & Why is it Getting Popular?What is Docker & Why is it Getting Popular?
What is Docker & Why is it Getting Popular?
 
7+1 myths of the new os
7+1 myths of the new os7+1 myths of the new os
7+1 myths of the new os
 
Axigen on docker
Axigen on dockerAxigen on docker
Axigen on docker
 
DevAssistant, Docker and You
DevAssistant, Docker and YouDevAssistant, Docker and You
DevAssistant, Docker and You
 
Demystifying Docker101
Demystifying Docker101Demystifying Docker101
Demystifying Docker101
 
Demystifying Docker
Demystifying DockerDemystifying Docker
Demystifying Docker
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Tampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerTampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday Docker
 
Weave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapWeave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 Recap
 
Docker and containerization
Docker and containerizationDocker and containerization
Docker and containerization
 
Docker Application to Scientific Computing
Docker Application to Scientific ComputingDocker Application to Scientific Computing
Docker Application to Scientific Computing
 
Devops interview questions 1 www.bigclasses.com
Devops interview questions  1  www.bigclasses.comDevops interview questions  1  www.bigclasses.com
Devops interview questions 1 www.bigclasses.com
 
Docker Use Cases.pdf
Docker Use Cases.pdfDocker Use Cases.pdf
Docker Use Cases.pdf
 
Docker training
Docker trainingDocker training
Docker training
 

More from PetteriTeikariPhD

ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsPetteriTeikariPhD
 
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsNext Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsPetteriTeikariPhD
 
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...PetteriTeikariPhD
 
Wearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingWearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingPetteriTeikariPhD
 
Precision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPrecision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPetteriTeikariPhD
 
Two-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationTwo-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationPetteriTeikariPhD
 
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phaseSkin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phasePetteriTeikariPhD
 
Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...PetteriTeikariPhD
 
Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...PetteriTeikariPhD
 
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresIntracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresPetteriTeikariPhD
 
Hand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsHand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsPetteriTeikariPhD
 
Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1PetteriTeikariPhD
 
Multimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisMultimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisPetteriTeikariPhD
 
Creativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyCreativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyPetteriTeikariPhD
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time SeriesPetteriTeikariPhD
 
Hyperspectral Retinal Imaging
Hyperspectral Retinal ImagingHyperspectral Retinal Imaging
Hyperspectral Retinal ImagingPetteriTeikariPhD
 
Instrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyInstrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyPetteriTeikariPhD
 
Future of Retinal Diagnostics
Future of Retinal DiagnosticsFuture of Retinal Diagnostics
Future of Retinal DiagnosticsPetteriTeikariPhD
 
OCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningOCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningPetteriTeikariPhD
 

More from PetteriTeikariPhD (20)

ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung Sounds
 
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsNext Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
 
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
 
Wearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingWearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung Sensing
 
Precision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPrecision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthma
 
Two-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationTwo-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature Segmentation
 
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phaseSkin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
 
Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...
 
Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...
 
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresIntracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
 
Hand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsHand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical Applications
 
Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1
 
Multimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysisMultimodal RGB-D+RF-based sensing for human movement analysis
Multimodal RGB-D+RF-based sensing for human movement analysis
 
Creativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyCreativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technology
 
Light Treatment Glasses
Light Treatment GlassesLight Treatment Glasses
Light Treatment Glasses
 
Deep Learning for Biomedical Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time SeriesDeep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical Unstructured Time Series
 
Hyperspectral Retinal Imaging
Hyperspectral Retinal ImagingHyperspectral Retinal Imaging
Hyperspectral Retinal Imaging
 
Instrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyInstrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopy
 
Future of Retinal Diagnostics
Future of Retinal DiagnosticsFuture of Retinal Diagnostics
Future of Retinal Diagnostics
 
OCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep LearningOCT Monte Carlo & Deep Learning
OCT Monte Carlo & Deep Learning
 

Recently uploaded

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 

Deploying deep learning models with Docker and Kubernetes

  • 1. Deploying deep learning models Platform agnostic approach for production with docker+Kubernetes
  • 2. About this presentation This was originally created to explain the basics of code deployment both in academia and startup environments. ACADEMIA Especially in academia outside computer science departments it is typical that the code developed is very unstructured without much thought on reproducibility or legibility. STARTUPS In smaller startups it is beneficial for all the team members to understand at least something on all the various aspects involved in building a tech product. Based on personal experience, these sort of knowledge gaps between a clinician/biologist and the technical person both in technology and biology can make the teamwork painfully slow.
  • 3. General “Data Science” Architecture Typical data scientist or researcher may use the “tech stack” on the left. In academia, the code is typically not deployed anywhere, e.g. installing a custom app on the smartphone of a subject in a clinical trial. The researcher simply gather the data with some 3rd party software and then writes some research-grade code to analyze it. In industry, the data science team can consist of multiple roles, and it becomes essential for the organization to have a smooth operation between different roles. In other words, the research and models done by the data scientist can be put in production quickly without major re-writing of the code. http://101.datascience.community/2016/11/28/data-scientists-data-eng ineers-software-engineers-the-difference-according-to-linkedin/ https://www.slideshare.net/continuumio/journey-to-open-data-science
  • 4. General “BIG DATA/ML” Architecture For example the model developed in TensorFlow might look like this when deployed as product (e.g. as an app for your phone to tell whether your image contains a cat or a dog).
  • 6. DOCKER Deep learning? Unfortunately that is wrong for deep learning applications. For any serious deep learning application, you need NVIDIA graphics cards, otherwise it could take months to train your models. NVIDIA requires both the host driver and the docker image's driver to be exactly the same. If the version is off by a minor number, you will not be able to use the NVIDIA card, it will refuse to run. I don't know how much of the binary code changes between minor versions, but I would rather have the card try to run instructions and get a segmentation fault then die because of a version mismatch. We build our docker images based off the NVIDIA card and driver along with the software needed. We essentially have the same docker image for each driver version. To help stay manage this, we have a test platform that makes sure all of our code runs on all the different docker images. This issue is mostly in NVIDIA's court, they can modify their drivers to be able to work across different versions. I'm not sure if there is anything that Docker can do on their side. I think its something they should figure out though, the combination of docker and deep learning could help a lot more people get started faster, but right now its an empty promise. http://www.somatic.io/blog/docker-and-deep-learning-a-bad-match The biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack. The wonderful triad of Docker : “Isolation! Portability! Repeatability!” There are numerous use cases where Docker might just be what you need, be it Data Analytics, Machine Learning or AI
  • 7. DOCKERize everything as microservices .pwc.com/us/en/technology-forecast/2014 http://www.slideshare.net/RichardHarvey7/micro-services-and-containers (ARC401) Cloud First: New Architecture for New Infrastructure Amazon Web Services, slideshare.net/AmazonWebServices
  • 8. Why Microservices? Whyrunmicroservicesusing DockerandKubernetes? Posted by: Seth Lakowske Published: 2016-04-25 http://sethlakowske.com/articles/why-run-docker-containers-and-kubernetes/ Benefitsof microservices 1) Codecanbebrokenoutintosmaller microservicesthatareeasiertolearn,release andupdate. 2) Individualmicroservicescanbewrittenusingthebesttoolsfor thejob. 3) Releasing anewservicedoesn'trequiresynchronizationacrossawholecompany. 4) Newtechnologystackshavelowerrisksincetheserviceisrelativelysmall. 5) Developerscanruncontainerslocally,rebuildingandverifyingaftereachcommitona systemthatmirrorsproduction. 6) BothDocker andKubernetesareopensourceandfreetouse. 7) AccesstoDockerhubleveragesthework oftheopensourcecommunity. 8) ServiceisolationwithouttheheavyweightVM.Addingaservicetoaserver doesnot affectother servicesontheserver. 9) Servicescanbemoreeasilyrunonalargeclusterofnodesmakingitmorereliable. 10) Someclientswillonlyhostinprivateandnotonpublicclouds. 11) Lendsitselftoimmutableinfrastructure,soservicesarereloadablewithoutmissing statewhenaserver goesdown. 12) Immutablecontainersimprovesecuritysincedatacanonlybemutatedinspecified volumes,rootkitsoftencan'tbeinstalledevenifthesystemispenetrated. 13) Increasingsupportfornewhardware,liketheGPUinacontainer meansevengpgpu taskslikedeeplearningcanbecontainerized. 14) Thereisacostforrunningmicroservices-thebuildandruntimebecomesmore complex.Thisispartofthepricetopayandifyou'vemadetherightdecisioninyour context,thenbenefitswillexceedthecosts. Costsof microservices • Managingmultipleservicestendstobemorecostly. • Newwaysfor network andserverstofail. Conclusion Intherightcircumstances,thebenefitsofmicroservicesoutweightheextracostof management. events.linuxfoundation.org,Frank Zhao https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
  • 9. Dockervs AWS Lambda ”In General” AWS Lambda will win - sort of.....  From a programming model and a cost model, AWS Lambda is the future - despite so of the tooling limitations. Docker in my opinion is an evolutionary step of "virtualization" that we've been seeing for the last 10 years. AWS Lambda is a step-function. In fact, I personally think it is innovations like Amazon Elastic Beanstalk and CloudFormation that has pushed the demand solutions like Docker.  In the near future, I predict that open source will catch up and provide an AWS Lambda experience on top of Docker containers. Iron.io is opensource andappearstobe goingdownthispath. FlorianWalker ProductManageratFujitsu Thefutureisnow:)Funktion,partofFabric8,aimsto provideaLamdaexperienceon-topofKubernetes-> https://github.com/fabric8io/funktion JasonDaniels CTO-FujitsuHybridCloudEMEIA projectKratos.. https://www.iron.io/introducing-aws-lambda-support/ https://www.quora.com/Are-there-any-alternatives-to-Amazon-Lambda Funktion is an open source event driven lambda style programming model on top of Kubernetes. A funktion is a regular function in any programming language bound to a trigger deployed into Kubernetes. Then Kubernetes takes care of the rest (scaling, high availability, loadbalancing, loggingandmetricsetc). Funktion supports hundreds of different triggerendpoint URLs including most network protocols, transports, databases, messaging systems, social networks, cloud services and SaaS offerings. In a sense funktion is a serverless approach to event driven microservices as you focus on just writing funktions and Kubernetes takes care of the rest. Its not that there's no servers; its more that you as the funktion developer don't have toworryabout managingthem. Announcing ProjectKratos I’m happy to announce that Project Kratos is now available in beta. Iron.io is rolling out a set of tools that allow you to convert AWS Lambda functions into Docker images. Now, you can import existing Lambda functions and run them via any container orchestration system. You can also create new Lambda functions and quickly package them up in a container to run on other platforms. All three of the AWS runtimes are supported – Node.js, Python and Java.
  • 10. Docker Issues Size Dockercontainers quickly grow in size as theyneedto contain everythingrequired fordeployment http://blog.xebia.com/create-the-smallest-possible-docker-container/ https://www.ctl.io/developers/blog/post/optimizing-docker-images/ “Docker imagescan get reallybig. Manyare over 1Gin size. How dotheyget sobig?Dotheyreally need tobe thisbig? Canwemake them smaller without sacrificingfunctionality? Here atCenturyLinkwe've spent alot oftime recentlybuildingdifferent docker images. As we began experimentingwith image creationoneof the thingswe discovered wasthat our custom images were ballooningin size prettyquickly(it wasn't uncommontoend up with imagesthat weighed-in at 1GB or more). Now, it'snot toobigadeal tohaveacouple gigsworth ofimages sittingon your local system, but it becomesabit ofpain assoon asyoustart pushing/pullingthese imagesacrossthe networkon aregular basis. “ https://blog.replicated.com/2016/02/05/refactoring-a-dockerfile-for-image-size/ “There’s been a welcomefocus in the Docker community recently around image size. Smaller image sizes are being championed by Docker and by the community. When many images clock in at multi-100 MB and ship with a large ubuntu base, it’s greatlyneeded.” https://ypereirareis.github.io/blog/2016/02/15/docker-image-size-optimization/ https://github.com/microscaling/imagelayers-graph ImageLayers.io is a project maintained by Microscaling Systems since September 2016. The project was developed by the team at CenturyLinkLabs. This utility provides a browser-based visualization of user-specified Docker Images and their layers. This visualization provides key information on the composition of a Docker Image and any commonalitiesbetween them. ImageLayers.io allows Docker users to easily discover best practices for image construction, and aid in determining which imagesare most appropriate for their specificuse cases. Deploying inKubernetes Please seedeployment/README.md
  • 11. What is lambda architecture anyway? https://www.oreilly.com/ideas/questioning-the-lambda-architecture The Lambda Architecture is an approach to building stream processing applications on top of MapReduce andStorm or similar systems. This has proven to be a surprisinglypopular idea,withadedicated website andan upcomingbook.  Theway thisworksisthatan immutablesequenceofrecordsiscaptured and fedintoa batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer. There arealot ofvariationson this. The Lambda Architecture is aimed at applications built around complex asynchronous transformations that need to run with low latency (say, a few seconds to a few hours). A good example would be a news recommendation system that needs to crawl various news sources, process and normalize all the input, and then index, rank, and store it for serving. I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. I also like that this architecture highlights the problem of reprocessingdata (processinginput dataoveragain tore-deriveoutput). The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable. Ultimately, even if you can avoid coding your application twice, the operational burden of running and debugging two systems is going to be very high. And any new abstraction can only provide the features supported by the intersection of the two systems. Worse, committing to this new uber-framework walls off the rich ecosystem of tools and languages that makes Hadoop so powerful (Hive, Pig, Crunch,Cascading,Oozie,etc). Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed throughthestreamingsystemquickly. Kappa Architecture revolutionizes database migrations and reorganizations: just delete your serving layer database and populate a new copy from the canonical store! Since there is no batch processing layer, only one set of code needs to bemaintained. kappa-architecture.com CHALLENGING THE LAMBDA ARCHITECTURE: BUILDING APPS FOR FAST DATA WITH VOLTDB V5.0 dataconomy.com VoltDB is an ideal alternativeto the LambdaArchitecture’s speed layer. It offers horizontal scaling and high per-machine throughput. It can easily ingest and process millions of tuples per second with redundancy, while using fewer resources than alternative solutions. VoltDB requires an orderofmagnitude fewer nodesto achievethescale andspeed of the Lambdaspeed layer. Asa benefit,substantiallysmallerclustersarecheapertobuildandrun,and easiertomanage.
  • 13. DOCKER Management enter→ Kubernetes https://www.youtube.com/watch?v=PivpCKEiQOQ www.computerweekly.com/feature/Demystifying-Kubernete Once every five years, the IT industry witnesses a major technology shift. In the past two decades, we have seen server paradigm evolve into web-based architecture that matured to service orientation before finally moving to the cloud. Today it is containers. Docker is much more than just the tools and API. It created a vibrant ecosystem that started to contribute to a variety of tools to manage the lifecycle of containers.  One of the first tools that Google decided to make open source is called Kubernetes, which means “pilot” or “helmsman” in Greek. Kubernetes works in conjunction with Docker. While Docker provides the lifecycle management of containers, Kubernetes takes it to the next level by providing orchestration and managing clusters of containers. Traditionally, platform as a service (PaaS) offerings such as Azure, App Engine, Cloud Foundry, OpenShift, Heroku and Engine Yard exposed the capability of running the code by abstracting the infrastructure. Kubernetes and Docker deliver the promise of PaaS through a simplified mechanism. Once the system administrators  configure and deploy Kubernetes on a specific infrastructure, developers can start pushing the code into the clusters. This hides the complexity of dealing with the command line tools, APIs and dashboards of specific IaaS providers. 
  • 14. Containers at scale As has been demonstrated, it is relatively easy to launchtensofthousandsofcontainers  on a single host. But how do you deploy thousands of containers? How do you manage and keep track of them? How do you manage and recover from failure. While these things sometimes might look easy, there are some hard problems to tackle. Let us walkthroughwhatitmakesitsodifficult. With a single command the Docker environment is set up and you can docker run until you drop. But what if you have to run Docker containers across two hosts? How about 50 hosts? Or how about 10,000 hosts? Now, you may ask why one would wanttodo  this. Therearesomegoodreasons why: nextplatform.com/2016/03/22 https://www.nextplatform.com/2015/09/29/why-containers-at-scale-is-hard/ nextplatform.com/2016/03/03 Two founders of the Kubernetes project at Google, Craig McLuckie and Joe Beda, today announced their new company, Heptio. The company has raised $8.5 million in a series A investmentroundledbyAccel,withparticipation fromMadronaVentureGroup. Open source Kubernetes is a widely deployed technology for container orchestration. Now, Heptiowillbringacommercialversion ofthesoftware to enterprises. www.sdxcentral.com
  • 15. Kubernetes Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon  15 years of experience of running production workloads at Google, combined with best-of-breed ideas and practices from the community. KubeWeekly — aggregating all interesting weekly news about Kubernetes in the form of a newsletter. Manage a cluster of Linux containers as a single system to accelerate Dev and simplify Ops. https://kubeweekly.com/ http://nshani.blogspot.co.uk/2016/02/getting-started-with-kubernetes.html https://www.youtube.com/watch?v=21hXNReWsUU http://cloud9.nebula.fi/app.html
  • 16. Kubernetes concepts http://www.slideshare.net/arungupta1/package-your-ja va-ee-application-using-docker-and-kubernetes linkedin.com/pulse http://www.slideshare.net/jawnsy/kubernetes-my-bff Inference can be very resource intensive. Our server executes the following TensorFlow graph to process every classification request it receives. The Inception-v3 model has over 27 million parameters and runs 5.7 billion floating pointoperationsper inference. Fortunately, this is where Kubernetes can help us. Kubernetes distributes inference request processing across a cluster using its ExternalLoadBalancer. Each pod in the cluster contains a TensorFlowServingDockerimage with the TensorFlow Serving-based gRPC server and a trained Inception-v3 model. The model is represented as a setoffiles describing the shape of the TensorFlowgraph,modelweights,assets,andso on. Since everything is neatly packaged together, we can dynamically scale the numberof replicatedpodsusing the KubernetesReplicationController to keep up with the service demands. blog.kubernetes.io/2016/03
  • 17. Alternatives medium.com/@mustwin BareMetal Most schedulers with the notable exception of Cloud Foundry can be installed on “bare metal” or physical machines inside your datacenter. This cansave you big onhypervisorlicensing fees. VolumeMounts Volume mounts allow you to persist data across container deployments. This is a key differentiator depending on your applications’ needs.Mesosistheleaderhere,andKubernetesisslowlycatching up. https://news.ycombinator.com/item?id=10438273 https://www.oreilly.com/ideas/swarm-v-fleet-v-kubernetes-v-mesos Conclusion There are clearly a lot of choices for orchestrating, clustering, and managing containers. That being said, the choices are generally well differentiated. In terms of orchestration, we can say the following: Swarm has the advantage (and disadvantage) of using the standard Docker interface. Whilst this makes it very simple to use Swarm and to integrate it into existing workflows, it may also make it more difficult to support the more complex scheduling that may be defined in custom interfaces. Fleet is a low-level and fairly simple orchestration layer that can be used as a base for running higher level orchestration tools, such as Kubernetes or custom systems. Kubernetes is an opinionated orchestration tool that comes with service discovery and replication baked-in. It may require some re-designing of existing applications, but used correctly will result in a fault-tolerant and scalable system. Mesos is a low-level, battle-hardened scheduler that supports several frameworks for container orchestration including Marathon, Kubernetes, and Swarm. At the time of writing, Kubernetes and Mesos are more developed and stable than Swarm. In terms of scale, only Mesos has been proven to support large-scale systems of hundreds or thousands of nodes. However, when looking at small clusters of, say, less than a dozen nodes, Mesos may be an overly complex solution.
  • 18. Kubernetes Still on top? https://news.ycombinator.com/item?id=12462261 After all, Kubernetes is a mere two years old (as a public open source project), whereas Apache Mesos has clocked seven years in market. Docker Swarm is younger than Kubernetes, and it comes with the backing of the center of the container universe, Docker Inc Yet the orchestration rivals pale in comparison to Kubernetes' community, which -- now under management by the Cloud Native Computing Foundation -- is exceptionally large and diverse. • Kubernetes is one of the top projects on GitHub: in the top 0.01 percent in stars and No. 1 in terms of activity. • While documentation is subpar, Kubernetes has a significant Slack and Stack Overflow community that steps in to answer questions and foster collaboration, with growth that dwarfs that of its rivals. • More professionals list Kubernetes in their LinkedIn profile than any other comparable offering by a wide margin. • Perhaps most glaring, data from OpenHub shows Apache Mesos dwindling since its initial release and Docker Swarm starting to slow. In terms of raw community contributions, Kubernetes is exploding, with 1,000-plus contributors and 34,000 commits -- more than four times those of nearest rival Mesos. http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html https://github.com/kubernetes/kubernetes I would argue that general-purpose clusters like those managed by Google Kubernetes are better for hosting Internet businesses depending on artificial intelligence technologies than special-purpose clusters like NVIDIA DGX-1. Consider the case that an experiment model training job is using all the 100 GPUs in the cluster. A production job gets started and asks for 50 GPUs. If we use MPI, we'd have to kill the experiment job so to release enough resource to run the production job. This tends to make the owner of the experiment job get the impression that he is doing a "second-class" work. Kubernetes is smarter than MPI as it can kill, or preempt, only 50 workers of the experiment job, so to allow both jobs run at the same time. With Kubernetes, people have to build their programs into Docker images that run as Docker containers. Each container has its own filesystem and network port space. When A runs as a container, it removes only files in its own directory. This is to some extent like that we define C++ classes in namespaces, which helps us removing class name conflicts. An Example A typical Kubernetes cluster runs an automatic speech recognition (ASR) business might be running the following jobs: 1) The speech service, with as many instances so to serve many simultaneous user requests. 2) The Kafka system, whose each channel collects a certain log stream of the speech service. 3) Kafka channels are followed by Storm jobs for online data processing. For example, a Storm job joins the utterance log stream and transcription stream. 4) The joined result, namely session log stream, is fed to an ASR model trainer that updates the model. 5) This trainer notifies ASR server when it writes updated models into Ceph. 6) Researchers might change the training algorithm, and run some experiment training jobs, which serve testing ASR service jobs.
  • 19. The famous 'classical big data' on Spark Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.http://dx.doi.org/10.1007/s41060-016-0027-9 In addition to the research highlights we presented in the previous sections, there are other research works which have been done using Apache Spark as a core engine for solving data problems in machine learning and data mining [5,36], graph processing [16], genomic analysis [60,65], time series data [71], smart grid data [73], spatial data processing [87], scientific computations of satellite data [67], large-scale biological sequence alignment [97] and data discretization [68]. There are also some recent works on using Apache Spark for deep learning [46,64]. CaffeOnSpark is an open source project [60] from Yahoo [61] for distributed deep learning on big data withApache Spark.
  • 20. “BIG Data” Frameworks Apache spark for example
  • 21. Tensorflow + Apache Spark https://www.youtube.c om/watch?v=PFK6gsnlV5 E https://www.meetup.com/Advanced-Spark-and- TensorFlow-Meetup/ You might be wondering: what’s Apache Spark’s use here when most high-performance deep learning implementations are single-node only? To answer this question, we walk through two use cases and explain how you can use Spark and a cluster of machines to improve deep learning pipelines with TensorFlow: Hyperparameter Tuning: use Spark to find the best set of hyperparameters for neural network training, leading to 10X reduction in training time and 34% lower error rate. Deploying models at scale: use Spark to apply a trained neural network model on a large amount of data. How does using Spark improve the accuracy? The accuracy with the default set of hyperparameters is 99.2%. Our best result with hyperparameter tuning has a 99.47% accuracy on the test set, which is a 34% reduction of the test error. Distributing the computations scaled linearly with the number of nodes added to the cluster: using a 13-node cluster, we were able to train 13 models in parallel, which translates into a 7x speedup compared to training the models one at a time on one machine. The goal of this workshop is to build an end-to-end, streaming data analytics and recommendations pipeline on your local machine using Docker and the latest streaming analytics tools. First, we create a data pipeline to interactively analyze, approximate, and visualize streaming data using modern tools such as Apache Spark, Kafka, Zeppelin, iPython, and ElasticSearch. http://advancedspark.com/:
  • 22. Dask as an alternative to apache spark #1 https://youtu.be/1kkFZ4P-XHg continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster MatthewRocklin'sBlog dask,theoriginalproject dask.distributed,thedistributedmemoryschedulerpowering theclustercomputing dask.bag,theuserAPIwe’veusedin thispost. AmazonEC2withDaskconfiguredwith JupyterNotebooks, and Anaconda.https://github.com/dask/dask-ec2
  • 23. Dask as an alternative to apache spark #2 http://dask.pydata.org/en/latest/spark.html Spark is mature and all-inclusive. If you want a single project that does everything and you’re already on Big Datahardware then Sparkis a safe bet, especially if your use cases are typical ETL+ SQLandyou’realreadyusingScala. Dask is lighter weight and is easier to integrate into existing code and hardware. If your problems vary beyond typical ETL + SQL and you want to add flexible parallelism to existing solutions then dask may be a good fit, especially if you are already using Python andassociated libraries likeNumPyand Pandas. If you are looking to manage a terabyte or less of tabular CSV or JSON data then you shouldforgetbothSparkandDask andusePostgresorMongoDB. https://news.ycombinator.com/item?id=10062076 Dask seemsto beaimed atparallelismofonlycertain operations(someparts of NumPyand Pandas) on larger than memory data on a single machine. Spark is a general purpose computing engine that can work across a cluster of machines and has many libraries optimizedfordistributedcomputing(machinelearning,graph, etc.). The advantages of Dask seem to be that it is a drop in replacement for NumPy and Pandas. Granted,given theprevalenceofthosetwolibrariesthatisn'tasmalladvantage. https://www.quora.com/Is-https-github-com-blaze-dask-an-alternative-to-Spark GPU ComputingwithApache SparkandPython by ContinuumAnalytics,.slideshare.net TensorFlowBasics WeightsPersistence.SaveandRestoreamodel. Fine-Tuning.Fine-Tuneapre-trainedmodelonanewtask. UsingHDF5.UseHDF5tohandlelargedatasets. UsingDASK.UseDASKtohandlelargedatasets.
  • 24. Kubernetes + Dask Runningonkuberneteson google containerengine Thissmall repo gives an example Kubernetes configuration for running dask.distributed on Google Container Engine. DaskClusterDeployments http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments This work is supported by Continuum Analytics and the  XDATAProgram aspartofthe BlazeProject All code in this post is experimental. It should not be relied upon. For people looking to deploy dask.distributed on a clusterpleasereferinsteadtothe documentation instead. Daskisdeployedtoday onthe followingsystemsinthewild: • SGE • SLURM, • Torque • Condor • LSF • Mesos • Marathon • Kubernetes • SSHandcustomscripts …theremay bemore.ThisiswhatIknowoffirst-hand. These systems provide users access to cluster resources and ensure that many distributed services / users play nicely together. They’re essential for any modern cluster deployment. For example, both OlivierGriesl (INRIA, scikit-learn) and  TimO’Donnell (Mount Sinai, Hammer lab) publish instructions on how to deploy Dask.distributed on  Kubernetes. • Olivier’srepository • Tim’srepository SciPyTutorialSetupOnKubernetes writtenbyBenjaminZaitlenon2016-09-30 http://quasiben.github.io/blog/2016/9/30/scipy-setup/ Ourgoalwasto givestudentsaccessto apreconfigured cluster with zero entryrequirements: push abuttongeta cluster with all toolsinstalled.Toaccomplish thisweneed ahandful ofdockerimages: • Web application:button and info • Jupyternotebook • proxyapp (more on thislater) • clustertechnologies:Spark, Dask, IPython Parallel Anda handful of Kubernetesconcepts: • Pods:collection of containers(similar to docker-compose) • namespaces:named andisolated clusters • replication controller:a scalable Pod.
  • 26. That is code What about data then? Using the different software above, an application can be deployed, scaled easily and accessed from the outside world in few seconds. But, what about the data? Structured content would probably be stored in a distributed database, like MongoDB, for example Unstructured content is traditionally stored in either a local file system, a NAS share or in Object Storage. A local file system doesn’t work as a container can be deployed on any node in the cluster. On the other side, Object Storage can be used by any application from any container, is highly available due to the use of load balancers, doesn’t require any provisioning and accelerate the development cycle of the applications. Why ? Because a developer doesn’t have to think about the way data should be stored, to manage a directory structure, and so on. The Amazon S3 endpoint used to upload and download pictures is displayed on the bottom left corner and shows that ViPR is used to store the data. The fact that the picture is uploaded directly to the Object Storage platform means that the web application is not in the data path. This allows the application to scale without deploying hundreds of instances. This web application can also be used to display all the pictures stored in the corresponding Amazon S3 bucket. The url displayed below each picture shows that the picture is downloaded directly from the Object Storage platform, which again means that the web application is not in the data path. This is another reason why Object Storage is the de facto standard for web scale applications. recorditblog.com http://www.slideshare.net/kubecon/kubecon-eu-2016-kubernetes-storage-101 Persistent Volumes Walkthrough The purpose of this guide is to help you become familiar with Kubernetes Persistent Volumes. By the end of the guide, we’ll have nginx serving content from your persistent volume. You can view all the files for this example in the docs repo here. This guide assumes knowledge of Kubernetes fundamentals and that you have a cluster up and running. See Persistent Storage design document for more information. http://kubernetes.io/docs/user-guide/persistent-volumes/walkthrough/
  • 27. Data Lakes vs data warehouses #1 “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” The table below helps flesh out this definition. It also highlights a few of the key differences between a data warehouse and a data lake. This is, by no means, an exhaustive list, but it does get us past this “been there, done that” mentality: Data. A data warehouse only stores data that has been modeled/structured, while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured. [See my big data is not new graphic. The data warehouse can only store the orange data, while the data lake can store all the orange and blue data.] Processing. Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to model it. That’s called schema-on-write. With a data lake, you just load in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches. Storage. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this: First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware. Agility. A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly. Security. Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when. Users. For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. Trust me, adata lake,at thispoint in itsmaturity,isbest suited for the datascientists. www.kdnuggets.com/2015/09 http://www.smartdatacollective.com/all/13556
  • 28. Data Lakes Medical Examples SettingUp the Data Lake http://www.slideshare.net/CasertaConcepts/setting-up-the-data-lake-55319460 searchhealthit.techtarget.com Unlike most relational databases' linear representation and analysis of data, Franz's semantic graph database technology employs with which users can graphically see data elements and their relationships. Montefiore also recently started another program using the data lake to do cardio- genetic predictive analytics to determine the degrees of possibility of patients having suddencardiacdeathbasedontheir geneticbackground.
  • 29. Deploying Deep learning models Small scale interference
  • 30. Data science pipeline development and deployment https://www.continuum.io/blog/developer-blog/productionizing-and-deploying-data-science-projects Note! Academia has been notoriously slow to adopt new technologies in academic research, and you should not be afraid of the use of “business analyst” academia. Dockerizing analysis research pipelines for end-to-end reproducibility would be an excellent way reducing “sloppy” science. Peer reviewers (or even the journals automatically) could potentially feed “standardized” datasets via REST APIs at various points of the pipeline to ensure that the methodologies used are of “good science”
  • 31. Minimal r&D ready for production Deep Learning with Keras on Google Compute Engine by Cole Murray. Software Engineer. Mobile & Full-Stack Engineering. Machine Learning. https://medium.com/google-cloud/keras-inception-v3-on-google-compute-engine-a54918b0058 “Keras Inception V3 image classification model Prediction with deployment on Compute Engine w/ Docker & Google Container Registry (like Docker Hub), using Flask Front-end Webserver.” When you have developed and trained your model, the prediction part in development becomes rather easy Flask web server allows fast creation of front-end which you can use for example for quick demo of your minimum viable product (MVP), progress report for your boss, interactive front-end for showing scientific results for your PI (principal investigator). Keras, Tensorflow, Docker, Flask, Gunicorn, Nginx, Supervisor tech stack in deployment PREDICT (Inference) e.g. “Whether the image is of a cat or a dog?” Front-End e.g. “How to upload a photo from your phone, and predict what it contains?” Note! If you are using TensorFlow (TF) backend, you can use the TF computation graph (i.e. trained model) trained “natively” on TensorFlow with the Keras deployment for example when your data scientist(s) keep on re-training the model, and you can then just replace the computation graph (.ckpt, checkpoint) with the new weights
  • 32. TensorFlow Serving Tensorflow for production How to deploy Machine Learning models with TensorFlow. Part 1— make your model ready for serving. https://medium.com/towards-data-science/how-to-deploy-machine-learning-models-with-tensorflow-part-1-make-your-model-ready-for-serving-776a14ec3198 By Vitaly Bezgachev
  • 33. Streaming Machine learning Google How to use Google Cloud Dataflow with TensorFlow for batch predictive analysis Justin Kestelyn, Google Cloud Platform, April 18, 2017 https://cloud.google.com/blog/big-data/2017/04/how-to-use-google-cloud-dataflow-with-tensorflow -for-batch-predictive-analysis This article shows you how to use Cloud Dataflow to run batch processing for machine learning predictions. The article uses the machine learning model trained with TensorFlow. The trained model is exported into a Google Cloud Storage bucket before batch processing starts. The model is dynamically restored on the worker nodes of prediction jobs. This approach enables you to make predictions against a large dataset, stored in a Cloud Storage bucket or Google BigQuery tables, in a scalable manner, because Cloud Dataflow automatically distributes the prediction tasks to multiple worker nodes. Cloud Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL (Extract, Transform, Load), batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks such as resource management and performance optimization. Using Cloud Storage as a data source. Using BigQuery as a data source.
  • 34. Automated Kubernetes GPU Deployment #1 with tensorflow How to automate deep learning training with Kubernetes GPU-cluster https://github.com/Langhalsdino/Kubernetes-GPU-Guide By Frederic Tausch Why did I write this guide? I have worked as in intern for the startup understand.ai and noticed the hassle of firstly designing a machine learning algorithm locally and then bringing it to the cloud for training with different parameters and datasets. The second part, bringing it to the cloud for extensive training, takes always longer than thought, is frustrating and involves usually a lot of pitfalls. For this reason I decided to work on this problem and make the second part effortless, simple and quick. The result of this work is this handy guide, that describes how everyone
  • 35. Automated Kubernetes GPU Deployment #2 with tensorflow GPUs & Kubernetes for Deep Learning By Samuel Cozannet Deploy Kubernetes with GPUs ● Deploy k8s on AWS in a development mode (no HA, colocating etcd, the control plane and PKI) ● Deploy 2x nodes with GPUs (p2.xlarge and p2.8xlarge instances) ● Deploy 3x nodes with CPU only (m4.xlarge) ● Validate GPU availability Add EFS storage to the cluster ● Programmatically add an EFS File System and Mount Points into your nodes ● Verify that it works by adding a PV / PVC in k8s. Automating Tensorflow deployment ● Data ingest code & package ● Training code & package ● Evaluation code & package ● Serving code & package ● Deployment process So what is a Deep Learning pipeline exactly? Well, my definition is a 4 step pipeline, with a potential retro-action loop, that consists of : Data Ingest: This step is the accumulation and pre processing of the data that will be used to train our model(s). This step is maybe the less fun, but it is one of the most important. Your model will be as good as the data you train it on Training: this is the part where you play God and create the Intelligence. From your data, you will create a model that you hope will be representative of the reality and capable of describing it (and even why not generate it) Evaluation + Monitoring: Unless you can prove your model is good, it is worth just about nothing. The evaluation phase aims at measuring the distance between your model and reality. This can then be fed back to a human being to adjust parameters. In advanced setups, this can essentially be your test in CI/CD of the model, and help auto tune the model even without human intervention Serving: there is a good chance you will want to update your model from time to time. If you want to recognize threats on the network for example, it is clear that you will have to learn new malware signatures and patterns of behavior, or you can close your business immediately. Serving is this phase where you expose the model for consumption, and make sure your customers always enjoy the latest model.
  • 36. Jupyter notebooks vs IDE development What is the practical difference?
  • 37. Jupyter notebooks What are they all about If you are a beginner in data science (and with Python) you might have noticed that there are a lot of tutorial implemented as Jupyter notebooks (used to be referred as ipython). The notebooks allow easy embedding of formatted text and images to executable code, which make then very useful to provided with scientific papers and walkthrough example code. Idea of a notebook Both in academia (e.g. non-tech savvy old-school PI vs. new generation data wizard) and in industry, one may want to communicate concisely the rationale for the computing with the results. Compare this to “Excel data science”, where a person manually manipulates the data destroying end-to-end processing pipeline and making reproducible research impossible. In Jupyter notebook, one could include both data preparation and data analysis parts, and the grad student could for example do the dirty work and ask for insight from more senior researchers. In theory, the collaborators would quickly realize that the notebook is about, and allowing quick playing around that would easily usable by the poor grad student as well eliminating the Excel → Matlab/R/Python Excel circus→ probably experienced in some research labs. Downside is that the notebooks are not ‘plug’n’play’ executables, and can depend on a lot of standard libraries, are even worse for a non- tech savvy PI, custom libraries written by the grad student
  • 38. Jupyter notebook Features Sep 10, 2016 • Alex Rogozhnikov http://arogozhnikov.github.io/2016/09/10/jupyter-features.html
  • 39. Jupyter notebooks Examples of capabilities A gallery of interesting Jupyter Notebooks David Haberthür Colour science computations with colour, a Python package implementing a comprehensive number of colour theory transformations and algorithms supported by a dedicated collection of IPython Notebooks. More colour science related IPython Notebooks are available on colour-science.org. Data-driven journalism The Need for Openness in Data Journalism, by Brian Keegan. St. Louis County Segregation Analysis , analysis for the article The Ferguson Area Is Even More Segregated Than You Probably Guessed by Jeremy Singer-Vine. Reproducible academic publications This section contains academic papers that have been published in the peer-reviewed literature or pre-print sites such as the ArXiv that include one or more notebooks that enable (even if only partially) readers to reproduce the results of the publication. 4) The Paper of the Future by Alyssa Goodman et al. (Authorea Preprint, 2017). This article explains and shows with demonstrations how scholarly "papers" can morph into long-lasting rich records of scientific discourse, enriched with deep data and code linkages, interactive figures, audio, video, and commenting. It includes an interactive d3.js visualization and has an astronomical data figure with an IPYthon Notebook "behind" it. A 5-minute video demonstration of this paper is available at this YouTube link. “Traditional” notebook example 5 Better “Science Storytelling” As we stated at the outset, communicating results by way of what cognitive scientists refer to as "storytelling" has the deepest, most long-lasting, impact on a reader, viewer, or listener. Until recently, journal articles only contained words, numbers, and pictures, but today we can enhance journal articles' storytelling potential with audio, video, and enhanced figures that offer interactivity and context. We consider each of these opportunities in turn, below. Screen shot of the first 3D PDF published in Nature in 2009 (Goodman 2009). A video demonstration of how users can interact with the figure is on YouTube, here. Open the PDF here in Adobe Acrobat to interact.
  • 40. Jupyter notebooks how do they compare to IDEs “Real-life” development typically a lot easier with a proper IDE (Integrated Development Environment) then https://www.slant.co/versus/1240/15716/~pycharm_vs_jupyter You can use Jupyter notebooks also with PyCharm https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html Compare to the Rstudio IDE and its notebooks
  • 41. IDE PyCharm PyCharm the most commonly used IDE, and could be a safe bet to start with. datacamp.com By Paulo Vasconcellos Top 5 Python IDEs For Data Science https://www.kaggle.com/general/4308 Jun 28, 2017 http://noeticforce.com/best-python-ide-for-programmers-windows-and-mac Rodeo IDE has the feel of RStudio if you are coming from R https://www.r-bloggers.com/ rstudio-clone-for-python-ro deo/ By Erik Marsja http://www.marsja.se/rstudio-like-python -ides-rodeo-spyder/
  • 42. Deploying Deep learning models Large-scale interference (+Training) with streaming
  • 43. Streaming Machine learning ”Heavier” tech stacks End to End Streaming ML Recommendation Pipeline Spark 2 0, Kafka, TensorFlow Workshop by Chris Fregly on 10/12/2016. https://www.youtube.com/watch?v=UmCB9ycz55Q | https://github.com/fluxcapacitor/pipeline/wiki “STREAMING” i.e. continuous stream of data for example from credit card transactions, Uber cars, Internet of Things devices such as electricity meters or next- generation vital monitoring at hospitals.
  • 44. Streaming Machine learning Confluent #1 R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics Kai Wähner, Technology Evangelist at Confluent, Published on Mar 24, 2017 https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics Big Data vs. Fast Data SQL Server, Oracle, MySQL, Teradata, Netezza, JDBC/ODBC, Hadoop, SFDC, PostgreSQL, RapidMiner, TIBCO Spotfire
  • 45. Streaming Machine learning Confluent #2a Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Streams Kai Wähner, Technology Evangelist at Confluent, Published on May 23, 2017 https://www.slideshare.net/KaiWaehner/apache-kafka-streams-machine-learning-deep-learning https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Kafka Cluster Deployed anywhere: Docker, Kubernetes, Mesos, Java App, ... Kafka Cluster
  • 46. Streaming Machine learning Confluent #2b Continuously train and improve the model with every new event How to improve models? 1) Manual Update 2) Automated Batch 3)Real Time
  • 47. Streaming Machine learning Databricks #1 https://www.slideshare.net/databricks/integrating-deep-learning-libraries-with-a pache-spark https://github.com/databricks/spark-deep-learning One way to productionize a model is to deploy it as a Spark SQL User Defined Function, which allows anyone who knows SQL to use it. Deep Learning Pipelines provides mechanisms to take a deep learning model and register a Spark SQL User Defined Function (UDF). The resulting UDF takes a column (formatted as a image struct "SpImage") and produces the output of the given Keras model (e.g. for Inception V3, it produces a real valued score vector over the ImageNet object categories).
  • 48. Streaming Machine learning Databricks #2 Getting the best results in deep learning requires experimenting with different values for training parameters, an important step called hyperparameter tuning. Since Deep Learning Pipelines enables exposing deep learning training as a step in Spark’s machine learning pipelines, users can rely on the hyperparameter tuning infrastructure already built into Spark. A Vision for Making Deep Learning Simple From Machine Learning Practitioners to Business Analysts by Sue Ann Hong, Tim Hunter and Reynold Xin Posted in ENGINEERING BLOGJune 6, 2017 SIMPLIFY MACHINE LEARNING WITH APACHE SPARK Read the White Paper Download our Machine Learning Starter Kit Deploying Models in SQL Once a data scientist builds the desired model, Deep Learning Pipelines makes it simple to expose it as a function in SQL, so anyone in their organization can use it – data engineers, data scientists, business analysts, anybody. sparkdl.registerKerasUDF("img_classify", "/mymodels/dogmodel.h5") Next, any user in the organization can apply prediction in SQL: SELECT image, img_classify(image) label FROM images WHERE contains(label, “Chihuahua”) Similar functionality is also available in the DataFrame programmatic API across all supported languages (Python, Scala, Java, R). Similar to scalable prediction, this feature works in both batch and structured streaming. Conclusion In this blog post, we introduced Deep Learning Pipelines, a new library that makes deep learning drastically easier to use and scale. While this is just the beginning, we believe Deep Learning Pipelines has the potential to accomplish what Spark did to big data: make the deep learning “superpower” approachable for everybody. Future posts in the series will cover the various tools in the library in more detail: image manipulation at scale, transfer learning, prediction at scale, and making deep learning available in SQL. To learn more about the library, check out the Databricks notebook as well as the github repository. We encourage you to give us feedback. Or even better, be a contributor and help bring the power of scalable deep learning to everyone.
  • 49. Streaming Machine learning StreamAnalytix Impetus Technologies Unveils New, TensorFlow-Based Deep Learning Feature on Apache Spark for StreamAnalytix LOS GATOS, Calif., June 15, 2017 /PRNewswire/ https://streamanalytix.com/industryBuzz/PR_15_06_2017 https://streamanalytix.com/architecture The combination of streaming analytics and deep learning enables a new breed of applications and machine capabilities in industrial IoT, voice analytics and anomaly detection.
  • 50. Streaming Machine learning linagora Making Image Classification Simple With Spark Deep Learning Zied Sellami, Jun 28 2017 https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-l earning-f654a8b876b8 Apache Spark is an open-source cluster-computing framework. Apache Spark is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Conclusion Apache Spark is a very powerful platform with elegant and expressive APIs to allow Big Data processing. We tried with success Spark Deep Learning, an API that combine Apache Spark and Tensorflow to train and deploy an image classifier. It is extremely easier (less than 30 lines of code). Our next objective is to test if we can deploy facial recognition model with this API While this support is only available on Python, we hope that integration will be done very soon on other programming languages especially with Scala. Run pyspark with spark-deep-learning library spark-deep-learning library comes from Databricks and leverages Spark for its two strongest facets: In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code. It uses Spark’s powerful distributed engine to scale out deep learning on massive datasets. The library run on pyspark. Let’s start pyspark: export SPARK_HOME=PATH/TO/spark-2.1.1-bin-hadoop2.7 export set JAVA_OPTS="-Xmx9G -XX:MaxPermSize=2G -XX:+UseCompressedOops -XX:MaxMetaspaceSize=512m" $SPARK_HOME/bin/pyspark --packages databricks:spark-deep-learning:0.1.0- spark2.1-s_2.11 --driver-memory 5g Below is a console output example. Deep Learning Pipelines on Apache Spark enables fast transfer learning with a Featurizer (that transform images to numeric features). from pyspark.ml.classification import LogisticRegression from pyspark.ml import Pipeline from sparkdl import DeepImageFeaturizer featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label") p = Pipeline(stages=[featurizer, lr]) p_model = p.fit(train_df) view rawmodelCreator.py hosted with by❤ GitHub
  • 52. Kubernetes in deep learning #1 https://openai.com/blog/infrastructure-for-deep-learning/ Deep learning is an empirical science, and the quality of a group's infrastructure is a multiplier on progress. Fortunately, today's open-source ecosystem makes it possible for anyone to build great deep learning infrastructure. In this post, we'll share how deep learning research usually proceeds, describe the infrastructure choices we've made to support it, and open-source kubernetes-ec2-autoscaler, a batch- optimized scaling manager for Kubernetes. We hope you find this post useful in building your own deep learning infrastructure. Once the model shows sufficient promise, you'll scale it up to larger datasets and more GPUs. This requires long jobs that consume many cycles and last for multiple days. You'll need careful experiment management, and to be extremely thoughtful about your chosen range of hyperparameters. Like much of the deep learning community, we use Python 2.7. We generally use Anaconda, which has convenient packaging for otherwise difficult packages such as OpenCV and performance optimizations for some scientific libraries. We also run our own physical servers, primarily running Titan X GPUs. We expect to have a hybrid cloud for the long haul: it's valuable to experiment with different GPUs, interconnects, and other techniques which may become important for the future of deep learning. Scalable infrastructure often ends up making the simple cases harder. We put equal effort into our infrastructure for small- and large-scale jobs, and we're actively solidifying our toolkit for making distributed use-cases as accessible as local ones. Kubernetes requires each job to be a Docker container, which gives us dependency isolation and code snapshotting. However, building a new Docker container can add precious extra seconds to a researcher's iteration cycle, so we also provide tooling to transparently ship code from a researcher's laptop into a standard image. We expose Kubernetes's flannel network directly to researchers' laptops, allowing users seamless network access to their running jobs. This is especially useful for accessing monitoring services such as TensorBoard. (Our initial approach — which is cleaner from a strict isolation perspective — required people to create a Kubernetes Service for each port they wanted to expose, but we found that it added too much friction.) We're releasing kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. It runs as a normal Pod on Kubernetes and requires only that your worker nodes are in Auto Scaling groups. Our infrastructure aims to maximize the productivity of deep learning researchers, allowing them to focus on the science. We're building tools to further improve our infrastructure and workflow, and will share these in upcoming weeks and months. We welcome help to make this go even faster!
  • 53. Kubernetes in deep learning #2 https://news.ycombinator.com/item?id=12391505 May 9 INFRA · DATA · RESEARCH · NEWS FEED · PYTHON IntroducingFBLearnerFlow:Facebook'sAIbackbone Jeffrey Dunn, https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
  • 54. Automated Kubernetes CPU Deployment Baidu Case Baidu's deep learning framework adopts Kubernetes Google's orchestration framework will provide smart resource allocation and cluster management to PaddlePaddle http://www.infoworld.com/article/3167608/artificial-intelligence/baidus-deep-learning-framework-adopts-kubernetes.html http://blog.kubernetes.io/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes.html Why Run PaddlePaddle on Kubernetes PaddlePaddle is designed to be slim and independent of computing infrastructure. Users can run it on top of Hadoop, Spark, Mesos, Kubernetes and others.. We have a strong interest with Kubernetes because of its flexibility, efficiency and rich features. A successful deep learning project includes both the research and the data processing pipeline. There are many parameters to be tuned. A lot of engineers work on the different parts of the project simultaneously. To ensure the project is easy to manage and utilize hardware resource efficiently, we want to run all parts of the project on the same infrastructure platform. The platform should provide: ● fault-tolerance. It should abstract each stage of the pipeline as a service, which consists of many processes that provide high throughput and robustness through redundancy. ● auto-scaling. In the daytime, there are usually many active users, the platform should scale out online services. While during nights, the platform should free some resources for deep learning experiments. ● job packing and isolation. It should be able to assign a PaddlePaddle trainer process requiring the GPU, a web backend service requiring large memory, and a CephFS process requiring disk IOs to the same node to fully utilize its hardware. What we want is a platform which runs the deep learning system, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. We want to run all jobs -- online and offline, production and experiments -- on the same cluster, so we could make full utilization of the cluster, as different kinds of jobs require different hardware resource.
  • 55. Docker customers Published on Aug 11, 2016 In this video Ajay Dankar, Senior Director Product Management at PayPal discusses why they selected Docker and Docker Trusted Registry to help them containerize their legacy apps to more efficiently utilize their infrastructure and secure workloads. -- Docker is an open platform for developers and system administrators to build, ship and run distributed applications. With Docker, IT organizations shrink application delivery from months to minutes, frictionlessly move workloads between data centers and the cloud and can achieve up to 20X greater efficiency in their use of computing resources. Inspired by an active community and by transparent, open source innovation, Docker containers have been downloaded more than 700 million times and Docker is used by millions of developers across thousands of the world’s most innovative organizations, including eBay, Baidu, the BBC, Goldman Sachs, Groupon, ING, Yelp, and Spotify. Docker’s rapid adoption has catalyzed an active ecosystem, resulting in more than 180,000 “Dockerized” applications, over 40 Docker-related startups and integration partnerships with AWS, Cloud Foundry, Google, IBM, Microsoft, OpenStack, Rackspace, Red Hat and VMware. https://www.youtube.com/watch?v=wf4Jg-9gv9Q
  • 56. Business data into value 8 ways to turn data into value with Apache Spark machine learning OCTOBER 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM http://www.ibmbigdatahub.com/blog/8-ways-turn-data-value-apache-spark-machine-learning 1. Obtain a holistic view of business In today's competitive world, many corporations work hard to gain a holistic view or a 360 degree view of customers, for many of the key benefits as outlined by data analytics expert Mr. Abhishek Joshi. In many cases, a holistic view was not obtained, partially due to the lack of capabilities to organize huge amount of data and then to analyze them. But Apache Spark’s ability to compute quickly while using data frames to organize huge amounts of data can help researchers quickly develop analytical models that provide a holistic view of the business, adding value to related business operations. To realize this value, however, an analytical process, from data cleaning to modeling, must still be completed. 4. Avoid customer churn by rethinking churn modeling Losing customers means losing revenue. Not surprisingly, then, companies strive to detect potential customer churn through predictive modeling, allowing them to implement interventions aimed at retaining customers. This might sound easy, but it can actually be very complicated: Customers leave for reasons that are as divergent as the customers themselves are, and products and services can play an important, but hidden, role in all this. What’s more, merely building models to predict churn for different customer segments—and with regard to different products and services—isn’t enough; we must also design interventions, then select the intervention judged most likely to prevent a particular customer from departing. Yet even doing this requires the use of analytics to evaluate the results achieved—and, eventually, to select interventions from an analytical standpoint. Amid this morass of choices, Apache Spark’s distributed computing capabilities can help solve previously baffling problems. 5. Develop meaningful purchase recommendations Recommendations for purchases of products and services can be very powerful when made appropriately, and they have become expected features of e-commerce platforms, with many customers relying on recommendations to guide their purchases. Yet developing recommendations at all means developing recommendations for each customer—or, at the very least, for small segments of customers. Apache Spark can make this possible by offering the distributed computing and streaming analytics capabilities that have become invaluable tools for this purpose. ebaytechblog: Spark is helping eBay create value from its data, and so the future is bright for Spark at eBay. In the meantime, we will continue to see adoption of Spark increase at eBay. This adoption will be driven by chats in the hall, newsletter blurbs, product announcements, industry chatter, and Spark’s own strengths and capabilities. http://dx.doi.org/10.1186/s40165-015-0014-6 https://thinkbiganalytics.com/big_data_solutions/data-science/
  • 57. Open data and Apache Spark -----JumptoTopic----- 00:00:06 -WorkshopIntro&Environment Setup 00:13:06 -BriefIntrotoSpark 00:17:32 -AnalysisOverview: SFFireDepartment CallsforService 00:23:22 -AnalysiswithPySparkDataFramesAPI 00:29:32 -DoingDate/TimeAnalysis 00:47:53 -Memory, CachingandWritingtoParquet 01:00:40 -SQLQueries 01:21:11 -ConvertaSparkDataFrametoaPandasDataFrame -----Q&A----- 01:24:43 -SparkDataFramesvs.SQL:ProsandCons? 01:26:57 -WorkflowforChainingDatabricksnotebooksintoPipeline? 01:30:27 -IsSpark2.0readytousein production? https://www.youtube.com/watch?v=iiJq8fvSMPg
  • 58. Internet of things (IoT) To proof that our IoT platform is really independent on application environment, we took one IoT gateway (RaspberryPi 2) from the city project and put into Austin Convention Center during OpenStack Summit together with IQRF based mesh network connecting sensors that measure humidity, temperature and CO2 levels. This demonstrates ability that IoT gateway can manage or collect data from any technology like IQRF, Bluetooth, GPIO, and any other communication standard supported on Linux based platforms. We deployed 20 sensors and 20 routers on 3 conference floors with a single active IoT gateway receiving data from entire IQRF mesh network and relaying it to dedicated time-series database, in this case Graphite. Collector is MQQT-Java bridge running inside docker container managed by Kubernetes. The following screenshot shows real time CO2 values from different rooms on 2 floors. Historical graph shows values from Monday. You can easily recognize when the main keynote session started and when was the lunch period.
  • 59. Healthcare https://www.youtube.com/watch?v=ePp54ofRqRs https://www.healthdirect.gov.au/ How open source container tech can impact healthcare At Red Hat, we believe that creating open source platforms allows the tech community to develop the best software possible. We recently launched a series of films highlighting the open source movement’s impact on healthcare, including initiatives that promote open patient data and provide 3D-printed prosthetics. Health is a great context to start exploring OpenShift’s open source capabilities. We designed OpenShift to allow developers to take full advantage of containers (Docker) and orchestration (Kubernetes), without having to learn the internals of how to build containers from scratch or understand sys admin enough to deploy production-quality apps that can scale on demand. OpenShift makes using containers and orchestration accessible by letting you focus on code instead of writing Dockerfiles and running Docker builds all day. With the integrated Source-to-Image open source project, the platform automatically creates containers while requiring only the URL for your source code repository. openshift.devpost.com Improving Container Security: Docker and More After 6 months and 15 successful beta deployments, Twistlock is announcing the general availability of our container security suite. Twistlock came out of stealth in May 2015. Since then, we have been working diligently with a select group of beta customers to validate the value of our offerings. This diverse group of 15 beta testers, including Wix,AppsFlyer, and HolidayCheck, spans financial services, hospitality, healthcare, Internet services, and government. These customers confirmed that we are hitting the sweet spot of their most pressing container security needs -- a majority of them already deployed our product into their production environments, protecting live services and customer data. The logical resource boundaries established in Docker containers are almost as secure as those established by the Linux operating system or by a virtual machine, according to a report by Gartner analyst Joerg Fritsch. However, Docker and Linux containers in general fall short when it comes to container management and administration, Fritsch said in his report, " Security properties of containers managed by Docker."
  • 61. Neuroscience & bioinformatics http://dx.doi.org/10.1016/j.conb.2015.04.002 Most large-scale analytics, whether in industry or neuroscience, involve common patterns. Raw data are massive in size. Often, they are processed so as to extract signals of interest, which are then used for statistical analysis, exploration, and visualization. But raw data can be analyzed or visualized directly (top arrow). And the results of each successive step informs how to perform the earlier ones (feedback loops). Icons below highlight some of the technologies, discussed in this essay, that are core to the modern large- scale analysis workflow. “Cloud deployment also makes it easier to build tools that run identically for all users, especially with virtual machine platforms like Docker. However, cloud deployment for neuroscience does require transferring data to cloud storage, which may become a bottleneck. Deploying on academic clusters requires at least some support from cluster administrators but keeps the data closer to the computation. … There is also rapidly growing interest in the ‘‘data analysis notebook’’. These notebooks – the Jupyter notebook being a particularly popular example – combine executable code blocks, notes, and graphics in an interactive document that runs in a web browser, and provides a seamless front-end to a computer, or a large cluster of computers if running against a framework like Spark. Notebooks are a particularly appealing way to disseminate information; a recent neuroimaging paper, for example, provided all of its analyses in a version-controlled repository hosted on GitHub with Jupyter notebooks that generate all the figures in the paper [45]—a clear model for the future of reproducible science https://www.docker.com/customers/docker-helps-varian-medical-systems-battle-cancer https://dx.doi.org/10.12688/f1000research.7536.1 http://dx.doi.org/10.1371/journal.pone.0152686 http://dx.doi.org/10.1186/s13742-015-0087-0 http://homolog.us/blogs/blog/2015/09/22/is-docker-for-suckers/
  • 62. Neuroscience streaming data with Spark http://dx.doi.org/10.1016/j.conb.2015.04.002 Streaming analysis of two-photon calcium imaging* * With these levels of analysis in mind, we probe the cortical circuits underlying active tactile decision making. Whisker- based haptic tasks for head-fixed mice developed in our lab provide outstanding stimulus control and facilitate applications of powerful biophysical methods, such as whole cell recordings and two-photon microscopy. in collaboration with Karel Svodoba and Nicholas Sofroniew. https://www.janelia.org/lab/svoboda-lab https://youtu.be/uUQTSPvD1mc?t=17m Real-time visualization during the experiment rather than doing the experiment with no good idea of what is going on before post- experiment offline analysis Real-time feedback on the experimental parameters, or for the brain itself via optogenetic stimulation for example.
  • 64. Reproducible SCIENCE with docker ANACONDA AND DOCKER BETTER TOGETHER FOR REPRODUCIBLE DATA SCIENCE Monday, June 20, 2016, continuum.io/blog Anaconda integrates with many different providers and platforms to give you access to the data science libraries you love on the services you use, including Amazon Web Services, Microsoft Azure, and Cloudera CDH. Today we’re excited to announce our new partnership with Docker. As part of the announcements at DockerCon this week, Anaconda images will be featured in the new Docker Store, including Anaconda and Miniconda images based on Python 2 and Python 3. These freely available Anaconda images for Docker are now verified, will be featured in the Docker Store when it launches, are being regularly scanned for security vulnerabilities and are available from the ContinuumIO organization on Docker Hub. Anaconda and Docker are a great combination to empower yourdevelopment, testing and deployment workflows with Open Data Science tools, including Python and R. Our users often ask whether they should be using Anaconda or Docker for data science development and deployment workflows. We suggest using both - they’re better together!
  • 65. Reproducible SCIENCE Between Jupyter and Docker Jupyter/JupyterLabdoesnotcomereally as 'plug'n'play'andyoustill havetohave allthe dependenciesresolved Build own condopackages,and deploy continuum.io/blog/developer-blog/whats-old-and-new-conda-build Anaconda Enterprise Notebooks continuum.io