Machine learning model to production

prototype -> production
Make your ML app rock

Agenda
• Problems with current workflow
• Interactive exploration to enterprise API
• Data Science Platforms
• My recommendation

About me @geoHeil
• Data Scientist at T-Mobile Austria
• Business Informatics at Vienna University of Technology
• Built predictive startup (predictr.eu)
• Data science projects at university

Ed, 41
Professional developer
Cares about Testing, CI,
stability
John, 28
Phd. cool kid
Wants to build
awesome app

Simple?
Goal: smart application improves business processes
John’s
Smart app
Ed’s
Business
process

Simple?
Goal: smart application improves business processes
Ed’s
Business
process

ML modes: similarity of environments?
Exploration
• Flexibility
• Easy to use
• reusability
Production
• Performance
• Scalability
• Monitoring
• API
Interaction required to improve business process
ML modes

from https://www.youtube.com/watch?v=R-6nAwLyWCI
flexibility performance

Stackup
Problems
• Move to production means
redevelopment from scratch
Solutions
• Notebooks as API

Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies

Stackup
Problems
• Enterprise operations handle JVM
only
Solutions
• Re develop from scratch

Data exchange possibilities (API)
Pickle – python only
Hadoop file formats (avro/parquet)
Thrift, protobuf
Message queue
REST

Stackup
Problems
only
Solutions
• Use analytics via an API

Big data starts at
20GB. Want to use
fancy hadoop cluster
We can buy a
server with 6 TB
RAM

3 types of big data
1. Fits in memory (6 TB of RAM …)
2. Raw data too large for memory, but aggregated data works
well
3. Too big => ml needs to be big as well

Stackup
Problems
only
only
• Inflexible big data tools
Solutions
• Your data is not “really big” and
still fits in memory

Security is
not my job
Disagree /
infoSec

Stackup
Problems
only
• Inflexible big data tools
• Security not taken care of
Solutions
• Your data is not “really big” and
still fits in memory ->keep using
python / R / notebooks
• Kerberized hadoop cluster :(

small data & R prototype
Separation of concerns.

Startup data science – predicting cash flows
• Custom backend (JVM)
• Data science and via an API (OpenCPU / R )
• Partly in backend (Renjin)

Other possibilities
• JNI (java native interface) :(
• JNA (java native access)
• Rkafka (did not have a MQ in infrastructure)
• Custom service (rest call) to JNA enabled server (too
costly)

Music streaming
Anomaly detection big data

Source
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be

project facts
• We were using a ms-sql backup (600 GB)
• Spark + parquet compressed it to 3 GB
• No cluster during development of the project, only laptops
+ 60 GB RAM server
• Most of the time spent in garbage collection (15 sec on
real cluster, 17 Minutes on laptop)

Data science stack
• Type 2 big data (aggregation allows for local in memory
processing in python/R)
• Spark as (REST) API
POST /jars/app_name jobserver:port/jars/myjob
POST jobserver:port/contexts/context_for_myapp
POST "paramKey = paramValue"
jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con
text=context_for_myapp
• Aggregated data fed to R via REST-API

Frontend Backend
Data-science
SQL aggregation / spark job-server
Spark cluster
Laptop J
R
via opencpu
Spark aggregaton & R as API
REST call
API
incompatibilities
L

Data science platform
Can the architecture be simplified?

Cloud solutions
• Notebook as API: Databricks workflows / Domino data lab
• Google, Microsoft, Amazon
• Several data science platform startups bigml, dataiku,
...
(+) cluster deploy on click
(+) some integrate notebooks well
(-) control over data?

What is missing?
Custom models, Control over data,
Testing, CI, AB testing, retraining

Several solutions – same problem

Lets try lean
Back to spark architecture overview …

Missing API layer / model deployment

Hydrospheredata/mist notebook, CI -> e2e

CI & testing +1
Notebook e2e +1
But again: a lot of
moving parts
Highly experimental

Seldon –e2e ml platform for enterprise

Seldon architecture
K8s for high availability
Hot model deployments
A-B testing
Holdout group
Containerized micro
services conforming to
seldon’s REST API
Overall verygood
But: outdated python
2.xx
Kubernetes
mandatory

In an ideal world
What I dream of …

Whish list
• Flexibility to experiment (notebooks)on big enough
hardware
• Make these easily available as an API in a pre-production
environment to gain quick business feedback
• A-B testing, holdout group, containers
• More “developer” mindset (Testing, CI, security) for data
scientists

Reality is different.
How I will move forward with my current
project

Write a JVM-based custom backend which operations and existing developers
can maintain. Apparently this is a better fit than a platform turnkey solution.

How to integrate spark?
Spark deployment modes revisited ...

Spark deployment scenarios
• Batch / bulk prediction in cluster -> job scheduling
overhead
• Long running spark application?(SJS, pipeline persistence
àlocal spark context)
• Predictive service without spark
• PMML? jpmml/sklearn2pmml
• scoring without spark -> mleap and SPARK-16365

What is your approach?
Thanks. @geoHeil

PMML - Openscoring
• Based on PMML (predictive markup model language)
(+) stay in java/xml world (enterprise operations J)
(+) quick predictions
(+) mature
(-) not all models suitable for PMML / some algorithms not
implemented
(-) xml

h2o steam
E2e platform
Build + deploy
interoparbility
Enterprise
permissions
Based on h2o-flow

pipeline.io notebook à
prediction, e2e
“Extend ml pipelines to
serve production users“

How do tools stack up regarding security?
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be

Python (what I learnt later on)
• Easily can deployed on its own (if ops can handle this)
• Python4j/ pyspark/ spylon?

Science in Python, production in java – spylon, Video
• Bring code via custom UDF to data in pySpark
• Model = fitted sk-learn model
• Requires model to be parallelizable

others
• Jupyter notebook to REST API (IBM interactive dashboard
http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/)
• Apache toree (interactive spark as notebook)

Machine learning model to production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Machine learning model to production

Similar to Machine learning model to production (20)

Recently uploaded

Recently uploaded (20)

Machine learning model to production

Editor's Notes