2. Agenda
• Problems with current workflow
• Interactive exploration to enterprise API
• Data Science Platforms
• My recommendation
3. About me @geoHeil
• Data Scientist at T-Mobile Austria
• Business Informatics at Vienna University of Technology
• Built predictive startup (predictr.eu)
• Data science projects at university
7. ML modes: similarity of environments?
Exploration
• Flexibility
• Easy to use
• reusability
Production
• Performance
• Scalability
• Monitoring
• API
Interaction required to improve business process
ML modes
10. Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
11. Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
• Notebooks as API
• Re develop from scratch
12. Prototype problem at current project
Easy move to the JVM?
Consultant
R
Me
Python
Production
JVM
native C dependencies
13. Data exchange possibilities (API)
Pickle – python only
Hadoop file formats (avro/parquet)
Thrift, protobuf
Message queue
REST
14. Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
Solutions
• Notebooks as API
• Use analytics via an API
15. Big data starts at
20GB. Want to use
fancy hadoop cluster
We can buy a
server with 6 TB
RAM
16. 3 types of big data
1. Fits in memory (6 TB of RAM …)
2. Raw data too large for memory, but aggregated data works
well
3. Too big => ml needs to be big as well
17. Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Enterprise operations handle JVM
only
• Inflexible big data tools
Solutions
• Notebooks as API
• Use analytics via an API
• Your data is not “really big” and
still fits in memory
19. Stackup
Problems
• Move to production means
redevelopment from scratch
• Enterprise operations handle JVM
only
• Inflexible big data tools
• Security not taken care of
Solutions
• Notebooks as API
• Use analytics via an API
• Your data is not “really big” and
still fits in memory ->keep using
python / R / notebooks
• Kerberized hadoop cluster :(
21. small data & R prototype
Separation of concerns.
22. Startup data science – predicting cash flows
• Custom backend (JVM)
• Data science and via an API (OpenCPU / R )
• Partly in backend (Renjin)
23. Other possibilities
• JNI (java native interface) :(
• JNA (java native access)
• Rkafka (did not have a MQ in infrastructure)
• Custom service (rest call) to JNA enabled server (too
costly)
29. project facts
• We were using a ms-sql backup (600 GB)
• Spark + parquet compressed it to 3 GB
• No cluster during development of the project, only laptops
+ 60 GB RAM server
• Most of the time spent in garbage collection (15 sec on
real cluster, 17 Minutes on laptop)
30. Data science stack
• Type 2 big data (aggregation allows for local in memory
processing in python/R)
• Spark as (REST) API
POST /jars/app_name jobserver:port/jars/myjob
POST jobserver:port/contexts/context_for_myapp
POST "paramKey = paramValue"
jobserver:port/myjob?appName=myjob&classPaht=path.to.main&con
text=context_for_myapp
• Aggregated data fed to R via REST-API
33. Cloud solutions
• Notebook as API: Databricks workflows / Domino data lab
• Google, Microsoft, Amazon
• Several data science platform startups bigml, dataiku,
...
(+) cluster deploy on click
(+) some integrate notebooks well
(-) control over data?
41. Seldon architecture
K8s for high availability
Hot model deployments
A-B testing
Holdout group
Containerized micro
services conforming to
seldon’s REST API
Overall verygood
But: outdated python
2.xx
Kubernetes
mandatory
43. Whish list
• Flexibility to experiment (notebooks)on big enough
hardware
• Make these easily available as an API in a pre-production
environment to gain quick business feedback
• A-B testing, holdout group, containers
• More “developer” mindset (Testing, CI, security) for data
scientists
45. Write a JVM-based custom backend which operations and existing developers
can maintain. Apparently this is a better fit than a platform turnkey solution.
49. PMML - Openscoring
• Based on PMML (predictive markup model language)
(+) stay in java/xml world (enterprise operations J)
(+) quick predictions
(+) mature
(-) not all models suitable for PMML / some algorithms not
implemented
(-) xml
55. How do tools stack up regarding security?
https://www.youtube.com/watch?v=t63SF2UkD0A&feature=youtu.be
56. Python (what I learnt later on)
• Easily can deployed on its own (if ops can handle this)
• Python4j/ pyspark/ spylon?
57. Science in Python, production in java – spylon, Video
• Bring code via custom UDF to data in pySpark
• Model = fitted sk-learn model
• Requires model to be parallelizable
58. others
• Jupyter notebook to REST API (IBM interactive dashboard
http://blog.ibmjstart.net/2016/01/28/jupyter-notebooks-as-restful-microservices/)
• Apache toree (interactive spark as notebook)
Editor's Notes
Hi Georg. Talk about how to not have a smart prototype script rot in the corner. First talk ;)
Question: Who has played with machine learning who is familiar with R / python? Who is using big data technology in production? Who is drving business decisions with ML?