Data Science Lifecycle with Apache Zeppelin and Spark

Data science lifecycle with
Apache Zeppelin And Spark
2015 Spark Summit Amsterdam
Moon moon@nflabs.comNFLabs www.nflabs.com

Data Science: process
https://en.wikipedia.org/wiki/Data_analysis

Data Science: people
Engineer Data Scientist
DevOps Business
http://aarondavis.design/

Hadoop Landscape
Cloudera-ML
ML-base
MRQL
Shark
?

Project Timeline
ASF Incubation12.2014
08.2014 Started getting adoption
http://zeppelin.incubator.apache.org
12.2012 Commercial Product for data analysis
10.2013 Open sourced a single feature

Apache Incubation Proposal
11.2014

Acceptance by Incubator
23.12.2014

Current Status
1 Release
71 Contributors worldwide
766 Stars on GH
300/900 Emails at users/dev
@i.a.o

Interpreter
http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html

Writing an Interpreter
public abstract void open();
public abstract void close();
public abstract InterpreterResult interpret(String st, InterpreterContext context);
public abstract void cancel(InterpreterContext context);
public abstract int getProgress(InterpreterContext context);
public abstract List<String> completion(String buf, int cursor);
public abstract FormType getFormType();
public Scheduler getScheduler();
Must
have
Good
to have
Advanced

Display System
Zeppelin Server
Spark Interpreter Other Interpreter
Zeppelin webapp
Websocket, REST
Text Html Table Angular

Display System
Select display
system
through
output

Built in scheduler
Built-in scheduler
runs your notebook
with cron
expression.

Flexible layout
Flexible layout

Zeppelin & Friends
Z-Manager
ZeppelinHub
…
Collaboration/Sharing
Packaging & Deployment Zeppelin + Full stack on a cloud
Packages Backend Integration

Deployment
https://github.com/hortonworks-gallery/ambari-zeppelin-service

AWS EMR
/aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing

An Engineer
engineer by http://aarondavis.design/

A Team

An Organization

That’s too many!

What is the problem?
Too much:
Install
Configure
Cluster resources

Solution?
We have containers
+
reverse proxy

Z Manager
http://github.com/NFLabs/z-manager
Apache 2.0 Licence
Containerized deployment per user
Reverse proxy
Single binary
Simple web application
Z Manager
SGA to ASF coming *

Z Manager
Auto-update
Linux box
go + react :)
Z Manager process

ZeppelinHub
https://www.zeppelinhub.com
Sharing notebooks with access control

Zeppelin
http://aarondavis.design/
Shares Notebook
Provides multi-tenant environment
z-manager
ZeppelinHub

Before
Cloudera-ML
ML-base
MRQL
Shark
?

After
Cloudera-ML
ML-base
MRQL
Shark

People do the similar work
with different data
New visualization
Model & Algorithm
Data process pipeline

Package and distribute work
New visualization
Model & Algorithm
Data process pipeline
Pkg
Repo

Helium
https://s.apache.org/helium
Platform for
on top of Apache Zeppelin
Data Analytics Application

Helium Application
= +
View Algorithm
Zeppelin provided Resources

Resources
Data
Computing
Any java object
- Result of last execution
- JDBC connection (from JDBC Interpreter)*
- SparkContext (from SparkInterpreter)
- Flink environment (from FlinkInterpreter)*
- Provided by user created Interpreter
- Provided by user created Helium application

Application Examples
Data
Computing
- ex) get git commit log data
https://github.com/Leemoonsoo/zeppelin-gitcommitdata
Visualization
- ex) run cpu usage monitoring code across spark cluster, using SparkContext
https://github.com/Leemoonsoo/zeppelin-sparkmon
- ex) display result data as a wordcloud
https://github.com/Leemoonsoo/zeppelin-wordcloud

How it works
Zeppelin Server
Web browser
View
Interpreter Process
Algorithm
Resource pool
Resource pool
Resource
pools are
connected
“Algorithm runs where resource exists”

API
class YourApplication extends org.apache.zeppelin.helium.Application {
@Override
public void run(ApplicationArgument arg, InterpreterContext context) {
…..
}
}
Easy API
Just extend helium.Application

Application Spec
{
mavenArtifact : "groupId:artifactId:version",
className : "your.helium.application.Class",
icon : "fa fa-cloud",
name : "My app name",
description : “some description",
consume : [
"org.apache.spark.SparkContext"
]
}
Simple
Writing a spec file allow Zeppelin load application

Deploy
Public
Repository
Private
Repository
Handy
Private
Public
Packaged to Jar and
Distributed through Maven
Downloaded on the fly and
run when user selects it

Thank you
Q & A
Moon
moon@nflabs.com
NFLabs
www.nflabs.com
http://zeppelin.incubator.apache.org/

Data Science Lifecycle with Apache Zeppelin and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (20)

Similar to Data Science Lifecycle with Apache Zeppelin and Spark

Similar to Data Science Lifecycle with Apache Zeppelin and Spark (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Data Science Lifecycle with Apache Zeppelin and Spark

Editor's Notes