Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin + Livy:
Bringing Multi Tenancy
to Interactive Data Analysis
Rohit Choudhary & Jeff Zhang
June 28, 2016

Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Scala and more
What’s Apache Zeppelin?

Interactive Analysis 1.0 (Spark-shell)

Interactive Analysis 2.0 (Zeppelin)
Spark Interpreter

Interactive Analysis 3.0 (Zeppelin + Livy)
Livy Interpreter

Open Source Activity

Quick Stats: Zeppelin
 Zeppelin graduated in May 2016 and is now TLP
 Incubated by Apache Foundation, since Dec- 2014
 9 Committers, 120+ contributors, growing list
 1000+ JIRAs filed
 900 PRs via the community
 Zeppelin just got a new friend “R”

Recent Updates
 Multi-tenancy with Livy
 Generic JDBC Interpreter
– Hive, Phoenix , RedShift
– Postgres, MySql
– Several others
 Notebook Authentication and Authorization
 UI Automation through Selenium
 Security for other interpreters (on its way)

Usage Patterns & Feedback
 Cluster monitoring, memory analysis
 Telecom data usage, Concert attendees travel patterns

Upcoming
 GA with HDP 2.5 & Ambari 2.4.0, ETA – End July

Architecture & Usage

Zeppelin Architecture
Current Interpreter Support
 HDFS
 PySpark, SparkR, Spark
 Hive, Phoenix, SQL
 Shell
 …

Zeppelin Features
Collate/Load
Data
Collate/Load data from existing data sources, load from external
CSVs. i.e. Eureka, Smartsense
Visualize Robust visualization mechanism to visualize data, and enable
insights
Collaborate Notebook base collaboration, export Notebooks, soon to be
added, tagging to Notebook generated data

Popular Usage Scenarios
Customized
Dashboards
Intended for usage towards customized dashboards for Big Data
clusters
Security
Analytics
Understanding nature of data coming through multiple sources
and analyzing the effects of it
Bio-sciences Medical research companies are interested in using this for their
research

Bringing Multi-tenancy to Zeppelin

Multi-Tenancy: Motivation
 Supporting workloads of multiple
customers
 Supporting multiple LOBs (lines of
business), on a single data systems
 Support fine grained audits
 Inability to provision capacity for multiple
user groups
 Inability to Audit user actions, as all jobs
are run via ‘zeppelin’ proxy user
 Inability to share state/data with other
users as well
Objectives Requirements

Zeppelin Livy Interaction
LDAP
Zeppelin
Shiro
Spark
Yarn
Livy
Ispark Group
Interpreter
SPNego: Kerberos Kerberos
Security Across Zeppelin-Livy-Spark
Livy APIs

Deep dive on Livy

What is Livy
Livy ServerLivy Client
Http
Http (RPC)
Http (RPC)
Livy is an open source REST interface for interacting with Spark from
anywhere.
Spark Interactive
Session
SparkContext
Spark Batch
Session
SparkContext

Why we need Livy with Zeppelin
Reduce the pressure on client machine
Make the job submission/monitoring easy
Customize the job schedule

Interactive Session – Create Session
2
1
3
4
curl -X POST --data '{"kind": "spark"}' -H "Content-Type: application/json" localhost:8998/sessions
{"state":"starting","proxyUser":”null","id":1,"kind":"spark","log":[]}
Request
Response
Livy Client
Livy Server
Spark Interactive
Session
SparkContext

Interactive Session – Execute Code
{"id":0,"state":"running","output":null}
Request
Response
curl http://localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d
'{"code":"sc.parallelize(0 to 100).sum()"}'
2
1
3
4
Livy Client
Livy Server
Spark Interactive
Session
SparkContext

SparkContext Sharing
Livy Server
Client 1
Client 2
Client 3
Session-1
Session-1
Session-2 Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext

Livy Security
Client Livy Server
(Impersonation)
Shared SecretSpengo
SparkSession
• Only authorized users can launch spark session / submit code
• Each user can access his own session
• Only Livy server can submit job securely to spark session

SPNEGO
Client
(Kerbrose TGT)
Livy Server
(SPENGO enabled)
Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO), often pronounced "spen-go”
It is a GSSAPI "pseudo mechanism" used by client-server software to negotiate the choice of security
technology.
Http Get http://site/a.html
Error 401 Unauthorized
Http Get Request
Authorization: Negotiation
Http Get Request

Impersonation
Alice
(Kerberos TGT)
Shared Secret
Bob
(Kerberos TGT)
Shared SecretSpengo
Spengo
Livy Server
(super user: livy)
Spark Session
Spark Session

Shared Secret
1. Livy Server generate secret key
2. Livy Server pass secret key to spark session when launching spark session
3. Use the secret key to communicate with each other
Spark Session
Shared Secret
Livy Server

Multi Tenant: Zeppelin Demo

Zeppelin Direction
 Workspaces and Collaboration
 Customizable Visualization
– Helium
– Custom, data type based visualization (Geolocation/Maps)
 Enterprise Readiness
– Bring security to all interpreters
– Performance improvements
 Collaboration
 Data Lineage

Q & A

Thank You

Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

Similar to Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis

Editor's Notes