Data Science with Spark & Zeppelin

•Download as PPTX, PDF•

14 likes•6,313 views

Vinay Shukla

Hadoop Summit, San Jose. Introducing Data science with Spark & Zeppelin

Technology

DataScience with Spark & Zeppelin
Ofer Mendelevitch
Vinay Shukla
Moon Soo Lee

Page 2 © Hortonworks Inc. 2014
Data Science with iPython
Ofer Mendelevitch

© Hortonworks Inc. 2015
The Data Science Workflow…
Page 3
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript

Introducing Apache Zeppelin
Lee Moon Soo,
Vinay Shukla

Apache Zeppelin
• A web-based notebook for interactive analytics
• Deeply integrated with Spark and Hadoop
• Supports multiple language backends
• Incubating

Use cases for Zeppelin
• Data exploration & discovery
• Visualization - tables, graphs, charts
• Interactive snippet-at-a-time experience
• Collaboration and publishing
“Modern Data Science Studio”

DEMO I
A day in the life of a data scientist with Zeppelin

Apache Spark Integration
• Supports scala, pyspark and spark sql
• SparkContext injected automatically
• Supports 3rd party dependencies
• Spark-on-YARN and Spark standalone modes
• Full Spark interpreter configuration
• Multiple Spark interpreter profiles

Support for multiple back-ends
• Scala, Python, spark sql
• Hive, Tajo, Ignite, Mysql, ….
• Apache Flink
• Markdown, shell
Driven by the community - thank you!
How is this so easy to do?

Zeppelin Interpreter Architecture
Interpreter is connector between Zeppelin and Backend data processing system.
ZeppelinServer
InterpreterGroup
Separate JVM process
Interpreter Interpreter Interpreter
Spark
Spark PySpark SparkSQL Dep
Load
libraries
Maven repositorySpark cluster
Share single SparkDriver
Thrift

Notebook - Interpreter Selection
Spark
spark pyspark sql dep
Load
libraries
Maven repositorySpark cluster
Share single SparkDriver

Join the community
• Try out Apache Zeppelin today
• https://zeppelin.incubator.apache.org/
• Join us on the community discussions
• Help define how we shape the roadmap and features
• Lets get this party started!

Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloud of your
choice
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Questions?
Thank you

What's hot

Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov

Spark Summit EU talk by Yiannis GkoufasSpark Summit

Spark Summit EU talk by Jakub HavaSpark Summit

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Spark Summit EU talk by Christos ErotocritouSpark Summit

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit

Introduction to Apache SparkRahul Jain

Spark Summit EU talk by Bas GeerdinkSpark Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark Summit EU talk by Emlyn WhittickSpark Summit

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks

Rethinking Streaming Analytics For ScaleHelena Edelson

Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks

What's hot (20)

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Spark Summit EU talk by Yiannis Gkoufas

Spark Summit EU talk by Jakub Hava

Building a Data Pipeline from Scratch - Joe Crobak

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Spark Summit EU talk by Christos Erotocritou

Spark Summit EU talk by Shay Nativ and Dvir Volk

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Building a Business Logic Translation Engine with Spark Streaming for Communi...

Introduction to Apache Spark

Spark Summit EU talk by Bas Geerdink

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit EU talk by Emlyn Whittick

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda

Rethinking Streaming Analytics For Scale

Improving Apache Spark for Dynamic Allocation and Spot Instances

Viewers also liked

Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Manual de programacion_con_robots_para_la_escuelaAngel De las Heras

Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird

Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss

Big Data visualization with Apache Spark and Zeppelinprajods

Viewers also liked (7)

Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...

Introduction to Streaming Distributed Processing with Storm

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Manual de programacion_con_robots_para_la_escuela

Real time data viz with Spark Streaming, Kafka and D3.js

Sparkly Notebook: Interactive Analysis and Visualization with Spark

Big Data visualization with Apache Spark and Zeppelin

Similar to Data Science with Spark & Zeppelin

Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.

Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan

IBM Strategy for SparkMark Kerzner

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Announcing Databricks Cloud (Spark Summit 2014)Databricks

PPT5: Neuron Introductionakira-ai

Bootcamp Data Science using ClouderaAntónio Rodrigues

Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks

Data Science with SparkKrishna Sankar

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

.net developer for Jupyter Notebook and Apache Spark and viceversaMarco Parenzan

Intro to Machine Learning with H2O and AWSSri Ambati

OpenStack Doc Overview for Boot CampAnne Gentle

Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks

Data Science at Scale by Sarah GuidoSpark Summit

Similar to Data Science with Spark & Zeppelin (20)

Data Science at Scale with Apache Spark and Zeppelin Notebook

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

IBM Strategy for Spark

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Announcing Databricks Cloud (Spark Summit 2014)

PPT5: Neuron Introduction

Bootcamp Data Science using Cloudera

Bridging Big Data and Data Science Using Scalable Workflows

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

Data Science with Spark

Mining public datasets using opensource tools: Zeppelin, Spark and Juju

.net developer for Jupyter Notebook and Apache Spark and viceversa

Intro to Machine Learning with H2O and AWS

OpenStack Doc Overview for Boot Camp

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud

Data Science at Scale by Sarah Guido

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

WordPress Websites for Engineers: Elevate Your Brandgvaughan

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

What is Artificial Intelligence?????????blackmambaettijean

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Take control of your SAP testing with UiPath Test SuiteDianaGray10

How to write a Business Continuity PlanDatabarracks

Artificial intelligence in cctv survelliance.pptxhariprasad279825

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Sample pptx for embedding into website for demoHarshalMandlekar2

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

How AI, OpenAI, and ChatGPT impact business and software.

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

WordPress Websites for Engineers: Elevate Your Brand

TeamStation AI System Report LATAM IT Salaries 2024

Gen AI in Business - Global Trends Report 2024.pdf

Generative AI for Technical Writer or Information Developers

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

What is Artificial Intelligence?????????

Unraveling Multimodality with Large Language Models.pdf

"Debugging python applications inside k8s environment", Andrii Soldatenko

Take control of your SAP testing with UiPath Test Suite

How to write a Business Continuity Plan

Artificial intelligence in cctv survelliance.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

SIP trunking in Janus @ Kamailio World 2024

Sample pptx for embedding into website for demo

What is DBT - The Ultimate Data Build Tool.pdf

Data Science with Spark & Zeppelin

1. DataScience with Spark & Zeppelin Ofer Mendelevitch Vinay Shukla Moon Soo Lee

3. © Hortonworks Inc. 2015 The Data Science Workflow… Page 3 What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript

4. Introducing Apache Zeppelin Lee Moon Soo, Vinay Shukla

5. Apache Zeppelin • A web-based notebook for interactive analytics • Deeply integrated with Spark and Hadoop • Supports multiple language backends • Incubating

6. Use cases for Zeppelin • Data exploration & discovery • Visualization - tables, graphs, charts • Interactive snippet-at-a-time experience • Collaboration and publishing “Modern Data Science Studio”

7. DEMO I A day in the life of a data scientist with Zeppelin

8. Apache Spark Integration • Supports scala, pyspark and spark sql • SparkContext injected automatically • Supports 3rd party dependencies • Spark-on-YARN and Spark standalone modes • Full Spark interpreter configuration • Multiple Spark interpreter profiles

9. DEMO I I Apache Spark using Zeppelin

10. Support for multiple back-ends • Scala, Python, spark sql • Hive, Tajo, Ignite, Mysql, …. • Apache Flink • Markdown, shell Driven by the community - thank you! How is this so easy to do?

11. Zeppelin Interpreter Architecture Interpreter is connector between Zeppelin and Backend data processing system. ZeppelinServer InterpreterGroup Separate JVM process Interpreter Interpreter Interpreter Spark Spark PySpark SparkSQL Dep Load libraries Maven repositorySpark cluster Share single SparkDriver Thrift

12. Notebook - Interpreter Selection Spark spark pyspark sql dep Load libraries Maven repositorySpark cluster Share single SparkDriver

13. DEMO III Interpreter Deep Dive

14. Join the community • Try out Apache Zeppelin today • https://zeppelin.incubator.apache.org/ • Join us on the community discussions • Help define how we shape the roadmap and features • Lets get this party started!

Data Science with Spark & Zeppelin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Data Science with Spark & Zeppelin

Similar to Data Science with Spark & Zeppelin (20)

Recently uploaded

Recently uploaded (20)

Data Science with Spark & Zeppelin