Spark Summit EU talk by Sudeep Das and Aish Faenton

•

2 likes•938 views

Spark Summit

Vegas, the Missing MatPlotLib for Spark

Data & Analytics

@NetflixResearch
@aishfenton @datamusing
The missing MatPlotLib
for Scala/Spark

MACHINE LEARNING
SYSTEMS CAN GET QUITE
COMPLICATED
Real life workflow

STATISTICAL
VISUALIZATIONS CAN BE
PAINFULConsider a researcher at Netflix who who has raw data in a spark dataframe with columns:
show_id, num_of_views, country, timestamp, video_age
The researcher wants to make the following plot:
Plot the five most popular titles,
according to total number of views,
in the last 5 hours,
as bar charts faceted by country,
where the bars are color coded by video_age.
Sorting
Aggregating after filtering
by timestamp
Grouping data by a
categorical value (Country)
Mapping a quantitative
column to a color
One could perform all
these operations on the
DF first and then create
one bar plot per country
via a loop.
Painful indeed!

DECLARATI
VE
STATISTICA
L
VISUALIZATIO
N
GRAMMAR
IN SCALA
You tell is WHAT should be done with the data, and it knows
HOW to do it!
Operations such as filtering, aggregation, faceting are built into the
visualization, rather than putting the burden on the user to massage the
data into shape.
Complex visualizations can be built with a few high level abstractions:
DATA
TRANS
-
FORM
S
SCALE
S
GUIDE
S
MARK
S
cf : Altair Talk by Brian Granger in PyData 2016 https://youtu.be/v5mrwq7yJc4

Anatomy of a plot
X/Y channel
Shape
Channel
Size Channel
Color
Channel

5. Built-in spark support
Vegas
.withDataFrame(myDataFrame)
.encodeX(“population”)
.encodeY(“age”)
Mapped
Columns
Pass In
DF.

6. Visual statistics
● Advanced Binning
● Sorting
● Scaling
● Custom Transforms
● Time Series
● Aggregation
● Filtering
● Math functions(log, etc)
● Missing data support
● Descriptive Statistics

VEGA
D3JS
VEGA - LITE
VEGA
S
SCALA DSL EMITS TYPE-CHECKED
VEGA-LITE JSON
VEGA-LITE CONVERTS
INTERNALLY TO VEGA JSON SPEC
VEGA TRANSLATES JSON TO D3JS
CODE THAT CAN BE VERY VERBOSE
A SCALA DSL FOR VEGA-
LITE

Spark Summit EU talk by Sudeep Das and Aish Faenton

Viewers also liked

Spark Summit EU talk by Oscar CastanedaSpark Summit

DataEngConf: Apache Spark in Financial Modeling at BlackRock Hakka Labs

Spark Summit EU talk by Javier AguedesSpark Summit

Spark Summit EU talk by Josef HabdankSpark Summit

Spark Summit EU talk by Zoltan ZvaraSpark Summit

Spark Summit EU talk by Simon WhitearSpark Summit

Spark Summit EU talk by Heiko KorndorfSpark Summit

Spark Summit EU talk by Jorg SchadSpark Summit

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit

Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit

Spark Summit EU talk by Luc BourlierSpark Summit

Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit

Spark Summit EU talk by Reza KarimiSpark Summit

Spark Summit EU talk by Dean WamplerSpark Summit

Spark Summit EU talk by Nick PentreathSpark Summit

Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit

Spark Summit EU talk by Sital KediaSpark Summit

Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit

The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit

Viewers also liked (20)

Spark Summit EU talk by Oscar Castaneda

DataEngConf: Apache Spark in Financial Modeling at BlackRock

Spark Summit EU talk by Javier Aguedes

Spark Summit EU talk by Josef Habdank

Spark Summit EU talk by Zoltan Zvara

Spark Summit EU talk by Simon Whitear

Spark Summit EU talk by Heiko Korndorf

Spark Summit EU talk by Jorg Schad

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...

Spark Summit EU talk by Erwin Datema and Roeland van Ham

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...

Spark Summit EU talk by Luc Bourlier

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Spark Summit EU talk by Reza Karimi

Spark Summit EU talk by Dean Wampler

Spark Summit EU talk by Nick Pentreath

Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...

Spark Summit EU talk by Sital Kedia

Spark Summit EU talk by Ruben Pulido and Behar Veliqi

The Potential of GPU-driven High Performance Data Analytics in Spark

Similar to Spark Summit EU talk by Sudeep Das and Aish Faenton

VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit

Unlock the power of spatial analysis using CARTO and python [CARTOframes]CARTO

Scaling PyData Up and OutTravis Oliphant

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Object Detection & Machine Learning PaperJoseph Mogannam

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly

An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks

A Picture is Worth a Thousand WordsJohn Park

Big data apache spark + scalaJuantomás García Molina

Report: EDA of TV shows & movies available on NetflixAnkitBirla5

Portland Splunk User Group May 2020 Amanda Richardson

The Power of Machine Learning and GraphsFranz Inc. - AllegroGraph

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

Extend Your Toolbox with RSzymon Skorupinski

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Analytics.pptxhari168183

Spot Sigsinfoblog

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson

Introduction to Bigdata and HADOOP vinoth kumar

Similar to Spark Summit EU talk by Sudeep Das and Aish Faenton (20)

VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...

Unlock the power of spatial analysis using CARTO and python [CARTOframes]

Scaling PyData Up and Out

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Object Detection & Machine Learning Paper

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...

An AI-Powered Chatbot to Simplify Apache Spark Performance Management

A Picture is Worth a Thousand Words

Big data apache spark + scala

Report: EDA of TV shows & movies available on Netflix

Portland Splunk User Group May 2020

The Power of Machine Learning and Graphs

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

Extend Your Toolbox with R

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Analytics.pptx

Spot Sigs

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Introduction to Bigdata and HADOOP

Recently uploaded

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Learn How Data Science Changes Our WorldEduminds Learning

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann

ASML's Taxonomy Adventure by Daniel Cantervoginip

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Recently uploaded (20)

Vision, Mission, Goals and Objectives ppt..pptx

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

GA4 Without Cookies [Measure Camp AMS]

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Semantic Shed - Squashing and Squeezing.pptx

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Learn How Data Science Changes Our World

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024

ASML's Taxonomy Adventure by Daniel Canter

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Identifying Appropriate Test Statistics Involving Population Mean

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh

Real-Time AI Streaming - AI Max Princeton

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Heart Disease Classification Report: A Data Analysis Project

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Spark Summit EU talk by Sudeep Das and Aish Faenton

2. @NetflixResearch @aishfenton @datamusing The missing MatPlotLib for Scala/Spark

5. Jan 6th, 2016

7. MACHINE LEARNING SYSTEMS CAN GET QUITE COMPLICATED Real life workflow

9. STATISTICAL VISUALIZATIONS CAN BE PAINFULConsider a researcher at Netflix who who has raw data in a spark dataframe with columns: show_id, num_of_views, country, timestamp, video_age The researcher wants to make the following plot: Plot the five most popular titles, according to total number of views, in the last 5 hours, as bar charts faceted by country, where the bars are color coded by video_age. Sorting Aggregating after filtering by timestamp Grouping data by a categorical value (Country) Mapping a quantitative column to a color One could perform all these operations on the DF first and then create one bar plot per country via a loop. Painful indeed!

10. DECLARATI VE STATISTICA L VISUALIZATIO N GRAMMAR IN SCALA You tell is WHAT should be done with the data, and it knows HOW to do it! Operations such as filtering, aggregation, faceting are built into the visualization, rather than putting the burden on the user to massage the data into shape. Complex visualizations can be built with a few high level abstractions: DATA TRANS - FORM S SCALE S GUIDE S MARK S cf : Altair Talk by Brian Granger in PyData 2016 https://youtu.be/v5mrwq7yJc4

11. Anatomy of a plot X/Y channel Shape Channel Size Channel Color Channel

12. Features…

13. 1. Supports most plot types

14. 2. Trellis plots

15. 3. Layers Layer 1. Layer 2. Layer 3.

16. 4. Notebook and Consoles

17. 5. Built-in spark support Vegas .withDataFrame(myDataFrame) .encodeX(“population”) .encodeY(“age”) Mapped Columns Pass In DF.

18. 6. Visual statistics ● Advanced Binning ● Sorting ● Scaling ● Custom Transforms ● Time Series ● Aggregation ● Filtering ● Math functions(log, etc) ● Missing data support ● Descriptive Statistics

19. How It Works !

20. VEGA D3JS VEGA - LITE VEGA S SCALA DSL EMITS TYPE-CHECKED VEGA-LITE JSON VEGA-LITE CONVERTS INTERNALLY TO VEGA JSON SPEC VEGA TRANSLATES JSON TO D3JS CODE THAT CAN BE VERY VERBOSE A SCALA DSL FOR VEGA- LITE

Spark Summit EU talk by Sudeep Das and Aish Faenton

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Spark Summit EU talk by Sudeep Das and Aish Faenton

Similar to Spark Summit EU talk by Sudeep Das and Aish Faenton (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Summit EU talk by Sudeep Das and Aish Faenton