SlideShare a Scribd company logo
1 of 30
Download to read offline
Exploratory Data Analysis
in Spark with Jupyter
https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
● Madhukara Phatak
● Team Lead at Tellius
● Work in Hadoop, Spark, ML
and Scala
● www.madhukaraphatak.com
Agenda
● Introduction to EDA
● EDA on Big Data
● EDA with Notebooks
● Five Point Summary
● Pyspark and EDA
● Histograms
● Outlier detection
● Correlation
Introduction to EDA
What’s EDA
● Exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main
characteristics, often with visual methods
● Uses statistical methods to analyse different aspects of
the data
● Puts lot of importance on visualisation
● Some of the EDA techniques are
○ Historgrams
○ Correlations etc
Why EDA?
● EDA helps data scientist to understand the distribution
of the data before they are fed to downstream
algorithms
● EDA also helps to understand the correlation between
different variables collected as part of the data collection
● Visualising the data also helps us to see the different
patterns in the data which can inform our later part of
the analysis
● Interactivity of EDA helps exploration of various different
assumptions
EDA in Hadoop ERA
EDA in Hadoop ERA
● Typical EDA is an interactive process and highly
experimental
● The first generation Hadoop systems where mostly built
for batch processes and don't offer much tools for
interactivity
● So typically data scientist used to take sample of the
data and run EDA using traditional tools like R / Python
etc
Limitation of Sample EDA
● Running EDA on Sample requires the sampling
techniques to sample data which represents the
distribution of full data
● It’s hard to achieve for the multi dimensional data which
is most of real world data
● Sample sometimes create issue for skewed distributions
Ex : Payment type in nyc taxi data
● So though sample works for most of the cases, it’s not
most accurate
EDA in Spark ERA
Interactive Analysis in Spark
● Spark is built for interactive data analysis from day one
● Below are some of the features for good for Interactive
analysis
○ Interactive spark-shell
○ Local mode for low latency
○ Caching for Speed up
○ Dataframe abstraction to support structured data
analysis
○ Support for Python
EDA on Notebooks
● Spark shell is good for one liners
● It’s not that great interface for writing long interactive
queries
● It’s also doesn’t support good visualisation options
which are important for EDA
● So notebooks systems are an alternative to spark shell
which keeps interactivity of the shell with other
advanced features
● So a notebook interface is good for EDA
Jupyter Notebook
Introduction to Notebook System
● Notebook is a interactive web interface primarily used
for exploratory programming
● They are spiritual successors for interactive shells found
in languages like python, scala etc
● Notebook systems typically supports multiple language
backends using kernels or interpreters
● Interpreter is language runtime which is responsible for
actual interpretation of code
● Ex : IPython, Zeppelin, Jupyter
Introduction to Jupyter
● Jupyter is one of the notebook systems which evolved
from the IPython shell and notebook system
● Primarily built for python based analysis
● Now supports multiple languages like python,R,scala
etc
● Also has good support for big data frameworks like
spark, flink
● http://jupyter.org/
Five Point Summary
Five Point Summary
● Five number summary is one of the basic data
exploration technique where we will find how values of
dataset columns are distributed.
● It calculates below values for a column
○ Min - Minimum value of the column
○ First Quartile - The 25% th data
○ Median - Middle Value
○ Third Quartile - 75% of the value
○ Max - Maximum value
Five Point Summary in Spark
● In spark , we can use describe method on dataframe to
get this summary for a given column
● In our example, we'll be using life expectancy data and
generating five point summary
● Ex : SummaryExample.scala
● From the results we can observe that
○ They miss quantiles and median
○ Spark gives stddev which is not there in original
definition
Approximate Quantities
● Quantiles are costly to calculate on large data as they
require sorting and result in skewed calculation
● So by default spark skips them in the describe function
● Spark 2.1 has introduced new method approxQuantile
on stat functions of dataframe
● This allows us to calculating these different quantiles
with reasonable time with threshold for accuracy
● Ex : SummaryExample.scala
Visualizing Five Point Summary
● In earlier examples, we have calculated the five point
summary
● By just looking at the numbers, it’s difficult to
understand how the data is distributed
● It’s always good to have visualize the numbers to
understand distribution
● Box plot is a good way visualize these numbers
● But how to visualize in Scala?
Scala and Visualisation Libraries
● Scala is often is choice of language to develop spark
application
● Scala gives rich language primitives to build robust
scalable systems
● But when it comes EDA, ecosystem support
visualization and other tools in not great in Scala
● Even though there are effort like plot.ly or Vegas they
are not as mature as pyplot or similar ones
● So Scala may not be great language of choice for EDA
EDA and PySpark
Pyspark
● Pyspark is a python interface for Spark API’s
● With Dataframe and Dataset API, performance is on par
with scala equivalent
● One of the advantage of pyspark over scala is it
seamless ability to convert between spark & pandas
dataframe
● Converting padas helps to use myriad of python
ecosystem tools for visualization
● But what about memory limitation about pandas?
EDA with Pyspark
● If we directly use pandas dataframe for EDA we will be
limited by data size
● So the trick is to calculate all the values using spark
API’s and then convert only result to pandas
● Then use visualize libraries like pyplot , seaborn etc to
visualize results on jupyter
● This combo of pyspark and python libraries enables us
to do interactive and high quality EDA on spark
Pyspark Boxplot
● In our example, we will first calculate five point summary
using pyspark code
● Then convert the result to pandas dataframe to extract
values
● Render box plot matplotlib.pyplot library
● One of the challenge is we need to draw using
precompute results rather than actual data itself
● It needs understanding lower level API
● Ex : EDA on Life Expectancy Data
Outlier Detection
Outlier Detection using IQR
● One of the use case to calculate five point summary is
to find outliers in data
● Idea is the any value which are significantly outside
IQR, interquartile range are typically signified as outliers
● IQR = Q3 - Q1
● One of the formula is to find the outlier which are
outside Q1- 1.5*IQR to Q3+1.5*IQR
Ex : OutliersWithIQR.scala
Histogram
Histogram
● A histogram is an accurate representation of the
distribution of numerical data
● It is a kind of bar graph
● To construct a histogram, the first step is to "bin" the
range of values—that is, divide the entire range of
values into a series of intervals—and then count how
many values fall into each interval
Histogram API
● Dataframe doesn’t have direct histogram method, but
RDD does have on DoubleRDD
● histogram API takes number buckets and it return two
things
○ Start Values for Each Buckets
○ No of elements in the bucket
● We can use pyplot barchart API to draw histogram
using these result
● Ex : EDA on Life Expectancy Data

More Related Content

What's hot

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 

What's hot (20)

Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Kdd process
Kdd processKdd process
Kdd process
 
Boost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined ProceduresBoost Your Neo4j with User-Defined Procedures
Boost Your Neo4j with User-Defined Procedures
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Business Intelligence: Multidimensional Analysis
Business Intelligence: Multidimensional AnalysisBusiness Intelligence: Multidimensional Analysis
Business Intelligence: Multidimensional Analysis
 
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral data
 

Similar to Exploratory Data Analysis in Spark

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 

Similar to Exploratory Data Analysis in Spark (20)

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Spark
SparkSpark
Spark
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 

More from datamantra

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 

Exploratory Data Analysis in Spark

  • 1. Exploratory Data Analysis in Spark with Jupyter https://github.com/phatak-dev/Statistical-Data-Exploration-Using-Spark-2.0
  • 2. ● Madhukara Phatak ● Team Lead at Tellius ● Work in Hadoop, Spark, ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Introduction to EDA ● EDA on Big Data ● EDA with Notebooks ● Five Point Summary ● Pyspark and EDA ● Histograms ● Outlier detection ● Correlation
  • 5. What’s EDA ● Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods ● Uses statistical methods to analyse different aspects of the data ● Puts lot of importance on visualisation ● Some of the EDA techniques are ○ Historgrams ○ Correlations etc
  • 6. Why EDA? ● EDA helps data scientist to understand the distribution of the data before they are fed to downstream algorithms ● EDA also helps to understand the correlation between different variables collected as part of the data collection ● Visualising the data also helps us to see the different patterns in the data which can inform our later part of the analysis ● Interactivity of EDA helps exploration of various different assumptions
  • 8. EDA in Hadoop ERA ● Typical EDA is an interactive process and highly experimental ● The first generation Hadoop systems where mostly built for batch processes and don't offer much tools for interactivity ● So typically data scientist used to take sample of the data and run EDA using traditional tools like R / Python etc
  • 9. Limitation of Sample EDA ● Running EDA on Sample requires the sampling techniques to sample data which represents the distribution of full data ● It’s hard to achieve for the multi dimensional data which is most of real world data ● Sample sometimes create issue for skewed distributions Ex : Payment type in nyc taxi data ● So though sample works for most of the cases, it’s not most accurate
  • 11. Interactive Analysis in Spark ● Spark is built for interactive data analysis from day one ● Below are some of the features for good for Interactive analysis ○ Interactive spark-shell ○ Local mode for low latency ○ Caching for Speed up ○ Dataframe abstraction to support structured data analysis ○ Support for Python
  • 12. EDA on Notebooks ● Spark shell is good for one liners ● It’s not that great interface for writing long interactive queries ● It’s also doesn’t support good visualisation options which are important for EDA ● So notebooks systems are an alternative to spark shell which keeps interactivity of the shell with other advanced features ● So a notebook interface is good for EDA
  • 14. Introduction to Notebook System ● Notebook is a interactive web interface primarily used for exploratory programming ● They are spiritual successors for interactive shells found in languages like python, scala etc ● Notebook systems typically supports multiple language backends using kernels or interpreters ● Interpreter is language runtime which is responsible for actual interpretation of code ● Ex : IPython, Zeppelin, Jupyter
  • 15. Introduction to Jupyter ● Jupyter is one of the notebook systems which evolved from the IPython shell and notebook system ● Primarily built for python based analysis ● Now supports multiple languages like python,R,scala etc ● Also has good support for big data frameworks like spark, flink ● http://jupyter.org/
  • 17. Five Point Summary ● Five number summary is one of the basic data exploration technique where we will find how values of dataset columns are distributed. ● It calculates below values for a column ○ Min - Minimum value of the column ○ First Quartile - The 25% th data ○ Median - Middle Value ○ Third Quartile - 75% of the value ○ Max - Maximum value
  • 18. Five Point Summary in Spark ● In spark , we can use describe method on dataframe to get this summary for a given column ● In our example, we'll be using life expectancy data and generating five point summary ● Ex : SummaryExample.scala ● From the results we can observe that ○ They miss quantiles and median ○ Spark gives stddev which is not there in original definition
  • 19. Approximate Quantities ● Quantiles are costly to calculate on large data as they require sorting and result in skewed calculation ● So by default spark skips them in the describe function ● Spark 2.1 has introduced new method approxQuantile on stat functions of dataframe ● This allows us to calculating these different quantiles with reasonable time with threshold for accuracy ● Ex : SummaryExample.scala
  • 20. Visualizing Five Point Summary ● In earlier examples, we have calculated the five point summary ● By just looking at the numbers, it’s difficult to understand how the data is distributed ● It’s always good to have visualize the numbers to understand distribution ● Box plot is a good way visualize these numbers ● But how to visualize in Scala?
  • 21. Scala and Visualisation Libraries ● Scala is often is choice of language to develop spark application ● Scala gives rich language primitives to build robust scalable systems ● But when it comes EDA, ecosystem support visualization and other tools in not great in Scala ● Even though there are effort like plot.ly or Vegas they are not as mature as pyplot or similar ones ● So Scala may not be great language of choice for EDA
  • 23. Pyspark ● Pyspark is a python interface for Spark API’s ● With Dataframe and Dataset API, performance is on par with scala equivalent ● One of the advantage of pyspark over scala is it seamless ability to convert between spark & pandas dataframe ● Converting padas helps to use myriad of python ecosystem tools for visualization ● But what about memory limitation about pandas?
  • 24. EDA with Pyspark ● If we directly use pandas dataframe for EDA we will be limited by data size ● So the trick is to calculate all the values using spark API’s and then convert only result to pandas ● Then use visualize libraries like pyplot , seaborn etc to visualize results on jupyter ● This combo of pyspark and python libraries enables us to do interactive and high quality EDA on spark
  • 25. Pyspark Boxplot ● In our example, we will first calculate five point summary using pyspark code ● Then convert the result to pandas dataframe to extract values ● Render box plot matplotlib.pyplot library ● One of the challenge is we need to draw using precompute results rather than actual data itself ● It needs understanding lower level API ● Ex : EDA on Life Expectancy Data
  • 27. Outlier Detection using IQR ● One of the use case to calculate five point summary is to find outliers in data ● Idea is the any value which are significantly outside IQR, interquartile range are typically signified as outliers ● IQR = Q3 - Q1 ● One of the formula is to find the outlier which are outside Q1- 1.5*IQR to Q3+1.5*IQR Ex : OutliersWithIQR.scala
  • 29. Histogram ● A histogram is an accurate representation of the distribution of numerical data ● It is a kind of bar graph ● To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval
  • 30. Histogram API ● Dataframe doesn’t have direct histogram method, but RDD does have on DoubleRDD ● histogram API takes number buckets and it return two things ○ Start Values for Each Buckets ○ No of elements in the bucket ● We can use pyplot barchart API to draw histogram using these result ● Ex : EDA on Life Expectancy Data