SlideShare a Scribd company logo
1 of 21
Download to read offline
Amit Kapoor
@amitkaps
Visualising
Big Data
Visualise Million
Data Points
x <- rnorm(1000000, mean=0, sd=2)
y <- rnorm(1000000, mean=0, sd=2)
xy <- data.frame(x,y)
Same order as the
Number of Pixels
on my MacBook Air
1400 x 900
Data
Data Sample
Sampling can be
effective (with
overweighting
unusual values)
Require multiple
plots or careful
tuning parameters
Data Sample
Model
Models are great as
they scale nicely.
But, visualisation is
required as
“I don’t know, what I
don’t know.”
Data Sample
ModelBinning
Binning can solve a
lot of these
challenges
“Bin - Summarize -
Smooth: A framework
for visualising big data” -
Hadley Wickam (2013)
“Visualising big data
is the process of creating
generalized histograms”
Approach
BIN : fixed size bins = (x-origin)/width
SUMMARIZE : summary stats = count, mean, stdev
SMOOTH : smoothing e.g. kernel mean, regression
VISUALISE : visualise using standard plots
Bigvis Package in R
Aim: To plot 100 million points in under 5 seconds.
Approach:
- Plotting using standard R libraries
- Processing done in (fast) compiled C++ code, using
Rcpp package
- Outlier removal in big data
- Smoothing to highlight trends & suppress noise
Diamonds dataset
ggplot(diamonds) + aes(carat, price)
+ geom_point(alpha = 0.2, colour =
“orange”)
50k observations e.g. price, carat of diamonds
Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 20
BinData <- with(diamonds, condense(
bin(carat, find_width(carat,Nbin)),
bin(price, find_width(price,Nbin)))
Plotting the Condense
p <- ggplot(BinData) + aes(carat,
price, fill=.count) + geom_tile()
Create bins = 20 and summarized using count
Both Points & Condensed
q <- p + geom_point(data = diamonds,
aes(fill = NULL), alpha = 0.2, colour
= "orange")
Create bins = 20, summarized using count & added base data
Movies dataset
ggplot(movies) + aes(length, rating)
+ geom_point(alpha = 0.2, colour =
“orange”)
130k observations e.g. length, rating of movies on IMDB
Let us see the outliers
title length rating
1 Matrjoschka 5700 8.5
2 The Cure for Insomnia 5220 5.9
3 The Longest Most Meaningless Movie in the World 2880 7.3
4 The Hazards of Helen 1428 6.6
5 **** 1100 6.9
Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 1e4
BinData <- with(movies, condense(
bin(length, find_width(length,Nbin)),
bin(rating, find_width(rating,Nbin)))
Condesed Plot
p <- ggplot(BinData) + aes(length,
rating, fill=.count) + geom_tile()
Create bins = 10000 and summarized using count
Remove Outliers
p %>% peel(BinData)
Create bins = 10000, summarize count & peel 1% outlier
Smoothing
smoothBinData <- smooth(peel
(binData), h=c(20, 1))
autoplot(smoothBinData)
Create bins = 20, summarize count, peel 1% outlier & smooth
Big Data Visualisation
● Approach: Bin - Summarize - Smooth - Visualise
● “Interactively” plot nearly 100 millions data point in-
memory for EDA in R
● Can be extend to in-database e.g. for binning
● Can be parallelised e.g. summarize on count, mean
Amit Kapoor
@amitkaps
amitkaps.com
narrativeviz.com
Data
Visual
Story
*

More Related Content

What's hot

Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPIntroducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPgoodfriday
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.Dr. Volkan OBAN
 
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONFun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONTomomi Imura
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysisVyacheslav Arbuzov
 
12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescriptpcnmtutorials
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
C Graphics Functions
C Graphics FunctionsC Graphics Functions
C Graphics FunctionsSHAKOOR AB
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingMark Kilgard
 
Juggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJuggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJosé Devezas
 
Pointer Events in Canvas
Pointer Events in CanvasPointer Events in Canvas
Pointer Events in Canvasdeanhudson
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Paul Richards
 

What's hot (20)

Real life XNA
Real life XNAReal life XNA
Real life XNA
 
Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTPIntroducing the Microsoft Virtual Earth Silverlight Map Control CTP
Introducing the Microsoft Virtual Earth Silverlight Map Control CTP
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.
 
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSONFun with D3.js: Data Visualization Eye Candy with Streaming JSON
Fun with D3.js: Data Visualization Eye Candy with Streaming JSON
 
Introduction to graphics programming in c
Introduction to graphics programming in cIntroduction to graphics programming in c
Introduction to graphics programming in c
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
 
peRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysispeRm R group. Review of packages for r for market data downloading and analysis
peRm R group. Review of packages for r for market data downloading and analysis
 
12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript12. Map | WeakMap | ES6 | JavaScript | Typescript
12. Map | WeakMap | ES6 | JavaScript | Typescript
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
 
C Graphics Functions
C Graphics FunctionsC Graphics Functions
C Graphics Functions
 
Numpy python cheat_sheet
Numpy python cheat_sheetNumpy python cheat_sheet
Numpy python cheat_sheet
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Juggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music RecommendationJuggle: Hybrid Large-Scale Music Recommendation
Juggle: Hybrid Large-Scale Music Recommendation
 
Pointer Events in Canvas
Pointer Events in CanvasPointer Events in Canvas
Pointer Events in Canvas
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Kwp2 091217
Kwp2 091217Kwp2 091217
Kwp2 091217
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 

Viewers also liked

Interent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsInterent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsTom Zorde
 
Telling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineTelling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineAmit Kapoor
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauRyan Sleeper
 
The Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningThe Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningAmit Kapoor
 
Data driven storytelling tips from an iron viz champion ryan sleeper
Data driven storytelling tips from an iron viz champion   ryan sleeperData driven storytelling tips from an iron viz champion   ryan sleeper
Data driven storytelling tips from an iron viz champion ryan sleeperRyan Sleeper
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with DataAmit Kapoor
 
Python Visualisation for Data Science
Python Visualisation for Data SciencePython Visualisation for Data Science
Python Visualisation for Data ScienceAmit Kapoor
 
Learning the Craft of Data Visualisation
Learning the Craft of Data VisualisationLearning the Craft of Data Visualisation
Learning the Craft of Data VisualisationAmit Kapoor
 
Data Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeData Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeAndy Kirk
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLPAmit Kapoor
 
Data Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataData Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataBright North
 
Storytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsStorytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsAmit Kapoor
 
Nonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryNonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryKivi Leroux Miller
 
Marketing Plan Template - Small Business
Marketing Plan Template - Small BusinessMarketing Plan Template - Small Business
Marketing Plan Template - Small BusinessChris R. Keller
 
Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Coincidencity
 
Сценарий для рисованной истории
Сценарий для рисованной историиСценарий для рисованной истории
Сценарий для рисованной историиЛидия Бабинцева
 
The 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationThe 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationAndy Kirk
 
5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and GraphsMetamorph Training Pvt Ltd
 

Viewers also liked (20)

Interent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference ModelsInterent of Things (IoT) & Data Science Contextual Reference Models
Interent of Things (IoT) & Data Science Contextual Reference Models
 
2016 04-07 презентация
2016 04-07 презентация2016 04-07 презентация
2016 04-07 презентация
 
Telling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story SpineTelling Stories with Data - Using Story Spine
Telling Stories with Data - Using Story Spine
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
 
The Power of Ensembles in Machine Learning
The Power of Ensembles in Machine LearningThe Power of Ensembles in Machine Learning
The Power of Ensembles in Machine Learning
 
Data driven storytelling tips from an iron viz champion ryan sleeper
Data driven storytelling tips from an iron viz champion   ryan sleeperData driven storytelling tips from an iron viz champion   ryan sleeper
Data driven storytelling tips from an iron viz champion ryan sleeper
 
Crafting Visual Stories with Data
Crafting Visual Stories with DataCrafting Visual Stories with Data
Crafting Visual Stories with Data
 
Python Visualisation for Data Science
Python Visualisation for Data SciencePython Visualisation for Data Science
Python Visualisation for Data Science
 
Learning the Craft of Data Visualisation
Learning the Craft of Data VisualisationLearning the Craft of Data Visualisation
Learning the Craft of Data Visualisation
 
Data Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to SeeData Visualisation Literacy - Learning to See
Data Visualisation Literacy - Learning to See
 
Deep Learning for NLP
Deep Learning for NLPDeep Learning for NLP
Deep Learning for NLP
 
Embedding with Tableau Server
Embedding with Tableau ServerEmbedding with Tableau Server
Embedding with Tableau Server
 
Data Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your dataData Storytelling: The only way to unlock true insight from your data
Data Storytelling: The only way to unlock true insight from your data
 
Storytelling with Data - Approach | Skills
Storytelling with Data - Approach | SkillsStorytelling with Data - Approach | Skills
Storytelling with Data - Approach | Skills
 
Nonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - SummaryNonprofit Marketing Plan Template - Summary
Nonprofit Marketing Plan Template - Summary
 
Marketing Plan Template - Small Business
Marketing Plan Template - Small BusinessMarketing Plan Template - Small Business
Marketing Plan Template - Small Business
 
Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...Data stories - how to combine the power storytelling with effective data visu...
Data stories - how to combine the power storytelling with effective data visu...
 
Сценарий для рисованной истории
Сценарий для рисованной историиСценарий для рисованной истории
Сценарий для рисованной истории
 
The 8 Hats of Data Visualisation
The 8 Hats of Data VisualisationThe 8 Hats of Data Visualisation
The 8 Hats of Data Visualisation
 
5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs5 Secrets to Better Presentation Charts and Graphs
5 Secrets to Better Presentation Charts and Graphs
 

Similar to Visualising Big Data

Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 
The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84Mahmoud Samir Fayed
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012aleks-f
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smallerTony Tran
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGANSeongcheol Baek
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..Dr. Volkan OBAN
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링Kwang Woo NAM
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKJeremy Chen
 
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola MykhailychFwdays
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 

Similar to Visualising Big Data (20)

Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84The Ring programming language version 1.2 book - Part 35 of 84
The Ring programming language version 1.2 book - Part 35 of 84
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGAN
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링집단지성 프로그래밍 08-가격모델링
집단지성 프로그래밍 08-가격모델링
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
"Wix Engineering Media AI Photo Studio", Mykola Mykhailych
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 

More from Amit Kapoor

Model Visualisation
Model VisualisationModel Visualisation
Model VisualisationAmit Kapoor
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataAmit Kapoor
 
Storytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageStorytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageAmit Kapoor
 
Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Amit Kapoor
 
What makes a data-story work?
What makes a data-story work?What makes a data-story work?
What makes a data-story work?Amit Kapoor
 
What is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistWhat is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistAmit Kapoor
 
Story Structure and Modern Storytelling
Story Structure and Modern StorytellingStory Structure and Modern Storytelling
Story Structure and Modern StorytellingAmit Kapoor
 
Targeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailTargeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailAmit Kapoor
 
Storytelling - Gutenberg
Storytelling - GutenbergStorytelling - Gutenberg
Storytelling - GutenbergAmit Kapoor
 
Analytics in Consulting
Analytics in ConsultingAnalytics in Consulting
Analytics in ConsultingAmit Kapoor
 
Retail Pricing Perspective
Retail Pricing PerspectiveRetail Pricing Perspective
Retail Pricing PerspectiveAmit Kapoor
 

More from Amit Kapoor (11)

Model Visualisation
Model VisualisationModel Visualisation
Model Visualisation
 
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with DataFifth Elephant 2014 talk - Crafting Visual Stories with Data
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
 
Storytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | EngageStorytelling with Data - See | Show | Tell | Engage
Storytelling with Data - See | Show | Tell | Engage
 
Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective Business Process Improvement - A Strategic and Supply Chain Perspective
Business Process Improvement - A Strategic and Supply Chain Perspective
 
What makes a data-story work?
What makes a data-story work?What makes a data-story work?
What makes a data-story work?
 
What is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a StrategistWhat is Strategy - Thinking like a Strategist
What is Strategy - Thinking like a Strategist
 
Story Structure and Modern Storytelling
Story Structure and Modern StorytellingStory Structure and Modern Storytelling
Story Structure and Modern Storytelling
 
Targeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in RetailTargeting the Moment of Truth - Using Big Data in Retail
Targeting the Moment of Truth - Using Big Data in Retail
 
Storytelling - Gutenberg
Storytelling - GutenbergStorytelling - Gutenberg
Storytelling - Gutenberg
 
Analytics in Consulting
Analytics in ConsultingAnalytics in Consulting
Analytics in Consulting
 
Retail Pricing Perspective
Retail Pricing PerspectiveRetail Pricing Perspective
Retail Pricing Perspective
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 

Visualising Big Data

  • 2. Visualise Million Data Points x <- rnorm(1000000, mean=0, sd=2) y <- rnorm(1000000, mean=0, sd=2) xy <- data.frame(x,y) Same order as the Number of Pixels on my MacBook Air 1400 x 900 Data
  • 3. Data Sample Sampling can be effective (with overweighting unusual values) Require multiple plots or careful tuning parameters
  • 4. Data Sample Model Models are great as they scale nicely. But, visualisation is required as “I don’t know, what I don’t know.”
  • 5. Data Sample ModelBinning Binning can solve a lot of these challenges “Bin - Summarize - Smooth: A framework for visualising big data” - Hadley Wickam (2013)
  • 6.
  • 7. “Visualising big data is the process of creating generalized histograms”
  • 8. Approach BIN : fixed size bins = (x-origin)/width SUMMARIZE : summary stats = count, mean, stdev SMOOTH : smoothing e.g. kernel mean, regression VISUALISE : visualise using standard plots
  • 9. Bigvis Package in R Aim: To plot 100 million points in under 5 seconds. Approach: - Plotting using standard R libraries - Processing done in (fast) compiled C++ code, using Rcpp package - Outlier removal in big data - Smoothing to highlight trends & suppress noise
  • 10. Diamonds dataset ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = “orange”) 50k observations e.g. price, carat of diamonds
  • 11. Condense (bin + summarise) library(bigvis) library(ggplot2) Nbin <- 20 BinData <- with(diamonds, condense( bin(carat, find_width(carat,Nbin)), bin(price, find_width(price,Nbin)))
  • 12. Plotting the Condense p <- ggplot(BinData) + aes(carat, price, fill=.count) + geom_tile() Create bins = 20 and summarized using count
  • 13. Both Points & Condensed q <- p + geom_point(data = diamonds, aes(fill = NULL), alpha = 0.2, colour = "orange") Create bins = 20, summarized using count & added base data
  • 14. Movies dataset ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = “orange”) 130k observations e.g. length, rating of movies on IMDB
  • 15. Let us see the outliers title length rating 1 Matrjoschka 5700 8.5 2 The Cure for Insomnia 5220 5.9 3 The Longest Most Meaningless Movie in the World 2880 7.3 4 The Hazards of Helen 1428 6.6 5 **** 1100 6.9
  • 16. Condense (bin + summarise) library(bigvis) library(ggplot2) Nbin <- 1e4 BinData <- with(movies, condense( bin(length, find_width(length,Nbin)), bin(rating, find_width(rating,Nbin)))
  • 17. Condesed Plot p <- ggplot(BinData) + aes(length, rating, fill=.count) + geom_tile() Create bins = 10000 and summarized using count
  • 18. Remove Outliers p %>% peel(BinData) Create bins = 10000, summarize count & peel 1% outlier
  • 19. Smoothing smoothBinData <- smooth(peel (binData), h=c(20, 1)) autoplot(smoothBinData) Create bins = 20, summarize count, peel 1% outlier & smooth
  • 20. Big Data Visualisation ● Approach: Bin - Summarize - Smooth - Visualise ● “Interactively” plot nearly 100 millions data point in- memory for EDA in R ● Can be extend to in-database e.g. for binning ● Can be parallelised e.g. summarize on count, mean