SlideShare a Scribd company logo
1 of 38
Download to read offline
Common Limitations with Big Data
in the Field
8/7/2014
Sean J. Taylor
Joint Statistical Meetings
Jsm big-data
Jsm big-data
Source: “big data” from Google
Jsm big-data
Mo’ Data,
Mo’ Problems
Why are the data so big?
Jsm big-data
data
“To consult the
statistician after an
experiment is finished is
often merely to ask him
to conduct a post-
mortem examination.
He can perhaps say what
the experiment died of.”
— Ronald Fisher
Data Alchemy?
Common Big Data Limitations:
!
1. Measuring the wrong thing
2. Biased samples
3. Dependence
4. Uncommon support
1. Measuring the wrong thing
The data you have a lot of doesn’t
measure what you want.
Jsm big-data
Jsm big-data
Common Pattern
▪ High volume of of cheap, easy to measure “surrogate” 

(e.g. steps, clicks)
▪ Surrogate is correlated with true measurement of interest
(e.g. overall health, purchase intention)
▪ key question: sign and magnitude of “interpretation bias”
▪ obvious strategy: design better measurement
▪ opportunity: joint, causal modeling of high-resolution
surrogate measurement with low-resolution “true”
measurement.
2. Biased samples
Size doesn’t guarantee your data is
representative of the population.
Jsm big-data
Bias?
World Cup Arrivals by Country
Jsm big-data
Is Liberal
Has
Daughter
Is Liberal
on
Facebook
Has
Daughter
on
Facebook
What we want
What we get
Common Pattern
▪ Your sample conditions on your own convenience, user
activity level, and/or technological sophistication of
participants.
▪ You’d like to extrapolate your statistic to a more general
population of interest.
▪ key question: (again) sign and magnitude of bias
▪ obvious strategy: weighting or regression
▪ opportunity: understand effect of selection on latent
variables on bias
3. Dependence
Effective size of data is small due
to repeated measurements of
units.
Bakshy et al. (2012)
“Social Influence in Social Advertising”
Person D Y
Evan 1 1
Evan 0 0
Ashley 0 1
Ashley 1 1
Ashley 0 1
Greg 1 0
Leena 1 0
Leena 1 1
Ema 0 0
Seamus 1 1
Person Ad D Y
Evan Sprite 1 1
Evan Coke 0 0
Ashley Pepsi 0 1
Ashley Pepsi 1 1
Ashley Coke 0 1
Greg Coke 1 0
Leena Pepsi 1 0
Leena Coke 1 1
Ema Sprite 0 0
Seamus Sprite 1 1
One-way Dependency Two-way Dependency
Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
Bakshy and Eckles (2013)
“Uncertainty in Online Experiments with Dependent Data”
Common Pattern
▪ Your sample is large because it includes many repeated
measurements of the most active units.
▪ An inferential procedure that assumes independence will
be anti-conservative.
▪ key question: how to efficiently produce confidence
intervals with the right coverage.
▪ obvious strategy: multi-way bootstrap (Owen and Eckles)
▪ opportunity: scalable inference for crossed random effects
models
4. Uncommon support
Your data are sparse in part of the
distribution you care about.
Jsm big-data
1.0
1.5
2.0
2.5
3.0
−2 −1 0 1 2 3 4 5
Time Relative to Kickoff (hours)
Positive/NegativeEmotionWords
outcome
loss
win
“The Emotional Highs and Lows of the NFL Season”
Facebook Data Science Blog
Which check-ins are useful?
▪ We can only study people who post enough text before
and after their check-ins to yield reliable measurement.
What is the control check-in?
▪ Ideally we’d do a within-subjects design. If so we’re stuck
with only other check-ins (with enough status updates)
from the same people.
▪ National Park visit is likely a vacation, so we need to find
other vacation check-ins. Similar day, time, and distance.



Only a small fraction of the “treatment” cases yield
reliable measurement plus control cases.
Common Pattern
▪ Your sample is large but observations satisfying some set
of qualifying criteria (usually matching) are rare.
▪ Even with strong assumptions, hard to measure precise
differences due to lack of similar cases.
▪ key question: how to use as many of the observations as
possible without introducing bias.
▪ obvious strategy: randomized experiments
▪ opportunity: more efficient causal inference techniques
that work at scale (high dimensional regression/matching)
>
Conclusion 1:
Making your own quality data is
better than being a data alchemist.
Conclusion 2:
Great research opportunities in
addressing limitations of
“convenience” big data.
!
(or how I stopped worrying and
learned to be a data alchemist)
Jsm big-data
Sustaining versus Disruptive Innovations
Sustaining innovations
▪ improve product performance
Disruptive innovations
▪ Result in worse performance (short-term)
▪ Eventually surpass sustaining technologies in satisfying
market demand with lower costs
▪ cheaper, simpler, smaller, and frequently more
convenient to use
Innovator’s Dilemma
sjt@fb.com
http://seanjtaylor.com

More Related Content

What's hot

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleDomino Data Lab
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleDomino Data Lab
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayHjk6653284
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?Daniel Tunkelang
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on PinterestJune Andrews
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged itChris Dwan
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningDaniel Tunkelang
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaStoic Advantage, LLC.
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)Troy Magennis
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyWebanalisten .nl
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Ogechi Onuoha
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingCybera Inc.
 

What's hot (20)

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on Pinterest
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged it
 
Probabilistic Reasoner
Probabilistic ReasonerProbabilistic Reasoner
Probabilistic Reasoner
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Math in data
Math in dataMath in data
Math in data
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie Eardley
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft Bing
 

Viewers also liked

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at ScaleSean Taylor
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experimentsSean Taylor
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipTaylorPearson.me
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?NUS-ISS
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop InnoTech
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?Amazon Web Services
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web ServicesAmazon Web Services
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Viewers also liked (19)

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at Scale
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of Entrepreneurship
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 
Big Data
Big DataBig Data
Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Jsm big-data

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostelloData Con LA
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)Iin Angriyani
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicatorsclearsateam
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey ResearchSiobhan O'Dwyer
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsDavid Perrott
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentationsteve muzzy
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careVanessa Lopez
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionalsNadir Mehmood
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...Ruth Kearney
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody LiesRahul Rishi
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 WorkshopWiLS
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?Dave Hogue
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementMark Reed
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataSylvia Ogweng
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practicesGregory Nelson
 

Similar to Jsm big-data (20)

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey Research
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural Economics
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
 
تحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطياتتحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطيات
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised care
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionals
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody Lies
 
intro_big_data.pptx
intro_big_data.pptxintro_big_data.pptx
intro_big_data.pptx
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 Workshop
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big Data
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practices
 

Recently uploaded

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Recently uploaded (17)

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Jsm big-data

  • 1. Common Limitations with Big Data in the Field 8/7/2014 Sean J. Taylor Joint Statistical Meetings
  • 7. Why are the data so big?
  • 10. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post- mortem examination. He can perhaps say what the experiment died of.” — Ronald Fisher
  • 12. Common Big Data Limitations: ! 1. Measuring the wrong thing 2. Biased samples 3. Dependence 4. Uncommon support
  • 13. 1. Measuring the wrong thing The data you have a lot of doesn’t measure what you want.
  • 16. Common Pattern ▪ High volume of of cheap, easy to measure “surrogate” 
 (e.g. steps, clicks) ▪ Surrogate is correlated with true measurement of interest (e.g. overall health, purchase intention) ▪ key question: sign and magnitude of “interpretation bias” ▪ obvious strategy: design better measurement ▪ opportunity: joint, causal modeling of high-resolution surrogate measurement with low-resolution “true” measurement.
  • 17. 2. Biased samples Size doesn’t guarantee your data is representative of the population.
  • 22. Common Pattern ▪ Your sample conditions on your own convenience, user activity level, and/or technological sophistication of participants. ▪ You’d like to extrapolate your statistic to a more general population of interest. ▪ key question: (again) sign and magnitude of bias ▪ obvious strategy: weighting or regression ▪ opportunity: understand effect of selection on latent variables on bias
  • 23. 3. Dependence Effective size of data is small due to repeated measurements of units.
  • 24. Bakshy et al. (2012) “Social Influence in Social Advertising”
  • 25. Person D Y Evan 1 1 Evan 0 0 Ashley 0 1 Ashley 1 1 Ashley 0 1 Greg 1 0 Leena 1 0 Leena 1 1 Ema 0 0 Seamus 1 1 Person Ad D Y Evan Sprite 1 1 Evan Coke 0 0 Ashley Pepsi 0 1 Ashley Pepsi 1 1 Ashley Coke 0 1 Greg Coke 1 0 Leena Pepsi 1 0 Leena Coke 1 1 Ema Sprite 0 0 Seamus Sprite 1 1 One-way Dependency Two-way Dependency Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
  • 26. Bakshy and Eckles (2013) “Uncertainty in Online Experiments with Dependent Data”
  • 27. Common Pattern ▪ Your sample is large because it includes many repeated measurements of the most active units. ▪ An inferential procedure that assumes independence will be anti-conservative. ▪ key question: how to efficiently produce confidence intervals with the right coverage. ▪ obvious strategy: multi-way bootstrap (Owen and Eckles) ▪ opportunity: scalable inference for crossed random effects models
  • 28. 4. Uncommon support Your data are sparse in part of the distribution you care about.
  • 30. 1.0 1.5 2.0 2.5 3.0 −2 −1 0 1 2 3 4 5 Time Relative to Kickoff (hours) Positive/NegativeEmotionWords outcome loss win “The Emotional Highs and Lows of the NFL Season” Facebook Data Science Blog
  • 31. Which check-ins are useful? ▪ We can only study people who post enough text before and after their check-ins to yield reliable measurement. What is the control check-in? ▪ Ideally we’d do a within-subjects design. If so we’re stuck with only other check-ins (with enough status updates) from the same people. ▪ National Park visit is likely a vacation, so we need to find other vacation check-ins. Similar day, time, and distance.
 
 Only a small fraction of the “treatment” cases yield reliable measurement plus control cases.
  • 32. Common Pattern ▪ Your sample is large but observations satisfying some set of qualifying criteria (usually matching) are rare. ▪ Even with strong assumptions, hard to measure precise differences due to lack of similar cases. ▪ key question: how to use as many of the observations as possible without introducing bias. ▪ obvious strategy: randomized experiments ▪ opportunity: more efficient causal inference techniques that work at scale (high dimensional regression/matching)
  • 33. > Conclusion 1: Making your own quality data is better than being a data alchemist.
  • 34. Conclusion 2: Great research opportunities in addressing limitations of “convenience” big data. ! (or how I stopped worrying and learned to be a data alchemist)
  • 36. Sustaining versus Disruptive Innovations Sustaining innovations ▪ improve product performance Disruptive innovations ▪ Result in worse performance (short-term) ▪ Eventually surpass sustaining technologies in satisfying market demand with lower costs ▪ cheaper, simpler, smaller, and frequently more convenient to use