SlideShare a Scribd company logo
1 of 34
Download to read offline
SampleClean: Bringing Data 
Cleaning into the BDAS Stack! 
Sanjay Krishnan and Daniel Haas! 
In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan 
Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken 
Goldberg !
Who publishes more? ! 
! 
! 
2
Microsoft Academic Search! 
! 
! 
Paper Id! Affiliation! 
16! Computer Science Division--University of 
California Berkeley CA! 
101! University of California at Berkeley! 
102! Department of Physics Stanford ! 
University California! 
116! Lawrence Berkeley National Labs! 
<ref>California</ref>! 
3
Microsoft Academic Search! 
! 
! 
Paper Id! Affiliation! 
16! Computer Science Division--University of 
California Berkeley CA! 
101! University of California at Berkeley! 
102! Department of Physics Stanford ! 
University California! 
116! Lawrence Berkeley National Labs! 
<ref>California</ref>! 
X 
4
Microsoft Academic Search! 
! 
! 
University of California at Berkeley! 
Computer Science Division! 
University of California at Berkeley! 
Department of Physics Stanford ! 
University California! 
5
• Data cleaning in BDAS.! 
– Problem 1. Scale! 
– Problem 2. Latency! 
! 
• Sampling to cope with scale.! 
• Asynchrony to cope with latency.! 
! 
Enter SampleClean! 
6
Now it’s your turn!! 
Be the crowd and help us decide! 
! 
! 
7
Dirty Data is Ubiquitous! 
8! 
Example: Missing, incomplete, inconsistent data!
Data Cleaning is Hard! 
9 
Time consuming!
Data Cleaning is Hard! 
10 
Time consuming! 
Costly!
Data Cleaning is Hard! 
11 
Time consuming! 
Costly! 
Domain-specific!
Data Cleaning is Hard! 
12 
Time consuming! 
Costly! 
Domain-specific!
A New Data Cleaning Architecture! 
Analy0cs 
13 
Data 
Data 
Cleaning
A New Data Cleaning Architecture! 
Analy0cs 
14 
Data 
Cleaning 
Data
Can it Scale?! 
People are slow and expensive! 
Crowd 
Machine 
Learning 
Regex 
Time 
15
Insight 1: Asynchrony Hides Latency! 
16
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
BlinkDB! 
Time! 
17
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
Time! 
Data 
Error 
BlinkDB! 
18
Insight 2: Sampling Hides Scale! 
Query ! 
Error! 
Time! 
Data 
Error 
SampleClean! 
BlinkDB! 
19
SampleClean Data Flow! 
Dirty 
Data 
Dirty 
Sample 
Query 
Clean 
Sample 
Data 
Cleaning 
20 
Sampling 
Asynchrony
SampleClean Data Flow! 
Query 
Clean 
Sample 
Data 
Cleaning 
Asynchrony 
21
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
22
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
23
Approximate Query Processing! 
• Estimate early results and bound with 
error bars! 
Query ! 
Error! 
Time! 
SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014! 
! 
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very 
Large Data. EuroSys 2013! 
24
The SampleClean Architecture! 
25 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
Data 
Cleaning 
Library
Crowds and Machines 
Work Together! 
• Extensible library of data cleaning tools! 
• Tools are:! 
– Automated! 
– Human-powered! 
– Hybrid! 
! 
Crowd 
Machine 
Learning 
Regex 
Time 
26
Active Learning and Crowds! 
• Choose informative training points! 
Not ! 
Informative! 
Are these the same?! 
Stanford Department of IEOR! 
! 
UC Berkeley Stats! 
! 
¢ Yes ! 
¤ No! 
Informative! 
Are these the same?! 
Department of Mathematics Stanford University! 
! 
University of California Berkeley Department of 
Mathematics! 
! 
¢ Yes ! 
¤ No! 
27
Active Learning and Crowds! 
• Choose informative training points! 
Not ! 
Informative! 
Are these the same?! 
Stanford Department of IEOR! 
! 
UC Berkeley Stats! 
! 
¢ Yes ! 
¤ No! 
Informative! 
Are these the same?! 
Department of Mathematics Stanford University! 
! 
University of California Berkeley Department of 
Mathematics! 
! 
¢ Yes ! 
¤ No! 
28
The SampleClean Architecture! 
29 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines
Putting it all together: 
Asynchronous Pipelines! 
• Users group data cleaning operations into 
pipelines! 
30
The SampleClean Architecture! 
Data 
Cleaning 
Library 
Issue Queries, ! 
Get Results! 
Approximate 
Asynchronous 
Query 
Processing 
Pipelines 
Clean 
Sample 
Declare Cleaning ! 
Operations! 
Dirty 
Sample 
31
Great, Now What?! 
• Prototype implementation complete!! 
• Significant research challenges remain:! 
• Crowd worker performance and quality! 
• Pipeline semantics and optimization! 
• Programming model and interface! 
! 
• Open source release targeted for next 
year! 
32
Summary! 
• Data Cleaning is slow, costly, and 
domain-specific! 
• SampleClean brings data cleaning into 
the BDAS stack ! 
• SampleClean uses asynchrony to hide 
latency, and sampling to hide scale! 
• SampleClean combines Algorithms, 
Machines, and People, all in one system! 33
Asynchrony in Spark! 
• The Spark abstraction: blocking BSP! 
• So how do we achieve asynchrony?! 
• Multithreaded master! 
• Intermediate results materialized in 
Hive! 
• Standalone Finagle HTTP server for 
crowd work! 
! 
34

More Related Content

Viewers also liked

Viewers also liked (18)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. Experience
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 

Similar to SampleClean: Bringing Data Cleaning into the BDAS Stack

Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital Experiments
Jose Enrique Ruiz
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Spark Summit
 

Similar to SampleClean: Bringing Data Cleaning into the BDAS Stack (20)

Transparency1
Transparency1Transparency1
Transparency1
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Empirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal PerspectiveEmpirical Evaluations in Software Engineering Research: A Personal Perspective
Empirical Evaluations in Software Engineering Research: A Personal Perspective
 
Collaborative Digital Experiments
Collaborative Digital ExperimentsCollaborative Digital Experiments
Collaborative Digital Experiments
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Why Electronic Data Capture?
Why Electronic Data Capture?Why Electronic Data Capture?
Why Electronic Data Capture?
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
What IF? The University of Central Florida’s strategy for success. Colding
What IF? The University of Central Florida’s strategy for success. Colding What IF? The University of Central Florida’s strategy for success. Colding
What IF? The University of Central Florida’s strategy for success. Colding
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 

More from jeykottalam (6)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 

SampleClean: Bringing Data Cleaning into the BDAS Stack

  • 1. SampleClean: Bringing Data Cleaning into the BDAS Stack! Sanjay Krishnan and Daniel Haas! In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !
  • 3. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! 3
  • 4. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! X 4
  • 5. Microsoft Academic Search! ! ! University of California at Berkeley! Computer Science Division! University of California at Berkeley! Department of Physics Stanford ! University California! 5
  • 6. • Data cleaning in BDAS.! – Problem 1. Scale! – Problem 2. Latency! ! • Sampling to cope with scale.! • Asynchrony to cope with latency.! ! Enter SampleClean! 6
  • 7. Now it’s your turn!! Be the crowd and help us decide! ! ! 7
  • 8. Dirty Data is Ubiquitous! 8! Example: Missing, incomplete, inconsistent data!
  • 9. Data Cleaning is Hard! 9 Time consuming!
  • 10. Data Cleaning is Hard! 10 Time consuming! Costly!
  • 11. Data Cleaning is Hard! 11 Time consuming! Costly! Domain-specific!
  • 12. Data Cleaning is Hard! 12 Time consuming! Costly! Domain-specific!
  • 13. A New Data Cleaning Architecture! Analy0cs 13 Data Data Cleaning
  • 14. A New Data Cleaning Architecture! Analy0cs 14 Data Cleaning Data
  • 15. Can it Scale?! People are slow and expensive! Crowd Machine Learning Regex Time 15
  • 16. Insight 1: Asynchrony Hides Latency! 16
  • 17. Insight 2: Sampling Hides Scale! Query ! Error! BlinkDB! Time! 17
  • 18. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error BlinkDB! 18
  • 19. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error SampleClean! BlinkDB! 19
  • 20. SampleClean Data Flow! Dirty Data Dirty Sample Query Clean Sample Data Cleaning 20 Sampling Asynchrony
  • 21. SampleClean Data Flow! Query Clean Sample Data Cleaning Asynchrony 21
  • 22. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 22
  • 23. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 23
  • 24. Approximate Query Processing! • Estimate early results and bound with error bars! Query ! Error! Time! SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014! ! BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013! 24
  • 25. The SampleClean Architecture! 25 Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample Data Cleaning Library
  • 26. Crowds and Machines Work Together! • Extensible library of data cleaning tools! • Tools are:! – Automated! – Human-powered! – Hybrid! ! Crowd Machine Learning Regex Time 26
  • 27. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 27
  • 28. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 28
  • 29. The SampleClean Architecture! 29 Data Cleaning Library Issue Queries, ! Get Results! Clean Sample Declare Cleaning ! Operations! Dirty Sample Approximate Asynchronous Query Processing Pipelines
  • 30. Putting it all together: Asynchronous Pipelines! • Users group data cleaning operations into pipelines! 30
  • 31. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 31
  • 32. Great, Now What?! • Prototype implementation complete!! • Significant research challenges remain:! • Crowd worker performance and quality! • Pipeline semantics and optimization! • Programming model and interface! ! • Open source release targeted for next year! 32
  • 33. Summary! • Data Cleaning is slow, costly, and domain-specific! • SampleClean brings data cleaning into the BDAS stack ! • SampleClean uses asynchrony to hide latency, and sampling to hide scale! • SampleClean combines Algorithms, Machines, and People, all in one system! 33
  • 34. Asynchrony in Spark! • The Spark abstraction: blocking BSP! • So how do we achieve asynchrony?! • Multithreaded master! • Intermediate results materialized in Hive! • Standalone Finagle HTTP server for crowd work! ! 34

Editor's Notes

  1. Start with Berkeley vs. Stanford, not the dataset
  2. Talk more about the dataset/problem/query before jumping into the sources of error
  3. Do *not* say ‘algorithms only go so far’!
  4. …and can’t be ignored! Analytics on dirty data can lead to incorrect decision-making.
  5. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned
  6. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. You saw this in the demo just now—the dashboard issued queries in realtime as the data updated.
  7. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  8. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  9. Asynchrony: We allow data cleaning to proceed in the background while analysts make use of the already cleaned data. Approximation: approximate results are often sufficient, especially for early data analysis tasks such as exploratory data analysis. We leverage sampling / machine learning to provide approximations quickly, then improve our answers as more of the data is cleaned.
  10. Imagine such a scenario, where you have a large and dirty dataset and cleaning the entire data may spend you a lot of time and money. When using our system, you don’t have to clean the entire data. You can only clean a small sample of the data, then our system will use the results of the cleaning process to understand data error and return a better query result for you. Even better, our system can also bound the query results and tell you that if you clean the entire data, in which ranges your query results will be. If you want to know more details about this sampling feature, you can refer to our latest SIGMOD paper. We follow the BlinkDB path and only support aggregate queries. We can extend this approach to support more complex queries using non-parametric bootstrap and diagnostics. In addition, we extend the BlinkDB approach to handle data error in addition to query error
  11. So in order to require as little work from humans as possible, we use humans to train models that can extrapolate human work to the rest of our data. In particular, we use a technique called active learning, where we have humans clean the most informative bits of data so we can train a better model faster.
  12. Point out that we have an extensible general purpose active learning library built on MLLib that can talk to multiple crowds
  13. Talk about executing on a sample samples Talk about arguments to pipeline