Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Quality: principles, approaches, and best practices

246 views

Published on

Presentation given at Big Data Toronto, 2019

Published in: Data & Analytics
  • Login to see the comments

Data Quality: principles, approaches, and best practices

  1. 1. Data Quality: Principles, Approaches, and Best Practices Carl Anderson carl.anderson@weightwatchers.com WW – the new Weight Watchers
  2. 2. 1/3 business leaders frequently make decisions with data they don’t trust Bad data costs the economy $100s BN / year [IBM] [TDWI]
  3. 3. Data Science Business Intelligence Engineering Data Strategy About Me
  4. 4. Big data: ● Food ● Activity ● Exercises ● Challenges ● Social network ● Workshops ● Personal Coaches ● CRM ● Fulfillment ● Meal kits ● Supermarket foods ● E-commerce ● Cruises ...for 56 years
  5. 5. 2017: fill lake with data; provide analysts access 2019: upstream control and governance
  6. 6. Data Entry Transformation 1 Transformation 2 Inaccurate (GIGO) Missing Defaults Dropped records Truncation Encoding changes Data type change Stale 3rd party Disagree In General, What Can Go Wrong? Shape change Dupes Dupes
  7. 7. Accurate Coherent Complete Consistent Defined Timely Missing data, duplicates Referential integrity, connect the dots Data entry issues, stale data, default dates... Data dictionaries, business glossary, provenance, schema Latency Same values across systems, e.g. same address Facets of Data Quality Trust Analysts willing to use data. NPS * * *
  8. 8. Accurate % records quarantined % records in range % records matching Coherent % records missing entity ID % records missing foreign key Complete % records dupes % records missing % records complete % fields complete Consistent % records consistent Defined % tables defined % fields defined % dimensions defined % measures defined Timely Mean time to arrival 95th percentile time to arrival Volume Number of Records Trust NPS “If you can't measure it, you can't improve it” - Peter Drucker Data Quality Scorecard
  9. 9. Facet: Accuracy Publish Schema Publish Schema Adhere to Schema Field Ranges Source teams then: Source teams now (WIP): Data team superpowers: 1. Auto consumption 2. Auto checks 3. Quarantine 4. Reporting Data did not always match schema Hard to trust Hard to automate No accountability
  10. 10. Accurate % records quarantined % records in range % records matching Facet: Accuracy Publish Schema Publish Schema Adhere to Schema Field Ranges Source teams then: Source teams now (WIP): Data team superpowers: 1. Auto consumption 2. Auto checks 3. Quarantine 4. Reporting Data did not always match schema Hard to trust Hard to automate No accountability
  11. 11. Facet: Defined Table-level data dictionaries Business-level data dictionary (Business Glossary) https://medium.com/@leapingllamas
  12. 12. Facet: Defined. Flow from master Data catalog is master for table-level definitions and business glossary Mapping table from master to BI tool: here, Looker dimensions and measures Tool compares master to BI tool and updates/injects and creates pull request Manually reviewed and merged Master definitions appear to users
  13. 13. Facet: Defined. Flow from master Data catalog is master for table-level definitions and business glossary Mapping table from master to BI tool: here, Looker dimensions and measures Tool compares master to BI tool and updates/injects and creates pull request Manually reviewed and merged Master definitions appear to users Open sourcing: https://github.com/ww-tech/lookml-tools
  14. 14. Facet: Defined. Style Guide Open sourcing: https://github.com/ww-tech/lookml-tools LookML linter
  15. 15. Defined % tables defined % fields defined Facet: Defined + LookML updater LookML linter Defined % dimensions defined % measures defined
  16. 16. Easy to lose trust. Hard to regain! We asked: ● NPS data: would you recommend our data to a friend? ● NPS infrastructure: would you recommend our infrastructure (Looker, BigQuery etc) to a friend? ● NPS support: would you recommend CIE’s support to a friend? We will resurvey at end of 2019 In April, 2019, we surveyed data-related NPS with analysts, data scientists, and some decisions makers and execs Trust NPS Facet: Trust
  17. 17. 1 Accurate % records quarantined % records in range % records matching 2 Coherent % records missing entity ID % records missing foreign key 3 Complete % records dupes % records missing % records complete % fields complete 4 Consistent % records consistent 5 Defined % tables defined % fields defined % dimensions defined % measures defined 6 Timely Mean time to arrival 95th percentile time to arrival 7 Volume Number of Records 8 Trust NPS “If you can't measure it, you can't improve it” - Peter Drucker Data Quality Scorecard Reference Data Server logs Metadata Schema Data catalog + lookml-tools Survey
  18. 18. Integrate into normal workflows Our engineers work in Slack, so let them do data quality work there too
  19. 19. Integrate into team culture Agile BI engineering team ● BI engineering teams set aside 10% of time for explicit data quality work ● Expect DQ dashboards for all new sources ● Weekly data quality meetings ● Now proactive, rather than reactive or retrospective
  20. 20. Data Quality is a Shared Responsibility Adhere to Schema Automated consumption DQ Dashboards Subscribe / Report Value Ranges Automated checks Data dictionaries Investigate Investigate Data dictionaries + glossary Investigate Single Source of Truth Investigate Data Catalog Data dictionaries docsschemaMonitor/ investigate
  21. 21. What Questions Do You Have For Me? Carl Anderson carl.anderson@weighwatchers.com @leapingllamas https://medium.com/ww-tech-blog We are hiring: BI engineers, engineers, and data scientists for our Toronto office (a few blocks away). Find our booth in recruiting hall.

×