SlideShare a Scribd company logo
1 of 28
Download to read offline
@joe_Caserta#DGIQ2015
The Fundamentals of Data Quality
Understanding, Planning and Achieving
Data Quality in Your Organization
Joe Caserta
@joe_Caserta#DGIQ2015
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 2,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
Joe Caserta Timeline
@joe_Caserta#DGIQ2015
Data Quality
• Foremost reason for data warehouse failure is lack of data
accuracy
Accurate data means:
Correct
Unambiguous
Consistent
Complete
• Every Data Management system needs a data quality sub-system to
some degree
@joe_Caserta#DGIQ2015
The Data Quality Pipeline
Extract Clean Conform Deliver
Extracted
data staged
to disk
Clean data
staged to
disk
Conformed
data staged
to disk
Cleansed
data ready
for delivery
Operations: Scheduling, Error Handling, Data Quality Assurance
• Extract. The raw data coming from source systems
• Clean. Data quality processing involves many discrete steps, including checking for valid
values, ensuring consistency, removing duplicates, and enforcement of complex business
rules
• Conform. Required whenever two or more data sources are merged in the data
warehouse.
• Deliver. The final step is physically structuring the data into a set of dimensional models
@joe_Caserta#DGIQ2015
• To trust your information a robust set of tools for continuous
monitoring is needed
• Accuracy and completeness of data must be ensured
• Any piece of information in the data ecosystem must have
monitoring:
• Basic Stats: source to target counts
• Error Events: did we trap any errors during processing
• Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Data Quality Monitoring
@joe_Caserta#DGIQ2015
• Every data element has a System-of-Record
• The System-of-record is the originating source of data
• Data may be copied, moved, manipulated, transformed, altered, cleansed,
or made corrupt throughout the enterprise
• If you don’t use the system-of-record data quality will be nearly impossible.
• The further downstream you go from the originating data source, you
increase the risk of corrupt data.
• Barring rare exceptions, maintain the practice of sourcing data only from the
system-of-record.
Determine the System of Record
@joe_Caserta#DGIQ2015
Cleaning Data from Multiple Sources
Merge lists
on multiple
attributes
Department 1
Customer List
Department 3
Customer List
Department 3
Customer List
Revised Master
Customer List
Retrieve/Ass
ign New
Master
Customer
Key
Remove
Duplicates
• Identify the source systems
• Understand the source
systems
• Create record matching logic
• Establish survivorship rules
• Establish non-key attribute
business rules
• Assign Surrogate Keys
• Load conformed dimension
@joe_Caserta#DGIQ2015
Be
Corrective
Be Fast
Be
Transparen
t
Be Thorough
Data Quality Priorities
• Be Thorough
• Be Fast
• Be Corrective
• Be Transparent
@joe_Caserta#DGIQ2015
Completeness Versus Speed
Data Quality
SpeedtoValue
Fast
Slow
Transparent Corrective
@joe_Caserta#DGIQ2015
Corrective Versus Transparent
• Corrective
– Hides operational
deficiencies
– ETL complex algorithms
– DW differs from OLTP
– Slows ETL Processes
• Transparent
– Highlight Issues
– Fast Delivery
– DW matches OLTP
– Forces source system
cleanup
@joe_Caserta#DGIQ2015
Data Quality Issues Policy
• Category A Issues must be addressed at the data source
• Category B Issues should be addressed at the data source even if there
might be creative ways of deducing or recreating the derelict information
• Category C Issues, for a host of reasons, are best addressed in the data-
quality ETL rather than at the source
• Category D Data-quality issues can only be pragmatically resolved in the
ETL system
@joe_Caserta#DGIQ2015
Data Quality Issues Bell Curve
MUST be addressed
at the SOURCE
BEST addressed
at the SOURCE
BEST
Addressed
In ETL
MUST be
Addressed
In ETL
Category A Category B Category C Category D
Political DMZ
ETL Focus is here
Universe of Known Data Quality Issues
@joe_Caserta#DGIQ2015
Types of Data Quality Enforcement
• Column Property Enforcement
• Structure Enforcement
• Data Enforcement
• Value Enforcement
@joe_Caserta#DGIQ2015
Column Property Enforcement
• Null values in required columns
• Numeric values that fall outside of range
• Columns whose lengths are unexpected
• Columns that contain data outside of allowed values
• Adherence to a required pattern
@joe_Caserta#DGIQ2015
Structure Enforcement
• Consistent Data Types
• Functional Dependencies
• Referential Integrity
• Hierarchical Relationships
• Domain Sensibility
@joe_Caserta#DGIQ2015
Data and Value Enforcement
• Business Rules
• Missing Data Values
• Incorrect Data Values
• Embedded Meanings in Data Values
• Domain Redundancy
@joe_Caserta#DGIQ2015
Data Quality Failure Options
• 1. Pass the record with no errors
• 2. Pass the record, flag offending column values
• 3. Reject the record
• 4. Stop the ETL job stream
• 5. Fix on the Fly
@joe_Caserta#DGIQ2015
Assessing Data Quality – It’s not as easy as it looks
Data Quality Violation Action
1. Incoming Employee has a termination date earlier than their hire date
2. Compensation fact has currency that does not exist in the currency
dimension
3. End Date is not a valid date
4. Bill Amount is 13,562,583.67 when bills usually don't exceed 1.3 million
5. The source for the region dimension contains a city 'New Yourk'
6. More than 90% of the prices are NULL while loading the Products
dimension
7. The customer key is not available during the sales detail fact table load
8. Column is not found while attempting to extract the status of an employee
9. A product with existing facts has been deleted from the source system
10. The description is empty for a new product in the Product dimension
@joe_Caserta#DGIQ2015
Tracking Data Quality Failures
• Error Event Star- Schema
– Enables trend analysis of errors and exceptions
• Audit Dimension
– Captures specific quality context of individual fact table
records
• Refer to The Data Warehouse ETL Toolkit pp.126-129 for
more information on tracking data quality errors
@joe_Caserta#DGIQ2015
Error Event Table Schema
• Each error instance of each data
quality check is captured
• Implemented as sub-system of
ETL
• Each fact stored unique identifier
of the defective source system
record
@joe_Caserta#DGIQ2015
Audit Dimension
• Fact table contains a foreign key to
audit key
• Dummy (OK) row for records with
no defects
• Audit dimensions can be unique to
each fact table
• Error Event Fact can be used to fill in
the measures of the audit dimension
@joe_Caserta#DGIQ2015
Data Quality Strategy
• 1. Perform Data Profiling
• 2. Document Data Defects
• 3. Determine Data Defect Responsibility
• 4. Define Data Quality Rules
• 5. Obtain Sign-off for Correction Logic
• 6. Integrate rules with Logical Data Mapping
@joe_Caserta#DGIQ2015
Enrollments
Claims
Finance
ETL
Horizontally Scalable Environment - Optimized for Analytics
NoSQL
Databases
ETL
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of the Enterprise Data Hub
Big Data Lake
ETL
@joe_Caserta#DGIQ2015
What’s Old is New Again
 Before Data Warehousing DG/DQ
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Data Lake DG/DQ
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance
will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to the old days:
No Trust = Not Actionable
@joe_Caserta#DGIQ2015
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Enterprise Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 ETL cleans, conforms, consolidates, enriches each tier
 Only top tier of the pyramid is fully governed
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
ETL/DQ
ETL/DQ
ETL/DQ
ETL
@joe_Caserta#DGIQ2015
Recommended Reading
Ralph Kimball
The Data Warehouse Lifecycle
Toolkit, 2nd Edition
Jack E Olson
Data Quality, the Accuracy
Dimension
Ralph Kimball, Joe Caserta
The Data Warehouse ETL
Toolkit
@joe_Caserta#DGIQ2015
Formal DW & ETL Training in NYC, 2015
Join us for one or both training courses combining two unique
workshops from international data warehousing veterans.
Workshops:
Sept 21-22 (2 days), Agile Data Warehousing with Lawrence Corr
Sept 23-24 (2 days), ETL Architecture and Design with Joe Caserta
SAVE $300 BY REGISTERING BEFORE JUNE 30TH!
Thanks! We look forward to seeing you there.
@joe_Caserta#DGIQ2015
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

More Related Content

What's hot

Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsMethod360
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slidesNicolas Sarramagna
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)raj.kamal13
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive AnalyticsNandita Nityanandam
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...Alan D. Duncan
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeStefan Kühn
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousingShahed Khalili
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousingsumit621
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
A Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementA Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementSaachiShankar
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)SQALab
 

What's hot (20)

Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source Systems
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
 
Data mining
Data miningData mining
Data mining
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousing
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Data analytics
Data analyticsData analytics
Data analytics
 
Part1
Part1Part1
Part1
 
A Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementA Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata Management
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)
 

Viewers also liked

Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperJane Roberts
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...IBM
 
7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)szCzímer Zoltán
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic
 
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...VMware Tanzu
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens WeijersBigDataExpo
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)William Yeh
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenBigDataExpo
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...EMC
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerData Con LA
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22Parashar Shah
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPrimend
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceAmazon Web Services
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Yearselketeaches
 

Viewers also liked (20)

Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaper
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...
 
Cloud developer evolution
Cloud developer evolutionCloud developer evolution
Cloud developer evolution
 
7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016
 
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Azure Key Vault
Azure Key VaultAzure Key Vault
Azure Key Vault
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens Weijers
 
Pesla
PeslaPesla
Pesla
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea Ursaner
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Years
 

Similar to DGIQ 2015 The Fundamentals of Data Quality

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
When the business needs intelligence (15Oct2014)
When the business needs intelligence   (15Oct2014)When the business needs intelligence   (15Oct2014)
When the business needs intelligence (15Oct2014)Dipti Patil
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality RightDATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
 You Need a Data Catalog. Do You Know Why? You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Objective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessObjective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessPersonifyMarketing
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDMrnaramore
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayTorana, Inc.
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 

Similar to DGIQ 2015 The Fundamentals of Data Quality (20)

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
When the business needs intelligence (15Oct2014)
When the business needs intelligence   (15Oct2014)When the business needs intelligence   (15Oct2014)
When the business needs intelligence (15Oct2014)
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality Right
 
You Need a Data Catalog. Do You Know Why?
 You Need a Data Catalog. Do You Know Why? You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Objective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessObjective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and Effectiveness
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDM
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile way
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 

DGIQ 2015 The Fundamentals of Data Quality

  • 1. @joe_Caserta#DGIQ2015 The Fundamentals of Data Quality Understanding, Planning and Achieving Data Quality in Your Organization Joe Caserta
  • 2. @joe_Caserta#DGIQ2015 Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations Joe Caserta Timeline
  • 3. @joe_Caserta#DGIQ2015 Data Quality • Foremost reason for data warehouse failure is lack of data accuracy Accurate data means: Correct Unambiguous Consistent Complete • Every Data Management system needs a data quality sub-system to some degree
  • 4. @joe_Caserta#DGIQ2015 The Data Quality Pipeline Extract Clean Conform Deliver Extracted data staged to disk Clean data staged to disk Conformed data staged to disk Cleansed data ready for delivery Operations: Scheduling, Error Handling, Data Quality Assurance • Extract. The raw data coming from source systems • Clean. Data quality processing involves many discrete steps, including checking for valid values, ensuring consistency, removing duplicates, and enforcement of complex business rules • Conform. Required whenever two or more data sources are merged in the data warehouse. • Deliver. The final step is physically structuring the data into a set of dimensional models
  • 5. @joe_Caserta#DGIQ2015 • To trust your information a robust set of tools for continuous monitoring is needed • Accuracy and completeness of data must be ensured • Any piece of information in the data ecosystem must have monitoring: • Basic Stats: source to target counts • Error Events: did we trap any errors during processing • Business Checks: is the metric “within expectations”, How does it compare with an abridged alternate calculation. Data Quality Monitoring
  • 6. @joe_Caserta#DGIQ2015 • Every data element has a System-of-Record • The System-of-record is the originating source of data • Data may be copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise • If you don’t use the system-of-record data quality will be nearly impossible. • The further downstream you go from the originating data source, you increase the risk of corrupt data. • Barring rare exceptions, maintain the practice of sourcing data only from the system-of-record. Determine the System of Record
  • 7. @joe_Caserta#DGIQ2015 Cleaning Data from Multiple Sources Merge lists on multiple attributes Department 1 Customer List Department 3 Customer List Department 3 Customer List Revised Master Customer List Retrieve/Ass ign New Master Customer Key Remove Duplicates • Identify the source systems • Understand the source systems • Create record matching logic • Establish survivorship rules • Establish non-key attribute business rules • Assign Surrogate Keys • Load conformed dimension
  • 8. @joe_Caserta#DGIQ2015 Be Corrective Be Fast Be Transparen t Be Thorough Data Quality Priorities • Be Thorough • Be Fast • Be Corrective • Be Transparent
  • 9. @joe_Caserta#DGIQ2015 Completeness Versus Speed Data Quality SpeedtoValue Fast Slow Transparent Corrective
  • 10. @joe_Caserta#DGIQ2015 Corrective Versus Transparent • Corrective – Hides operational deficiencies – ETL complex algorithms – DW differs from OLTP – Slows ETL Processes • Transparent – Highlight Issues – Fast Delivery – DW matches OLTP – Forces source system cleanup
  • 11. @joe_Caserta#DGIQ2015 Data Quality Issues Policy • Category A Issues must be addressed at the data source • Category B Issues should be addressed at the data source even if there might be creative ways of deducing or recreating the derelict information • Category C Issues, for a host of reasons, are best addressed in the data- quality ETL rather than at the source • Category D Data-quality issues can only be pragmatically resolved in the ETL system
  • 12. @joe_Caserta#DGIQ2015 Data Quality Issues Bell Curve MUST be addressed at the SOURCE BEST addressed at the SOURCE BEST Addressed In ETL MUST be Addressed In ETL Category A Category B Category C Category D Political DMZ ETL Focus is here Universe of Known Data Quality Issues
  • 13. @joe_Caserta#DGIQ2015 Types of Data Quality Enforcement • Column Property Enforcement • Structure Enforcement • Data Enforcement • Value Enforcement
  • 14. @joe_Caserta#DGIQ2015 Column Property Enforcement • Null values in required columns • Numeric values that fall outside of range • Columns whose lengths are unexpected • Columns that contain data outside of allowed values • Adherence to a required pattern
  • 15. @joe_Caserta#DGIQ2015 Structure Enforcement • Consistent Data Types • Functional Dependencies • Referential Integrity • Hierarchical Relationships • Domain Sensibility
  • 16. @joe_Caserta#DGIQ2015 Data and Value Enforcement • Business Rules • Missing Data Values • Incorrect Data Values • Embedded Meanings in Data Values • Domain Redundancy
  • 17. @joe_Caserta#DGIQ2015 Data Quality Failure Options • 1. Pass the record with no errors • 2. Pass the record, flag offending column values • 3. Reject the record • 4. Stop the ETL job stream • 5. Fix on the Fly
  • 18. @joe_Caserta#DGIQ2015 Assessing Data Quality – It’s not as easy as it looks Data Quality Violation Action 1. Incoming Employee has a termination date earlier than their hire date 2. Compensation fact has currency that does not exist in the currency dimension 3. End Date is not a valid date 4. Bill Amount is 13,562,583.67 when bills usually don't exceed 1.3 million 5. The source for the region dimension contains a city 'New Yourk' 6. More than 90% of the prices are NULL while loading the Products dimension 7. The customer key is not available during the sales detail fact table load 8. Column is not found while attempting to extract the status of an employee 9. A product with existing facts has been deleted from the source system 10. The description is empty for a new product in the Product dimension
  • 19. @joe_Caserta#DGIQ2015 Tracking Data Quality Failures • Error Event Star- Schema – Enables trend analysis of errors and exceptions • Audit Dimension – Captures specific quality context of individual fact table records • Refer to The Data Warehouse ETL Toolkit pp.126-129 for more information on tracking data quality errors
  • 20. @joe_Caserta#DGIQ2015 Error Event Table Schema • Each error instance of each data quality check is captured • Implemented as sub-system of ETL • Each fact stored unique identifier of the defective source system record
  • 21. @joe_Caserta#DGIQ2015 Audit Dimension • Fact table contains a foreign key to audit key • Dummy (OK) row for records with no defects • Audit dimensions can be unique to each fact table • Error Event Fact can be used to fill in the measures of the audit dimension
  • 22. @joe_Caserta#DGIQ2015 Data Quality Strategy • 1. Perform Data Profiling • 2. Document Data Defects • 3. Determine Data Defect Responsibility • 4. Define Data Quality Rules • 5. Obtain Sign-off for Correction Logic • 6. Integrate rules with Logical Data Mapping
  • 23. @joe_Caserta#DGIQ2015 Enrollments Claims Finance ETL Horizontally Scalable Environment - Optimized for Analytics NoSQL Databases ETL Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of the Enterprise Data Hub Big Data Lake ETL
  • 24. @joe_Caserta#DGIQ2015 What’s Old is New Again  Before Data Warehousing DG/DQ  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Data Lake DG/DQ  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 25. @joe_Caserta#DGIQ2015 Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Enterprise Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  ETL cleans, conforms, consolidates, enriches each tier  Only top tier of the pyramid is fully governed Fully Data Governed ( trusted) User community arbitrary queries and reporting ETL/DQ ETL/DQ ETL/DQ ETL
  • 26. @joe_Caserta#DGIQ2015 Recommended Reading Ralph Kimball The Data Warehouse Lifecycle Toolkit, 2nd Edition Jack E Olson Data Quality, the Accuracy Dimension Ralph Kimball, Joe Caserta The Data Warehouse ETL Toolkit
  • 27. @joe_Caserta#DGIQ2015 Formal DW & ETL Training in NYC, 2015 Join us for one or both training courses combining two unique workshops from international data warehousing veterans. Workshops: Sept 21-22 (2 days), Agile Data Warehousing with Lawrence Corr Sept 23-24 (2 days), ETL Architecture and Design with Joe Caserta SAVE $300 BY REGISTERING BEFORE JUNE 30TH! Thanks! We look forward to seeing you there.
  • 28. @joe_Caserta#DGIQ2015 Thank You / Q&A Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta