SlideShare a Scribd company logo
1 of 50
Download to read offline
www.scling.com
Engineering data quality
Øredev, 2019-11-08
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Data value requires data quality
2
Hey, the CRM pipeline is down!
We really need the data.
But the data is completely
bogus, and we need to work
with the provider to fix it.
…?
But we use it to feed our
analytics, and need data now!
www.scling.com
Scope
3
Data engineering perspective on data quality
● Context - big data environments
● Origins of good or bad data
● Quality assessment
● Quality assurance
www.scling.com
Big data - a collaboration paradigm
4
Stream storage
Data lake
Data
democratised
www.scling.com
Data pipelines
5
Data lake
www.scling.com
More data - decreased friction
6
Data lake
Stream storage
www.scling.com
Scling - data-value-as-a-service
7
Data lake
Stream storage
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency
Our marketing strategy:
● Promiscuously share knowledge
○ On slides devoid of glossy polish :-)
www.scling.com
Data platform overview
8
Data lake
Batch
processing
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Workflow
orchestration
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
9
www.scling.com
The truth is out there
10
I love working with
data, because data
is true.
www.scling.com
Truth mutated
11
We put the new model
out for A/B testing, and
it looks great! Great. What fraction of
test users showed a
KPI improvement?
100%!
Hmm..
Wait, it seems ads
were disabled for the
test group....
www.scling.com
Not the whole truth
12
Our steel customers are
affected by cracks, causing
corrosion. Can you look at
our defect reports, and help
us predict issues?
Sure, hang on.
We have found a strong
signal: The customer id...
www.scling.com
Something but the truth
13
Huh, why do have a sharp increase
in invalid_media_type ?
I’ll have a look.
It seems that we have a new
media type “bullshit”...
www.scling.com
Hearsay
14
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
Sure, hang on.
www.scling.com
Hearsay
15
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
Sure, hang on.
This looks like an
early indicator!
Wait, is this interpolated?
www.scling.com
Events vs current state
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
16
DB’DB
join?
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
17
www.scling.com
Ensuring timeliness
● First rule of distributed systems: Avoid distributed systems.
● Keep things simple.
● Master workflow orchestration.
Other than that, very large topic...
18
www.scling.com
19
Design for testability
● Output = function(input, code)
● No dependency on external services
● Avoid non-deterministic factors
q
DB Service
www.scling.com
20
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job
www.scling.com
21
Recommended scopes
● Single job
● Multiple jobs
● Pipeline, including service
Job
Service
App
Stream
Stream
Job
Stream
Job
www.scling.com
22
Scopes to avoid
● Unit/Component
○ Few stable interfaces
○ Avoid mocks, dependency injection rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system tests, complement
with smaller where you cannot get coverage.”
- Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job
www.scling.com
Testing single batch job
23
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
class CleanUserTest extends FlatSpec {
def validateInvariants(
input: Seq[User],
output: Seq[User],
counters: Map[String, Int]) = {
output.foreach(recordInvariant)
// Dataset invariants
assert(input.size === output.size)
assert(input.size should be >= counters["upper-cased"])
}
def recordInvariant(u: User) =
assert(u.country.size === 2)
def runJob(input: Seq[User]): Seq[User] = {
// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}
// Test case is the same
}
24
Invariants
● Some things are true
○ For every record
○ For every job invocation
● Not necessarily in production
○ Reuse invariant predicates as quality
probes
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
25
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com
26
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com
27
Measuring correctness: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com
28
Conditional consumption
● Conditional consumption
○ Express in workflow orchestration
○ Read metrics DB, quality dataset
○ Producer can recommend, not decide
● Insufficient quality?
○ Wait for bug fix
○ Use older/newer input dataset
Recommendation
metrics
Report
www.scling.com
29
The unknown unknowns
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Revert to old egress dataset if necessary
Measure interactions
DB
Standard
alerting
service
Stream Job
www.scling.com
Fuzzy products
● Data-driven applications often have fuzzy logic
○ No clear right output
○ Quality == total experience for all users
● Testing for productivity must be binary
● Cause → effect connection must be strong
○ Cause = code change
○ Effect = quality degradation
30
Code ∆?
www.scling.com
Converting fuzzy to binary
● Break out binary behaviour
31
Simple input
Output sane?
Invariants?
Clear cut
scenario
Clear cut
result?
www.scling.com
Golden scenario suite
● Scenarios that must never fail
● May include real-world data
32
"Stockholm"
Geo data
www.scling.com
Weighted quality sum
● Sum of test case results should not regress
● Individual regressions are acceptable
● Example: Map searches, done from a US IP address with English language browser setting
Sum = 5.4
33
Input Output Verdict (0-1) Weight Weighted verdict
Springfield Springfield, MA 1 2 2
Hartfield Hartford, CT 0 1 0
Philadelphia Philadelphia, Egypt 0.2 5 1
Boston Boston, UK 0.4 4 1.6
Betlehem Betlehem, Israel 0.8 1 0.8
www.scling.com
Testing with real world / production data
● Data is volatile
○ Separate code change from test data change
○ Take snapshots to use for test
● Beware of privacy issues
34
∆?
Code ∆!
Input ∆?
www.scling.com
Data completeness
● Static workflow DAGs ensure dataset completeness
● Dataset completeness != data completeness
● Collected events might be delayed
● Event creation to collection delay is unbounded
○ Consider offline phones
35
www.scling.com
val orderLateCounter = longAccumulator("order-event-late")
val hourPaths = conf.order.split(",")
val order = hourPaths
.map(spark.read.json(_))
.reduce(a, b => a.union(b))
val orderThisHour = order
.map({ cl =>
# Count the events that came after the delay window
if (cl.eventTime.hour + config.delayHours <
config.hour) {
orderLateCounter.add(1)
}
order
})
.filter(cl => cl.eventTime.hour == config.hour)
class OrderShuffle(SparkSubmitTask):
hour = DateHourParameter()
delay_hours = IntParameter()
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderJob'
def requires(self):
# Note: This delays processing by three hours.
return [Order(hour=hour) for hour in
[self.hour + timedelta(hour=h) for h in
range(self.delay_hours)]]
def output(self):
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/"
f"{self.hour:%Y/%m/%d/%H}/")
def app_options(self):
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]
Incompleteness recovery
36
SQL: Separate job for measuring window leakage.
www.scling.com
class OrderShuffleAll(WrapperTask):
hour = DateHourParameter()
def requires(self):
return [OrderShuffle(hour=self.hour, delay_hour=d)
for d in [0, 4, 12]]
class OrderDashboard(mysql.CopyToTable):
hour = DateHourParameter()
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)
class FinancialReport(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hour=h)),
delay_hour=12)
for h in range(24)]
Fast data, complete data
37
Delay: 0
Delay: 4
Delay: 12
www.scling.com
Things to plan for early
38
Data quality
www.scling.com
Things to plan for early
39
Data quality
Input
validation
Software
supply
chain
Multi-cloud
Perfor-
mance
Cloud
native
UX
User
feedback
Web
security
Foobarility
Scalability
Testability
Accessi-
bility
Bidi
languages
i18n
Machine
learning
bias
Get the MVP out?
Mobile
browsers
www.scling.com
No.
Valuable
graph
Any graph
Does anyone care about data quality?
40
No graph
Great! Meh.
Valuable
model
Any ML
model
No model
www.scling.com
1999: Does anyone care about code quality?
No.
41
7 years
www.scling.com
Code quality 1999
42
Behold our
great code!
We think some QA and test
automation would be great.
Nah, boring. We
don’t have time.
Just put it in
production for us.
www.scling.com
Code quality 2019
43
We have invented DevOps
and continuous delivery.
Test automation is key!
That sounds familiar...
www.scling.com
Data quality 2019
44
Behold our
great model!
We think some data quality
assessment and automation
would be great.
Nah, why? We don’t have time.
Just put it in production for us.
www.scling.com
Data quality 2029
45
We have invented MLOps
and continuous modelling.
Quality feedback
automation is key!
That sounds familiar...
www.scling.com
Changing culture bottom-up
46
www.scling.com
Repeating the success?
47
www.scling.com
Resources, credits
Presentations, articles on related subjects:
● https://www.scling.com/reading-list
● https://www.scling.com/presentations
Useful tools:
● https://github.com/awslabs/deequ
● https://github.com/great-expectations/great_expectations
● https://github.com/spotify/ratatool
48
Thank you,
● Irene Gonzálvez, Spotify
https://youtu.be/U63TmQPS9Z8
● Anders Holst, RISE
www.scling.com
Tech has massive impact on society
49
Product?
Supplier?
Employer?
Make an active
choice whether to
have an impact!
Cloud?
www.scling.com
Laptop sticker
Vintage data visualisations, by Karin Lind.
● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869.
● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852.
● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean
war, 1854-1856. Drawn 1858.
○ Blue = disease, red = wounds, black = battle + other.
● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion
Measures of Twelve Pulsars, 1970
○ Joy Division:
Unknown Pleasures, 1979
○ “Joy plot” → “ridge plot”
50

More Related Content

What's hot

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data EngineeringAnanth PackkilDurai
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine LearningDatabricks
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Altis Webinar: Use Cases For The Modern Data Platform
Altis Webinar: Use Cases For The Modern Data PlatformAltis Webinar: Use Cases For The Modern Data Platform
Altis Webinar: Use Cases For The Modern Data PlatformAltis Consulting
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 

What's hot (20)

Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the business
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Graph Databases at Netflix
Graph Databases at NetflixGraph Databases at Netflix
Graph Databases at Netflix
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Altis Webinar: Use Cases For The Modern Data Platform
Altis Webinar: Use Cases For The Modern Data PlatformAltis Webinar: Use Cases For The Modern Data Platform
Altis Webinar: Use Cases For The Modern Data Platform
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 

Similar to Engineering data quality

Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatEvention
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 

Similar to Engineering data quality (20)

Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 

More from Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 

More from Lars Albertsson (16)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Engineering data quality

  • 1. www.scling.com Engineering data quality Øredev, 2019-11-08 Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Data value requires data quality 2 Hey, the CRM pipeline is down! We really need the data. But the data is completely bogus, and we need to work with the provider to fix it. …? But we use it to feed our analytics, and need data now!
  • 3. www.scling.com Scope 3 Data engineering perspective on data quality ● Context - big data environments ● Origins of good or bad data ● Quality assessment ● Quality assurance
  • 4. www.scling.com Big data - a collaboration paradigm 4 Stream storage Data lake Data democratised
  • 6. www.scling.com More data - decreased friction 6 Data lake Stream storage
  • 7. www.scling.com Scling - data-value-as-a-service 7 Data lake Stream storage ● Extract value from your data ● Data platform + custom data pipelines ● Imitate data leaders: ○ Quick idea-to-production ○ Operational efficiency Our marketing strategy: ● Promiscuously share knowledge ○ On slides devoid of glossy polish :-)
  • 8. www.scling.com Data platform overview 8 Data lake Batch processing Cold store Dataset Pipeline Service Service Online services Offline data platform Job Workflow orchestration
  • 9. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 9
  • 10. www.scling.com The truth is out there 10 I love working with data, because data is true.
  • 11. www.scling.com Truth mutated 11 We put the new model out for A/B testing, and it looks great! Great. What fraction of test users showed a KPI improvement? 100%! Hmm.. Wait, it seems ads were disabled for the test group....
  • 12. www.scling.com Not the whole truth 12 Our steel customers are affected by cracks, causing corrosion. Can you look at our defect reports, and help us predict issues? Sure, hang on. We have found a strong signal: The customer id...
  • 13. www.scling.com Something but the truth 13 Huh, why do have a sharp increase in invalid_media_type ? I’ll have a look. It seems that we have a new media type “bullshit”...
  • 14. www.scling.com Hearsay 14 Manufacturing line disruptions are expensive. Can you look at our sensor data, and help us predict? Sure, hang on.
  • 15. www.scling.com Hearsay 15 Manufacturing line disruptions are expensive. Can you look at our sensor data, and help us predict? Sure, hang on. This looks like an early indicator! Wait, is this interpolated?
  • 16. www.scling.com Events vs current state ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 16 DB’DB join?
  • 17. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 17
  • 18. www.scling.com Ensuring timeliness ● First rule of distributed systems: Avoid distributed systems. ● Keep things simple. ● Master workflow orchestration. Other than that, very large topic... 18
  • 19. www.scling.com 19 Design for testability ● Output = function(input, code) ● No dependency on external services ● Avoid non-deterministic factors q DB Service
  • 20. www.scling.com 20 Potential test scopes ● Unit/component ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Job Service App Stream Stream Job Stream Job
  • 21. www.scling.com 21 Recommended scopes ● Single job ● Multiple jobs ● Pipeline, including service Job Service App Stream Stream Job Stream Job
  • 22. www.scling.com 22 Scopes to avoid ● Unit/Component ○ Few stable interfaces ○ Avoid mocks, dependency injection rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Job Service App Stream Stream Job Stream Job
  • 23. www.scling.com Testing single batch job 23 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 24. www.scling.com class CleanUserTest extends FlatSpec { def validateInvariants( input: Seq[User], output: Seq[User], counters: Map[String, Int]) = { output.foreach(recordInvariant) // Dataset invariants assert(input.size === output.size) assert(input.size should be >= counters["upper-cased"]) } def recordInvariant(u: User) = assert(u.country.size === 2) def runJob(input: Seq[User]): Seq[User] = { // Same as before ... validateInvariants(input, output, counters) (output, counters) } // Test case is the same } 24 Invariants ● Some things are true ○ For every record ○ For every job invocation ● Not necessarily in production ○ Reuse invariant predicates as quality probes
  • 25. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 25 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 26. www.scling.com 26 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 27. www.scling.com 27 Measuring correctness: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 28. www.scling.com 28 Conditional consumption ● Conditional consumption ○ Express in workflow orchestration ○ Read metrics DB, quality dataset ○ Producer can recommend, not decide ● Insufficient quality? ○ Wait for bug fix ○ Use older/newer input dataset Recommendation metrics Report
  • 29. www.scling.com 29 The unknown unknowns ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Revert to old egress dataset if necessary Measure interactions DB Standard alerting service Stream Job
  • 30. www.scling.com Fuzzy products ● Data-driven applications often have fuzzy logic ○ No clear right output ○ Quality == total experience for all users ● Testing for productivity must be binary ● Cause → effect connection must be strong ○ Cause = code change ○ Effect = quality degradation 30 Code ∆?
  • 31. www.scling.com Converting fuzzy to binary ● Break out binary behaviour 31 Simple input Output sane? Invariants? Clear cut scenario Clear cut result?
  • 32. www.scling.com Golden scenario suite ● Scenarios that must never fail ● May include real-world data 32 "Stockholm" Geo data
  • 33. www.scling.com Weighted quality sum ● Sum of test case results should not regress ● Individual regressions are acceptable ● Example: Map searches, done from a US IP address with English language browser setting Sum = 5.4 33 Input Output Verdict (0-1) Weight Weighted verdict Springfield Springfield, MA 1 2 2 Hartfield Hartford, CT 0 1 0 Philadelphia Philadelphia, Egypt 0.2 5 1 Boston Boston, UK 0.4 4 1.6 Betlehem Betlehem, Israel 0.8 1 0.8
  • 34. www.scling.com Testing with real world / production data ● Data is volatile ○ Separate code change from test data change ○ Take snapshots to use for test ● Beware of privacy issues 34 ∆? Code ∆! Input ∆?
  • 35. www.scling.com Data completeness ● Static workflow DAGs ensure dataset completeness ● Dataset completeness != data completeness ● Collected events might be delayed ● Event creation to collection delay is unbounded ○ Consider offline phones 35
  • 36. www.scling.com val orderLateCounter = longAccumulator("order-event-late") val hourPaths = conf.order.split(",") val order = hourPaths .map(spark.read.json(_)) .reduce(a, b => a.union(b)) val orderThisHour = order .map({ cl => # Count the events that came after the delay window if (cl.eventTime.hour + config.delayHours < config.hour) { orderLateCounter.add(1) } order }) .filter(cl => cl.eventTime.hour == config.hour) class OrderShuffle(SparkSubmitTask): hour = DateHourParameter() delay_hours = IntParameter() jar = 'orderpipeline.jar' entry_class = 'com.example.shop.OrderJob' def requires(self): # Note: This delays processing by three hours. return [Order(hour=hour) for hour in [self.hour + timedelta(hour=h) for h in range(self.delay_hours)]] def output(self): return HdfsTarget("/prod/red/order/v1/" f"delay={self.delay}/" f"{self.hour:%Y/%m/%d/%H}/") def app_options(self): return [ "--hour", self.hour, "--delay-hours", self.delay_hours, "--order", ",".join([i.path for i in self.input()]), "--output", self.output().path] Incompleteness recovery 36 SQL: Separate job for measuring window leakage.
  • 37. www.scling.com class OrderShuffleAll(WrapperTask): hour = DateHourParameter() def requires(self): return [OrderShuffle(hour=self.hour, delay_hour=d) for d in [0, 4, 12]] class OrderDashboard(mysql.CopyToTable): hour = DateHourParameter() def requires(self): return OrderShuffle(hour=self.hour, delay_hour=0) class FinancialReport(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderShuffle( hour=datetime.combine(self.date, time(hour=h)), delay_hour=12) for h in range(24)] Fast data, complete data 37 Delay: 0 Delay: 4 Delay: 12
  • 38. www.scling.com Things to plan for early 38 Data quality
  • 39. www.scling.com Things to plan for early 39 Data quality Input validation Software supply chain Multi-cloud Perfor- mance Cloud native UX User feedback Web security Foobarility Scalability Testability Accessi- bility Bidi languages i18n Machine learning bias Get the MVP out? Mobile browsers
  • 40. www.scling.com No. Valuable graph Any graph Does anyone care about data quality? 40 No graph Great! Meh. Valuable model Any ML model No model
  • 41. www.scling.com 1999: Does anyone care about code quality? No. 41 7 years
  • 42. www.scling.com Code quality 1999 42 Behold our great code! We think some QA and test automation would be great. Nah, boring. We don’t have time. Just put it in production for us.
  • 43. www.scling.com Code quality 2019 43 We have invented DevOps and continuous delivery. Test automation is key! That sounds familiar...
  • 44. www.scling.com Data quality 2019 44 Behold our great model! We think some data quality assessment and automation would be great. Nah, why? We don’t have time. Just put it in production for us.
  • 45. www.scling.com Data quality 2029 45 We have invented MLOps and continuous modelling. Quality feedback automation is key! That sounds familiar...
  • 48. www.scling.com Resources, credits Presentations, articles on related subjects: ● https://www.scling.com/reading-list ● https://www.scling.com/presentations Useful tools: ● https://github.com/awslabs/deequ ● https://github.com/great-expectations/great_expectations ● https://github.com/spotify/ratatool 48 Thank you, ● Irene Gonzálvez, Spotify https://youtu.be/U63TmQPS9Z8 ● Anders Holst, RISE
  • 49. www.scling.com Tech has massive impact on society 49 Product? Supplier? Employer? Make an active choice whether to have an impact! Cloud?
  • 50. www.scling.com Laptop sticker Vintage data visualisations, by Karin Lind. ● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869. ● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852. ● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean war, 1854-1856. Drawn 1858. ○ Blue = disease, red = wounds, black = battle + other. ● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion Measures of Twelve Pulsars, 1970 ○ Joy Division: Unknown Pleasures, 1979 ○ “Joy plot” → “ridge plot” 50