Engineering data quality

www.scling.com
Engineering data quality
Øredev, 2019-11-08
Lars Albertsson (@lalleal)
Scling
1

www.scling.com
Data value requires data quality
2
Hey, the CRM pipeline is down!
We really need the data.
But the data is completely
bogus, and we need to work
with the provider to fix it.
…?
But we use it to feed our
analytics, and need data now!

www.scling.com
Scope
3
Data engineering perspective on data quality
● Context - big data environments
● Origins of good or bad data
● Quality assessment
● Quality assurance

www.scling.com
Big data - a collaboration paradigm
4
Stream storage
Data lake
Data
democratised

www.scling.com
Data pipelines
5
Data lake

www.scling.com
More data - decreased friction
6
Data lake
Stream storage

www.scling.com
Scling - data-value-as-a-service
7
Data lake
Stream storage
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency
Our marketing strategy:
● Promiscuously share knowledge
○ On slides devoid of glossy polish :-)

www.scling.com
Data platform overview
8
Data lake
Batch
processing
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Workflow
orchestration

www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
9

www.scling.com
The truth is out there
10
I love working with
data, because data
is true.

www.scling.com
Truth mutated
11
We put the new model
out for A/B testing, and
it looks great! Great. What fraction of
test users showed a
KPI improvement?
100%!
Hmm..
Wait, it seems ads
were disabled for the
test group....

www.scling.com
Not the whole truth
12
Our steel customers are
affected by cracks, causing
corrosion. Can you look at
our defect reports, and help
us predict issues?
Sure, hang on.
We have found a strong
signal: The customer id...

www.scling.com
Something but the truth
13
Huh, why do have a sharp increase
in invalid_media_type ?
I’ll have a look.
It seems that we have a new
media type “bullshit”...

www.scling.com
Hearsay
14
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
Sure, hang on.

www.scling.com
Hearsay
15
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
Sure, hang on.
This looks like an
early indicator!
Wait, is this interpolated?

www.scling.com
Events vs current state
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
16
DB’DB
join?

www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
17

www.scling.com
Ensuring timeliness
● First rule of distributed systems: Avoid distributed systems.
● Keep things simple.
● Master workflow orchestration.
Other than that, very large topic...
18

www.scling.com
19
Design for testability
● Output = function(input, code)
● No dependency on external services
● Avoid non-deterministic factors
q
DB Service

www.scling.com
20
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job

www.scling.com
21
Recommended scopes
● Single job
● Multiple jobs
● Pipeline, including service
Job
Service
App
Stream
Stream
Job
Stream
Job

www.scling.com
22
Scopes to avoid
● Unit/Component
○ Few stable interfaces
○ Avoid mocks, dependency injection rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system tests, complement
with smaller where you cannot get coverage.”
- Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job

www.scling.com
Testing single batch job
23
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE

www.scling.com
class CleanUserTest extends FlatSpec {
def validateInvariants(
input: Seq[User],
output: Seq[User],
counters: Map[String, Int]) = {
output.foreach(recordInvariant)
// Dataset invariants
assert(input.size === output.size)
assert(input.size should be >= counters["upper-cased"])
}
def recordInvariant(u: User) =
assert(u.country.size === 2)
def runJob(input: Seq[User]): Seq[User] = {
// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}
// Test case is the same
}
24
Invariants
● Some things are true
○ For every record
○ For every job invocation
● Not necessarily in production
○ Reuse invariant predicates as quality
probes

www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
25
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope

www.scling.com
26
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service

www.scling.com
27
Measuring correctness: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service

www.scling.com
28
Conditional consumption
● Conditional consumption
○ Express in workflow orchestration
○ Read metrics DB, quality dataset
○ Producer can recommend, not decide
● Insufficient quality?
○ Wait for bug fix
○ Use older/newer input dataset
Recommendation
metrics
Report

www.scling.com
29
The unknown unknowns
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Revert to old egress dataset if necessary
Measure interactions
DB
Standard
alerting
service
Stream Job

www.scling.com
Fuzzy products
● Data-driven applications often have fuzzy logic
○ No clear right output
○ Quality == total experience for all users
● Testing for productivity must be binary
● Cause → effect connection must be strong
○ Cause = code change
○ Effect = quality degradation
30
Code ∆?

www.scling.com
Converting fuzzy to binary
● Break out binary behaviour
31
Simple input
Output sane?
Invariants?
Clear cut
scenario
Clear cut
result?

www.scling.com
Golden scenario suite
● Scenarios that must never fail
● May include real-world data
32
"Stockholm"
Geo data

www.scling.com
Weighted quality sum
● Sum of test case results should not regress
● Individual regressions are acceptable
● Example: Map searches, done from a US IP address with English language browser setting
Sum = 5.4
33
Input Output Verdict (0-1) Weight Weighted verdict
Springfield Springfield, MA 1 2 2
Hartfield Hartford, CT 0 1 0
Philadelphia Philadelphia, Egypt 0.2 5 1
Boston Boston, UK 0.4 4 1.6
Betlehem Betlehem, Israel 0.8 1 0.8

www.scling.com
Testing with real world / production data
● Data is volatile
○ Separate code change from test data change
○ Take snapshots to use for test
● Beware of privacy issues
34
∆?
Code ∆!
Input ∆?

www.scling.com
Data completeness
● Static workflow DAGs ensure dataset completeness
● Dataset completeness != data completeness
● Collected events might be delayed
● Event creation to collection delay is unbounded
○ Consider offline phones
35

www.scling.com
val orderLateCounter = longAccumulator("order-event-late")
val hourPaths = conf.order.split(",")
val order = hourPaths
.map(spark.read.json(_))
.reduce(a, b => a.union(b))
val orderThisHour = order
.map({ cl =>
# Count the events that came after the delay window
if (cl.eventTime.hour + config.delayHours <
config.hour) {
orderLateCounter.add(1)
}
order
})
.filter(cl => cl.eventTime.hour == config.hour)
class OrderShuffle(SparkSubmitTask):
hour = DateHourParameter()
delay_hours = IntParameter()
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderJob'
def requires(self):
# Note: This delays processing by three hours.
return [Order(hour=hour) for hour in
[self.hour + timedelta(hour=h) for h in
range(self.delay_hours)]]
def output(self):
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/"
f"{self.hour:%Y/%m/%d/%H}/")
def app_options(self):
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]
Incompleteness recovery
36
SQL: Separate job for measuring window leakage.

www.scling.com
class OrderShuffleAll(WrapperTask):
def requires(self):
return [OrderShuffle(hour=self.hour, delay_hour=d)
for d in [0, 4, 12]]
class OrderDashboard(mysql.CopyToTable):
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)
class FinancialReport(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hour=h)),
delay_hour=12)
for h in range(24)]
Fast data, complete data
37
Delay: 0
Delay: 4
Delay: 12

www.scling.com
Things to plan for early
38
Data quality

www.scling.com
Things to plan for early
39
Data quality
Input
validation
Software
supply
chain
Multi-cloud
Perfor-
mance
Cloud
native
UX
User
feedback
Web
security
Foobarility
Scalability
Testability
Accessi-
bility
Bidi
languages
i18n
Machine
learning
bias
Get the MVP out?
Mobile
browsers

www.scling.com
No.
Valuable
graph
Any graph
Does anyone care about data quality?
40
No graph
Great! Meh.
Valuable
model
Any ML
model
No model

www.scling.com
1999: Does anyone care about code quality?
No.
41
7 years

www.scling.com
Code quality 1999
42
Behold our
great code!
We think some QA and test
automation would be great.
Nah, boring. We
don’t have time.
Just put it in
production for us.

www.scling.com
Code quality 2019
43
We have invented DevOps
and continuous delivery.
Test automation is key!
That sounds familiar...

www.scling.com
Data quality 2019
44
Behold our
great model!
We think some data quality
assessment and automation
would be great.
Nah, why? We don’t have time.
Just put it in production for us.

www.scling.com
Data quality 2029
45
We have invented MLOps
and continuous modelling.
Quality feedback
automation is key!
That sounds familiar...

www.scling.com
Changing culture bottom-up
46

www.scling.com
Repeating the success?
47

www.scling.com
Resources, credits
Presentations, articles on related subjects:
● https://www.scling.com/reading-list
● https://www.scling.com/presentations
Useful tools:
● https://github.com/awslabs/deequ
● https://github.com/great-expectations/great_expectations
● https://github.com/spotify/ratatool
48
Thank you,
● Irene Gonzálvez, Spotify
https://youtu.be/U63TmQPS9Z8
● Anders Holst, RISE

www.scling.com
Tech has massive impact on society
49
Product?
Supplier?
Employer?
Make an active
choice whether to
have an impact!
Cloud?

www.scling.com
Laptop sticker
Vintage data visualisations, by Karin Lind.
● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869.
● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852.
● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean
war, 1854-1856. Drawn 1858.
○ Blue = disease, red = wounds, black = battle + other.
● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion
Measures of Twelve Pulsars, 1970
○ Joy Division:
Unknown Pleasures, 1979
○ “Joy plot” → “ridge plot”
50

Engineering data quality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Engineering data quality

Similar to Engineering data quality (20)

More from Lars Albertsson

More from Lars Albertsson (16)

Recently uploaded

Recently uploaded (20)

Engineering data quality