Validating big data pipelines - FOSDEM 2019

Validating
Big Data & ML Pipelines
With Apache Spark & Airflow & Friends:
knowing when you crash
Melinda
Seckington

Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC
● (One of) the same speakers as the last talk :p
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

What is going to be covered:
● What validation is & why you should do it for your data pipelines
● How to make simple validation rules & our current limitations
● ML Validation - Guessing if our black box is “correct”
● Cute & scary pictures
○ I promise at least one cat
○ And at least one picture of my scooter club
Andrew

Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Possibly Familiar with one of Scala, if your new WELCOME!
● Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if
not)
● Want to make better software
○ (or models, or w/e)
● Or just want to make software good enough to not have to keep your resume
up to date
● Open to the idea that pipeline validation can be explained with a scooter club
that is definitely not a gang.

Test are not perfect: See Motorcycles/Scooters/...
● Are not property checking
● It’s just multiple choice
● You don’t even need one to ride a scoot!

So why should you validate?
● tl;dr - Your tests probably aren’t perfect
● You want to know when you're aboard the failboat
● Our code will most likely fail at some point
○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t all that hard
○ Repeating Holden’s mistakes is still not fun

So why should you validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

What happens when we don’t
This talk is being recorded so we’ll leave it at:
● Go home after an accident rather than checking on bones
Or with computers:
● Breaking a feature that cost a few million dollars
● Every search result was a coffee shop
● Rabbit (“bunny”) versus rabbit (“queue”) versus rabbit (“health”)
● VA, BoA, etc.
itsbruce

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

Lets focus on validation some more:
*Can be used during integration tests to further validate integration results

So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.

How many people have something like this?
val data = ...
val parsed = data.flatMap(x =>
try {
Some(parse(x))
} catch {
case _ => None // Whatever, it's JSON
}
}
Lilithis

But we need some data...
val data = ...
data.cache()
val validData = data.filter(isValid)
val badData = data.filter(! isValid(_))
if validData.count() < badData.count() {
// Ruh Roh! Special business error handling goes here
}
...
Pager photo by Vitachao CC-SA 3

Well that’s less fun :(
● Our optimizer can’t just magically chain everything together anymore
● My flatMap.map.map is fnur :(
● Now I’m blocking on a thing in the driver
Sn.Ho

Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya

So what does that look like?
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker

Ok but what about those *s
● Beam counters are implementation dependent
● Spark counters aren’t great for data properties
● etc.
Miguel Olaya

+ You need to understand your domain, like bubbles

General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I’m going to rewrite it over the holidays as a two-stage job (one to collect metrics in your main
application and a second to validate).
○ I was probably a bit sleep deprived when I wrote it because looking at it… idk
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Remember those property tests? Could be Validation rules
● Historical data
● Domain specific solutions
Photo by:
Paul Schadler

Input Schema Validation
● Handling the “wrong” type of cat
● Many many different approaches
○ filter/flatMap stages
○ Working in Scala/Java? .as[T]
○ Manually specify your schema after doing inference the first time :p
● Unless your working on mnist.csv there is a good chance your validation is
going to be fuzzy (reject some records accept others)
● How do we know if we’ve rejected too much?
Bradley Gordon

As a relative rule:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow

% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?

Validation rules can be a separate stage(s)
● Sometimes data validation in parallel in a separate process
● Combined with counters/metrics from your job
● Can then be compared with a seperate job that looks at the results and
decides if the pipeline should continue

TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)
Details:
https://medium.com/tensorflow/introducing-tensorflow-data-
validation-data-understanding-validation-and-monitoring-at-
scale-d38e3952c2f0

Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Have an ML model?
○ Welcome to new params - or old params with different default values.
○ We’ll talk more about that later
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the model number
Francesco

Putting it together with Airflow
important_business_logic = SparkSubmitOperator(
task_id='important_business_logic', …)
data_validation_task = SparkSubmitOperator( # Can use bash operator for TFDV
task_id='data_validation', …)
t1 = S3KeySensor( # Or GCSKeySensor
task_id='validation_check', bucket_key='...', dag=dag)
import publish_graph
# Set upstreams and have party

Some ending notes
● Your validation rules don’t have to be perfect
○ But they should be good enough they alert infrequently
● You should have a way for the human operator to
override.
● Just like tests, try and make your validation rules
specific and actionable
○ # of input rows changed is not a great message - table XYZ grew
unexpectedly to Y%
● While you can use (some of) your tests as a basis for
your rules, your rules need tests too
○ e.g. add junk records/pure noise and see if it rejects
James Petts

Related Links:
● https://github.com/holdenk/data-pipeline-validator
● Testing Spark Best Practices (Spark Summit 2014)
● https://www.tensorflow.org/tfx/data_validation/get_started
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base
● Testing strategy for Apache Spark jobs
● The BEAM programming guide
Interested in OSS (especially Spark)?
● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today, not a lot on testing and almost nothing on
validation, but that should not stop you from buying several
copies (if you have an expense account).
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D

Sign up for the mailing list @
http://www.distributedcomputing4kids.com

And some upcoming talks:
● March
○ Strata San Francisco
● April
○ Strata London
● May
○ Code Mania

Sparkling Pink Panda Scooter group photo by Kenzi
k thnx bye! (or questions…)
If you want to fill out a survey:
http://bit.ly/holdenTestingSpark
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I'll be in the hallway or you
can email me:
holden@pigscanfly.ca

Validating big data pipelines - FOSDEM 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Validating big data pipelines - FOSDEM 2019

Similar to Validating big data pipelines - FOSDEM 2019 (20)

Recently uploaded

Recently uploaded (17)

Validating big data pipelines - FOSDEM 2019