3. @holdenkarau
Holden:
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC & Committer
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Past Spark Talk Videos http://bit.ly/holdenSparkVideos
● Direct Talk feedback: http://bit.ly/holdenTalkFeedback
● Working on a book on Kubeflow (ML + Kubernetes):
http://www.introductiontomlwithkubeflow.com/
5. @holdenkarau
What is going to be covered:
● What validation is & why you should do it for your data pipelines
● How to make simple validation rules & our current limitations
● ML Validation - Guessing if our black box is “correct”
● Cute & scary pictures
○ I promise at least one cat
○ And at least one picture of my scooter club
Andrew
6. @holdenkarau
Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Possibly Familiar with Spark, if your new WELCOME!
● Want to make better software
○ (or models, or w/e)
● Or just want to make software good enough to not have to keep your resume
up to date
● Open to the idea that pipeline validation can be explained with a scooter club
that is definitely not a gang.
8. @holdenkarau
Test are not perfect: See Motorcycles/Scooters/...
● Are not property checking
● It’s just multiple choice
● You don’t even need one to ride a scoot!
9. @holdenkarau
Why don’t we validate?
● We already tested our code
○ Riiiight?
● What could go wrong?
Also extra hard in distributed systems
● Distributed metrics are hard
● not much built in (not very consistent)
● not always deterministic
● Complicated production systems
10. @holdenkarau
So why should you validate?
● tl;dr - Your tests probably aren’t perfect
● You want to know when you're aboard the failboat
● Our code will most likely fail at some point
○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t all that hard
○ Repeating Holden’s mistakes is still not fun
11. @holdenkarau
So why should you validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
12. @holdenkarau
So why should you validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
13. @holdenkarau
What happens when we don’t
This talk is being recorded so no company or rider names:
● Go home after an accident rather than checking on bones
Or with computers:
● Breaking a feature that cost a few million dollars
● Every search result was a coffee shop
● Rabbit (“bunny”) versus rabbit (“queue”) versus rabbit (“health”)
● VA, BoA, etc.
itsbruce
17. @holdenkarau
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.
18. @holdenkarau
How many people have something like this?
val data = ...
val parsed = data.flatMap(x =>
try {
Some(parse(x))
} catch {
case _ => None // Whatever, it's JSON
}
}
Lilithis
19. @holdenkarau
But we need some data...
val data = ...
data.cache()
val validData = data.filter(isValid)
val badData = data.filter(! isValid(_))
if validData.count() < badData.count() {
// Ruh Roh! Special business error handling goes here
}
...
Pager photo by Vitachao CC-SA 3
20. @holdenkarau
Well that’s less fun :(
● Our optimizer can’t just magically chain everything together anymore
● My flatMap.map.map is fnur :(
● Now I’m blocking on a thing in the driver
Sn.Ho
21. @holdenkarau
Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya
22. @holdenkarau
So what does that look like?
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker
23. @holdenkarau
Ok but what about those *s
● Beam counters are implementation dependent
● Spark counters aren’t great for data properties
● etc.
Miguel Olaya
24. @holdenkarau
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I was probably a bit sleep deprived when I wrote it because looking at it… idk
○ I have a rewrite which is going through our open source releasing process. Maybe it will be
released! Not a guarantee.
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Remember those property tests? Could be Validation rules
● Historical data
● Domain specific solutions
● Do you have property tests?
○ You should! Check out spark-testing-base
○ But you can use your property tests as a basis for validation rules as well
Photo by:
Paul Schadler
25. @holdenkarau
Input Schema Validation
● Handling the “wrong” type of cat
● Many many different approaches
○ filter/flatMap stages
○ Working in Scala/Java? .as[T]
○ Manually specify your schema after doing inference the first time :p
● Unless your working on mnist.csv there is a good chance your validation is
going to be fuzzy (reject some records accept others)
● How do we know if we’ve rejected too much?
Bradley Gordon
27. @holdenkarau
So using names & logging & accs could be:
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
28. @holdenkarau
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?
29. @holdenkarau
Validation rules can be a separate stage(s)
● Sometimes data validation in parallel in a separate process
● Combined with counters/metrics from your job
● Can then be compared with a seperate job that looks at the results and
decides if the pipeline should continue
30. @holdenkarau
TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)
Details:
https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-underst
anding-validation-and-monitoring-at-scale-d38e3952c2f0
31. @holdenkarau
TFDV: Magic*
● Not in exactly in Spark (works with direct runner)
● Buuut we have the right tools to do the same computation in Spark
Cats by
moonwhiskers
32. @holdenkarau
What can we learn from TFDV:
● Auto Schema Generation & Comparison
○ Spark SQL yay!
● We can compute summary statistics of your inputs & outputs
○ Spark SQL yay!
● If they change a lot "something" is on fire
● Anomaly detection: a few different spark libraries & talks here
○ Can help show you what might have gone wrong
Tim Walker
33. @holdenkarau
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the model number
Francesco
34. @holdenkarau
Onto ML (or Beyond ETL :p)
● Some of the same principals work (yay!)
○ Schemas, invalid records, etc.
● Some new things to check
○ CV performance, Feature normalization ranges
● Some things don’t really work
○ Output size probably isn’t that great a metric anymore
○ Eyeballing the results for override is a lot harder
contraption
35. @holdenkarau
Extra considerations for ML jobs:
● Harder to look at output size and say if its good
● We can look at the cross-validation performance
● Fixed test set performance
● Number of iterations / convergence rate
● Number of features selected / number of features changed in selection
● (If applicable) delta in model weights or delta in hyper params
Hsu Luke
36. @holdenkarau
Traditional theory (Models)
● Human decides it's time to “update their models”
● Human goes through a model update run-book
● Human does other work while their “big-data” job runs
● Human deploys X% new models
● Looks at graphs
● Presses deploy
Andrew
37. @holdenkarau
Traditional practice (Models)
● Human is cornered by stakeholders and forced to update models
● Spends a few hours trying to remember where the guide is
● Gives up and kind of wings it
● Comes back to a trained model
● Human deploys X% models
● Human reads reddit/hacker news/etc.
● Presses deploy
Bruno Caimi
38. @holdenkarau
New possible practice (sometimes)
● Computer kicks off job (probably at an hour boundary because *shrug*) to
update model
● Workflow tool notices new model is available
● Computer deploys X% models
● Software looks at monitoring graphs, uses statistical test to see if it’s bad
● Robot rolls it back & pager goes off
● Human Presses overrides and deploys anyways
Henrique Pinto
39. @holdenkarau
Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to version
○ Common kappa-like arch and then revert to checkpoint
○ Slowly degrading models, oh my!
● Iterative batches: automatically train on new data, deploy model, and A/B test
● But A/B testing isn’t enough -- bad data can result in wrong or even illegal
results
40. @holdenkarau
Cross-validation
because saving a test set is effort
● Trains on X% of the data and tests on Y%
○ Multiple times switching the samples
● org.apache.spark.ml.tuning has the tools for auto fitting using CB
○ If your going to use this for auto-tuning please please save a test set
○ Otherwise your models will look awesome and perform like a ford pinto (or whatever a crappy
car is here. Maybe a renault reliant?)
Jonathan Kotta
41. @holdenkarau
False sense of security:
● A/B test please even if CV says amazing
● Rank based things can have training bias with previous orders
○ Non-displayed options: unlikely to be chosen
○ Sometimes can find previous formulaic corrections
○ Sometimes we can “experimentally” determine
● Other times we just hope it’s better than nothing
● Try and make sure your ML isn’t evil or re-encoding human biases but
stronger
42. @holdenkarau
Some ending notes
● Your validation rules don’t have to be perfect
○ But they should be good enough they alert infrequently
● You should have a way for the human operator to override.
● Just like tests, try and make your validation rules specific and actionable
○ # of input rows changed is not a great message - table XYZ grew unexpectedly to Y%
● While you can use (some of) your tests as a basis for your rules, your rules
need tests too
○ e.g. add junk records/pure noise and see if it rejects
James Petts
43. @holdenkarau
Related Links:
● https://github.com/holdenk/data-pipeline-validator
● Testing Spark Best Practices (Spark Summit 2014)
● https://www.tensorflow.org/tfx/data_validation/get_started
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base
● Testing strategy for Apache Spark jobs
● The BEAM programming guide
Interested in OSS (especially Spark)?
● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai
45. @holdenkarau
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
46. @holdenkarau
High Performance Spark!
Available today, not a lot on testing and almost nothing on validation, but that
should not stop you from buying several copies (if you have an expense
account).
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
49. @holdenkarau
Want to turn your failing code into "art"?
https://haute.codes/
It doesn't use Spark*
*yet
50. @holdenkarau
And some upcoming talks:
● April
○ Strata London
● May
○ KiwiCoda Mania
○ KubeCon Barcelona
● June
○ Scala Days EU
○ Berlin Buzzwords
51. @holdenkarau
Sparkling Pink Panda Scooter group photo by Kenzi
k thnx bye! (or questions…)
If you want to fill out a survey:
http://bit.ly/holdenTestingSpark
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I'll be in the hallway and
back tomorrow or you can
email me:
holden@pigscanfly.ca
52. @holdenkarau
Property generating libs: QuickCheck / ScalaCheck
● QuickCheck (haskell) generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● Sscheck (scala check for spark)
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt