SlideShare a Scribd company logo
1 of 63
Download to read offline
Testing & Validation
for
Distributed Systems
with
Apache Spark & BEAM
Now
mostly
“works”*
Melinda
Seckington
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC (as of December!) :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
What is going to be covered:
● A bit about why you should test & validate your distributed programs
● “Normal” unit testing in Spark (then BEAM)
● Testing at scale(ish) in Spark (then BEAM)
● Validation - how to make simple validation rules & our current limitations
● Cute & scary pictures
○ I promise at least one panda and one cat
○ If you come to my talk tomorrow you may see some duplicate cat photos, I’m sorry.
Andrew
Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Possibly Familiar with one of Scala, Java, or Python
● Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if
not)
● Want to make better software
○ (or models, or w/e)
● Or just want to make software good enough to not have to keep your resume
up to date
So why should you test?
● Makes you a better person
● Avoid making your users angry
● Save $s
○ AWS (sorry I mean Google Cloud Whatever) is expensive
● Waiting for our jobs to fail is a pretty long dev cycle
● Repeating Holden’s mistakes is not fun (see misscategorized items)
● Honestly you came to the testing track so you probably already care
So why should you validate?
● You want to know when you're aboard the failboat
● Halt deployment, roll-back
● Our code will most likely fail
○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t all that hard
○ Repeating Holden’s mistakes is still not fun
So why should you test & validate:
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
So why should you test & validate - cont
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests
● Our tests can get too slow
○ Packaging and building scala is already sad
● It takes a lot of time
○ and people always want everything done yesterday
○ or I just want to go home see my partner
○ Etc.
● Distributed systems is particularly hard
Why don’t we test? (continued)
Why don’t we validate?
● We already tested our code
● What could go wrong?
Also extra hard in distributed systems
● Distributed metrics are hard
● not much built in (not very consistent)
● not always deterministic
● Complicated production systems
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
A simple unit test with spark-testing-base
class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}
A simple unit test with BEAM (no libs!)
PCollection<KV<String, Long>> filteredWords = p.apply(...)
List<KV<String, Long>> expectedResults = Arrays.asList(
KV.of("Flourish", 3L),
KV.of("stomach", 1L));
PAssert.that(filteredWords).containsInAnyOrder(expectedResults);
p.run().waitUntilFinish();
Where do those run?
● By default your local host with a “local mode”
● Spark’s local mode attempts to simulate a “real” cluster
○ Attempts but it is not perfect
● BEAM’s local mode is a “DirectRunner”
○ This is super fast
○ But think of it as more like a mock than a test env
● You can point either to a “local” cluster
○ Feeling fancy? Use docker
○ Feeling not-so-fancy? Run worker and master on localhost…
○ Note: with BEAM different runners have different levels of support so choose the one matching
production
Andréia Bohner
So: setup.sh (or travis.yml or…)
pushd spark-2.2.0-bin-hadoop2.7
./sbin/start-master.sh -h localhost
./sbin/start-slave.sh spark://localhost:7077
export SPARK_MASTER="spark://localhost:7077"
Andréia Bohner
Ok but what about problems @ scale
● Maybe our program works fine on our local sized input
● Our actual workload is probably larger than 3kb
● We need to test partitioning logic, scaling, reading writing, serializing
But how do we test workloads too large for a single machine?
(We can’t just use parallelize and collect) and we need to run on something
more like a cluster
Qfamily
Spark: Distributed “set” operations to the rescue
● Pretty close - already built into Spark
● Doesn’t do so well with floating points :(
○ damn floating points keep showing up everywhere :p
● Doesn’t really handle duplicates very well
○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations…
● Or use RDDComparisions in your favourite** testing library
○ will cogroup / zip as appropriate to compare the results. Asserts when people steal your coffee
Matti Mattila
Betsy Weber
BEAM: CoGroupByKey + DoFn + PAssert
● CoGroup dataset with the dataset for the expected result
● Filter out records where everything matches
● PAssert that the result is empty
Matti Mattila
Betsy Weber
But where do we get the data for those tests?
● Most people generate data by hand
● If you have production data you can
sample you are lucky!
○ If possible you can try and save in the same
format
● If our data is a bunch of Vectors or
Doubles Spark’s got tools :)
● Coming up with good test data can
take a long time
● Important to test different distributions,
input files, empty partitions etc.
Lori Rielly
Property generating libs: QuickCheck / ScalaCheck
● QuickCheck (haskell) generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● Sscheck (scala check for spark)
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}
With spark-testing-base & a million entries
test("map should not change number of elements") {
implicit val generatorDrivenConfig =
PropertyCheckConfig(minSize = 0, maxSize = 1000000)
val property = forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
check(property)
}
But that can get a bit slow for all of our tests
● Not all of your tests should need a cluster (or even a cluster simulator)
● If you are ok with not using lambdas everywhere you can factor out that logic
and test normally
● Or if you want to keep those lambdas - or verify the transformations logic
without the overhead of running a local distributed systems you can try a
library like kontextfrei
○ Don’t rely on this alone (but can work well with something like scalacheck)
e.g. Lambdas aren’t always your friend
● Lambda’s can make finding the error more challenging
● I love lambda x, y: x / y as much as the next human but
when y is zero :(
● A small bit of refactoring for your tests never hurt*
● Difficult to put logs inside of them
*A blatant lie, but…. it hurts less often than it helps
Zoli Juhasz
“Business logic” only w/o - refactor for testing
def difficultTokenizeRDD(input: RDD[String]) = {
input.flatMap(line =>line.split(" "))
}
=>
def tokenizeRDD(input: RDD[String]) = {
input.flatMap(tokenize)
}
protected[tokenize] def tokenize(line: String) = {
line.split(" ")
}
Stanley
Wood
…. And Spark Datasets
● We can do the same as we did for RDD’s (.rdd)
● Inside of Spark validation looks like:
def checkAnswer(df: Dataset[T], expectedAnswer: T*)
● Sadly it’s not in a published package & local only
● instead we expose:
def equalDatasets(expected: Dataset[U], result: Dataset[U]) {
def approxEqualDatasets(e: Dataset[U], r: Dataset[U], tol: Double) {
Houser Wolf
This is what it looks like:
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
wenliang chen
Or with a generator based on Schema*:
test("assert rows' types like schema type") {
val schema = StructType(List(StructField("name", StringType)))
val rowGen: Gen[Row] =
DataframeGenerator.getRowGenerator(schema)
val property =
forAll(rowGen) {
row => row.get(0).isInstanceOf[String]
}
check(property)
}
*For simple schemas, complex types in future versions
thelittleone417
Which has “built-in” large support :)
Testing streaming….
Photo by Steve Jurvetson
Testing streaming the happy panda way
● Creating test data is hard
● Collecting the data locally is ugly
● figuring out when your test is “done”
Let’s abstract all that away into testOperation
A simple (non-scalable) stream test:
test("really simple transformation") {
val input = List(List("hi"), List("hi holden"), List("bye"))
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}
Photo by An eye
for my mind
And putting the two together: Structured Streaming
Structured data, in real time, new
api!
What could go wrong?
Simplest test, it sort of works! (½)
// Input data source
val input = MemoryStream[Int]
// Your business logic goes here. Much business very logic
val transformed = bussinesLogic(input.toDF())
val query = transformed.writeStream
.format("memory")
.outputMode("append")
.queryName("memStream")
.start()
stanze
Simplest test, it sort of works! (2/2)
// Add some data
input.addData(1, 2, 3)
query.processAllAvailable()
val resultRows = spark.table("memStream").as[T]
// Now make it into a data frame and check as previously.
Robert Lorenz
ooor...
def businessLogic(input: Dataset[Int]): Dataset[String] = {
// Business logic goes here
}
testSimpleStreamEndState(spark, input, expected, "append",
businessLogic)
Robert Lorenz
Photo by allison
On to validation*
*Can be used during integration tests to further validate integration results
So how do we validate our jobs?
● Both BEAM & Spark has it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can write rules for if the values are expected
○ Simple rules (X > J)
■ The number of records should be greater than 0
○ Historic rules (X > Avg(last(10, J)))
■ We need to keep track of our previous values - but this can be great for debugging &
performance investigation too.
Photo by:
Paul Schadler
Validation
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
● Doesn’t need to be done in your Spark job (can be done in your scripting
language of choice with whatever job control system you are using)
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
- that is ok!
● Remember those property tests? Could be great Validation rules!
Photo by:
Paul Schadler
So using names & logging & accs could be:
data = sc.parallelize(range(10))
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
Using a Spark accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30),
Some(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Business logic goes here
assert(v.validate(5) === true)
}
Photo by Dvortygirl
Counters in BEAM: (1 of 2)
private final Counter matchedWords =
Metrics.counter(FilterTextFn.class, "matchedWords");
private final Counter unmatchedWords =
Metrics.counter(FilterTextFn.class, "unmatchedWords");
// Your special business logic goes here (aka shell out to Fortan
or Cobol)
Luke Jones
Counters in BEAM: (2 of 2)
Long matchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("matchedWords")).counters().next().committed();
Long unmatchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("unmatchedWords")).counters().next().committed();
assertThat("unmatchWords less than matched words",
unmatchedWordsValue,
lessThan(matchedWordsValue));
Luke Jones
Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base
● Testing strategy for Apache Spark jobs
● The BEAM programming guide
Becky Lai
Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/sscheck
● spark-validator: https://github.com/holdenk/spark-validator *Proof of
concept, but updated*
● spark-perf - https://github.com/databricks/spark-perf
● spark-integration-tests - https://github.com/databricks/spark-integration-tests
● scalacheck - https://www.scalacheck.org/
Becky Lai
And including spark-testing-base up to spark 2.2
sbt:
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test"
Maven:
<dependency>
<groupId>com.holdenkarau</groupId>
<artifactId>spark-testing-base_2.11</artifactId>
<version>${spark.version}_0.8.0</version>
<scope>test</scope>
</dependency>
Vladimir Pustovit
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● FOSDEM HPC room tomorrow @ 4pm
○ Maybe office hours after that if there is interest?
● JFokus
● Strata San Jose
● Strata London
● QCon Brasil
● QCon AI SF
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey:
http://bit.ly/holdenTestingSpark
I will use update results in &
give the talk again the next
time Spark adds a major
feature.
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
Other options for generating data:
● mapPartitions + Random + custom code
● RandomRDDs in mllib
○ Uniform, Normal, Possion, Exponential, Gamma, logNormal & Vector versions
○ Different type: implement the RandomDataGenerator interface
● Random
RandomRDDs
val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size
= rows).map(_.toInt.toString)
val valuesRDD = RandomRDDs.normalVectorRDD(sc, numRows = rows,
numCols = numCols).repartition(zipRDD.partitions.size)
val keyRDD = sc.parallelize(1L.to(rows),
zipRDD.getNumPartitions)
keyRDD.zipPartitions(zipRDD, valuesRDD){
(i1, i2, i3) =>
new Iterator[(Long, String, Vector)] {
...
Testing libraries:
● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base
○ sscheck - https://github.com/juanrh/sscheck
● Simplified unit testing (“business logic only”)
○ kontextfrei - https://github.com/dwestheide/kontextfrei *
● Integration testing
○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
● Performance
○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
● Spark job validation
○ spark-validator - https://github.com/holdenk/spark-validator *
Photo by Mike Mozart
*Early stage or work-in progress, or proof of concept
Let’s talk about local mode
● It’s way better than you would expect*
● It does its best to try and catch serialization errors
● It’s still not the same as running on a “real” cluster
● Especially since if we were just local mode, parallelize and collect might be
fine
Photo by: Bev Sykes
Options beyond local mode:
● Just point at your existing cluster (set master)
● Start one with your shell scripts & change the master
○ Really easy way to plug into existing integration testing
● spark-docker - hack in our own tests
● YarnMiniCluster
○ https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/
BaseYarnClusterSuite.scala
○ In Spark Testing Base extend SharedMiniCluster
■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner
Integration testing - docker is awesome
● Spark-docker, kafka-docker, etc.
○ Not always super up to date sadly - if you are last stable release A-OK, if you build from
master - sad pandas
● Or checkout JuJu Charms (from Canonical) - https://jujucharms.com/
○ Makes it easy to deploy a bunch of docker containers together & configured in a reasonable
way.
Setting up integration on Yarn/Mesos
● So lucky!
● You can write your tests in the same way as before - just read from your test
data sources
● Missing a data source?
○ Can you sample it or fake it using the techniques from before?
○ If so - do that and save the result to your integration enviroment
○ If not… well good luck
● Need streaming integration?
○ You will probably need a second Spark (or other) job to generate the test data
“Business logic” only test w/kontextfrei
import com.danielwestheide.kontextfrei.DCollectionOps
trait UsersByPopularityProperties[DColl[_]] extends
BaseSpec[DColl] {
import DCollectionOps.Imports._
property("Each user appears only once") {
forAll { starredEvents: List[RepoStarred] =>
val result =
logic.usersByPopularity(unit(starredEvents)).collect().toList
result.distinct mustEqual result
}
}
… (continued in example/src/test/scala/com/danielwestheide/kontextfrei/example/)
Generating Data with Spark
import org.apache.spark.mllib.random.RandomRDDs
...
RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows)
RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols)

More Related Content

What's hot

Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
 

What's hot (20)

Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 

Similar to Testing and validating distributed systems with Apache Spark and Apache Beam (fosdem 2018) (1)

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018Holden Karau
 
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Validating big data pipelines - FOSDEM 2019
Validating big data pipelines -  FOSDEM 2019Validating big data pipelines -  FOSDEM 2019
Validating big data pipelines - FOSDEM 2019Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Holden Karau
 
Developer Tests - Things to Know
Developer Tests - Things to KnowDeveloper Tests - Things to Know
Developer Tests - Things to KnowVaidas Pilkauskas
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)vilniusjug
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 

Similar to Testing and validating distributed systems with Apache Spark and Apache Beam (fosdem 2018) (1) (20)

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018
 
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Pipelines - Big Data Spain 2018
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Validating big data pipelines - FOSDEM 2019
Validating big data pipelines -  FOSDEM 2019Validating big data pipelines -  FOSDEM 2019
Validating big data pipelines - FOSDEM 2019
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...Validating spark ml jobs stopping failures before production on Apache Spark ...
Validating spark ml jobs stopping failures before production on Apache Spark ...
 
Developer Tests - Things to Know
Developer Tests - Things to KnowDeveloper Tests - Things to Know
Developer Tests - Things to Know
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 

Recently uploaded

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 

Recently uploaded (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 

Testing and validating distributed systems with Apache Spark and Apache Beam (fosdem 2018) (1)

  • 1. Testing & Validation for Distributed Systems with Apache Spark & BEAM Now mostly “works”* Melinda Seckington
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC (as of December!) :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. What is going to be covered: ● A bit about why you should test & validate your distributed programs ● “Normal” unit testing in Spark (then BEAM) ● Testing at scale(ish) in Spark (then BEAM) ● Validation - how to make simple validation rules & our current limitations ● Cute & scary pictures ○ I promise at least one panda and one cat ○ If you come to my talk tomorrow you may see some duplicate cat photos, I’m sorry. Andrew
  • 5. Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with one of Scala, Java, or Python ● Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if not) ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date
  • 6. So why should you test? ● Makes you a better person ● Avoid making your users angry ● Save $s ○ AWS (sorry I mean Google Cloud Whatever) is expensive ● Waiting for our jobs to fail is a pretty long dev cycle ● Repeating Holden’s mistakes is not fun (see misscategorized items) ● Honestly you came to the testing track so you probably already care
  • 7. So why should you validate? ● You want to know when you're aboard the failboat ● Halt deployment, roll-back ● Our code will most likely fail ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun
  • 8. So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 9. So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 10. Why don’t we test? ● It’s hard ○ Faking data, setting up integration tests ● Our tests can get too slow ○ Packaging and building scala is already sad ● It takes a lot of time ○ and people always want everything done yesterday ○ or I just want to go home see my partner ○ Etc. ● Distributed systems is particularly hard
  • 11. Why don’t we test? (continued)
  • 12. Why don’t we validate? ● We already tested our code ● What could go wrong? Also extra hard in distributed systems ● Distributed metrics are hard ● not much built in (not very consistent) ● not always deterministic ● Complicated production systems
  • 13. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 14. A simple unit test with spark-testing-base class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  • 15. A simple unit test with BEAM (no libs!) PCollection<KV<String, Long>> filteredWords = p.apply(...) List<KV<String, Long>> expectedResults = Arrays.asList( KV.of("Flourish", 3L), KV.of("stomach", 1L)); PAssert.that(filteredWords).containsInAnyOrder(expectedResults); p.run().waitUntilFinish();
  • 16. Where do those run? ● By default your local host with a “local mode” ● Spark’s local mode attempts to simulate a “real” cluster ○ Attempts but it is not perfect ● BEAM’s local mode is a “DirectRunner” ○ This is super fast ○ But think of it as more like a mock than a test env ● You can point either to a “local” cluster ○ Feeling fancy? Use docker ○ Feeling not-so-fancy? Run worker and master on localhost… ○ Note: with BEAM different runners have different levels of support so choose the one matching production Andréia Bohner
  • 17. So: setup.sh (or travis.yml or…) pushd spark-2.2.0-bin-hadoop2.7 ./sbin/start-master.sh -h localhost ./sbin/start-slave.sh spark://localhost:7077 export SPARK_MASTER="spark://localhost:7077" Andréia Bohner
  • 18. Ok but what about problems @ scale ● Maybe our program works fine on our local sized input ● Our actual workload is probably larger than 3kb ● We need to test partitioning logic, scaling, reading writing, serializing But how do we test workloads too large for a single machine? (We can’t just use parallelize and collect) and we need to run on something more like a cluster Qfamily
  • 19. Spark: Distributed “set” operations to the rescue ● Pretty close - already built into Spark ● Doesn’t do so well with floating points :( ○ damn floating points keep showing up everywhere :p ● Doesn’t really handle duplicates very well ○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations… ● Or use RDDComparisions in your favourite** testing library ○ will cogroup / zip as appropriate to compare the results. Asserts when people steal your coffee Matti Mattila Betsy Weber
  • 20. BEAM: CoGroupByKey + DoFn + PAssert ● CoGroup dataset with the dataset for the expected result ● Filter out records where everything matches ● PAssert that the result is empty Matti Mattila Betsy Weber
  • 21. But where do we get the data for those tests? ● Most people generate data by hand ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time ● Important to test different distributions, input files, empty partitions etc. Lori Rielly
  • 22. Property generating libs: QuickCheck / ScalaCheck ● QuickCheck (haskell) generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● Sscheck (scala check for spark) ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  • 23. With spark-testing-base test("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() } }
  • 24. With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
  • 25. But that can get a bit slow for all of our tests ● Not all of your tests should need a cluster (or even a cluster simulator) ● If you are ok with not using lambdas everywhere you can factor out that logic and test normally ● Or if you want to keep those lambdas - or verify the transformations logic without the overhead of running a local distributed systems you can try a library like kontextfrei ○ Don’t rely on this alone (but can work well with something like scalacheck)
  • 26. e.g. Lambdas aren’t always your friend ● Lambda’s can make finding the error more challenging ● I love lambda x, y: x / y as much as the next human but when y is zero :( ● A small bit of refactoring for your tests never hurt* ● Difficult to put logs inside of them *A blatant lie, but…. it hurts less often than it helps Zoli Juhasz
  • 27. “Business logic” only w/o - refactor for testing def difficultTokenizeRDD(input: RDD[String]) = { input.flatMap(line =>line.split(" ")) } => def tokenizeRDD(input: RDD[String]) = { input.flatMap(tokenize) } protected[tokenize] def tokenize(line: String) = { line.split(" ") } Stanley Wood
  • 28. …. And Spark Datasets ● We can do the same as we did for RDD’s (.rdd) ● Inside of Spark validation looks like: def checkAnswer(df: Dataset[T], expectedAnswer: T*) ● Sadly it’s not in a published package & local only ● instead we expose: def equalDatasets(expected: Dataset[U], result: Dataset[U]) { def approxEqualDatasets(e: Dataset[U], r: Dataset[U], tol: Double) { Houser Wolf
  • 29. This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) } wenliang chen
  • 30. Or with a generator based on Schema*: test("assert rows' types like schema type") { val schema = StructType(List(StructField("name", StringType))) val rowGen: Gen[Row] = DataframeGenerator.getRowGenerator(schema) val property = forAll(rowGen) { row => row.get(0).isInstanceOf[String] } check(property) } *For simple schemas, complex types in future versions thelittleone417
  • 31. Which has “built-in” large support :)
  • 33. Testing streaming the happy panda way ● Creating test data is hard ● Collecting the data locally is ugly ● figuring out when your test is “done” Let’s abstract all that away into testOperation
  • 34. A simple (non-scalable) stream test: test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true) } Photo by An eye for my mind
  • 35. And putting the two together: Structured Streaming Structured data, in real time, new api! What could go wrong?
  • 36. Simplest test, it sort of works! (½) // Input data source val input = MemoryStream[Int] // Your business logic goes here. Much business very logic val transformed = bussinesLogic(input.toDF()) val query = transformed.writeStream .format("memory") .outputMode("append") .queryName("memStream") .start() stanze
  • 37. Simplest test, it sort of works! (2/2) // Add some data input.addData(1, 2, 3) query.processAllAvailable() val resultRows = spark.table("memStream").as[T] // Now make it into a data frame and check as previously. Robert Lorenz
  • 38. ooor... def businessLogic(input: Dataset[Int]): Dataset[String] = { // Business logic goes here } testSimpleStreamEndState(spark, input, expected, "append", businessLogic) Robert Lorenz
  • 40. On to validation* *Can be used during integration tests to further validate integration results
  • 41. So how do we validate our jobs? ● Both BEAM & Spark has it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can write rules for if the values are expected ○ Simple rules (X > J) ■ The number of records should be greater than 0 ○ Historic rules (X > Avg(last(10, J))) ■ We need to keep track of our previous values - but this can be great for debugging & performance investigation too. Photo by: Paul Schadler
  • 42. Validation ● For now checking file sizes & execution time seem like the most common best practice (from survey) ● spark-validator is still in early stages and not ready for production use but interesting proof of concept ● Doesn’t need to be done in your Spark job (can be done in your scripting language of choice with whatever job control system you are using) ● Sometimes your rules will miss-fire and you’ll need to manually approve a job - that is ok! ● Remember those property tests? Could be great Validation rules! Photo by: Paul Schadler
  • 43. So using names & logging & accs could be: data = sc.parallelize(range(10)) rejectedCount = sc.accumulator(0) def loggedDivZero(x): import logging try: return [x / 0] except Exception as e: rejectedCount.add(1) logging.warning("Error found " + repr(e)) return [] transform1 = data.flatMap(loggedDivZero) transform2 = transform1.map(add1) transform2.count() print("Reject " + str(rejectedCount.value))
  • 44. Using a Spark accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 45. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some(1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  • 46. Counters in BEAM: (1 of 2) private final Counter matchedWords = Metrics.counter(FilterTextFn.class, "matchedWords"); private final Counter unmatchedWords = Metrics.counter(FilterTextFn.class, "unmatchedWords"); // Your special business logic goes here (aka shell out to Fortan or Cobol) Luke Jones
  • 47. Counters in BEAM: (2 of 2) Long matchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("matchedWords")).counters().next().committed(); Long unmatchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("unmatchedWords")).counters().next().committed(); assertThat("unmatchWords less than matched words", unmatchedWordsValue, lessThan(matchedWordsValue)); Luke Jones
  • 48. Related talks & blog posts ● Testing Spark Best Practices (Spark Summit 2014) ● Every Day I’m Shuffling (Strata 2015) & slides ● Spark and Spark Streaming Unit Testing ● Making Spark Unit Testing With Spark Testing Base ● Testing strategy for Apache Spark jobs ● The BEAM programming guide Becky Lai
  • 49. Related packages ● spark-testing-base: https://github.com/holdenk/spark-testing-base ● sscheck: https://github.com/juanrh/sscheck ● spark-validator: https://github.com/holdenk/spark-validator *Proof of concept, but updated* ● spark-perf - https://github.com/databricks/spark-perf ● spark-integration-tests - https://github.com/databricks/spark-integration-tests ● scalacheck - https://www.scalacheck.org/ Becky Lai
  • 50. And including spark-testing-base up to spark 2.2 sbt: "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test" Maven: <dependency> <groupId>com.holdenkarau</groupId> <artifactId>spark-testing-base_2.11</artifactId> <version>${spark.version}_0.8.0</version> <scope>test</scope> </dependency> Vladimir Pustovit
  • 51. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 52. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 53. And some upcoming talks: ● FOSDEM HPC room tomorrow @ 4pm ○ Maybe office hours after that if there is interest? ● JFokus ● Strata San Jose ● Strata London ● QCon Brasil ● QCon AI SF ● Know of interesting conferences/webinar things that should be on my radar? Let me know!
  • 54. Cat wave photo by Quinn Dombrowski k thnx bye! If you want to fill out survey: http://bit.ly/holdenTestingSpark I will use update results in & give the talk again the next time Spark adds a major feature. Give feedback on this presentation http://bit.ly/holdenTalkFeedback
  • 55. Other options for generating data: ● mapPartitions + Random + custom code ● RandomRDDs in mllib ○ Uniform, Normal, Possion, Exponential, Gamma, logNormal & Vector versions ○ Different type: implement the RandomDataGenerator interface ● Random
  • 56. RandomRDDs val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows).map(_.toInt.toString) val valuesRDD = RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols).repartition(zipRDD.partitions.size) val keyRDD = sc.parallelize(1L.to(rows), zipRDD.getNumPartitions) keyRDD.zipPartitions(zipRDD, valuesRDD){ (i1, i2, i3) => new Iterator[(Long, String, Vector)] { ...
  • 57. Testing libraries: ● Spark unit testing ○ spark-testing-base - https://github.com/holdenk/spark-testing-base ○ sscheck - https://github.com/juanrh/sscheck ● Simplified unit testing (“business logic only”) ○ kontextfrei - https://github.com/dwestheide/kontextfrei * ● Integration testing ○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests ● Performance ○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf ● Spark job validation ○ spark-validator - https://github.com/holdenk/spark-validator * Photo by Mike Mozart *Early stage or work-in progress, or proof of concept
  • 58. Let’s talk about local mode ● It’s way better than you would expect* ● It does its best to try and catch serialization errors ● It’s still not the same as running on a “real” cluster ● Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  • 59. Options beyond local mode: ● Just point at your existing cluster (set master) ● Start one with your shell scripts & change the master ○ Really easy way to plug into existing integration testing ● spark-docker - hack in our own tests ● YarnMiniCluster ○ https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/ BaseYarnClusterSuite.scala ○ In Spark Testing Base extend SharedMiniCluster ■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  • 60. Integration testing - docker is awesome ● Spark-docker, kafka-docker, etc. ○ Not always super up to date sadly - if you are last stable release A-OK, if you build from master - sad pandas ● Or checkout JuJu Charms (from Canonical) - https://jujucharms.com/ ○ Makes it easy to deploy a bunch of docker containers together & configured in a reasonable way.
  • 61. Setting up integration on Yarn/Mesos ● So lucky! ● You can write your tests in the same way as before - just read from your test data sources ● Missing a data source? ○ Can you sample it or fake it using the techniques from before? ○ If so - do that and save the result to your integration enviroment ○ If not… well good luck ● Need streaming integration? ○ You will probably need a second Spark (or other) job to generate the test data
  • 62. “Business logic” only test w/kontextfrei import com.danielwestheide.kontextfrei.DCollectionOps trait UsersByPopularityProperties[DColl[_]] extends BaseSpec[DColl] { import DCollectionOps.Imports._ property("Each user appears only once") { forAll { starredEvents: List[RepoStarred] => val result = logic.usersByPopularity(unit(starredEvents)).collect().toList result.distinct mustEqual result } } … (continued in example/src/test/scala/com/danielwestheide/kontextfrei/example/)
  • 63. Generating Data with Spark import org.apache.spark.mllib.random.RandomRDDs ... RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows) RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols)