Beyond Parallelize & Collect
(Effective testing of Spark Programs)
*See developer for details. Does not imply warranty. :p
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer
● currently IBM and previously Alpine, Databricks, Google, Foursquare &
● co-author of Learning Spark & Fast Data processing with Spark
● @holdenkarau
● Slide share
● Linkedin
● Spark Videos
What is going to be covered:
● What I think I might know about you
● A bit about why you should test your programs
● Using parallelize & collect for unit testing (quick skim)
● Comparing datasets too large to fit in memory
● Considerations for Streaming & SQL (DataFrames & Datasets)
● Cute & scary pictures
○ I promise at least one panda and one cat
● “Future Work”
○ Integration testing lives here for now (sorry)
○ Some of this future work might even get done!
Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Familiar with Apache Spark
○ If not, buy one of my books or watch Paco’s awesome video
● Familiar with one of Scala, Java, or Python
○ If you know R well I’d love to chat though
● Want to make better software
○ (or models, or w/e)
So why should you test?
● Makes you a better person
● Save $s
○ May help you avoid losing your employer all of their money
■ Or “users” if we were in the bay
○ AWS is expensive
● Waiting for our jobs to fail is a pretty long dev cycle
● This is really just to guilt trip you & give you flashbacks to your QA internships
So why should you test - continued
Results from: Testing with Spark survey
So why should you test - continued
Results from: Testing with Spark survey
Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests, urgh w/e
● Our tests can get too slow
● It takes a lot of time
○ and people always want everything done yesterday
○ or I just want to go home see my partner
○ etc.
Cat photo from
An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
override def afterAll() {
if (sc != null)
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
Photo by morinesque
And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
def tokenize(f: RDD[String]) = {" ").toList)
Photo by morinesque
Wait, where were the batteries?
Photo by Jim Bauer
Let’s get batteries!
● Spark unit testing
○ spark-testing-base -
○ sscheck -
● Integration testing
○ spark-integration-tests (Spark internals) -
● Performance
○ spark-perf (also for Spark internals) -
● Spark job validation
○ spark-validator -
Photo by Mike Mozart
A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
Ok but what about problems @ scale
● Maybe our program works fine on our local sized input
● If we are using Spark our actual workload is probably huge
● How do we test workloads too large for a single machine?
○ we can’t just use parallelize and collect
Distributed “set” operations to the rescue*
● Pretty close - already built into Spark
● Doesn’t do so well with floating points :(
○ damn floating points keep showing up everywhere :p
● Doesn’t really handle duplicates very well
○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD
[T], result: RDD[T]): Option[(T, T)] = {{case (x, y) => x != y}.take(1).
Matti Mattila
Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option
[(T, Int, Int)] = {
val expectedKeyed = => (x, 1)).reduceByKey(_ +
val resultKeyed = => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
Matti Mattila
But where do we get the data for those tests?
● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● If our data is a bunch of Vectors or Doubles Spark’s got tools :)
● Coming up with good test data can take a long time
Lori Rielly
QuickCheck / ScalaCheck
● QuickCheck generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
With spark-testing-base
test("map should not change number of elements") {
rdd => == rdd.count()
Testing streaming….
Photo by Steve Jurvetson
// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration - 1).
val selectedInput = if (index < input.size) input(index) else
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + + " with " + selectedInput)
Artisanal Stream Testing Code
trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging
with SharedSparkContext {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[4]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
override def afterAll() {
o by
and continued….
* Create an input stream for the provided input sequence. This is done using
* TestInputStream as queueStream's are not checkpointable.
def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ :
input: Seq[Seq[T]]): TestInputStream[T] = {
new TestInputStream(sc, ssc_, input, numInputPartitions)
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
override def beforeAll() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.
TestManualClock") // We can specify our own clock
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock")
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream
[R], TestStreamingContext))
(block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = {
val outputStream = outputStreamSSC._1
val ssc = outputStreamSSC._2
try {
block(outputStream, ssc)
} finally {
try {
ssc.stop(stopSparkContext = false)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
and now for the clock
* Allows us access to a manual clock. Note that the manual clock changed between
1.1.1 and 1.3
class TestManualClock(var time: Long) extends Clock {
def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat
def currentTime(): Long = getTimeMillis() // Compat
def getTimeMillis(): Long =
synchronized {
def setTime(timeToSet: Long): Unit =
synchronized {
time = timeToSet
def advance(timeToAdd: Long): Unit =
synchronized {
time += timeToAdd
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
* @param targetTime block until the clock time is set or advanced to at least this
* @return current time reported by the clock when waiting finishes
def waitTillTime(targetTime: Long): Long =
synchronized {
while (time < targetTime) {
Testing streaming the happy panda way
● Creating test data is hard
○ ssc.queueStream works - unless you need checkpoints (1.4.1+)
● Collecting the data locally is hard
○ foreachRDD & a var
● figuring out when your test is “done”
Let’s abstract all that away into testOperation
We can hide all of that:
test("really simple transformation") {
val input = List(List("hi"), List("hi holden"), List("bye"))
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
Photo by An eye
for my mind
What about DataFrames?
● We can do the same as we did for RDD’s (.rdd)
● Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
● Sadly it’s not in a published package & local only
● instead we expose:
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
…. and Datasets
● We can do the same as we did for RDD’s (.rdd)
● Inside of Spark validation looks like:
def checkAnswer(df: Dataset[T], expectedAnswer: T*)
● Sadly it’s not in a published package & local only
● instead we expose:
def equalDatasets(expected: Dataset[U], result: Dataset[V]) {
def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
This is what it looks like:
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
*This may or may not be easier.
Which has “built-in” large support :)
Photo by allison
Let’s talk about local mode
● It’s way better than you would expect*
● It does its best to try and catch serialization errors
● It’s still not the same as running on a “real” cluster
● Especially since if we were just local mode, parallelize and collect might be
Photo by: Bev Sykes
Options beyond local mode:
● Just point at your existing cluster (set master)
● Start one with your shell scripts & change the master
○ Really easy way to plug into existing integration testing
● spark-docker - hack in our own tests
● YarnMiniCluster
○ https://github.
○ In Spark Testing Base extend SharedMiniCluster
■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner
● Validation can be really useful for catching errors before deploying a model
○ Our tests can’t catch everything
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● Accumulators have some challenges (see SPARK-12469 for progress) but
are an interesting option
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
Photo by:
Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base
Related packages
● spark-testing-base:
● sscheck:
● spark-validator: *ALPHA*
● spark-perf -
● spark-integration-tests -
● scalacheck -
And including spark-testing-base:
"com.holdenkarau" %% "spark-testing-base" % "1.5.2_0.3.1"
“Future Work”
● Better ScalaCheck integration (ala sscheck)
● Testing details in my next Spark book
● Whatever* you all want
○ Testing with Spark survey:
● integration testing (for now see @cfriegly’s Spark + Docker setup):
Pretty unlikely:
● Integrating into Apache Spark ( SPARK-12433 )
*That I feel like doing, or you feel like making a pull request for.
Photo by
k thnx bye!
If you want to fill out survey: http:
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau

Beyond Parallelize and Collect by Holden Karau

  • 1. Beyond Parallelize & Collect (Effective testing of Spark Programs) Now mostly “works”* *See developer for details. Does not imply warranty. :p
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Software Engineer ● currently IBM and previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ● @holdenkarau ● Slide share ● Linkedin ● Spark Videos
  • 3. What is going to be covered: ● What I think I might know about you ● A bit about why you should test your programs ● Using parallelize & collect for unit testing (quick skim) ● Comparing datasets too large to fit in memory ● Considerations for Streaming & SQL (DataFrames & Datasets) ● Cute & scary pictures ○ I promise at least one panda and one cat ● “Future Work” ○ Integration testing lives here for now (sorry) ○ Some of this future work might even get done!
  • 4. Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Familiar with Apache Spark ○ If not, buy one of my books or watch Paco’s awesome video ● Familiar with one of Scala, Java, or Python ○ If you know R well I’d love to chat though ● Want to make better software ○ (or models, or w/e)
  • 5. So why should you test? ● Makes you a better person ● Save $s ○ May help you avoid losing your employer all of their money ■ Or “users” if we were in the bay ○ AWS is expensive ● Waiting for our jobs to fail is a pretty long dev cycle ● This is really just to guilt trip you & give you flashbacks to your QA internships
  • 6. So why should you test - continued Results from: Testing with Spark survey
  • 7. So why should you test - continued Results from: Testing with Spark survey
  • 8. Why don’t we test? ● It’s hard ○ Faking data, setting up integration tests, urgh w/e ● Our tests can get too slow ● It takes a lot of time ○ and people always want everything done yesterday ○ or I just want to go home see my partner ○ etc.
  • 9. Cat photo from
  • 10. An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() } Photo by morinesque
  • 11. And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = {" ").toList) } Photo by morinesque
  • 12. Wait, where were the batteries? Photo by Jim Bauer
  • 13. Let’s get batteries! ● Spark unit testing ○ spark-testing-base - ○ sscheck - ● Integration testing ○ spark-integration-tests (Spark internals) - ● Performance ○ spark-perf (also for Spark internals) - ● Spark job validation ○ spark-validator - Photo by Mike Mozart
  • 14. A simple unit test re-visited (Scala) class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  • 15. Ok but what about problems @ scale ● Maybe our program works fine on our local sized input ● If we are using Spark our actual workload is probably huge ● How do we test workloads too large for a single machine? ○ we can’t just use parallelize and collect Qfamily
  • 16. Distributed “set” operations to the rescue* ● Pretty close - already built into Spark ● Doesn’t do so well with floating points :( ○ damn floating points keep showing up everywhere :p ● Doesn’t really handle duplicates very well ○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  • 17. Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD [T], result: RDD[T]): Option[(T, T)] = {{case (x, y) => x != y}.take(1). headOption } Matti Mattila
  • 18. Or use RDDComparisions: def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option [(T, Int, Int)] = { val expectedKeyed = => (x, 1)).reduceByKey(_ + _) val resultKeyed = => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1). headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  • 19. But where do we get the data for those tests? ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time Lori Rielly
  • 20. QuickCheck / ScalaCheck ● QuickCheck generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● sscheck ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  • 21. With spark-testing-base test("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => == rdd.count() } }
  • 23. // Setup our Stream: class TestInputStream[T: ClassTag](@transient var sc: SparkContext, ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int) extends FriendlyInputDStream[T](ssc_) { def start() {} def stop() {} def compute(validTime: Time): Option[RDD[T]] = { logInfo("Computing RDD for time " + validTime) val index = ((validTime - ourZeroTime) / slideDuration - 1). toInt val selectedInput = if (index < input.size) input(index) else Seq[T]() // lets us test cases where RDDs are not created if (selectedInput == null) { return None } val rdd = sc.makeRDD(selectedInput, numPartitions) logInfo("Created RDD " + + " with " + selectedInput) Some(rdd) } } Artisanal Stream Testing Code trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext { // Name of the framework for Spark context def framework: String = this.getClass.getSimpleName // Master for Spark context def master: String = "local[4]" // Batch duration def batchDuration: Duration = Seconds(1) // Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString } // Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() } Phot o by Stev e Jurv etso n
  • 24. and continued…. /** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) } // Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util. TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() } /** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream [R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } } }
  • 25. and now for the clock /* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */ class TestManualClock(var time: Long) extends Clock { def this() = this(0L) def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time } def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() } def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() } def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat /** * @param targetTime block until the clock time is set or advanced to at least this time * @return current time reported by the clock when waiting finishes */ def waitTillTime(targetTime: Long): Long = synchronized { while (time < targetTime) { wait(100) } getTimeMillis() } }
  • 26. Testing streaming the happy panda way ● Creating test data is hard ○ ssc.queueStream works - unless you need checkpoints (1.4.1+) ● Collecting the data locally is hard ○ foreachRDD & a var ● figuring out when your test is “done” Let’s abstract all that away into testOperation
  • 27. We can hide all of that: test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true) } Photo by An eye for my mind
  • 28. What about DataFrames? ● We can do the same as we did for RDD’s (.rdd) ● Inside of Spark validation looks like: def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]) ● Sadly it’s not in a published package & local only ● instead we expose: def equalDataFrames(expected: DataFrame, result: DataFrame) { def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
  • 29. …. and Datasets ● We can do the same as we did for RDD’s (.rdd) ● Inside of Spark validation looks like: def checkAnswer(df: Dataset[T], expectedAnswer: T*) ● Sadly it’s not in a published package & local only ● instead we expose: def equalDatasets(expected: Dataset[U], result: Dataset[V]) { def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
  • 30. This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) } *This may or may not be easier.
  • 31. Which has “built-in” large support :)
  • 33. Let’s talk about local mode ● It’s way better than you would expect* ● It does its best to try and catch serialization errors ● It’s still not the same as running on a “real” cluster ● Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  • 34. Options beyond local mode: ● Just point at your existing cluster (set master) ● Start one with your shell scripts & change the master ○ Really easy way to plug into existing integration testing ● spark-docker - hack in our own tests ● YarnMiniCluster ○ https://github. com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClu sterSuite.scala ○ In Spark Testing Base extend SharedMiniCluster ■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  • 35. Validation ● Validation can be really useful for catching errors before deploying a model ○ Our tests can’t catch everything ● For now checking file sizes & execution time seem like the most common best practice (from survey) ● Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● spark-validator is still in early stages and not ready for production use but interesting proof of concept Photo by: Paul Schadler
  • 36. Related talks & blog posts ● Testing Spark Best Practices (Spark Summit 2014) ● Every Day I’m Shuffling (Strata 2015) & slides ● Spark and Spark Streaming Unit Testing ● Making Spark Unit Testing With Spark Testing Base
  • 37. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 38. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 39. And the next book….. Still being written - signup to be notified when it is available: ● ●
  • 40. Related packages ● spark-testing-base: ● sscheck: ● spark-validator: *ALPHA* ● spark-perf - ● spark-integration-tests - ● scalacheck -
  • 41. And including spark-testing-base: sbt: "com.holdenkarau" %% "spark-testing-base" % "1.5.2_0.3.1" maven: <dependency> <groupId>com.holdenkarau</groupId> <artifactId>spark-testing-base</artifactId> <version>${spark.version}_0.3.1</version> <scope>test</scope> </dependency>
  • 42. “Future Work” ● Better ScalaCheck integration (ala sscheck) ● Testing details in my next Spark book ● Whatever* you all want ○ Testing with Spark survey: Semi-likely: ● integration testing (for now see @cfriegly’s Spark + Docker setup): ○ Pretty unlikely: ● Integrating into Apache Spark ( SPARK-12433 ) *That I feel like doing, or you feel like making a pull request for. Photo by bullet101
  • 43. Cat wave photo by Quinn Dombrowski k thnx bye! If you want to fill out survey: http: // Will use update results in Strata Presentation & tweet eventually at @holdenkarau