Effective testing for spark programs Strata NY 2015

Effective Testing for Spark
Programs
or avoiding “I didn’t think that could happen”
Now
mostly
“works”*
*See developer for details. Does not imply warranty. :p

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer
● currently Alpine and previously Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Spark Videos http://bit.ly/holdenSparkVideos

What is going to be covered:
● What I think I might know about you
● A bit about why you should test your programs
● Doing traditional unit testing for Spark programs
○ Along with special considerations for SQL/Hive & Streaming
● Using counters & other job acceptance tests w/ Spark
● Cute & scary pictures
○ I promise at least one panda and one cat
● “Future Work”
○ Some of this future work might even get done!

Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Familiar with Apache Spark
○ If not, buy one of my books or watch Paco’s awesome video
● Familiar with one of Scala, Java, or Python
○ If you know R well I’d love to chat though
● Want to make better software
○ (or models, or w/e)

So why should you test?
● Makes you a better person
● Save $s
○ May help you avoid losing your employer all of their money
■ Or “users” if we were in the bay
○ AWS is expensive
● Waiting for our jobs to fail is a pretty long dev cycle
● This is really just to guilt trip you & give you flashbacks to your QA internships

Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests, urgh w/e
● Our tests can get too slow
● It takes a lot of time
○ and people always want everything done yesterday
○ or I just want to go home see my partner
○ etc.

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
super.beforeAll()
}
override def afterAll() {
if (sc != null)
sc.stop()
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
super.afterAll()
}
Photo by morinesque

And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
}
def tokenize(f: RDD[String]) = {
f.map(_.split(" ").toList)
}
Photo by morinesque

Wait, where were the batteries?
Photo by Jim Bauer

Let’s get batteries!
● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base
○ sscheck - https://github.com/juanrh/sscheck
● Integration testing
○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
● Performance
○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
● Spark job validation
○ spark-validator - https://github.com/holdenk/spark-validator
Photo by Mike Mozart

A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
val input = List("hi", "hi holden", "bye")
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}

A simple unit test (Java)
public class SampleJavaRDDTest
extends SharedJavaSparkContext implements Serializable {
@Test public void verifyMapTest() {
List<Integer> input = Arrays.asList(1,2);
JavaRDD<Integer> result = jsc().parallelize(input).map(
new Function<Integer, Integer>() { public Integer call(Integer x) { return
x * x;}});
assertEquals(input.size(), result.count());
}
}

A simple unit test (Python)
class SimpleTest(SparkTestingBaseTestCase):
"""A simple test."""
def test_basic(self):
"""Test a simple collect."""
input = ["hello world"]
rdd = self.sc.parallelize(input)
result = rdd.collect()
assert result == input

Making fake data
● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● sc.parallelize is pretty good for small tests
○ Note: that we can specify the number of partitions
● Coming up with good test data can take a long time
Lori Rielly

QuickCheck / ScalaCheck
● QuickCheck generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological RDDs
*I assume
PROtara hunt

With sscheck
def forallRDDGenOfNtoM = {
val minWords, maxWords = (50, 100)
Prop.forAll(RDDGen.ofNtoM(50, 100, arbitrary[String])) { rdd : RDD[String] =>
rdd.map(_.length()).sum must be_>=(0.0)
}
}

With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}

Testing streaming….
Photo by Steve Jurvetson

// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
SparkContext,
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration - 1).
toInt
val selectedInput = if (index < input.size) input(index) else
Seq[T]()
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
}
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + rdd.id + " with " + selectedInput)
Some(rdd)
}
}
Artisanal Stream Testing Code
trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging
with SharedSparkContext {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[4]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
dir.toString
}
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
override def afterAll() {
System.clearProperty("spark.streaming.clock")
super.afterAll()
}
Phot
o by
Stev
e
Jurv
etso
n

and continued….
/**
* Create an input stream for the provided input sequence. This is done using
* TestInputStream as queueStream's are not checkpointable.
*/
def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ :
TestStreamingContext,
input: Seq[Seq[T]]): TestInputStream[T] = {
new TestInputStream(sc, ssc_, input, numInputPartitions)
}
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
override def beforeAll() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.
TestManualClock") // We can specify our own clock
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock")
}
super.beforeAll()
}
/**
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
*/
def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream
[R], TestStreamingContext))
(block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = {
val outputStream = outputStreamSSC._1
val ssc = outputStreamSSC._2
try {
block(outputStream, ssc)
} finally {
try {
ssc.stop(stopSparkContext = false)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
}
}
}
}

and now for the clock
/*
* Allows us access to a manual clock. Note that the manual clock changed between
1.1.1 and 1.3
*/
class TestManualClock(var time: Long) extends Clock {
def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat
def currentTime(): Long = getTimeMillis() // Compat
def getTimeMillis(): Long =
synchronized {
time
}
def setTime(timeToSet: Long): Unit =
synchronized {
time = timeToSet
notifyAll()
}
def advance(timeToAdd: Long): Unit =
synchronized {
time += timeToAdd
notifyAll()
}
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
/**
* @param targetTime block until the clock time is set or advanced to at least this
time
* @return current time reported by the clock when waiting finishes
*/
def waitTillTime(targetTime: Long): Long =
synchronized {
while (time < targetTime) {
wait(100)
}
getTimeMillis()
}
}

Testing streaming the happy panda way
● Creating test data is hard
○ ssc.queueStream works - unless you need checkpoints (1.4.1+)
● Collecting the data locally is hard
○ foreachRDD & a var
● figuring out when your test is “done”
Let’s abstract all that away into testOperation

We can hide all of that:
val input = List(List("hi"), List("hi holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}
Photo by An eye
for my mind

What about DataFrames?
● We can do the same as we did for RDD’s
● Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
● Sadly it’s not in a published package :(
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {

We can make it easier!*
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
*This may or may not be easier.

Let’s talk about local mode
● It’s way better than you would expect*
● It does its best to try and catch serialization errors
● It’s still not the same as running on a “real” cluster
Photo by: Bev Sykes

Running on a real* cluster
● Start one with your shell scripts & change the master
○ Really easy way to plug into existing integration testing
● spark-docker - hack in our own tests
● YarnMiniCluster
○ https://github.
com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClu
sterSuite.scala
○ In Spark Testing Base extend SharedMiniCluster
■ Not recommended until after SPARK-10812 is merged & (maybe 1.6 ?)
Photo by Richard Masoner

Why should we validate our jobs?
● Our code will most likely fail
○ Sometimes data sources fail in new & exciting ways (see Mongo)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t that hard
Photo of GSM intercept by Matt E

So how do we validate our jobs?
● Spark has it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
● We can write rules for if the values are expected
○ Simple rules (X > J)
○ Historic rules (X > Avg(J))
Photo by:
Paul Schadler

Simple historic validation Photo by Dvortygirl
val vc = new ValidationConf(jobHistoryPath, "1", true,
List[ValidationRule](new AvgRule("acc", 0.001, Some(200))))
val v = Validation(sc, vc)
// Some job logic
// Register an accumulator (optional)
val acc = sc.accumulator(0)
v.registerAccumulator(acc, "acc")
// More Job logic goes here
if (v.validate(jobId)) {
// Success logic goes here
} else sadness()

With a Spark internal counter...
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some
(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Do work here....
assert(v.validate(5) === true)
}
Photo by Dvortygirl

Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base

Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/sscheck
● spark-validator: https://github.com/holdenk/spark-validator *ALPHA*
● spark-perf - https://github.com/databricks/spark-perf
● spark-integration-tests - https://github.com/databricks/spark-integration-tests
● scalacheck - https://www.scalacheck.org/

“Future Work”
● Integrating into Apache Spark
○ Using their style rules to simplify future transition
● Better ScalaCheck integration (with the help of the sscheck people)
● Some reasonable prefab rules for Job validation
● Testing details in my next Spark book
● Whatever* you all want
○ Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
● integration testing
*That I feel like doing, or you feel like making a pull request for.
Photo by
bullet101

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey: http://bit.
ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau

Effective testing for spark programs Strata NY 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Effective testing for spark programs Strata NY 2015

Similar to Effective testing for spark programs Strata NY 2015 (20)

Recently uploaded

Recently uploaded (20)

Effective testing for spark programs Strata NY 2015