Types of Software Tests
Veriﬁes the smallest
testable parts of an
Veriﬁes methods and/or
the smallest testable unit.
Verify the interactions and
connectivity between the
modules of the
Validates the complete
and fully integrated
Evaluate the end-to-end
Most of your tests should be unit tests!
● Utilize expected production throughputs to establish the impact on transaction times of
estimated volumes of transactions and users.
● Test the application’s performance while subjected to concurrency issues under production
load and volume.
● Subject the application to unrealistically high volumes of users accessing the system at the
same time in order to determine a system breaking point.
● An extended period of testing at predicted business volumes in order to determine if system
performance degrades during a period of continuous usage.
● How much data are you getting into your data pipeline?
● How much data is coming out of your pipeline?
● Does the schema look right?
● Are the data types correct?
● Valid values?
○ Distinct values?
○ Correct distribution of values? Ex: a lot of null values
○ People often disagree on what each type of test is.
○ Unreasonable metrics like code coverage.
○ Focus on manual testing instead of automated CI pipelines
● Unit testing
○ Testing that only check for the absence of errors, not functionality.
○ Testing the wrong thing - symptom of this is mocking everything
● Integration and system tests
○ Can be brittle - prone to breaking and require constant updates
○ Can be diﬃcult to debug - where is the issue?
● Performance tests
○ Tests aren’t repeatable - the size and shape of your data matters!
○ Results should be comparable over time
Things That Go Wrong
Why: Error Signal Collapse
1. Prevent bugs from getting into production
2. Allow developers to make changes more conﬁdently/quickly
● Align with your team on what tests should look like
● Testing Spark Apps requires your full attention
○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases)
○ Test your logic, not spark or the dependencies
○ Use pull requests (PR) to review the lack of tests or bad tests
● Start small
○ Bottom (of the pyramid) up - unit tests ﬁrst
○ Focus on tests that provide the most value
● First priority: run unit tests and build app in a CI pipeline, automatically on PR
● Bug in production? Reproduce it in a unit test ﬁrst.
Advice for Testing Spark Apps
● Project Management
○ Maven (for those of us coming from Java it is the familiar tool)
○ Alternatives: sbt
○ Version 3.0.1, choose the same version as your cluster
● Unit testing
○ Scalatest, Scalamock
○ Alternatives: JUnit, TestNG
● CI Pipeline
○ Alternatives: Github actions, AWS Codebuild
Spark App Testing Stack (MVP)
● Integration testing
○ Scalatest + Testcontainers
○ Alternatives: scalamock
● System testing
○ Alternatives: java projects
● Performance testing
○ Re-use system test scripts… just with a lot more data
○ Some way to save results (can just be logs!)
○ Spark-testing-base - Base classes to setup/tear down local spark context
○ Test-containers - use docker containers inside scalatest
Spark App Testing Stack (expanded)
Stack Overﬂow is a question and answer site for professional and enthusiast programmers.
It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help,
we're working together to build a library of detailed answers to every question about
● Gain reputation by asking and answering questions
● Stack overﬂow spawned a network of other Q&A sites
○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade
How do you ﬁnd the answers to your questions? Usually by typing things into Google and
end up at Stack overﬂow...
● How do you ask a good question?
○ Write a title that summarizes the speciﬁc problem
○ Introduce the problem before you post any code
○ Help others reproduce the problem
○ Respond to feedback
Asking for an opinion
● Spark (data?) questions are HARD to ask
○ What is your input?
○ What code do you have?
○ What is the expected output?
○ What the heck are you trying to do?
● Use your local development environment to help answer questions!
● Once you’ve found a question you think you can answer, create a unit test
Cultivate your Spark Skills (and reputation!)
Tip #1: Use parallelize to create test data
Tip #2: Use printSchema and show to check
Iterate quickly by re-running your test!
○ Get at core functionality
○ Small, self-contained units
○ Easy to for someone else to understand
○ Helps others!
Good questions are like unit tests
Get out there, write more tests, and
give back to your community.
Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world
Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html
STIL IDEA Meetup Talk Ideas