Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

May 2021 Spark Testing ... or how to farm reputation on StackOverflow

Slides from the St. Louis Big Data IDEA meetup. Kit Menke presented on how to test your Apache Spark applications. May 2021 meetup notes

  • Be the first to comment

  • Be the first to like this

May 2021 Spark Testing ... or how to farm reputation on StackOverflow

  1. 1. Testing Your Apache Spark Apps ... or How to Farm Reputation on stack overflow STL Big Data - Innovation, Data Engineering, Analytics Group May 5, 2021
  2. 2. About Me Kit Menke is the newest organizer of the STL Big Data IDEA meetup and the Practice Director for Data Engineering at 1904labs. We’re hiring! Insert Image
  3. 3. ● Testing Theory ● Testing is Hard ● Why Test? ● An Example Spark App Testing Setup ● Stack Overflow Agenda
  4. 4. Testing Theory
  5. 5. Types of Software Tests Verifies the smallest testable parts of an application. Purpose Verifies methods and/or the smallest testable unit. Unit Verify the interactions and connectivity between the modules of the application. Purpose Ensure different components work together. Integration Validates the complete and fully integrated software product. Purpose Evaluate the end-to-end system. System Regression Tests
  6. 6. Testing Pyramid System Tests Integration Tests Unit Tests Isolated Isolation Faster Speed Tests Run Slower Fully integrated Most of your tests should be unit tests!
  7. 7. Volume ● Utilize expected production throughputs to establish the impact on transaction times of estimated volumes of transactions and users. Rendezvous tests ● Test the application’s performance while subjected to concurrency issues under production load and volume. Stress Tests ● Subject the application to unrealistically high volumes of users accessing the system at the same time in order to determine a system breaking point. Soak Tests ● An extended period of testing at predicted business volumes in order to determine if system performance degrades during a period of continuous usage. Performance Tests
  8. 8. Basic checks ● How much data are you getting into your data pipeline? ● How much data is coming out of your pipeline? ● Does the schema look right? Detailed checks ● Are the data types correct? ● Valid values? ○ Distinct values? ○ Ranges? ○ Correct distribution of values? Ex: a lot of null values Data Validation
  9. 9. Testing is Hard
  10. 10. ● General ○ People often disagree on what each type of test is. ○ Unreasonable metrics like code coverage. ○ Focus on manual testing instead of automated CI pipelines ● Unit testing ○ Testing that only check for the absence of errors, not functionality. ○ Testing the wrong thing - symptom of this is mocking everything ● Integration and system tests ○ Can be brittle - prone to breaking and require constant updates ○ Can be difficult to debug - where is the issue? ● Performance tests ○ Tests aren’t repeatable - the size and shape of your data matters! ○ Results should be comparable over time Things That Go Wrong
  11. 11. Discussion: why test?
  12. 12. Why: Error Signal Collapse Static Analyses Unit Tests Integration Tests System Tests Performance Tests Other Tests Mutation Testing
  13. 13. 1. Prevent bugs from getting into production 2. Allow developers to make changes more confidently/quickly Why test?
  14. 14. ● Align with your team on what tests should look like ● Testing Spark Apps requires your full attention ○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases) ○ Test your logic, not spark or the dependencies ○ Use pull requests (PR) to review the lack of tests or bad tests ● Start small ○ Bottom (of the pyramid) up - unit tests first ○ Focus on tests that provide the most value ● First priority: run unit tests and build app in a CI pipeline, automatically on PR ● Bug in production? Reproduce it in a unit test first. Advice for Testing Spark Apps
  15. 15. Spark App Testing
  16. 16. ● Project Management ○ Maven (for those of us coming from Java it is the familiar tool) ○ Alternatives: sbt ● Spark ○ Version 3.0.1, choose the same version as your cluster ● Unit testing ○ Scalatest, Scalamock ○ Alternatives: JUnit, TestNG ● CI Pipeline ○ Jenkins ○ Alternatives: Github actions, AWS Codebuild Spark App Testing Stack (MVP)
  17. 17. ● Integration testing ○ Scalatest + Testcontainers ○ Alternatives: scalamock ● System testing ○ Scripts ○ Alternatives: java projects ● Performance testing ○ Re-use system test scripts… just with a lot more data ○ Some way to save results (can just be logs!) ● Helpers ○ Spark-testing-base - Base classes to setup/tear down local spark context ○ Test-containers - use docker containers inside scalatest Spark App Testing Stack (expanded)
  18. 18. Demo Example Project
  19. 19. And now for something completely different...
  20. 20. Stack Overflow is a question and answer site for professional and enthusiast programmers. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about programming. ● Gain reputation by asking and answering questions ● Stack overflow spawned a network of other Q&A sites ○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade Stack Overflow
  21. 21. How do you find the answers to your questions? Usually by typing things into Google and end up at Stack overflow... ● How do you ask a good question? ○ Write a title that summarizes the specific problem ○ Introduce the problem before you post any code ○ Help others reproduce the problem ○ Respond to feedback Stack Overflow Unclear Asking for an opinion Too large
  22. 22. ● Spark (data?) questions are HARD to ask ○ What is your input? ○ What code do you have? ○ What is the expected output? ○ What the heck are you trying to do? Data Questions
  23. 23. ● Use your local development environment to help answer questions! ● Once you’ve found a question you think you can answer, create a unit test Cultivate your Spark Skills (and reputation!) Tip #1: Use parallelize to create test data Tip #2: Use printSchema and show to check your work Iterate quickly by re-running your test!
  24. 24. ○ Get at core functionality ○ Small, self-contained units ○ Easy to for someone else to understand ○ Helps others! Good questions are like unit tests Get out there, write more tests, and give back to your community.
  25. 25. Example Spark project with unit tests Scalatest spark-testing-base Testcontainers Test Pyramid Monitoring STIL IDEA Meetup Talk Ideas edit Links