SlideShare a Scribd company logo
1 of 25
Download to read offline
Testing Your Apache Spark Apps
... or How to Farm Reputation on stack overflow
STL Big Data - Innovation, Data Engineering, Analytics Group
May 5, 2021
About Me
Kit Menke is the newest organizer of the STL Big
Data IDEA meetup and the Practice Director for
Data Engineering at 1904labs.
We’re hiring!
https://1904labs.com/your-careers/
Insert
Image
● Testing Theory
● Testing is Hard
● Why Test?
● An Example Spark App Testing Setup
● Stack Overflow
Agenda
Testing Theory
Types of Software Tests
Verifies the smallest
testable parts of an
application.
Purpose
Verifies methods and/or
the smallest testable unit.
Unit
Verify the interactions and
connectivity between the
modules of the
application.
Purpose
Ensure different
components work
together.
Integration
Validates the complete
and fully integrated
software product.
Purpose
Evaluate the end-to-end
system.
System
Regression Tests
Testing Pyramid
System
Tests
Integration Tests
Unit Tests
Isolated
Isolation
Faster
Speed
Tests
Run
Slower
Fully integrated
Most of your tests should be unit tests!
Volume
● Utilize expected production throughputs to establish the impact on transaction times of
estimated volumes of transactions and users.
Rendezvous tests
● Test the application’s performance while subjected to concurrency issues under production
load and volume.
Stress Tests
● Subject the application to unrealistically high volumes of users accessing the system at the
same time in order to determine a system breaking point.
Soak Tests
● An extended period of testing at predicted business volumes in order to determine if system
performance degrades during a period of continuous usage.
Performance Tests
Basic checks
● How much data are you getting into your data pipeline?
● How much data is coming out of your pipeline?
● Does the schema look right?
Detailed checks
● Are the data types correct?
● Valid values?
○ Distinct values?
○ Ranges?
○ Correct distribution of values? Ex: a lot of null values
Data Validation
Testing is Hard
● General
○ People often disagree on what each type of test is.
○ Unreasonable metrics like code coverage.
○ Focus on manual testing instead of automated CI pipelines
● Unit testing
○ Testing that only check for the absence of errors, not functionality.
○ Testing the wrong thing - symptom of this is mocking everything
● Integration and system tests
○ Can be brittle - prone to breaking and require constant updates
○ Can be difficult to debug - where is the issue?
● Performance tests
○ Tests aren’t repeatable - the size and shape of your data matters!
○ Results should be comparable over time
Things That Go Wrong
Discussion: why test?
Why: Error Signal Collapse
Static
Analyses
Unit
Tests
Integration
Tests
System
Tests
Performance
Tests
Other
Tests
Mutation
Testing
1. Prevent bugs from getting into production
2. Allow developers to make changes more confidently/quickly
Why test?
● Align with your team on what tests should look like
● Testing Spark Apps requires your full attention
○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases)
○ Test your logic, not spark or the dependencies
○ Use pull requests (PR) to review the lack of tests or bad tests
● Start small
○ Bottom (of the pyramid) up - unit tests first
○ Focus on tests that provide the most value
● First priority: run unit tests and build app in a CI pipeline, automatically on PR
● Bug in production? Reproduce it in a unit test first.
Advice for Testing Spark Apps
Spark App Testing
● Project Management
○ Maven (for those of us coming from Java it is the familiar tool)
○ Alternatives: sbt
● Spark
○ Version 3.0.1, choose the same version as your cluster
● Unit testing
○ Scalatest, Scalamock
○ Alternatives: JUnit, TestNG
● CI Pipeline
○ Jenkins
○ Alternatives: Github actions, AWS Codebuild
Spark App Testing Stack (MVP)
● Integration testing
○ Scalatest + Testcontainers
○ Alternatives: scalamock
● System testing
○ Scripts
○ Alternatives: java projects
● Performance testing
○ Re-use system test scripts… just with a lot more data
○ Some way to save results (can just be logs!)
● Helpers
○ Spark-testing-base - Base classes to setup/tear down local spark context
○ Test-containers - use docker containers inside scalatest
Spark App Testing Stack (expanded)
Demo Example Project
And now for something completely
different...
Stack Overflow is a question and answer site for professional and enthusiast programmers.
It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help,
we're working together to build a library of detailed answers to every question about
programming.
https://stackoverflow.com/tour
● Gain reputation by asking and answering questions
● Stack overflow spawned a network of other Q&A sites
○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade
Stack Overflow
How do you find the answers to your questions? Usually by typing things into Google and
end up at Stack overflow...
● How do you ask a good question?
○ Write a title that summarizes the specific problem
○ Introduce the problem before you post any code
○ Help others reproduce the problem
○ Respond to feedback
Stack Overflow
Unclear
Asking for an opinion
Too large
● Spark (data?) questions are HARD to ask
○ What is your input?
○ What code do you have?
○ What is the expected output?
○ What the heck are you trying to do?
Data Questions
● Use your local development environment to help answer questions!
● Once you’ve found a question you think you can answer, create a unit test
Cultivate your Spark Skills (and reputation!)
Tip #1: Use parallelize to create test data
Tip #2: Use printSchema and show to check
your work
Iterate quickly by re-running your test!
○ Get at core functionality
○ Small, self-contained units
○ Easy to for someone else to understand
○ Helps others!
Good questions are like unit tests
Get out there, write more tests, and
give back to your community.
Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world
Scalatest https://www.scalatest.org/
spark-testing-base https://github.com/holdenk/spark-testing-base
Testcontainers https://www.testcontainers.org/
Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html
Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals
STIL IDEA Meetup Talk Ideas
https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/
edit
Links

More Related Content

What's hot

WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...GeeksLab Odessa
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise WideDatabricks
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowDatabricks
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesDatabricks
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleDatabricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!Databricks
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit
 
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Databricks
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at TwitterPrasad Wagle
 

What's hot (20)

WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the People
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at Twitter
 

Similar to May 2021 Spark Testing ... or how to farm reputation on StackOverflow

Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideasRichard Robinson
 
Unit testing (Exploring the other side as a tester)
Unit testing (Exploring the other side as a tester)Unit testing (Exploring the other side as a tester)
Unit testing (Exploring the other side as a tester)Abhijeet Vaikar
 
Unit testing
Unit testingUnit testing
Unit testingPiXeL16
 
Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven developmentEinar Ingebrigtsen
 
An Introduction to Unit Testing
An Introduction to Unit TestingAn Introduction to Unit Testing
An Introduction to Unit TestingSahar Nofal
 
Unit Testing and TDD 2017
Unit Testing and TDD 2017Unit Testing and TDD 2017
Unit Testing and TDD 2017Xavi Hidalgo
 
Lessons Learned When Automating
Lessons Learned When AutomatingLessons Learned When Automating
Lessons Learned When AutomatingAlan Richardson
 
Software Testing Basic Concepts
Software Testing Basic ConceptsSoftware Testing Basic Concepts
Software Testing Basic Conceptswesovi
 
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)Dinis Cruz
 
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
assertYourself - Breaking the Theories and Assumptions of Unit Testing in FlexassertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flexmichael.labriola
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentZendCon
 
How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...Max Barrass
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayBizTalk360
 

Similar to May 2021 Spark Testing ... or how to farm reputation on StackOverflow (20)

UPC Plone Testing Talk
UPC Plone Testing TalkUPC Plone Testing Talk
UPC Plone Testing Talk
 
Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideas
 
Unit testing (Exploring the other side as a tester)
Unit testing (Exploring the other side as a tester)Unit testing (Exploring the other side as a tester)
Unit testing (Exploring the other side as a tester)
 
Unit testing
Unit testingUnit testing
Unit testing
 
Driving application development through behavior driven development
Driving application development through behavior driven developmentDriving application development through behavior driven development
Driving application development through behavior driven development
 
An Introduction to Unit Testing
An Introduction to Unit TestingAn Introduction to Unit Testing
An Introduction to Unit Testing
 
Unit Testing and TDD 2017
Unit Testing and TDD 2017Unit Testing and TDD 2017
Unit Testing and TDD 2017
 
Lessons Learned When Automating
Lessons Learned When AutomatingLessons Learned When Automating
Lessons Learned When Automating
 
Software Testing Basic Concepts
Software Testing Basic ConceptsSoftware Testing Basic Concepts
Software Testing Basic Concepts
 
Usable Software Design
Usable Software DesignUsable Software Design
Usable Software Design
 
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
 
Integration testing - A&BP CC
Integration testing - A&BP CCIntegration testing - A&BP CC
Integration testing - A&BP CC
 
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
assertYourself - Breaking the Theories and Assumptions of Unit Testing in FlexassertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
 
Software testing
Software testingSoftware testing
Software testing
 
Unit Testing
Unit TestingUnit Testing
Unit Testing
 
Automated testing
Automated testingAutomated testing
Automated testing
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...
 
Testing for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration MondayTesting for Logic App Solutions | Integration Monday
Testing for Logic App Solutions | Integration Monday
 
Why Unit Testingl
Why Unit TestinglWhy Unit Testingl
Why Unit Testingl
 

More from Adam Doyle

Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoopAdam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

May 2021 Spark Testing ... or how to farm reputation on StackOverflow

  • 1. Testing Your Apache Spark Apps ... or How to Farm Reputation on stack overflow STL Big Data - Innovation, Data Engineering, Analytics Group May 5, 2021
  • 2. About Me Kit Menke is the newest organizer of the STL Big Data IDEA meetup and the Practice Director for Data Engineering at 1904labs. We’re hiring! https://1904labs.com/your-careers/ Insert Image
  • 3. ● Testing Theory ● Testing is Hard ● Why Test? ● An Example Spark App Testing Setup ● Stack Overflow Agenda
  • 5. Types of Software Tests Verifies the smallest testable parts of an application. Purpose Verifies methods and/or the smallest testable unit. Unit Verify the interactions and connectivity between the modules of the application. Purpose Ensure different components work together. Integration Validates the complete and fully integrated software product. Purpose Evaluate the end-to-end system. System Regression Tests
  • 6. Testing Pyramid System Tests Integration Tests Unit Tests Isolated Isolation Faster Speed Tests Run Slower Fully integrated Most of your tests should be unit tests!
  • 7. Volume ● Utilize expected production throughputs to establish the impact on transaction times of estimated volumes of transactions and users. Rendezvous tests ● Test the application’s performance while subjected to concurrency issues under production load and volume. Stress Tests ● Subject the application to unrealistically high volumes of users accessing the system at the same time in order to determine a system breaking point. Soak Tests ● An extended period of testing at predicted business volumes in order to determine if system performance degrades during a period of continuous usage. Performance Tests
  • 8. Basic checks ● How much data are you getting into your data pipeline? ● How much data is coming out of your pipeline? ● Does the schema look right? Detailed checks ● Are the data types correct? ● Valid values? ○ Distinct values? ○ Ranges? ○ Correct distribution of values? Ex: a lot of null values Data Validation
  • 10. ● General ○ People often disagree on what each type of test is. ○ Unreasonable metrics like code coverage. ○ Focus on manual testing instead of automated CI pipelines ● Unit testing ○ Testing that only check for the absence of errors, not functionality. ○ Testing the wrong thing - symptom of this is mocking everything ● Integration and system tests ○ Can be brittle - prone to breaking and require constant updates ○ Can be difficult to debug - where is the issue? ● Performance tests ○ Tests aren’t repeatable - the size and shape of your data matters! ○ Results should be comparable over time Things That Go Wrong
  • 12. Why: Error Signal Collapse Static Analyses Unit Tests Integration Tests System Tests Performance Tests Other Tests Mutation Testing
  • 13. 1. Prevent bugs from getting into production 2. Allow developers to make changes more confidently/quickly Why test?
  • 14. ● Align with your team on what tests should look like ● Testing Spark Apps requires your full attention ○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases) ○ Test your logic, not spark or the dependencies ○ Use pull requests (PR) to review the lack of tests or bad tests ● Start small ○ Bottom (of the pyramid) up - unit tests first ○ Focus on tests that provide the most value ● First priority: run unit tests and build app in a CI pipeline, automatically on PR ● Bug in production? Reproduce it in a unit test first. Advice for Testing Spark Apps
  • 16. ● Project Management ○ Maven (for those of us coming from Java it is the familiar tool) ○ Alternatives: sbt ● Spark ○ Version 3.0.1, choose the same version as your cluster ● Unit testing ○ Scalatest, Scalamock ○ Alternatives: JUnit, TestNG ● CI Pipeline ○ Jenkins ○ Alternatives: Github actions, AWS Codebuild Spark App Testing Stack (MVP)
  • 17. ● Integration testing ○ Scalatest + Testcontainers ○ Alternatives: scalamock ● System testing ○ Scripts ○ Alternatives: java projects ● Performance testing ○ Re-use system test scripts… just with a lot more data ○ Some way to save results (can just be logs!) ● Helpers ○ Spark-testing-base - Base classes to setup/tear down local spark context ○ Test-containers - use docker containers inside scalatest Spark App Testing Stack (expanded)
  • 19. And now for something completely different...
  • 20. Stack Overflow is a question and answer site for professional and enthusiast programmers. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about programming. https://stackoverflow.com/tour ● Gain reputation by asking and answering questions ● Stack overflow spawned a network of other Q&A sites ○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade Stack Overflow
  • 21. How do you find the answers to your questions? Usually by typing things into Google and end up at Stack overflow... ● How do you ask a good question? ○ Write a title that summarizes the specific problem ○ Introduce the problem before you post any code ○ Help others reproduce the problem ○ Respond to feedback Stack Overflow Unclear Asking for an opinion Too large
  • 22. ● Spark (data?) questions are HARD to ask ○ What is your input? ○ What code do you have? ○ What is the expected output? ○ What the heck are you trying to do? Data Questions
  • 23. ● Use your local development environment to help answer questions! ● Once you’ve found a question you think you can answer, create a unit test Cultivate your Spark Skills (and reputation!) Tip #1: Use parallelize to create test data Tip #2: Use printSchema and show to check your work Iterate quickly by re-running your test!
  • 24. ○ Get at core functionality ○ Small, self-contained units ○ Easy to for someone else to understand ○ Helps others! Good questions are like unit tests Get out there, write more tests, and give back to your community.
  • 25. Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world Scalatest https://www.scalatest.org/ spark-testing-base https://github.com/holdenk/spark-testing-base Testcontainers https://www.testcontainers.org/ Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals STIL IDEA Meetup Talk Ideas https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/ edit Links