SlideShare a Scribd company logo
1 of 36
Download to read offline
Albert Franzi (@FranziCros), Alpha Health
Modular Apache Spark:
Transform Your Code into
Pieces
2
Slides available in:
http://bit.ly/SparkAI2019-afranzi
About me
3
2013 2015 20182014 2017 2019
Modular Apache Spark:
Transform Your Code into Pieces
4
5
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
6
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
7
Did you play with duplicated code across your Spark Jobs?
@ The Shining 1980
Have you ever experienced the joy of code
reviewing a never ending stream of spark chaos?
8
Please! Don’t play with duplicated code never ever!
@ The Shining 1980
Fragment the Spark Job
9
Spark Readers Transformation
JoinsAliases Formatters
Spark WritersTask Context
...
Readers / Writers
10
Spark Readers
Spark Writers
● Enforce schemas
● Use schemas to read only the fields you are going to use
● Provide Readers per Dataset & attach its sources to it
● Share schemas & sources between Readers & Writers
● GDPR compliant by design
val userBehaviourSchema: StructType = ???
val userBehaviourPath =
Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/")
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPath)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
GDPR
Readers
11
Spark Readers
Readers
12
Spark Readers
val userBehaviourSchema: StructType = ???
// Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourPaths: Seq[Path] = PathBuilder
.latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPaths: _*)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Readers
13
val userBehaviourSchema: StructType = ???
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Spark Readers
Readers
14
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
Spark Readers
Transforms
15
Transformation
def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]
Dataset[T] ⇒ magic ⇒ Dataset[U]
Transforms
16
Transformation
def withGreeting(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit("hello world"))
}
def extractFromJson(colName: String,
outputColName: String,
jsonSchema: StructType)(df: DataFrame): DataFrame = {
df.withColumn(outputColName, from_json(col(colName), jsonSchema))
}
Transforms
17
Transformation
def onlyClassifiedAds(df: DataFrame): DataFrame = {
df.filter(col("event_type") === "View")
.filter(col("object_type") === "ClassifiedAd")
}
def dropDuplicates(df: DataFrame): DataFrame = {
df.dropDuplicates()
}
def cleanedCity(df: DataFrame): DataFrame = {
df.withColumn("city", getCityUdf(col("object.location.address")))
}
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
Transforms
18
Transformation
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
“As a data consumer, I only need to pick up which transformations I would like to apply,
instead of coding them from scratch.”
“It’s like cooking, engineers provide manufactured ingredients (transformations) and
Data Scientists use the required ones for a successful receipt.”
Transforms - Links of Interest
“DataFrame transformations can be defined with arguments so they don’t make
assumptions about the schema of the underlying DataFrame.” - by Matthew Powers.
bit.ly/Spark-ChainingTransformations
bit.ly/Spark-SchemaIndependentTransformations
19
Transformation
github.com/MrPowers/spark-daria
“Spark helper methods to maximize developer productivity.”
20
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
21
Did you put untested Spark jobs into production?
“Mars Climate Orbiter destroyed
because of a Metric System Mixup (1999)”
Testing with Holden Karau
22
github.com/holdenk/spark-testing-base
“Base classes to use when writing tests with Spark.”
● Share Spark Context between tests
● Provide methods to make tests easier
○ Fixture Readers
○ Json to DF converters
○ Extra validators
github.com/MrPowers/spark-fast-tests
“An alternative to spark-testing-base to run tests in parallel without restarting Spark Session
after each test file.”
Testing SharedSparkContext
23
package com.holdenkarau.spark.testing
import java.util.Date
import org.apache.spark._
import org.scalatest.{BeforeAndAfterAll, Suite}
/**
* Shares a local `SparkContext` between all tests in a suite
* and closes it at the end. You can share between suites by enabling
* reuseContextIfPossible.
*/
trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider {
self: Suite =>
...
protected implicit def reuseContextIfPossible: Boolean = false
...
}
Testing
24
package com.alpha.data.test
trait SparkSuite extends DataFrameSuiteBase {
self: Suite =>
override def reuseContextIfPossible: Boolean = true
protected def createDF(data: Seq[Row], schema: StructType): DataFrame = {
spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
}
protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = {
val fixtureContent = readFixtureContent(fileName)
val fixtureJson = fixtureContentToJson(fixtureContent)
jsonToDF(fixtureJson, schema)
}
protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = {
assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep)
}
...
}
Testing
25
Spark Readers Transformation Spark WritersTask Context
Testing each piece independently helps testing all together.
Test
26
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
27
Are tests taking too long to execute?
Junit4Git by Raquel Pau
28
github.com/rpau/junit4git
“Junit Extensions for Test Impact Analysis.”
“This is a JUnit extension that ignores those tests that are not related with your last
changes in your Git repository.”
Junit4Git - Gradle conf
29
configurations {
agent
}
dependencies {
testCompile("org.walkmod:scalatest4git_2.11:${version}")
agent "org.walkmod:junit4git-agent:${version}"
}
test.doFirst {
jvmArgs "-javaagent:${configurations.agent.singleFile}"
}
@RunWith(classOf[ScalaGitRunner])
Junit4Git- Travis with Git notes
30
before_install:
- echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc
after_script:
- git push origin refs/notes/tests:refs/notes/tests
.travis.yml
Junit4Git- Runners
31
import org.scalatest.junit.JUnitRunner
@RunWith(classOf[JUnitRunner])
abstract class UnitSpec extends FunSuite
import org.scalatest.junit.ScalaGitRunner
@RunWith(classOf[ScalaGitRunner])
abstract class IntegrationSpec extends FunSuite
Tests to run always
Tests to run on code changes
Junit4Git
32
Runner Listener Agent Test Method
testStarted
Starts the report
loadClass
testFinished
addTestClass
Closes the
report
Class
instrumentClass
Junit4Git- Notes
33
afranzi:~/Projects/data-core$ git notes show
[
{
"test": "com.alpha.data.core.ContextSuite",
"method": "Context Classes with CommonBucketConf should contain a commonBucket Field",
"classes": [
"com.alpha.data.core.Context",
"com.alpha.data.core.ContextSuite$$anonfun$1$$anon$1",
"com.alpha.data.core.Context$",
"com.alpha.data.core.CommonBucketConf$class",
"com.alpha.data.core.Context$$anonfun$getConfigParamWithLogging$1"
]
}
...
]
com.alpha.data.core.ContextSuite > Context Classes with CommonBucketConf should contain a commonBucket Field SKIPPED
com.alpha.data.core.ContextSuite > Context Classes with BucketConf should contain a bucket Field SKIPPED
Junit4Git - Links of Interest
34
bit.ly/SlidesJunit4Git
Real Impact Testing Analysis For JVM
bit.ly/TestImpactAnalysis
The Rise of Test Impact Analysis
Summary
35
Use Transforms as pills
Modularize your code
Don’t duplicate code between Spark Jobs
Don’t be afraid of testing everything
Share all you built as a library, so others can
reuse your code.
36

More Related Content

More from Databricks

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 

More from Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Modular Apache Spark: Transform Your Code in Pieces

  • 1. Albert Franzi (@FranziCros), Alpha Health Modular Apache Spark: Transform Your Code into Pieces
  • 3. About me 3 2013 2015 20182014 2017 2019
  • 4. Modular Apache Spark: Transform Your Code into Pieces 4
  • 5. 5 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 6. 6 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 7. 7 Did you play with duplicated code across your Spark Jobs? @ The Shining 1980 Have you ever experienced the joy of code reviewing a never ending stream of spark chaos?
  • 8. 8 Please! Don’t play with duplicated code never ever! @ The Shining 1980
  • 9. Fragment the Spark Job 9 Spark Readers Transformation JoinsAliases Formatters Spark WritersTask Context ...
  • 10. Readers / Writers 10 Spark Readers Spark Writers ● Enforce schemas ● Use schemas to read only the fields you are going to use ● Provide Readers per Dataset & attach its sources to it ● Share schemas & sources between Readers & Writers ● GDPR compliant by design
  • 11. val userBehaviourSchema: StructType = ??? val userBehaviourPath = Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/") val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPath) .buildReader() val df: DataFrame = userBehaviourReader.read() GDPR Readers 11 Spark Readers
  • 12. Readers 12 Spark Readers val userBehaviourSchema: StructType = ??? // Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/ val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourPaths: Seq[Path] = PathBuilder .latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPaths: _*) .buildReader() val df: DataFrame = userBehaviourReader.read()
  • 13. Readers 13 val userBehaviourSchema: StructType = ??? val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay) .buildReader() val df: DataFrame = userBehaviourReader.read() Spark Readers
  • 14. Readers 14 val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) Spark Readers
  • 15. Transforms 15 Transformation def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Dataset[T] ⇒ magic ⇒ Dataset[U]
  • 16. Transforms 16 Transformation def withGreeting(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) } def extractFromJson(colName: String, outputColName: String, jsonSchema: StructType)(df: DataFrame): DataFrame = { df.withColumn(outputColName, from_json(col(colName), jsonSchema)) }
  • 17. Transforms 17 Transformation def onlyClassifiedAds(df: DataFrame): DataFrame = { df.filter(col("event_type") === "View") .filter(col("object_type") === "ClassifiedAd") } def dropDuplicates(df: DataFrame): DataFrame = { df.dropDuplicates() } def cleanedCity(df: DataFrame): DataFrame = { df.withColumn("city", getCityUdf(col("object.location.address"))) } val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
  • 18. Transforms 18 Transformation val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*) “As a data consumer, I only need to pick up which transformations I would like to apply, instead of coding them from scratch.” “It’s like cooking, engineers provide manufactured ingredients (transformations) and Data Scientists use the required ones for a successful receipt.”
  • 19. Transforms - Links of Interest “DataFrame transformations can be defined with arguments so they don’t make assumptions about the schema of the underlying DataFrame.” - by Matthew Powers. bit.ly/Spark-ChainingTransformations bit.ly/Spark-SchemaIndependentTransformations 19 Transformation github.com/MrPowers/spark-daria “Spark helper methods to maximize developer productivity.”
  • 20. 20 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 21. 21 Did you put untested Spark jobs into production? “Mars Climate Orbiter destroyed because of a Metric System Mixup (1999)”
  • 22. Testing with Holden Karau 22 github.com/holdenk/spark-testing-base “Base classes to use when writing tests with Spark.” ● Share Spark Context between tests ● Provide methods to make tests easier ○ Fixture Readers ○ Json to DF converters ○ Extra validators github.com/MrPowers/spark-fast-tests “An alternative to spark-testing-base to run tests in parallel without restarting Spark Session after each test file.”
  • 23. Testing SharedSparkContext 23 package com.holdenkarau.spark.testing import java.util.Date import org.apache.spark._ import org.scalatest.{BeforeAndAfterAll, Suite} /** * Shares a local `SparkContext` between all tests in a suite * and closes it at the end. You can share between suites by enabling * reuseContextIfPossible. */ trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider { self: Suite => ... protected implicit def reuseContextIfPossible: Boolean = false ... }
  • 24. Testing 24 package com.alpha.data.test trait SparkSuite extends DataFrameSuiteBase { self: Suite => override def reuseContextIfPossible: Boolean = true protected def createDF(data: Seq[Row], schema: StructType): DataFrame = { spark.createDataFrame(spark.sparkContext.parallelize(data), schema) } protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = { val fixtureContent = readFixtureContent(fileName) val fixtureJson = fixtureContentToJson(fixtureContent) jsonToDF(fixtureJson, schema) } protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = { assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep) } ... }
  • 25. Testing 25 Spark Readers Transformation Spark WritersTask Context Testing each piece independently helps testing all together. Test
  • 26. 26 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 27. 27 Are tests taking too long to execute?
  • 28. Junit4Git by Raquel Pau 28 github.com/rpau/junit4git “Junit Extensions for Test Impact Analysis.” “This is a JUnit extension that ignores those tests that are not related with your last changes in your Git repository.”
  • 29. Junit4Git - Gradle conf 29 configurations { agent } dependencies { testCompile("org.walkmod:scalatest4git_2.11:${version}") agent "org.walkmod:junit4git-agent:${version}" } test.doFirst { jvmArgs "-javaagent:${configurations.agent.singleFile}" } @RunWith(classOf[ScalaGitRunner])
  • 30. Junit4Git- Travis with Git notes 30 before_install: - echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc after_script: - git push origin refs/notes/tests:refs/notes/tests .travis.yml
  • 31. Junit4Git- Runners 31 import org.scalatest.junit.JUnitRunner @RunWith(classOf[JUnitRunner]) abstract class UnitSpec extends FunSuite import org.scalatest.junit.ScalaGitRunner @RunWith(classOf[ScalaGitRunner]) abstract class IntegrationSpec extends FunSuite Tests to run always Tests to run on code changes
  • 32. Junit4Git 32 Runner Listener Agent Test Method testStarted Starts the report loadClass testFinished addTestClass Closes the report Class instrumentClass
  • 33. Junit4Git- Notes 33 afranzi:~/Projects/data-core$ git notes show [ { "test": "com.alpha.data.core.ContextSuite", "method": "Context Classes with CommonBucketConf should contain a commonBucket Field", "classes": [ "com.alpha.data.core.Context", "com.alpha.data.core.ContextSuite$$anonfun$1$$anon$1", "com.alpha.data.core.Context$", "com.alpha.data.core.CommonBucketConf$class", "com.alpha.data.core.Context$$anonfun$getConfigParamWithLogging$1" ] } ... ] com.alpha.data.core.ContextSuite > Context Classes with CommonBucketConf should contain a commonBucket Field SKIPPED com.alpha.data.core.ContextSuite > Context Classes with BucketConf should contain a bucket Field SKIPPED
  • 34. Junit4Git - Links of Interest 34 bit.ly/SlidesJunit4Git Real Impact Testing Analysis For JVM bit.ly/TestImpactAnalysis The Rise of Test Impact Analysis
  • 35. Summary 35 Use Transforms as pills Modularize your code Don’t duplicate code between Spark Jobs Don’t be afraid of testing everything Share all you built as a library, so others can reuse your code.
  • 36. 36