SlideShare a Scribd company logo
1 of 32
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Oliver Țupran
From HelloWorld to
Configurable and Reusable
Apache Spark Applications
In Scala
#UnifiedDataAnalytics #SparkAISummit
whoami
3#UnifiedDataAnalytics #SparkAISummit
Oliver Țupran
Software Engineer
Aviation, Banking, Telecom...
Scala Enthusiast
Apache Spark Enthusiast
olivertupran
tupol
@olivertupran
Intro
Audience
● Professionals starting with Scala and Apache Spark
● Basic Scala knowledge is required
● Basic Apache Spark knowledge is required
4
Intro
#UnifiedDataAnalytics #SparkAISummit
Agenda
● Hello, World!
● Problems
● Solutions
● Summary
5
Intro
#UnifiedDataAnalytics #SparkAISummit
Hello, World!
6
Hello,World!
./bin/spark-shell
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
scala> linesWithSpark.count() // How many lines contain "Spark"?
res3: Long = 15
Source spark.apache.org/docs/latest/quick-start.html
#UnifiedDataAnalytics #SparkAISummit
Hello, World!
7
Hello,World!
Source spark.apache.org/docs/latest/quick-start.html
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
#UnifiedDataAnalytics #SparkAISummit
Problems
8
Problems
● Configuration mixed with the application logic
● IO can be much more complex than it looks
● Hard to test
#UnifiedDataAnalytics #SparkAISummit
Solutions
9
Solutions
● Clean separation of the business logic
● Spark session out of the box
● Configuration and validation support
● Encourage and facilitate testing
#UnifiedDataAnalytics #SparkAISummit
tupol/spark-utils
Business Logic Separation
10
Solutions
/**
* @tparam Context The type of the application context class.
* @tparam Result The output type of the run function.
*/
trait SparkRunnable[Context, Result] {
/**
* @param context context instance containing all the application specific configuration
* @param spark active spark session
* @return An instance of type Result
*/
def run(implicit spark: SparkSession, context: Context): Result
}
Source github.com/tupol/spark-utils
#UnifiedDataAnalytics #SparkAISummit
Stand-Alone App Blueprint
11
Solutions
trait SparkApp[Context, Result] extends SparkRunnable[Context, Result] with Logging {
def appName: String = . . .
private def applicationConfiguration(implicit spark: SparkSession, args: Array[String]):
com.typesafe.config.Config = . . .
def createSparkSession(runnerName: String): SparkSession =
. . .
def createContext(config: com.typesafe.config.Config): Context
def main(implicit args: Array[String]): Unit = {
// Create a SparkSession, initialize a Typesafe Config instance,
// validate and initialize the application context,
// execute the run() function, close the SparkSession and
// return the result or throw and Exception
. . .
}
}
Source github.com/tupol/spark-utils
#UnifiedDataAnalytics #SparkAISummit
Back to SimpleApp
12
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
Solutions
Source spark.apache.org/docs/latest/quick-start.html
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
13
Source github.com/tupol/spark-utils-demos/
object SimpleApp extends SparkApp[Unit, Unit]{
override def createContext(config: Config): Unit = Unit
override def run(implicit spark: SparkSession, context: Unit): Unit {
val logFile = "YOUR_SPARK_HOME/README.md"
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
}
}
Solutions
1
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
14
object SimpleApp extends SparkApp[Unit, Unit]{
override def createContext (config: Config): Unit = Unit
override def run(implicit spark: SparkSession , context: Unit): Unit = {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
}
def appLogic(data: Dataset[String]): (Long, Long) = {
val numAs = data.filter(line => line.contains("a")).count()
val numBs = data.filter(line => line.contains("b")).count()
(numAs, numBs)
}
}
Solutions
2
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
15
Solutions
3case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleApp extends SparkApp[SimpleAppContext, Unit]{
override def createContext(config: Config): SimpleAppContext = ???
override def run(implicit spark: SparkSession , context: SimpleAppContext ): Unit = {
val logData = spark.source(context.input).read.as[String].cache
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
}
def appLogic(data: Dataset[String] , context: SimpleAppContext): (Long, Long) = {
val numAs = data.filter(line => line.contains(context.filterA )).count()
val numBs = data.filter(line => line.contains(context.filterB )).count()
(numAs, numBs)
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Why Typesafe Config?
16
Solutions
● supports files in three formats: Java properties, JSON, and a human-friendly JSON superset
● merges multiple files across all formats
● can load from files, URLs or classpath
● users can override the config with Java system properties, java -Dmyapp.foo.bar=10
● supports configuring an app, with its framework and libraries, all from a single file such as
application.conf
● extracts typed properties
● JSON superset features:
○ comments
○ includes
○ substitutions ("foo" : ${bar}, "foo" : Hello ${who})
○ properties-like notation (a.b=c)
○ less noisy, more lenient syntax
○ substitute environment variables (logdir=${HOME}/logs)
○ lists
Source github.com/lightbend/config
#UnifiedDataAnalytics #SparkAISummit
Application Configuration File
17
Solutions
Hocon
SimpleApp {
input {
format: text
path: SPARK_HOME/README.md
}
filterA: A
filterB: B
}
Java Properties
SimpleApp.input.format=text
SimpleApp.input.path=SPARK_HOME/README.md
SimpleApp.filterA=A
SimpleApp.filterB=B
#UnifiedDataAnalytics #SparkAISummit
Configuration and Validation
18
Solutions
case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleAppContext extends Configurator[SimpleAppContext] {
import org.tupol.utils.config._
override def validationNel(config: com.typesafe.config.Config):
scalaz .ValidationNel [Throwable, SimpleAppContext ] = {
config.extract[FileSourceConfiguration]("input")
.ensure(new IllegalArgumentException (
"Only 'text' format files are supported" ).toNel)(_.format == FormatType.Text) |@|
config.extract[ String]("filterA") |@|
config.extract[ String]("filterB") apply
SimpleAppContext .apply
}
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Configurator Framework?
19
Solutions
● DSL for easy definition of the context
○ config.extract[Double](“parameter.path”)
○ |@| operator to compose the extracted parameters
○ apply to build the configuration case class
● Type based configuration parameters extraction
○ extract[Double](“parameter.path”)
○ extract[Option[Seq[Double]]](“parameter.path”)
○ extract[Map[Int, String]](“parameter.path”)
○ extract[Either[Int, String]](“parameter.path”)
● Implicit Configurators can be used as extractors in the DSL
○ config.extract[SimpleAppContext](“configuration.path”)
● The ValidationNel contains either a list of exceptions or the application context
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
20
Solutions
4case class SimpleAppContext (input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleApp extends SparkApp[SimpleAppContext , (Long, Long)]{
override def createContext (config: Config): SimpleAppContext = SimpleAppContext(config).get
override def run(implicit spark: SparkSession , context: SimpleAppContext ): (Long, Long) = {
val logData = spark.source(context.input).read.as[ String].cache
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
(numAs, numBs)
}
def appLogic(data: Dataset[String] , context: SimpleAppContext ): (Long, Long) = {
. . .
}
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
CSV JSON XML JDBC ...
format
URI (connection URL, file path…)
schema
sep
encoding
quote
escape
comment
header
inferSchema
ignoreLeadingWhiteSpace
ignoreTrailingWhiteSpace
nullValue
nanValue
positiveInf
negativeInf
dateFormat
timestampFormat
maxColumns
maxCharsPerColumn
maxMalformedLogPerPartition
mode
primitivesAsString
prefersDecimal
allowComments
allowUnquotedFieldNames
allowSingleQuotes
allowNumericLeadingZeros
allowBackslashEscapingAnyCharacter
mode
columnNameOfCorruptRecord
dateFormat
timestampFormat
rowTag
samplingRatio
excludeAttribute
treatEmptyValuesAsNulls
mode
columnNameOfCorruptRecord
attributePrefix
valueTag
charset
ignoreSurroundingSpaces
table
columnName
lowerBound
upperBound
numPartitions
connectionProperties
Data Sources and Data Sinks
21
Solutions
Source spark.apache.org/docs/latest/
#UnifiedDataAnalytics #SparkAISummit
Data Sources and Data Sinks
22
Solutions
import org.tupol.spark.implicits._
import org.tupol.spark.io._
import spark.implicits._
. . .
val input = config.extract[FileSourceConfiguration]("input").get
val lines = spark.source(input).read.as[String]
// org.tupol.spark.io.FileDataSource(input).read
// spark.read.format(...).option(...).option(...).schema(...).load()
val output = config.extract[FileSinkConfiguration]("output").get
lines.sink(output).write
// org.tupol.spark.io.FileDataSink(output).write(lines)
// lines.write.format(...).option(...).option(...).partitionBy(...).mode(...)
#UnifiedDataAnalytics #SparkAISummit
Data Sources and Data Sinks
23
Solutions
● Very concise and intuitive DSL
● Support for multiple formats: text, csv, json, xml, avro, parquet, orc, jdbc, ...
● Specify a schema on read
● Schema is passed as a full json structure, as serialised by the StructType
● Specify the partitioning and bucketing for writing the data
● Structured streaming support
● Delta Lake support
● . . .
#UnifiedDataAnalytics #SparkAISummit
Test! Test! Test!
24
AWorldofOpportunities
class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession {
. . .
val DummyInput = FileSourceConfiguration("no path", TextSourceConfiguration())
val DummyContext = SimpleAppContext(input = DummyInput, filterA = "", filterB = "")
test("appLogic should return 0 counts of a and b for an empty DataFrame") {
val testData = spark.emptyDataset[String]
val result = SimpleApp.appLogic(testData, DummyContext)
result shouldBe (0, 0)
}
. . .
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Test! Test! Test!
25
Solutions
class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession {
. . .
test("run should return (1, 2) as count of a and b for the given data") {
val inputSource = FileSourceConfiguration("src/test/resources/input-test-01",
TextSourceConfiguration())
val context = SimpleAppContext(input = inputSource, filterA = "a", filterB = "b")
val result = SimpleApp.run(spark, context)
result shouldBe (1, 2)
}
. . .
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Format Converter
26
Solutions
case class MyAppContext(input : FormatAwareDataSourceConfiguration,
output: FormatAwareDataSinkConfiguration)
object MyAppContext extends Configurator[MyAppContext] {
import scalaz.ValidationNel
import scalaz.syntax.applicative._
def validationNel(config: Config): ValidationNel[Throwable, MyAppContext] = {
config.extract[FormatAwareDataSourceConfiguration]("input") |@|
config.extract[FormatAwareDataSinkConfiguration]("output") apply
MyAppContext.apply
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Format Converter
27
Solutions
object MyApp extends SparkApp[MyAppContext, DataFrame] {
override def createContext(config: Config): MyAppContext = MyAppContext(config).get
override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = {
val data = spark.source(context.input).read
data.sink(context.output).write
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Beyond Format Converter
28
Solutions
object MyApp extends SparkApp[MyAppContext, DataFrame] {
override def createContext(config: Config): MyAppContext = MyAppContext(config).get
override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = {
val inputData = spark.source(context.input).read
val outputData = transform(inputData)
outputData.sink(context.output).write
}
def transform(data: DataFrame)(implicit spark: SparkSession, context: MyAppContext) = {
data // Transformation logic here
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Summary
29
Summary
● Write Apache Spark applications with minimal ceremony
○ batch
○ structured streaming
● IO and general application configuration support
● Facilitates testing
● Increase productivity
tupol/spark-utils spark-tools spark-utils-demos spark-apps.seed.g8
#UnifiedDataAnalytics #SparkAISummit
What’s Next?
30
What’sNext?
● Support for more source types
● Improvements of the configuration framework
● Feedback is welcomed!
● Help is welcomed!
#UnifiedDataAnalytics #SparkAISummit
olivertupran
tupol
@olivertupran
References
Presentation https://tinyurl.com/yxuneqcs
spark-utils https://github.com/tupol/spark-utils
spark-utils-demos https://github.com/tupol/spark-utils-demos
spark-apps.seed.g8 https://github.com/tupol/spark-apps.seed.g8
spark-tools https://github.com/tupol/spark-tools
Lightbend Config https://github.com/lightbend/config
Giter8 http://www.foundweekends.org/giter8/
Apache Spark http://spark.apache.org/
ScalaZ https://github.com/scalaz/scalaz
Scala https://scala-lang.org
31
References
#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 

What's hot (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Similar to From HelloWorld to Configurable and Reusable Apache Spark Applications in Scala – A Developer’s Journey

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensionsSandeep Joshi
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache sparkAmine Sagaama
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Scala is java8.next()
Scala is java8.next()Scala is java8.next()
Scala is java8.next()daewon jeong
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Scala4sling
Scala4slingScala4sling
Scala4slingday
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_iNico Ludwig
 
Hammock, a Good Place to Rest
Hammock, a Good Place to RestHammock, a Good Place to Rest
Hammock, a Good Place to RestStratoscale
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Dustin Vannoy
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and InvocationArvind Surve
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation processArvind Surve
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 

Similar to From HelloWorld to Configurable and Reusable Apache Spark Applications in Scala – A Developer’s Journey (20)

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensions
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scala is java8.next()
Scala is java8.next()Scala is java8.next()
Scala is java8.next()
 
srgoc
srgocsrgoc
srgoc
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Scala4sling
Scala4slingScala4sling
Scala4sling
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i
 
Hammock, a Good Place to Rest
Hammock, a Good Place to RestHammock, a Good Place to Rest
Hammock, a Good Place to Rest
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation process
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

From HelloWorld to Configurable and Reusable Apache Spark Applications in Scala – A Developer’s Journey

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Oliver Țupran From HelloWorld to Configurable and Reusable Apache Spark Applications In Scala #UnifiedDataAnalytics #SparkAISummit
  • 3. whoami 3#UnifiedDataAnalytics #SparkAISummit Oliver Țupran Software Engineer Aviation, Banking, Telecom... Scala Enthusiast Apache Spark Enthusiast olivertupran tupol @olivertupran Intro
  • 4. Audience ● Professionals starting with Scala and Apache Spark ● Basic Scala knowledge is required ● Basic Apache Spark knowledge is required 4 Intro #UnifiedDataAnalytics #SparkAISummit
  • 5. Agenda ● Hello, World! ● Problems ● Solutions ● Summary 5 Intro #UnifiedDataAnalytics #SparkAISummit
  • 6. Hello, World! 6 Hello,World! ./bin/spark-shell scala> val textFile = spark.read.textFile("README.md") textFile: org.apache.spark.sql.Dataset[String] = [value: string] scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string] scala> linesWithSpark.count() // How many lines contain "Spark"? res3: Long = 15 Source spark.apache.org/docs/latest/quick-start.html #UnifiedDataAnalytics #SparkAISummit
  • 7. Hello, World! 7 Hello,World! Source spark.apache.org/docs/latest/quick-start.html object SimpleApp { def main(args: Array[String]) { val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() } } #UnifiedDataAnalytics #SparkAISummit
  • 8. Problems 8 Problems ● Configuration mixed with the application logic ● IO can be much more complex than it looks ● Hard to test #UnifiedDataAnalytics #SparkAISummit
  • 9. Solutions 9 Solutions ● Clean separation of the business logic ● Spark session out of the box ● Configuration and validation support ● Encourage and facilitate testing #UnifiedDataAnalytics #SparkAISummit tupol/spark-utils
  • 10. Business Logic Separation 10 Solutions /** * @tparam Context The type of the application context class. * @tparam Result The output type of the run function. */ trait SparkRunnable[Context, Result] { /** * @param context context instance containing all the application specific configuration * @param spark active spark session * @return An instance of type Result */ def run(implicit spark: SparkSession, context: Context): Result } Source github.com/tupol/spark-utils #UnifiedDataAnalytics #SparkAISummit
  • 11. Stand-Alone App Blueprint 11 Solutions trait SparkApp[Context, Result] extends SparkRunnable[Context, Result] with Logging { def appName: String = . . . private def applicationConfiguration(implicit spark: SparkSession, args: Array[String]): com.typesafe.config.Config = . . . def createSparkSession(runnerName: String): SparkSession = . . . def createContext(config: com.typesafe.config.Config): Context def main(implicit args: Array[String]): Unit = { // Create a SparkSession, initialize a Typesafe Config instance, // validate and initialize the application context, // execute the run() function, close the SparkSession and // return the result or throw and Exception . . . } } Source github.com/tupol/spark-utils #UnifiedDataAnalytics #SparkAISummit
  • 12. Back to SimpleApp 12 object SimpleApp { def main(args: Array[String]) { val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() } } Solutions Source spark.apache.org/docs/latest/quick-start.html #UnifiedDataAnalytics #SparkAISummit
  • 13. SimpleApp as SparkApp 13 Source github.com/tupol/spark-utils-demos/ object SimpleApp extends SparkApp[Unit, Unit]{ override def createContext(config: Config): Unit = Unit override def run(implicit spark: SparkSession, context: Unit): Unit { val logFile = "YOUR_SPARK_HOME/README.md" val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") } } Solutions 1 #UnifiedDataAnalytics #SparkAISummit
  • 14. SimpleApp as SparkApp 14 object SimpleApp extends SparkApp[Unit, Unit]{ override def createContext (config: Config): Unit = Unit override def run(implicit spark: SparkSession , context: Unit): Unit = { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") } def appLogic(data: Dataset[String]): (Long, Long) = { val numAs = data.filter(line => line.contains("a")).count() val numBs = data.filter(line => line.contains("b")).count() (numAs, numBs) } } Solutions 2 Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 15. SimpleApp as SparkApp 15 Solutions 3case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleApp extends SparkApp[SimpleAppContext, Unit]{ override def createContext(config: Config): SimpleAppContext = ??? override def run(implicit spark: SparkSession , context: SimpleAppContext ): Unit = { val logData = spark.source(context.input).read.as[String].cache val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") } def appLogic(data: Dataset[String] , context: SimpleAppContext): (Long, Long) = { val numAs = data.filter(line => line.contains(context.filterA )).count() val numBs = data.filter(line => line.contains(context.filterB )).count() (numAs, numBs) } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 16. Why Typesafe Config? 16 Solutions ● supports files in three formats: Java properties, JSON, and a human-friendly JSON superset ● merges multiple files across all formats ● can load from files, URLs or classpath ● users can override the config with Java system properties, java -Dmyapp.foo.bar=10 ● supports configuring an app, with its framework and libraries, all from a single file such as application.conf ● extracts typed properties ● JSON superset features: ○ comments ○ includes ○ substitutions ("foo" : ${bar}, "foo" : Hello ${who}) ○ properties-like notation (a.b=c) ○ less noisy, more lenient syntax ○ substitute environment variables (logdir=${HOME}/logs) ○ lists Source github.com/lightbend/config #UnifiedDataAnalytics #SparkAISummit
  • 17. Application Configuration File 17 Solutions Hocon SimpleApp { input { format: text path: SPARK_HOME/README.md } filterA: A filterB: B } Java Properties SimpleApp.input.format=text SimpleApp.input.path=SPARK_HOME/README.md SimpleApp.filterA=A SimpleApp.filterB=B #UnifiedDataAnalytics #SparkAISummit
  • 18. Configuration and Validation 18 Solutions case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleAppContext extends Configurator[SimpleAppContext] { import org.tupol.utils.config._ override def validationNel(config: com.typesafe.config.Config): scalaz .ValidationNel [Throwable, SimpleAppContext ] = { config.extract[FileSourceConfiguration]("input") .ensure(new IllegalArgumentException ( "Only 'text' format files are supported" ).toNel)(_.format == FormatType.Text) |@| config.extract[ String]("filterA") |@| config.extract[ String]("filterB") apply SimpleAppContext .apply } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 19. Configurator Framework? 19 Solutions ● DSL for easy definition of the context ○ config.extract[Double](“parameter.path”) ○ |@| operator to compose the extracted parameters ○ apply to build the configuration case class ● Type based configuration parameters extraction ○ extract[Double](“parameter.path”) ○ extract[Option[Seq[Double]]](“parameter.path”) ○ extract[Map[Int, String]](“parameter.path”) ○ extract[Either[Int, String]](“parameter.path”) ● Implicit Configurators can be used as extractors in the DSL ○ config.extract[SimpleAppContext](“configuration.path”) ● The ValidationNel contains either a list of exceptions or the application context #UnifiedDataAnalytics #SparkAISummit
  • 20. SimpleApp as SparkApp 20 Solutions 4case class SimpleAppContext (input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleApp extends SparkApp[SimpleAppContext , (Long, Long)]{ override def createContext (config: Config): SimpleAppContext = SimpleAppContext(config).get override def run(implicit spark: SparkSession , context: SimpleAppContext ): (Long, Long) = { val logData = spark.source(context.input).read.as[ String].cache val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") (numAs, numBs) } def appLogic(data: Dataset[String] , context: SimpleAppContext ): (Long, Long) = { . . . } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 21. CSV JSON XML JDBC ... format URI (connection URL, file path…) schema sep encoding quote escape comment header inferSchema ignoreLeadingWhiteSpace ignoreTrailingWhiteSpace nullValue nanValue positiveInf negativeInf dateFormat timestampFormat maxColumns maxCharsPerColumn maxMalformedLogPerPartition mode primitivesAsString prefersDecimal allowComments allowUnquotedFieldNames allowSingleQuotes allowNumericLeadingZeros allowBackslashEscapingAnyCharacter mode columnNameOfCorruptRecord dateFormat timestampFormat rowTag samplingRatio excludeAttribute treatEmptyValuesAsNulls mode columnNameOfCorruptRecord attributePrefix valueTag charset ignoreSurroundingSpaces table columnName lowerBound upperBound numPartitions connectionProperties Data Sources and Data Sinks 21 Solutions Source spark.apache.org/docs/latest/ #UnifiedDataAnalytics #SparkAISummit
  • 22. Data Sources and Data Sinks 22 Solutions import org.tupol.spark.implicits._ import org.tupol.spark.io._ import spark.implicits._ . . . val input = config.extract[FileSourceConfiguration]("input").get val lines = spark.source(input).read.as[String] // org.tupol.spark.io.FileDataSource(input).read // spark.read.format(...).option(...).option(...).schema(...).load() val output = config.extract[FileSinkConfiguration]("output").get lines.sink(output).write // org.tupol.spark.io.FileDataSink(output).write(lines) // lines.write.format(...).option(...).option(...).partitionBy(...).mode(...) #UnifiedDataAnalytics #SparkAISummit
  • 23. Data Sources and Data Sinks 23 Solutions ● Very concise and intuitive DSL ● Support for multiple formats: text, csv, json, xml, avro, parquet, orc, jdbc, ... ● Specify a schema on read ● Schema is passed as a full json structure, as serialised by the StructType ● Specify the partitioning and bucketing for writing the data ● Structured streaming support ● Delta Lake support ● . . . #UnifiedDataAnalytics #SparkAISummit
  • 24. Test! Test! Test! 24 AWorldofOpportunities class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession { . . . val DummyInput = FileSourceConfiguration("no path", TextSourceConfiguration()) val DummyContext = SimpleAppContext(input = DummyInput, filterA = "", filterB = "") test("appLogic should return 0 counts of a and b for an empty DataFrame") { val testData = spark.emptyDataset[String] val result = SimpleApp.appLogic(testData, DummyContext) result shouldBe (0, 0) } . . . } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 25. Test! Test! Test! 25 Solutions class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession { . . . test("run should return (1, 2) as count of a and b for the given data") { val inputSource = FileSourceConfiguration("src/test/resources/input-test-01", TextSourceConfiguration()) val context = SimpleAppContext(input = inputSource, filterA = "a", filterB = "b") val result = SimpleApp.run(spark, context) result shouldBe (1, 2) } . . . } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 26. Format Converter 26 Solutions case class MyAppContext(input : FormatAwareDataSourceConfiguration, output: FormatAwareDataSinkConfiguration) object MyAppContext extends Configurator[MyAppContext] { import scalaz.ValidationNel import scalaz.syntax.applicative._ def validationNel(config: Config): ValidationNel[Throwable, MyAppContext] = { config.extract[FormatAwareDataSourceConfiguration]("input") |@| config.extract[FormatAwareDataSinkConfiguration]("output") apply MyAppContext.apply } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 27. Format Converter 27 Solutions object MyApp extends SparkApp[MyAppContext, DataFrame] { override def createContext(config: Config): MyAppContext = MyAppContext(config).get override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = { val data = spark.source(context.input).read data.sink(context.output).write } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 28. Beyond Format Converter 28 Solutions object MyApp extends SparkApp[MyAppContext, DataFrame] { override def createContext(config: Config): MyAppContext = MyAppContext(config).get override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = { val inputData = spark.source(context.input).read val outputData = transform(inputData) outputData.sink(context.output).write } def transform(data: DataFrame)(implicit spark: SparkSession, context: MyAppContext) = { data // Transformation logic here } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 29. Summary 29 Summary ● Write Apache Spark applications with minimal ceremony ○ batch ○ structured streaming ● IO and general application configuration support ● Facilitates testing ● Increase productivity tupol/spark-utils spark-tools spark-utils-demos spark-apps.seed.g8 #UnifiedDataAnalytics #SparkAISummit
  • 30. What’s Next? 30 What’sNext? ● Support for more source types ● Improvements of the configuration framework ● Feedback is welcomed! ● Help is welcomed! #UnifiedDataAnalytics #SparkAISummit olivertupran tupol @olivertupran
  • 31. References Presentation https://tinyurl.com/yxuneqcs spark-utils https://github.com/tupol/spark-utils spark-utils-demos https://github.com/tupol/spark-utils-demos spark-apps.seed.g8 https://github.com/tupol/spark-apps.seed.g8 spark-tools https://github.com/tupol/spark-tools Lightbend Config https://github.com/lightbend/config Giter8 http://www.foundweekends.org/giter8/ Apache Spark http://spark.apache.org/ ScalaZ https://github.com/scalaz/scalaz Scala https://scala-lang.org 31 References #UnifiedDataAnalytics #SparkAISummit
  • 32. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT