SlideShare a Scribd company logo
1 of 32
Download to read offline
Correctness and Performance
of Apache Spark SQL
Spark + AI Summit, London
October 4, 2018
1
2
NICOLAS POGGI
Databricks, Performance Engineer
• Spark benchmarking
Barcelona Supercomputing - Microsoft Research Centre
• Lead researcher ALOJA project
• New architectures for Big Data
BarcelonaTech (UPC), PhD in Computer Architecture
• Autonomic resource manager for the cloud
• Web customer modeling
About us
BOGDAN GHIT
Databricks, Software Engineer
• SQL performance optimizations
IBM T.J. Watson, Research Intern
• Bid advisor for cloud spot markets
Delft University of Technology, PhD in Computer Science
• Resource management in datacenters
• Performance of Spark, Hadoop
Databricks ecosystem
3
ToolsDevelopers
DBR Cluster Manager
Infrastructure Customers
DBR 5.0
DBR 4.3-LTS
DBR 4.3
Feb’18
Beta
Full Support
Marked for deprecation
Deprecated
Databricks runtime (DBR) releases
Our goal is to make releases automatic and frequent
Jun’18 Oct’18 Feb’19 Jun’19 Oct’19 Feb’20
Spark 2.4
Spark 2.3
Spark 2.3
Apache Spark contributions
5
Hundreds of commits monthly to the Apache Spark project
Numberofcommits
At this pace of development, mistakes are bound to happen
Where do these contributions go?
6
Scope of the testing
Developers put a significant engineering effort in testing
Query
Input data
Configuration
Over 200 built-in functions
Yet another brick in the wall
Unit testing is not enough to guarantee correctness and performance
Unit testing
Integration
E2E
Micro
Benchmarks
Plan
stability
Fuzz
testing
Macro
benchmarks
Stress
testing
Customer
workloads
Failure
testing
8
Continuous Integration pipeline
New artifacts Metrics
- Correctness
- Performance
Test
Alerts
- Merge
- Build
Dev
- Rules
- Policies
Analyze
9
Classification and alerting
- Impact
- Scope
- Correlation
- Confirm?
Failure
Regression
- Minimize
- Drill-down
- Profile
- Compare
- Validate
Events Re-test Alert
Classify Root-cause
Correctness
Performance
10
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Correctness
Random query generation
11
Query profile
Model
translator
Spark
Query
Postgres
Query
vs
vs
...
...
DDL and datagen
12
...
...
BigIntBoolean
Timestamp
Decimal
FloatInteger
SmallInt
String
Choose a data type
Random number of rows
Random number of columns
Random number of tables
Random partition columns
Recursive query model
13
SQL Query
WITH
FROMUNION
SELECT
Functions
Constant
GROUP BY
ORDER BY
Table
Column
Alias
Query
Clause
Expression
JOIN
WHERE
Probabilistic query profile
Independent weights
• Optional query clauses
Inter-dependent weights
• Join types
• Select functions
ORDER BY
UNION
GROUP BY WHERE
10%
10%
50%
10%
Coalesce flattening (1/4)
SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col,
IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1,
STDDEV(t2.double_col_2) AS float_col,
COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3,
t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3),
COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2
FROM table_4 t1
INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7)
WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3)
GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)
Small dataset with 2 tables of 5x5 size
Within 10 randomly generated queries
Error: Operation is in ERROR_STATE
Coalesce flattening (2/3)
Aggregate
Project
Join
FILTERSCAN foo
SCAN bar
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
GROUP BY COALESCE(foo.id, foo.val)
Coalesce flattening (3/4)
Aggregate
Project
Join
FILTERSCAN t1
SCAN t2
foo.id IN
(CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id)
foo.ts = bar.ts
COALESCE(COALESCE(foo.id, foo.val), 88)
COALESCE(foo.id, foo.val)
Coalesce flattening (4/4)
Aggregate
Project
SCAN foo
Minimized query:
SELECT
COALESCE(COALESCE(foo.id, foo.val), 88)
FROM foo
GROUP BY
COALESCE(foo.id, foo.val)
Analyzing the error
● The optimizer flattens the nested coalesce calls
● The SELECT clause doesn’t contain the GROUP BY expression
● Possibly a problem with any GROUP BY expression that can be optimized
Lead function (1/3)
SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col,
CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3,
(COALESCE(CAST('0.02' AS DECIMAL(10,10)),
CAST('0.47' AS DECIMAL(10,10)),
CAST('-0.53' AS DECIMAL(10,10)))) +
(LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3),
CAST(696 AS STRING))) AS decimal_col_1,
CAST(-349 AS STRING) AS char_col_1
FROM table_16 t1
WHERE (943) > (889)
Error: Column 4 in row 10 does not match:
[1.0, 696, -871.81, <<-64.98>>, -349] SPARK row
[1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
Lead function (2/3)
Project
FILTER
SCAN foo
WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Lead function (3/3)
Project
FILTER WHERE expr
COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
Analyzing the error
● Using constant input values breaks the behaviour of the LEAD function
● SC-16633: https://github.com/apache/spark/pull/14284
SCAN foo
22
Performance
22
Failure
Regression
Events
Re-test
Alert
Classify Root-cause
Correctness
Performance
Benchmarking tools
•We use spark-sql-perf public library for
TPC workloads
• Provides datagen and import scripts
• local, cluster, S3
• Dashboards for analyzing results
•The Spark micro benchmarks
•And the async-profiler
• to produce flamegraphs
23
https://github.com/databricks/spark-sql-perf
Source:
http://www.brendangregg.com/flamegraphs.html
CPU Flame Graph
15%
------------- journey
Per query drill-down: 67
First, scope and validate
• in 2.4-master (dev) compared
• to 2.3 in DBR 4.3 (prod)
Query 67: 18% regression From 320s to 390s
Q67 executor profile for Spark 2.4-master
Side-by-side 2.3 vs 2.4: find the differences
Spark 2.3 Spark 2.4
Framegraph diff zoom Red slower White new
unsafe/Platform.copyMemory()
unsafe/BytesToBytesMap.safeLookup
New: hash/Murmur3_x86_32.hashUTF8String()
Murmur3_x86_32.hashUnsafeBytesBlock()
Look for hints:
- Mem mgmt
- Hashing
- unsafe
Root-causing
Results:
• Spark 2.3: hashUnsafeBytes() -> 40µs
• Spark 2.4 hashUnsafeBytesBlock() -> 140µs
• also slower UTF8String.getBytes()
Microbenchmark for UTF8String
GIT BISECT
1.)
2.)
3.)
It is a journey to get a release out
DBR and Spark testing and performance are a continuous effort
• Over a month effort to bring performance to improving
TPC-DS 2.4-master vs. 2.3 at SF 1000
15%
5%
< 0%
Conclusion
Spark in production is not just the framework
Unit and integration testing are not enough
We need Spark specific tools to automate the process
to ensure both correctness and performance
Thanks!
Correctness and Performance of Apache Spark SQL
October 2018
32
Test AnalyzeDev

More Related Content

What's hot

Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkJen Aman
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZKnoldus Inc.
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
JVM languages "flame wars"
JVM languages "flame wars"JVM languages "flame wars"
JVM languages "flame wars"Gal Marder
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streamsconfluent
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsLightbend
 

What's hot (20)

Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
JVM languages "flame wars"
JVM languages "flame wars"JVM languages "flame wars"
JVM languages "flame wars"
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Robust Operations of Kafka Streams
Robust Operations of Kafka StreamsRobust Operations of Kafka Streams
Robust Operations of Kafka Streams
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive Streams
 

Similar to Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas Poggi

Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksEDB
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Explaining the Postgres Query Optimizer (Bruce Momjian)
Explaining the Postgres Query Optimizer (Bruce Momjian)Explaining the Postgres Query Optimizer (Bruce Momjian)
Explaining the Postgres Query Optimizer (Bruce Momjian)Ontico
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisSebastiano Panichella
 
Optimizing Tcl Bytecode
Optimizing Tcl BytecodeOptimizing Tcl Bytecode
Optimizing Tcl BytecodeDonal Fellows
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentationBryan Reinero
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Grow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackGrow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackKeitaSugiyama1
 

Similar to Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas Poggi (20)

Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Explaining the Postgres Query Optimizer (Bruce Momjian)
Explaining the Postgres Query Optimizer (Bruce Momjian)Explaining the Postgres Query Optimizer (Bruce Momjian)
Explaining the Postgres Query Optimizer (Bruce Momjian)
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
 
Optimizing Tcl Bytecode
Optimizing Tcl BytecodeOptimizing Tcl Bytecode
Optimizing Tcl Bytecode
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Grow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM StackGrow and Shrink - Dynamically Extending the Ruby VM Stack
Grow and Shrink - Dynamically Extending the Ruby VM Stack
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas Poggi

  • 1. Correctness and Performance of Apache Spark SQL Spark + AI Summit, London October 4, 2018 1
  • 2. 2 NICOLAS POGGI Databricks, Performance Engineer • Spark benchmarking Barcelona Supercomputing - Microsoft Research Centre • Lead researcher ALOJA project • New architectures for Big Data BarcelonaTech (UPC), PhD in Computer Architecture • Autonomic resource manager for the cloud • Web customer modeling About us BOGDAN GHIT Databricks, Software Engineer • SQL performance optimizations IBM T.J. Watson, Research Intern • Bid advisor for cloud spot markets Delft University of Technology, PhD in Computer Science • Resource management in datacenters • Performance of Spark, Hadoop
  • 3. Databricks ecosystem 3 ToolsDevelopers DBR Cluster Manager Infrastructure Customers
  • 4. DBR 5.0 DBR 4.3-LTS DBR 4.3 Feb’18 Beta Full Support Marked for deprecation Deprecated Databricks runtime (DBR) releases Our goal is to make releases automatic and frequent Jun’18 Oct’18 Feb’19 Jun’19 Oct’19 Feb’20 Spark 2.4 Spark 2.3 Spark 2.3
  • 5. Apache Spark contributions 5 Hundreds of commits monthly to the Apache Spark project Numberofcommits At this pace of development, mistakes are bound to happen
  • 6. Where do these contributions go? 6 Scope of the testing Developers put a significant engineering effort in testing Query Input data Configuration Over 200 built-in functions
  • 7. Yet another brick in the wall Unit testing is not enough to guarantee correctness and performance Unit testing Integration E2E Micro Benchmarks Plan stability Fuzz testing Macro benchmarks Stress testing Customer workloads Failure testing
  • 8. 8 Continuous Integration pipeline New artifacts Metrics - Correctness - Performance Test Alerts - Merge - Build Dev - Rules - Policies Analyze
  • 9. 9 Classification and alerting - Impact - Scope - Correlation - Confirm? Failure Regression - Minimize - Drill-down - Profile - Compare - Validate Events Re-test Alert Classify Root-cause Correctness Performance
  • 11. Random query generation 11 Query profile Model translator Spark Query Postgres Query vs vs
  • 12. ... ... DDL and datagen 12 ... ... BigIntBoolean Timestamp Decimal FloatInteger SmallInt String Choose a data type Random number of rows Random number of columns Random number of tables Random partition columns
  • 13. Recursive query model 13 SQL Query WITH FROMUNION SELECT Functions Constant GROUP BY ORDER BY Table Column Alias Query Clause Expression JOIN WHERE
  • 14. Probabilistic query profile Independent weights • Optional query clauses Inter-dependent weights • Join types • Select functions ORDER BY UNION GROUP BY WHERE 10% 10% 50% 10%
  • 15. Coalesce flattening (1/4) SELECT COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col, IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1, STDDEV(t2.double_col_2) AS float_col, COALESCE(MIN((t1.smallint_col_3) - (COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3))), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3), COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2 FROM table_4 t1 INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7) WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3) GROUP BY COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) Small dataset with 2 tables of 5x5 size Within 10 randomly generated queries Error: Operation is in ERROR_STATE
  • 16. Coalesce flattening (2/3) Aggregate Project Join FILTERSCAN foo SCAN bar foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) GROUP BY COALESCE(foo.id, foo.val)
  • 17. Coalesce flattening (3/4) Aggregate Project Join FILTERSCAN t1 SCAN t2 foo.id IN (CAST(‘0.04’ AS DECIMAL(10, 10)), foo.id) foo.ts = bar.ts COALESCE(COALESCE(foo.id, foo.val), 88) COALESCE(foo.id, foo.val)
  • 18. Coalesce flattening (4/4) Aggregate Project SCAN foo Minimized query: SELECT COALESCE(COALESCE(foo.id, foo.val), 88) FROM foo GROUP BY COALESCE(foo.id, foo.val) Analyzing the error ● The optimizer flattens the nested coalesce calls ● The SELECT clause doesn’t contain the GROUP BY expression ● Possibly a problem with any GROUP BY expression that can be optimized
  • 19. Lead function (1/3) SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col, CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3, (COALESCE(CAST('0.02' AS DECIMAL(10,10)), CAST('0.47' AS DECIMAL(10,10)), CAST('-0.53' AS DECIMAL(10,10)))) + (LEAD(-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3), CAST(696 AS STRING))) AS decimal_col_1, CAST(-349 AS STRING) AS char_col_1 FROM table_16 t1 WHERE (943) > (889) Error: Column 4 in row 10 does not match: [1.0, 696, -871.81, <<-64.98>>, -349] SPARK row [1.0, 696, -871.81, <<None>>, -349] POSTGRESQL row
  • 20. Lead function (2/3) Project FILTER SCAN foo WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr
  • 21. Lead function (3/3) Project FILTER WHERE expr COALESCE(expr) + LEAD(-65, 4) OVER ORDER BY expr Analyzing the error ● Using constant input values breaks the behaviour of the LEAD function ● SC-16633: https://github.com/apache/spark/pull/14284 SCAN foo
  • 23. Benchmarking tools •We use spark-sql-perf public library for TPC workloads • Provides datagen and import scripts • local, cluster, S3 • Dashboards for analyzing results •The Spark micro benchmarks •And the async-profiler • to produce flamegraphs 23 https://github.com/databricks/spark-sql-perf Source: http://www.brendangregg.com/flamegraphs.html CPU Flame Graph
  • 25. Per query drill-down: 67 First, scope and validate • in 2.4-master (dev) compared • to 2.3 in DBR 4.3 (prod) Query 67: 18% regression From 320s to 390s
  • 26. Q67 executor profile for Spark 2.4-master
  • 27. Side-by-side 2.3 vs 2.4: find the differences Spark 2.3 Spark 2.4
  • 28. Framegraph diff zoom Red slower White new unsafe/Platform.copyMemory() unsafe/BytesToBytesMap.safeLookup New: hash/Murmur3_x86_32.hashUTF8String() Murmur3_x86_32.hashUnsafeBytesBlock() Look for hints: - Mem mgmt - Hashing - unsafe
  • 29. Root-causing Results: • Spark 2.3: hashUnsafeBytes() -> 40µs • Spark 2.4 hashUnsafeBytesBlock() -> 140µs • also slower UTF8String.getBytes() Microbenchmark for UTF8String GIT BISECT 1.) 2.) 3.)
  • 30. It is a journey to get a release out DBR and Spark testing and performance are a continuous effort • Over a month effort to bring performance to improving TPC-DS 2.4-master vs. 2.3 at SF 1000 15% 5% < 0%
  • 31. Conclusion Spark in production is not just the framework Unit and integration testing are not enough We need Spark specific tools to automate the process to ensure both correctness and performance
  • 32. Thanks! Correctness and Performance of Apache Spark SQL October 2018 32 Test AnalyzeDev