SlideShare a Scribd company logo
1 of 38
Download to read offline
Technical Deep Dive:
How to Think Vectorized
Alex Behm
Tech Lead, Photon
Agenda
Introduction
Delta Engine, vectorization, micro-benchmarks
Expressions
Compute kernels, adaptivity, lazy filters
Aggregation
Hash tables, mixed row/columnar kernels
End-to-End Performance
Hardware Changes since 2015
2010 2015 2020
Storage
50 MB/s
(HDD)
500 MB/s
(SSD)
16 GB/s
(NVMe)
10X
Network 1 Gbps 10 Gbps 100 Gbps 10X
CPU ~3 GHz ~3 GHz ~3 GHz ☹
CPUs continue to be the bottleneck.
How do we achieve next level performance?
Workload Trends
Businesses are moving faster, and as a result organizations spend less
time in data modeling, leading to worse performance:
▪ Most columns don’t have “NOT NULL” defined
▪ Strings are convenient, and many date columns are stored as strings
▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality
Can we get both agility and performance?
Query
Optimizer
Photon
Execution
Engine
SQL
Spark
DataFrame
Koalas
Caching
Delta Engine
Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization: data-level and instruction-level parallelism
▪ Optimize for modern structured and semi-structured workloads
Vectorization
● Decompose query into compute kernels that process vectors of data
● Typically: Columnar in-memory format
● Cache and CPU friendly: simple predictable loops, many data items, SIMD
● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs
● Modular: Can optimize individual kernels as needed
Sounds great! But… what does it really mean? How does it work? Is it worth
it?
This talk: I will teach you how to think vectorized!
Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries
Let’s build a simple engine from scratch.
1. Expression evaluation and adaptivity
2. Filters and laziness
3. Hash tables and mixed column/row operations
Vectorization: Basic Building Blocks
Expressions
Running Example
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
We’re not covering this part
Operators pass
batches of
columnar data
Expression Evaluation
c1 c2
+
<
10
Out
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
c1 c2
+
<
10
Out
Kernels!
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
output[i] = left[i] + right[i]
}
}
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
output[i] = left[i] + right[i]
}
}
🤔
What about NULLs?
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
for (int32_t i = 0; i < num_rows; ++i) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
for (int32_t i = 0; i < num_rows; ++i) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
> 30% slower with NULL checks
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation: Runtime Adaptivity
void PlusKernelNoNulls(...);
void PlusKernel(...);
void PlusEval(Column left, Column right, Column output) {
if (!left.has_nulls() && !right.has_nulls()) {
PlusKernelNoNulls(left.data(), right.data(), output.data());
} else {
PlusKernel(left.data(), left.nulls(), …);
}
}
But what if my data rarely has NULLs?
Expression Evaluation
c1 c2
+
<
10
Out
● Similar kernel approach
● Can optimize for literals,
~25% faster
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Filters
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
c1 c2
+
<
10
Out ???
● What exactly is the output?
● What should we do with our
input column batch?
Filters: Lazy Representation as Active Rows
5
4
3
2
1
7
2
3
8
5
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c1 c
2
c3 g1 g
2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
{c1, c2, c3, g1, g2}
{c1, c2, c3, g1, g2}
Column Batch
3
2
0
c1 + c2 < 10
Active RowsSELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Filters: Lazy Representation as Active Rows
void PlusNoNullsSomeActiveKernel(
const int64_t* left, const int64_t* right,
const int32_t* active_rows, int32_t num_rows,
int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
int32_t active_idx = active_rows[i];
output[active_idx] = left[active_idx] * right[active_idx]
}
}
Active rows concept must be supported throughout the engine
● Adds complexity, code
● Will come in handy for advanced operations like aggregation/join
Aggregation
Hash Aggregation
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input
Hash Table
{g1, g2, SUM}
Hash Aggregation
Think vectorized!
● Columnar, batch-oriented
● Type specialized
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input
Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries
SELECT co1l, SUM(col2)
FROM t
GROUP BY col1
Hash Aggregation
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
● Compare keys
● Create an active rows
for non-matches
(collisions)
Collision
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 3, 0}
{7, 4, 3}
● Advance buckets for all
collisions and compare keys
● Repeat until match or
empty bucket
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 12}
{7, 3, 1}
{7, 4, 5}
● Update the aggregation
state for each aggregate
Mixed Column/Row Kernel Example
void AggKernel(AggFn* fn,
int64_t* input,
int8_t** buckets,
int64_t buffer_offset,
int32_t num_rows) {
for (int32_t i = 0; i < num_rows; ++i) {
// Memory access into large array. Good to have a tight loop.
int8_t* bucket = buckets[i];
// Make sure this gets inlined.
fn->update(input[i], bucket + buffer_offset);
}
}
A “column” whose values are sprayed
across rows in the hash table
End-to-End Performance
Why go to the trouble? TPC-DS 30TB Queries/Hour
3.3x
speedup
110
32
(Higher is better)
32
23 columns
mixed types
1 column
Real-World Queries
▪ Several preview customers from different industries
▪ Need to have a suitable workload with sufficient Photon feature coverage
▪ Typical experience: 2-3x speedup end-to-end
▪ Mileage varies, best speedup: From 80 → 5 minutes!
▪ Vectorization: Decompose query into simple loops over vectors of data
▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs
▪ Lazy filter evaluation with an active rows → useful concept
▪ Mixed column/row operations for accessing hash tables
Recap
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 

What's hot (20)

A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Similar to Photon Technical Deep Dive: How to Think Vectorized

Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Provectus
 
Whats new in_csharp4
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4
Abed Bukhari
 
Intro to tsql unit 10
Intro to tsql   unit 10Intro to tsql   unit 10
Intro to tsql unit 10
Syed Asrarali
 

Similar to Photon Technical Deep Dive: How to Think Vectorized (20)

lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
GCC
GCCGCC
GCC
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
 
Explain this!
Explain this!Explain this!
Explain this!
 
The Ring programming language version 1.6 book - Part 9 of 189
The Ring programming language version 1.6 book - Part 9 of 189The Ring programming language version 1.6 book - Part 9 of 189
The Ring programming language version 1.6 book - Part 9 of 189
 
Automatically Documenting Program Changes
Automatically Documenting Program ChangesAutomatically Documenting Program Changes
Automatically Documenting Program Changes
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
The Ring programming language version 1.7 book - Part 10 of 196
The Ring programming language version 1.7 book - Part 10 of 196The Ring programming language version 1.7 book - Part 10 of 196
The Ring programming language version 1.7 book - Part 10 of 196
 
Whats new in_csharp4
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
Simplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functionsSimplifying SQL with CTE's and windowing functions
Simplifying SQL with CTE's and windowing functions
 
LDCQ paper Dec21 with answer key_62cb2996afc60f6aedeb248c1d9283e5.pdf
LDCQ paper Dec21 with answer key_62cb2996afc60f6aedeb248c1d9283e5.pdfLDCQ paper Dec21 with answer key_62cb2996afc60f6aedeb248c1d9283e5.pdf
LDCQ paper Dec21 with answer key_62cb2996afc60f6aedeb248c1d9283e5.pdf
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
 
Intro to tsql unit 10
Intro to tsql   unit 10Intro to tsql   unit 10
Intro to tsql unit 10
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Novos recursos do postgre sql para sharding
Novos recursos do postgre sql para shardingNovos recursos do postgre sql para sharding
Novos recursos do postgre sql para sharding
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 

Recently uploaded (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Photon Technical Deep Dive: How to Think Vectorized

  • 1. Technical Deep Dive: How to Think Vectorized Alex Behm Tech Lead, Photon
  • 2. Agenda Introduction Delta Engine, vectorization, micro-benchmarks Expressions Compute kernels, adaptivity, lazy filters Aggregation Hash tables, mixed row/columnar kernels End-to-End Performance
  • 3. Hardware Changes since 2015 2010 2015 2020 Storage 50 MB/s (HDD) 500 MB/s (SSD) 16 GB/s (NVMe) 10X Network 1 Gbps 10 Gbps 100 Gbps 10X CPU ~3 GHz ~3 GHz ~3 GHz ☹ CPUs continue to be the bottleneck. How do we achieve next level performance?
  • 4. Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance: ▪ Most columns don’t have “NOT NULL” defined ▪ Strings are convenient, and many date columns are stored as strings ▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality Can we get both agility and performance?
  • 6. Photon New execution engine for Delta Engine to accelerate Spark SQL Built from scratch in C++, for performance: ▪ Vectorization: data-level and instruction-level parallelism ▪ Optimize for modern structured and semi-structured workloads
  • 7. Vectorization ● Decompose query into compute kernels that process vectors of data ● Typically: Columnar in-memory format ● Cache and CPU friendly: simple predictable loops, many data items, SIMD ● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs ● Modular: Can optimize individual kernels as needed Sounds great! But… what does it really mean? How does it work? Is it worth it? This talk: I will teach you how to think vectorized!
  • 8. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries
  • 9. Let’s build a simple engine from scratch. 1. Expression evaluation and adaptivity 2. Filters and laziness 3. Hash tables and mixed column/row operations Vectorization: Basic Building Blocks
  • 11. Running Example SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) We’re not covering this part Operators pass batches of columnar data
  • 12. Expression Evaluation c1 c2 + < 10 Out SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 13. Expression Evaluation c1 c2 + < 10 Out Kernels! SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 14. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 15. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } 🤔 What about NULLs? SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 16. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 17. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } > 30% slower with NULL checks SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 18. Expression Evaluation: Runtime Adaptivity void PlusKernelNoNulls(...); void PlusKernel(...); void PlusEval(Column left, Column right, Column output) { if (!left.has_nulls() && !right.has_nulls()) { PlusKernelNoNulls(left.data(), right.data(), output.data()); } else { PlusKernel(left.data(), left.nulls(), …); } } But what if my data rarely has NULLs?
  • 19. Expression Evaluation c1 c2 + < 10 Out ● Similar kernel approach ● Can optimize for literals, ~25% faster SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 20. Filters SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 c1 c2 + < 10 Out ??? ● What exactly is the output? ● What should we do with our input column batch?
  • 21. Filters: Lazy Representation as Active Rows 5 4 3 2 1 7 2 3 8 5 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c1 c 2 c3 g1 g 2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) {c1, c2, c3, g1, g2} {c1, c2, c3, g1, g2} Column Batch 3 2 0 c1 + c2 < 10 Active RowsSELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 22. Filters: Lazy Representation as Active Rows void PlusNoNullsSomeActiveKernel( const int64_t* left, const int64_t* right, const int32_t* active_rows, int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { int32_t active_idx = active_rows[i]; output[active_idx] = left[active_idx] * right[active_idx] } } Active rows concept must be supported throughout the engine ● Adds complexity, code ● Will come in handy for advanced operations like aggregation/join
  • 24. Hash Aggregation Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input Hash Table {g1, g2, SUM}
  • 25. Hash Aggregation Think vectorized! ● Columnar, batch-oriented ● Type specialized Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input
  • 26. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries SELECT co1l, SUM(col2) FROM t GROUP BY col1
  • 27. Hash Aggregation Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes
  • 28. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3}
  • 29. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} ● Compare keys ● Create an active rows for non-matches (collisions) Collision
  • 30. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 3, 0} {7, 4, 3} ● Advance buckets for all collisions and compare keys ● Repeat until match or empty bucket
  • 31. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 12} {7, 3, 1} {7, 4, 5} ● Update the aggregation state for each aggregate
  • 32. Mixed Column/Row Kernel Example void AggKernel(AggFn* fn, int64_t* input, int8_t** buckets, int64_t buffer_offset, int32_t num_rows) { for (int32_t i = 0; i < num_rows; ++i) { // Memory access into large array. Good to have a tight loop. int8_t* bucket = buckets[i]; // Make sure this gets inlined. fn->update(input[i], bucket + buffer_offset); } } A “column” whose values are sprayed across rows in the hash table
  • 34. Why go to the trouble? TPC-DS 30TB Queries/Hour 3.3x speedup 110 32 (Higher is better)
  • 36. Real-World Queries ▪ Several preview customers from different industries ▪ Need to have a suitable workload with sufficient Photon feature coverage ▪ Typical experience: 2-3x speedup end-to-end ▪ Mileage varies, best speedup: From 80 → 5 minutes!
  • 37. ▪ Vectorization: Decompose query into simple loops over vectors of data ▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs ▪ Lazy filter evaluation with an active rows → useful concept ▪ Mixed column/row operations for accessing hash tables Recap
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.