SlideShare a Scribd company logo
1 of 24
Download to read offline
Ioana Delaney, Jia Li
Spark Technology Center, IBM
Extending Spark SQL Data Sources
APIs with Join Push Down
#EUdev7
About the speakers
• Ioana Delaney
– Spark Technology Center, IBM
– DB2 Optimizer developer working in the areas of query semantics, rewrite, and
optimizer.
– Worked on various releases of DB2 LUW and DB2 with BLU Acceleration
– Apache Spark SQL Contributor
• Jia Li
– Spark Technology Center, IBM
– Apache Spark SQL Contributor
– Worked on various releases of IBM BigInsights and IBM Optim Query Workload
Tuner
2#EUdev7
IBM Spark Technology Center
3#EUdev7
• Founded in 2015
• Location: 505 Howard St., San Francisco
• Web: http://spark.tc
• Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to
the Apache Spark community.
– Make the core technology enterprise and
cloud-ready.
– Build data science skills to drive intelligence
into business applications
http://bigdatauniversity.com
Motivation
• Spark applications often directly query external data sources such as relational
databases or files
• Spark provides Data Sources APIs for accessing structured data through Spark SQL
• Support optimizations such as Filter push down and Column pruning - subset of the
functionality that can be pushed down to some data sources
• This work extends Data Sources APIs with join push down
• Join push down is nothing more than Selection and Projection push down
• Significantly improves query performance by reducing the amount of data transfer
and exploiting the capabilities of the data sources such as index access
4#EUdev7
Data Sources APIs
• Mechanism for accessing external data sources
• Built-in libraries (Hive, Parquet, Json, JDBC, etc.) and external, third-party libraries
through its spark-packages
• Users can read/save data through generic load()/save() functions, or relational SQL
tables i.e. DataFrames with persistent metadata and names
• Developers can build new libraries for various data sources i.e. to create a new data
source one needs to specify the schema and how to get the data
• Tightly integrated with Spark’s Catalyst Optimizer by allowing optimizations to be
pushed down to the data source
5#EUdev7
Use Case 1: Enterprise-wide data pipeline using Spark
6#EUdev7
• As data gets distributed from different data
sources throughout an organization, there is
a need to create a pipeline to connect these
sources.
• Spark framework extracts the records from
the backing store, caches the DataFrame
results, and then performs joins with the
updates coming from the stream
• e.g. Spark at Bloomberg – Summit 2016
Backing
Data Store
Real-Time
Stream
Data
Frame U1 U2 U3 …
(Avro de-serialize)
Use Case 2: SQL Acceleration with Spark
7#EUdev7
• e.g. RDBMS MPP cluster nodes are overlaid
with local Spark executor processes
• The data frames in Spark are derived from
the existing data partitions of the database
cluster.
• The co-location of executors with the
database engine processes minimizes the
latency of accessing the data
• IBM’s dashDB and other vendors provide
SQL acceleration through co-location or
other Spark integrated architectures
Use Case 3: ETL from legacy data sources
8#EUdev7
• Common pattern in Big Data
space is offloading relational
databases and files into
Hadoop and other data sources
JDBC
Data
Source
Extract,
Transform,
Load Hadoop
Files
Cassandra
Hadoop
Cassandra
Spark as a Unified Analytics Platform for Data Federation
• Ingests data from disparate data sources and perform fast in-memory
analytics on their combined data
• Supports a wide variety of structured and unstructured data sources
• Applies analytics everywhere without the need of data centralization
9#EUdev7
Challenges: Network speed
10#EUdev7
• e.g. large tables in DBMS and selective join in Spark
• Spark reads the entire table, applies some predicates at the data source, and
executes the join locally
• Instead, push the join execution to the data source
Ø Reduce the amount of data transfer
Ø The query runs 40x faster!
Challenges: Exploit data source capabilities
11#EUdev7
• e.g. DBMS has indexes
• Join execution in Spark ~ O(|L|)
• Join execution in DBMS ~ O(|S|)
• Instead, push the join execution to the data source
Ø Efficient join execution using index access
Ø Reduce the data transfer
Ø The query runs 100x faster!
Join push down to the data source
• Alternative execution strategy in Spark
• Nothing more than Selection and Projection push down
• May reduce the amount of data transfer
• Provides Spark with the functions of the underlying data source
• Small API changes with significant performance improvement
12#EUdev7
Push down based on cost vs. heuristics
• Acceptable performance is the most significant concern about data federation
• Query Optimizer determines the best execution of a SQL query e.g. implementation
of relational operators, order of execution, etc.
• In a federated environment, the optimizer must also decide whether the different
operations should be done by the federated server, e.g. Spark, or by the data source
– Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how
much it costs (e.g. statistics from data source, network speed, etc.)
• Spark’s Catalyst Optimizer uses a combination of heuristics and cost model
• Cost model is an evolving feature in Spark
• Until federated cost model is fully implemented, use safe heuristics
13#EUdev7
Minimize data transfer
14#EUdev7
Table diagram:
date_dim
store_sales store
N : 1
1:N
Filtering joins (e.g. Star-joins) Expanding joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
M : N
store_returns
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it
from store_sales, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_prof it, sum(ss_net_loss) as store_loss
from store_sales, date_dim, store, store_returns
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk and
ss_ticket_number = sr_ticket_number and
ss_item_sk = sr_item_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
Star-schema joins
• Joins among tables in a star-schema relationship
• Star-schema is the simplest form of a data warehouse schema
• Star-schema model consists of a fact table referencing a number of dimension tables
• Fact and dimension tables are in a primary key – foreign key relationship.
• Star joins are filters applied to the fact table
• Catalyst recognizes star-schema joins
• We use this information to detect and push down star joins to the data source
15#EUdev7
Join re-ordering based on data source
16#EUdev7
• Maximize the amount of functionality that can be pushed down to the data source
• Extends Catalyst’s join enumeration rules to re-order joins based on data source
• Important alternative execution plan for global federated optimization
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it
from store_sales, customer, customer_address, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
sr_store_sk = ss_store_sk
group by s_store_id ,s_store_name order by s_store_id ,s_store_name
limit 100
Managing parallelism: Single vs. multiple partitions join
17
#EUdev7
• Partitions are the basic unit of parallelism in Spark
• JDBC options to specify data partitioning: partitionColumn, lowerbound, upperbound,
and numPartitions
• Spark splits the table read across numPartitions tasks with a stride specified by
lowerbound and upperbound
CREATE TABLE STORE_SALES USING org.apache.spark.sql.jdbc OPTIONS
(. . .
dbtable 'TPCDS.STORE_SALES',
columnName=“ss_sold_date_sk”
lowerBound=2415022,
upperBound=2488020
numpartitions = 10
. . .
select d_moy ,sum(ss_net_profit) as store_sales_profit
from store_sales, date_dim
where d_year = 2001 and
d_date_sk = ss_sold_date_sk
group by d_moy
order by d_moy
How to partition when joins are pushed down?
18#EUdev7
1) Push down partitioned, co-located joins to
data source
– A co-located join occurs locally on the
database partition on which the data
resides.
– Partitioning in Spark aligns with the data
source partitioning
– Hard to achieve for multi-way joins
– Cost based decision
How to partition when joins are pushed down?
19#EUdev7
2) Perform partitioning in Spark i.e. choose a join method that favors parallelism
– Broadcast Hash Join is the preferred method when one of the tables is small
– If the large table comes from a single partition JDBC connection, the execution is serialized
on that single partition
– In such cases, Shuffle Hash Join and Sort Merge Join may outperform Broadcast Hash Join
Broadcast Hash join to single partition Shuffle Hash join Sort Merge join
1 TB TPC-DS Performance Results
20#EUdev7
• Proxy of a real data warehouse
• Retail product supplier e.g. retail sales, web, catalog, etc.
• Ad-hoc, reporting, and data mining type of queries
• Mix of two data sources: IBM DB2/JDBC and Parquet
Cluster: 4-node cluster, each node having:
12 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017)
Database info:
Schema: TPCDS
Scale factor: 1TB total space
Mix of Parquet and DB2/JDBC data sources
DB2 DPF info: 4-node cluster, each node having:
10 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
TPC-DS
Query
spark-2.2
(mins)
spark-2.2-jpd
(mins)
Speedup
Q8 32 4 8x
Q13 121 5 25x
Q15 4 2 2x
Q17 77 7 11x
Q19 42 4 11x
Q25 153 7 21x
Q29 81 7 11x
Q42 31 3 10x
Q45 14 3 4x
Q46 61 5 12x
Q48 155 5 31x
Q52 31 5 6x
Q55 31 4 8x
Q68 69 4 17x
Q74 47 23 2x
Q79 63 4 15x
Q85 22 2 11x
Q89 55 4 14x
Data Sources APIs for reading the data
• BaseRelation: The abstraction of a collection of tuples read from the data
source. It provides the schema of the data.
• TableScan: Reads all the data in the data source.
• PrunedScan: Eliminates unneeded columns at the data source.
• PrunedFilteredScan: Applies predicates at the data source.
• PushDownScan:New API that applies complex operations such as joins at
the data source.
21#EUdev7
PushDownScan APIs
• Trait for a BaseRelation
• Used with data sources that support complex functionality such as joins
• Extends PrunedFilteredScan with DataSourceCapabilities
• Methods needed to be overriden:
– def buildScan(columns: Array[String],
filters: Array[Filter],
tables: Seq[BaseRelation]):RDD[Row]
– def getDataSource(): String
• DataSourceCapabilities trait to model data source characteristics e.g. type of joins
22#EUdev7
Future work: Cost Model for Data Source APIs
• Transform Catalyst into a global optimizer
• Global optimizer generates an optimal execution plan across all data sources
• Determines where an operation should be evaluated based on:
1. The cost to execute the operation.
2. The cost to transfer data between Spark and the data sources
• Key factors that affect global optimization:
– Remote table statistics (e.g. number of rows, number of distinct values in each
column, etc)
– Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.)
• Extend Data Source APIs with data source characteristics
• Retrieve/compute data source table statistics
• Integrate data source cost model into Catalyst
23#EUdev7
Thank You.
Ioana Delaney ursu@us.ibm.com
Jia Li jiali@us.ibm.com
Visit http://spark.tc

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 

Viewers also liked

Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 

Viewers also liked (7)

Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Similar to Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana Delaney and Jia Li

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
Rajeev Kumar
 

Similar to Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana Delaney and Jia Li (20)

2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Spark1
Spark1Spark1
Spark1
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana Delaney and Jia Li

  • 1. Ioana Delaney, Jia Li Spark Technology Center, IBM Extending Spark SQL Data Sources APIs with Join Push Down #EUdev7
  • 2. About the speakers • Ioana Delaney – Spark Technology Center, IBM – DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer. – Worked on various releases of DB2 LUW and DB2 with BLU Acceleration – Apache Spark SQL Contributor • Jia Li – Spark Technology Center, IBM – Apache Spark SQL Contributor – Worked on various releases of IBM BigInsights and IBM Optim Query Workload Tuner 2#EUdev7
  • 3. IBM Spark Technology Center 3#EUdev7 • Founded in 2015 • Location: 505 Howard St., San Francisco • Web: http://spark.tc • Twitter: @apachespark_tc • Mission: – Contribute intellectual and technical capital to the Apache Spark community. – Make the core technology enterprise and cloud-ready. – Build data science skills to drive intelligence into business applications http://bigdatauniversity.com
  • 4. Motivation • Spark applications often directly query external data sources such as relational databases or files • Spark provides Data Sources APIs for accessing structured data through Spark SQL • Support optimizations such as Filter push down and Column pruning - subset of the functionality that can be pushed down to some data sources • This work extends Data Sources APIs with join push down • Join push down is nothing more than Selection and Projection push down • Significantly improves query performance by reducing the amount of data transfer and exploiting the capabilities of the data sources such as index access 4#EUdev7
  • 5. Data Sources APIs • Mechanism for accessing external data sources • Built-in libraries (Hive, Parquet, Json, JDBC, etc.) and external, third-party libraries through its spark-packages • Users can read/save data through generic load()/save() functions, or relational SQL tables i.e. DataFrames with persistent metadata and names • Developers can build new libraries for various data sources i.e. to create a new data source one needs to specify the schema and how to get the data • Tightly integrated with Spark’s Catalyst Optimizer by allowing optimizations to be pushed down to the data source 5#EUdev7
  • 6. Use Case 1: Enterprise-wide data pipeline using Spark 6#EUdev7 • As data gets distributed from different data sources throughout an organization, there is a need to create a pipeline to connect these sources. • Spark framework extracts the records from the backing store, caches the DataFrame results, and then performs joins with the updates coming from the stream • e.g. Spark at Bloomberg – Summit 2016 Backing Data Store Real-Time Stream Data Frame U1 U2 U3 … (Avro de-serialize)
  • 7. Use Case 2: SQL Acceleration with Spark 7#EUdev7 • e.g. RDBMS MPP cluster nodes are overlaid with local Spark executor processes • The data frames in Spark are derived from the existing data partitions of the database cluster. • The co-location of executors with the database engine processes minimizes the latency of accessing the data • IBM’s dashDB and other vendors provide SQL acceleration through co-location or other Spark integrated architectures
  • 8. Use Case 3: ETL from legacy data sources 8#EUdev7 • Common pattern in Big Data space is offloading relational databases and files into Hadoop and other data sources JDBC Data Source Extract, Transform, Load Hadoop Files Cassandra Hadoop Cassandra
  • 9. Spark as a Unified Analytics Platform for Data Federation • Ingests data from disparate data sources and perform fast in-memory analytics on their combined data • Supports a wide variety of structured and unstructured data sources • Applies analytics everywhere without the need of data centralization 9#EUdev7
  • 10. Challenges: Network speed 10#EUdev7 • e.g. large tables in DBMS and selective join in Spark • Spark reads the entire table, applies some predicates at the data source, and executes the join locally • Instead, push the join execution to the data source Ø Reduce the amount of data transfer Ø The query runs 40x faster!
  • 11. Challenges: Exploit data source capabilities 11#EUdev7 • e.g. DBMS has indexes • Join execution in Spark ~ O(|L|) • Join execution in DBMS ~ O(|S|) • Instead, push the join execution to the data source Ø Efficient join execution using index access Ø Reduce the data transfer Ø The query runs 100x faster!
  • 12. Join push down to the data source • Alternative execution strategy in Spark • Nothing more than Selection and Projection push down • May reduce the amount of data transfer • Provides Spark with the functions of the underlying data source • Small API changes with significant performance improvement 12#EUdev7
  • 13. Push down based on cost vs. heuristics • Acceptable performance is the most significant concern about data federation • Query Optimizer determines the best execution of a SQL query e.g. implementation of relational operators, order of execution, etc. • In a federated environment, the optimizer must also decide whether the different operations should be done by the federated server, e.g. Spark, or by the data source – Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how much it costs (e.g. statistics from data source, network speed, etc.) • Spark’s Catalyst Optimizer uses a combination of heuristics and cost model • Cost model is an evolving feature in Spark • Until federated cost model is fully implemented, use safe heuristics 13#EUdev7
  • 14. Minimize data transfer 14#EUdev7 Table diagram: date_dim store_sales store N : 1 1:N Filtering joins (e.g. Star-joins) Expanding joins Table diagram: date_dim store_sales store N : 1 1:N M : N store_returns select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it from store_sales, date_dim, store where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100 select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_prof it, sum(ss_net_loss) as store_loss from store_sales, date_dim, store, store_returns where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk and ss_ticket_number = sr_ticket_number and ss_item_sk = sr_item_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100
  • 15. Star-schema joins • Joins among tables in a star-schema relationship • Star-schema is the simplest form of a data warehouse schema • Star-schema model consists of a fact table referencing a number of dimension tables • Fact and dimension tables are in a primary key – foreign key relationship. • Star joins are filters applied to the fact table • Catalyst recognizes star-schema joins • We use this information to detect and push down star joins to the data source 15#EUdev7
  • 16. Join re-ordering based on data source 16#EUdev7 • Maximize the amount of functionality that can be pushed down to the data source • Extends Catalyst’s join enumeration rules to re-order joins based on data source • Important alternative execution plan for global federated optimization select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_prof it from store_sales, customer, customer_address, date_dim, store where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and sr_store_sk = ss_store_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100
  • 17. Managing parallelism: Single vs. multiple partitions join 17 #EUdev7 • Partitions are the basic unit of parallelism in Spark • JDBC options to specify data partitioning: partitionColumn, lowerbound, upperbound, and numPartitions • Spark splits the table read across numPartitions tasks with a stride specified by lowerbound and upperbound CREATE TABLE STORE_SALES USING org.apache.spark.sql.jdbc OPTIONS (. . . dbtable 'TPCDS.STORE_SALES', columnName=“ss_sold_date_sk” lowerBound=2415022, upperBound=2488020 numpartitions = 10 . . . select d_moy ,sum(ss_net_profit) as store_sales_profit from store_sales, date_dim where d_year = 2001 and d_date_sk = ss_sold_date_sk group by d_moy order by d_moy
  • 18. How to partition when joins are pushed down? 18#EUdev7 1) Push down partitioned, co-located joins to data source – A co-located join occurs locally on the database partition on which the data resides. – Partitioning in Spark aligns with the data source partitioning – Hard to achieve for multi-way joins – Cost based decision
  • 19. How to partition when joins are pushed down? 19#EUdev7 2) Perform partitioning in Spark i.e. choose a join method that favors parallelism – Broadcast Hash Join is the preferred method when one of the tables is small – If the large table comes from a single partition JDBC connection, the execution is serialized on that single partition – In such cases, Shuffle Hash Join and Sort Merge Join may outperform Broadcast Hash Join Broadcast Hash join to single partition Shuffle Hash join Sort Merge join
  • 20. 1 TB TPC-DS Performance Results 20#EUdev7 • Proxy of a real data warehouse • Retail product supplier e.g. retail sales, web, catalog, etc. • Ad-hoc, reporting, and data mining type of queries • Mix of two data sources: IBM DB2/JDBC and Parquet Cluster: 4-node cluster, each node having: 12 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48 Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017) Database info: Schema: TPCDS Scale factor: 1TB total space Mix of Parquet and DB2/JDBC data sources DB2 DPF info: 4-node cluster, each node having: 10 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48 TPC-DS Query spark-2.2 (mins) spark-2.2-jpd (mins) Speedup Q8 32 4 8x Q13 121 5 25x Q15 4 2 2x Q17 77 7 11x Q19 42 4 11x Q25 153 7 21x Q29 81 7 11x Q42 31 3 10x Q45 14 3 4x Q46 61 5 12x Q48 155 5 31x Q52 31 5 6x Q55 31 4 8x Q68 69 4 17x Q74 47 23 2x Q79 63 4 15x Q85 22 2 11x Q89 55 4 14x
  • 21. Data Sources APIs for reading the data • BaseRelation: The abstraction of a collection of tuples read from the data source. It provides the schema of the data. • TableScan: Reads all the data in the data source. • PrunedScan: Eliminates unneeded columns at the data source. • PrunedFilteredScan: Applies predicates at the data source. • PushDownScan:New API that applies complex operations such as joins at the data source. 21#EUdev7
  • 22. PushDownScan APIs • Trait for a BaseRelation • Used with data sources that support complex functionality such as joins • Extends PrunedFilteredScan with DataSourceCapabilities • Methods needed to be overriden: – def buildScan(columns: Array[String], filters: Array[Filter], tables: Seq[BaseRelation]):RDD[Row] – def getDataSource(): String • DataSourceCapabilities trait to model data source characteristics e.g. type of joins 22#EUdev7
  • 23. Future work: Cost Model for Data Source APIs • Transform Catalyst into a global optimizer • Global optimizer generates an optimal execution plan across all data sources • Determines where an operation should be evaluated based on: 1. The cost to execute the operation. 2. The cost to transfer data between Spark and the data sources • Key factors that affect global optimization: – Remote table statistics (e.g. number of rows, number of distinct values in each column, etc) – Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.) • Extend Data Source APIs with data source characteristics • Retrieve/compute data source table statistics • Integrate data source cost model into Catalyst 23#EUdev7
  • 24. Thank You. Ioana Delaney ursu@us.ibm.com Jia Li jiali@us.ibm.com Visit http://spark.tc