SlideShare a Scribd company logo
1 of 72
Download to read offline
Extending Spark
for Qbeast's SQL
Data Source
with Paola Pardo and
Cesare Cugnasco
BarcelonaSpark Meetup
24th of October 2019
From the research to the industry
At first it was Extraction Transformation Loading
Hybrid Transactional Analytical Processing
Then the Lambda architecture tried to reduce latency
Hybrid Transactional Analytical Processing
A plot of the relative
bandwidth of system
components in the Titan
supercomputer at the Oak
Ridge Leadership Class Facility. Source: Bauer, Andrew C., et
al. "In situ methods,
infrastructures, and
applications on high
performance computing
platforms."
5
Consistent and transactional (at various
degree) level
Storage:
● Memory
● Local storage
Big Data HTAP: general design
Fast consistent layer
Weak consistency, high-latency,
immutable files
Storage:
• No-POSIX distributed file system
• Object Stores
Cheap/throughput layer
On-demand resources - decoupled
storage/ CPU
Temporary storage:
• Local disk
• Object Stores.
Query execution
Data ingestion
Periodical flushes
Data
Examples
Google’s Procella Snowflakes
Big Data HTAP: min-max pruning, zone maps, bloom filters..
Primary key partition A
Primary key partition B
Meta
Min/max
Bloom
range
Metadata server
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
June May
MarchJune
11
12
13
14
15Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
16
Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
960 KB 7 KB
17
18
High-priority
Medium-priority
Low-priority
RAM
Persistent
memory
Local disk
Object storage
Cold storage
QDB: file layout
Original data OutlookTree
Metadata and buffer
in fast storage
Data in columnar format
in slower storage
Hybrid columnar row
Row data
Disk
S3
Optane
Columnar to row mapping base on
the fact that the
random priority = DHT token
Interactive Big Data Visualization
● Overview
○ Catalyst Optimizer
○ APIs
○ Spark-Cassandra
● Extensions
○ SamplingPushdown
○ Multidimensional Filter Pushdown
● Future work
Outline
Overview
● CatalystOptimizer
● DataSources APIs
○ Key Concepts
○ Examples
● Spark-Cassandra-Connector
○ CassandraSourceRelation
24
Catalyst Optimizer
25
User Query
SELECT sum(v)
FROM
SELECT t1.id, t1.value+1+2 AS v
FROM t1 JOIN t2
WHERE
(t1.id == t2.id AND t2.id > 50)
● Expressions
○ New value computedon input values
● Attributes
○ Column of a data collection
○ Dataset,Data Operation
26
Unresolved Plan
PROJECT
FILTER
JOIN
UnresolvedRelation t1 UnresolvedRelation t2
SELECT sum(v)
FROM
SELECT t1.id, t1.value+1+2 AS v
FROM t1 JOIN t2
WHERE
(t1.id == t2.id AND t2.id > 50)
AGG
27
Analysis
JOIN
UnresolvedRelation t1 UnresolvedRelation t2
JOIN
MyCustomRelation t1 MyCustomRelation t2
Metadata
● Tree
○ Abstraction of users program
○ Node objects
● Rules
○ Transform the tree
○ Logical Optimization
○ Heuristics
Logical Plan
SELECT t1.value+1+2 AS v
ADD
ADDT1.value
Literal(1) Literal(2)
29
Optimized Logical Plan
ADD
ADDT1.value
Literal(1) Literal(2)
ADD
Literal(3)T1.value
30
Physical Planning
● Strategies
○ Set of transformations
○ Eg: selects the best Join execution
● Rule executor
○ Ensure requirements
○ Apply optimization
31
Physical Planning
● Strategies
○ Set of transformations
○ Eg: selects the best Join execution
● Rule executor
○ Ensure requirements
○ Apply physicaloptimization
Cost-based
● Key part to integrate datasources
○ How to read/writefrom/tostorage
○ Statistics
○ Physical Planning
● Hadoop, Hive
● Presto and Cassandra connectors
DataSource API
API
DataSource API
trait RelationProvider {
def createRelation
(sqlContext:SQLContext,
parameters: Map[String, String]):
BaseRelation
}
abstract class BaseRelation {
def sqlContext: SQLContext
def schema: StructType
def unhandledFilters: Array[Filter]
def sizeInBytes: Long
def needConversion: Boolean
}
trait TableScan {
def buildScan(): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
class DefaultSource extends RelationProvider with
SchemaRelationProvider {
override def createRelation(sqlContext: SQLContext,
parameters: Map[String, String])
: BaseRelation = {
createRelation(sqlContext, parameters, null)
}
//creates a relation with an Undefined Schema (null)
override def createRelation( “”, “” schema: StructType)
: BaseRelation = {
//implementation
return new MyCustomRelation(<>, schema)(sqlContext)
}
//gets the Schema of the table and produces a
MyCustomRelation
}
DataSource API
class MyCustomRelation(location: String,
userSchema: StructType)
(@transient val sqlContext: SQLContext)
extends BaseRelation
with Serializable {
override def schema: StructType = {
//implementation which returns
// StructType
// (or a sequence of StructFields)
}
}
}
● Limited extension
● Lack of info about partition
● Lack of Columnar and Streaming
support
DataSource API
trait LimitedScan {
def buildScan(limit: Int): RDD[Row]
}
trait PrunedLimitedScan {
def buildScan(requiredColumns: Array[String],
limit: Int): RDD[Row]
}
trait PrunedFilteredLimitedScan {
def buildScan(requiredColumns: Array[String],
filters: Array[Filter], limit: Int): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
● Writed in Java since 2.3
● ReadSupport or
WriteSupport
● Own partitioner
● Mix-in some Support
interfaces
DataSourcev2 API
DataSourceV2
with ReadSupport
with ReadSupportWithSchema
DataSourceReader
with SupportPushdownFilters
with SupportPushdownRequiredColumns
....
InputPartitions
InputPartitionReader
DataSourcev2 API
public interface ReadSupport extends DataSourceV2 {
DataSourceReader createReader
(DataSourceOptions options);
}
public interface DataSourceReader {
StructType readSchema();
List<InputPartitions<Row>>planInputPartitions()
}
public interface SupportsPushDownRequiredColumns
extends DataSourceReader {
void pruneColumns
(StructType requiredSchema);
}
public interface InputPartition<T> {
InputPartitionReader<T>
createPartitionReader();
}
public interface InputPartitionReader<T> extends
Closeable {
boolean next();
T get();
}
● DataStax open-source
● RDDs, DataFrames and CQL
39
Spark-Cassandra-Connector
40
CassandraSourceRelation
PrunedFilteredScan InsertableRelation
BaseRelation:
● schema
● sizeInBytes
● unhandledFilters
private[cassandra] class
CassandraSourceRelation(
tableRef: TableRef,
userSpecifiedSchema: Option[StructType],
filterPushdown: Boolean,
confirmTruncate: Boolean,
tableSizeInBytes: Option[Long],
connector: CassandraConnector,
readConf: ReadConf,
writeConf: WriteConf,
sparkConf: SparkConf,
override val sqlContext:
SQLContext)
extends BaseRelation
with InsertableRelation
with PrunedFilteredScan
with Logging
org.apache.spark.sql.cassandra.CassandraConn...
41
CassandraSourceRelation
Pruned
Filtered
Scan
● Column Pruning
○ Discard columns
● Filter Pushdown
○ Discard rows
● DataSource API
● Pushdown restrictions
○ Filteringonly one column
○ Not custom index suppory
Limitations
Extensions
● Scenario
● Sampling Pushdown
○ Sample Operator
○ Changes
● Multidimensional Filter Pushdown
○ Filter Pushdown
○ Changes
44
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
45
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
FILTERPUSHDOWN
SAMPLING PUSHDOWN
● Sample
○ lower/upper bound
○ with/without Replacement
○ seed
Sample Operator on Spark
SELECT * from keyspace.table
TABLESAMPLE(5 ROWS)
SELECT * from keyspace.table
TABLESAMPLE(10 PERCENT)
df.sample(...)
47
Sampling Pushdown
Catalyst Optimizer
DataSource API
CassandraSourceRelation
● Filter Pushdown
● Column Pruning
● Sampling with Qbeast?
● Filter Pushdown
● Column Pruning
● Sampling Pushdown?
● New interfaces for the Scan
● New method to detect sampling
operator and Datasource
48
Sampling Pushdown
48
Pruned
Sampled
Filtered
Scan
Sampled
Pruned
Scan
DataSourceAPI
Sampled
Scan
Sampled
Filtered
Scan
@InterfaceStability.Stable
trait SampledFilteredScan {
def buildScan(filters: Array[Filter], sample:
Sample): RDD[Row]
}
@InterfaceStability.Stable
trait PrunedSampledScan {
def buildScan(requiredColumns: Array[String],
sample: Sample): RDD[Row]
}
@InterfaceStability.Stable
trait SampledScan {
def buildScan(sample: Sample): RDD[Row]
}
Sampling Pushdown
@InterfaceStability.Stable
trait PrunedSampledFilteredScan {
def pushSampling(sample: Sample): Boolean
def buildScan(requiredColumns: Array[String],
filters: Array[Filter], sample: Sample): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
case s @ Sample(_, _, _, _, physical_op @ PhysicalOperation(p, f, l:
LogicalRelation)) =>
l.relation match {
case scan: PrunedSampledFilteredScan if scan.pushSampling(s) =>
pruneFilterProject(
l,
p,
f,
(a, f) => toCatalystRDD(l, a,
scan.buildScan(a.map(_.name).toArray, f, s))) :: Nil
case _ => Nil
}
Sampling Pushdown
org.apache.spark.sql.execution.datasources.DataSourceStrategy
51
Sampling Pushdown
1. User level option to pushdown sampling
2. Detection of Sample
3. Analysis
4. Write CQL expression to query the index
5. Let Qbeast handle it again!
Processing the pushdown:
Sampling Pushdown
private[cassandra] class
CassandraSourceRelation(
//other stuff
sampling: Boolean
override val sqlContext: SQLContext)
extends BaseRelation
with InsertableRelation
with PrunedFilteredScan
with PrunedFilteredSampledScan
with Logging
override def pushSampling(sample: Sample): Boolean = {
//check if the table is indexed and the user wants to
pushdown the operator
}
override def buildScan
(requiredColumns: Array[String], filters: Array[Filter],
sample: Sample): RDD[Row] = {
//construct the index CQL code and push it through the
scanning
}
org.apache.spark.sql.cassandra.CassandraConn...
Sampling Pushdown
SELECT * from keyspace.table
TABLESAMPLE (5 PERCENT)
Simple LookupSample(0.0,0,05,false, 983653)
Full Table Scan
Filter Pushdown
55
Multidimensional Pruning
Catalyst Optimizer
DataSource API
CassandraSourceRelation
● Filter Pushdown
● Column Pruning
● Samplingwith Qbeast
● Multidimensional pushdown?
● Filter Pushdown
● Column Pruning
● SamplingPushdown
56
Multidimensional Pruning
1. Detect the index
2. Analyze the predicate
3. Pushdown the Filters to Cassandra
4. Let Qbeast handle it!
Processing the pushdown:
private val qbeast = table.qbeastColumns.map(_.columnName)
/** Returns the set of predicates that contains doubleranges
for the index qBeast*/
val qbeastPredicatesToPushdown: Set[Predicate] = {
val doubleRange = rangePredicatesByName.filter(p =>
p._2.exists(Predicates.isLessThanPredicate)
&&
p._2.exists(Predicates.isGreaterThanOrEqualPredicate))
if (qbeast.toSet subsetOf doubleRange.keySet) {
val eqQbeast = qbeast.flatMap(rangePredicatesByName)
eqQbeast.toSet
}
else
Set.empty
}}
Multidimensional Pruning
val predicatesToPushDown: Set[Predicate] =
partitionKeyPredicatesToPushDown ++
clusteringColumnPredicatesToPushDown ++
indexedColumnPredicatesToPushDow ++
qbeastPredicatesToPushdown
org.apache.spark.sql.cassandra.BasicCassandraPredicateToPushdown
Multidimensional Pushdown
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
FILTER(isNotNull)
PrunedFilteredScan
FILTER(x,y, z, isNotNull)
Full Table Scan
Example
Example
Example
Example
Example
Future Work
● Dimensional Aware
● Join Strategy
● Storage
● Useful for Data Locality Strategies
● Physical Planning
Dimensional Aware
● Shuffle-Hash-Join
● Broadcast-Join
● Sort-Merge-Join
66
Join Strategy in Spark
● Dimensional Aware Data Partition
● Speculative optimization on Sampling
Join on Qbeast
● Save Qbeast data in Arrow
● Static column with file information
● Make Analytics Faster
● Spark support since 2.3
Integration with Arrow
Future Work
● Dimensional Aware
● Join Strategy
● Storage
● DataSource V2
● New Java Class
● New method to detect sampling
operator and Datasource
70
DataSourceV2
70
DataSourceAPIv2
Supports
Pushdown
Sampling
package org.apache.spark.sql.sources.v2.reader;
@InterfaceStability.Evolving
public interface SupportsPushDownSampling extends
DataSourceReader {
boolean pushSampling(Sample sample);
}
DataSourceV2
case s @ Sample(_, _, _, _, l @ PhysicalOperation(p, f, e: DataSourceV2Relation)) =>
//implementation of pruning and filter pushdown
ProjectExec(p, withFilter) :: Nil
case _ => Nil
}
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

More Related Content

What's hot

Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰoggers
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Data centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheData centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheSpark Summit
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 

What's hot (20)

Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Data centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheData centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad Ulreche
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 

Similar to Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfCédrick Lunven
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCRMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCClément OUDOT
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreLukas Fittl
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterEDB
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 

Similar to Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco (20)

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCRMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data Center
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

  • 1. Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cugnasco BarcelonaSpark Meetup 24th of October 2019
  • 2. From the research to the industry
  • 3. At first it was Extraction Transformation Loading Hybrid Transactional Analytical Processing
  • 4. Then the Lambda architecture tried to reduce latency Hybrid Transactional Analytical Processing
  • 5. A plot of the relative bandwidth of system components in the Titan supercomputer at the Oak Ridge Leadership Class Facility. Source: Bauer, Andrew C., et al. "In situ methods, infrastructures, and applications on high performance computing platforms." 5
  • 6. Consistent and transactional (at various degree) level Storage: ● Memory ● Local storage Big Data HTAP: general design Fast consistent layer Weak consistency, high-latency, immutable files Storage: • No-POSIX distributed file system • Object Stores Cheap/throughput layer On-demand resources - decoupled storage/ CPU Temporary storage: • Local disk • Object Stores. Query execution Data ingestion Periodical flushes Data
  • 8. Big Data HTAP: min-max pruning, zone maps, bloom filters.. Primary key partition A Primary key partition B Meta Min/max Bloom range Metadata server Meta Min/max Bloom range Meta Min/max Bloom range Meta Min/max Bloom range June May MarchJune
  • 9.
  • 10.
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
  • 16. 16 Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas 960 KB 7 KB
  • 17. 17
  • 19. QDB: file layout Original data OutlookTree Metadata and buffer in fast storage Data in columnar format in slower storage
  • 20. Hybrid columnar row Row data Disk S3 Optane Columnar to row mapping base on the fact that the random priority = DHT token
  • 21. Interactive Big Data Visualization
  • 22. ● Overview ○ Catalyst Optimizer ○ APIs ○ Spark-Cassandra ● Extensions ○ SamplingPushdown ○ Multidimensional Filter Pushdown ● Future work Outline
  • 23. Overview ● CatalystOptimizer ● DataSources APIs ○ Key Concepts ○ Examples ● Spark-Cassandra-Connector ○ CassandraSourceRelation
  • 25. 25 User Query SELECT sum(v) FROM SELECT t1.id, t1.value+1+2 AS v FROM t1 JOIN t2 WHERE (t1.id == t2.id AND t2.id > 50) ● Expressions ○ New value computedon input values ● Attributes ○ Column of a data collection ○ Dataset,Data Operation
  • 26. 26 Unresolved Plan PROJECT FILTER JOIN UnresolvedRelation t1 UnresolvedRelation t2 SELECT sum(v) FROM SELECT t1.id, t1.value+1+2 AS v FROM t1 JOIN t2 WHERE (t1.id == t2.id AND t2.id > 50) AGG
  • 27. 27 Analysis JOIN UnresolvedRelation t1 UnresolvedRelation t2 JOIN MyCustomRelation t1 MyCustomRelation t2 Metadata
  • 28. ● Tree ○ Abstraction of users program ○ Node objects ● Rules ○ Transform the tree ○ Logical Optimization ○ Heuristics Logical Plan SELECT t1.value+1+2 AS v ADD ADDT1.value Literal(1) Literal(2)
  • 29. 29 Optimized Logical Plan ADD ADDT1.value Literal(1) Literal(2) ADD Literal(3)T1.value
  • 30. 30 Physical Planning ● Strategies ○ Set of transformations ○ Eg: selects the best Join execution ● Rule executor ○ Ensure requirements ○ Apply optimization
  • 31. 31 Physical Planning ● Strategies ○ Set of transformations ○ Eg: selects the best Join execution ● Rule executor ○ Ensure requirements ○ Apply physicaloptimization
  • 33. ● Key part to integrate datasources ○ How to read/writefrom/tostorage ○ Statistics ○ Physical Planning ● Hadoop, Hive ● Presto and Cassandra connectors DataSource API API
  • 34. DataSource API trait RelationProvider { def createRelation (sqlContext:SQLContext, parameters: Map[String, String]): BaseRelation } abstract class BaseRelation { def sqlContext: SQLContext def schema: StructType def unhandledFilters: Array[Filter] def sizeInBytes: Long def needConversion: Boolean } trait TableScan { def buildScan(): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 35. class DefaultSource extends RelationProvider with SchemaRelationProvider { override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) : BaseRelation = { createRelation(sqlContext, parameters, null) } //creates a relation with an Undefined Schema (null) override def createRelation( “”, “” schema: StructType) : BaseRelation = { //implementation return new MyCustomRelation(<>, schema)(sqlContext) } //gets the Schema of the table and produces a MyCustomRelation } DataSource API class MyCustomRelation(location: String, userSchema: StructType) (@transient val sqlContext: SQLContext) extends BaseRelation with Serializable { override def schema: StructType = { //implementation which returns // StructType // (or a sequence of StructFields) } } }
  • 36. ● Limited extension ● Lack of info about partition ● Lack of Columnar and Streaming support DataSource API trait LimitedScan { def buildScan(limit: Int): RDD[Row] } trait PrunedLimitedScan { def buildScan(requiredColumns: Array[String], limit: Int): RDD[Row] } trait PrunedFilteredLimitedScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter], limit: Int): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 37. ● Writed in Java since 2.3 ● ReadSupport or WriteSupport ● Own partitioner ● Mix-in some Support interfaces DataSourcev2 API DataSourceV2 with ReadSupport with ReadSupportWithSchema DataSourceReader with SupportPushdownFilters with SupportPushdownRequiredColumns .... InputPartitions InputPartitionReader
  • 38. DataSourcev2 API public interface ReadSupport extends DataSourceV2 { DataSourceReader createReader (DataSourceOptions options); } public interface DataSourceReader { StructType readSchema(); List<InputPartitions<Row>>planInputPartitions() } public interface SupportsPushDownRequiredColumns extends DataSourceReader { void pruneColumns (StructType requiredSchema); } public interface InputPartition<T> { InputPartitionReader<T> createPartitionReader(); } public interface InputPartitionReader<T> extends Closeable { boolean next(); T get(); }
  • 39. ● DataStax open-source ● RDDs, DataFrames and CQL 39 Spark-Cassandra-Connector
  • 40. 40 CassandraSourceRelation PrunedFilteredScan InsertableRelation BaseRelation: ● schema ● sizeInBytes ● unhandledFilters private[cassandra] class CassandraSourceRelation( tableRef: TableRef, userSpecifiedSchema: Option[StructType], filterPushdown: Boolean, confirmTruncate: Boolean, tableSizeInBytes: Option[Long], connector: CassandraConnector, readConf: ReadConf, writeConf: WriteConf, sparkConf: SparkConf, override val sqlContext: SQLContext) extends BaseRelation with InsertableRelation with PrunedFilteredScan with Logging org.apache.spark.sql.cassandra.CassandraConn...
  • 41. 41 CassandraSourceRelation Pruned Filtered Scan ● Column Pruning ○ Discard columns ● Filter Pushdown ○ Discard rows
  • 42. ● DataSource API ● Pushdown restrictions ○ Filteringonly one column ○ Not custom index suppory Limitations
  • 43. Extensions ● Scenario ● Sampling Pushdown ○ Sample Operator ○ Changes ● Multidimensional Filter Pushdown ○ Filter Pushdown ○ Changes
  • 44. 44 Scenario CREATE TABLE keyspace.table ( id double PRIMARY KEY, x double, y double, z double ); CREATE CUSTOM INDEX IF NOT EXISTS table_idx ON table.keyspace (x, y, z) SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 A Qbeast indexed Table and Query examples: SELECT * from keyspace.table WHERE expr(table_idx, ‘precision=0.1’)
  • 45. 45 Scenario CREATE TABLE keyspace.table ( id double PRIMARY KEY, x double, y double, z double ); CREATE CUSTOM INDEX IF NOT EXISTS table_idx ON table.keyspace (x, y, z) SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 A Qbeast indexed Table and Query examples: SELECT * from keyspace.table WHERE expr(table_idx, ‘precision=0.1’) FILTERPUSHDOWN SAMPLING PUSHDOWN
  • 46. ● Sample ○ lower/upper bound ○ with/without Replacement ○ seed Sample Operator on Spark SELECT * from keyspace.table TABLESAMPLE(5 ROWS) SELECT * from keyspace.table TABLESAMPLE(10 PERCENT) df.sample(...)
  • 47. 47 Sampling Pushdown Catalyst Optimizer DataSource API CassandraSourceRelation ● Filter Pushdown ● Column Pruning ● Sampling with Qbeast? ● Filter Pushdown ● Column Pruning ● Sampling Pushdown?
  • 48. ● New interfaces for the Scan ● New method to detect sampling operator and Datasource 48 Sampling Pushdown 48 Pruned Sampled Filtered Scan Sampled Pruned Scan DataSourceAPI Sampled Scan Sampled Filtered Scan
  • 49. @InterfaceStability.Stable trait SampledFilteredScan { def buildScan(filters: Array[Filter], sample: Sample): RDD[Row] } @InterfaceStability.Stable trait PrunedSampledScan { def buildScan(requiredColumns: Array[String], sample: Sample): RDD[Row] } @InterfaceStability.Stable trait SampledScan { def buildScan(sample: Sample): RDD[Row] } Sampling Pushdown @InterfaceStability.Stable trait PrunedSampledFilteredScan { def pushSampling(sample: Sample): Boolean def buildScan(requiredColumns: Array[String], filters: Array[Filter], sample: Sample): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 50. case s @ Sample(_, _, _, _, physical_op @ PhysicalOperation(p, f, l: LogicalRelation)) => l.relation match { case scan: PrunedSampledFilteredScan if scan.pushSampling(s) => pruneFilterProject( l, p, f, (a, f) => toCatalystRDD(l, a, scan.buildScan(a.map(_.name).toArray, f, s))) :: Nil case _ => Nil } Sampling Pushdown org.apache.spark.sql.execution.datasources.DataSourceStrategy
  • 51. 51 Sampling Pushdown 1. User level option to pushdown sampling 2. Detection of Sample 3. Analysis 4. Write CQL expression to query the index 5. Let Qbeast handle it again! Processing the pushdown:
  • 52. Sampling Pushdown private[cassandra] class CassandraSourceRelation( //other stuff sampling: Boolean override val sqlContext: SQLContext) extends BaseRelation with InsertableRelation with PrunedFilteredScan with PrunedFilteredSampledScan with Logging override def pushSampling(sample: Sample): Boolean = { //check if the table is indexed and the user wants to pushdown the operator } override def buildScan (requiredColumns: Array[String], filters: Array[Filter], sample: Sample): RDD[Row] = { //construct the index CQL code and push it through the scanning } org.apache.spark.sql.cassandra.CassandraConn...
  • 53. Sampling Pushdown SELECT * from keyspace.table TABLESAMPLE (5 PERCENT) Simple LookupSample(0.0,0,05,false, 983653) Full Table Scan
  • 55. 55 Multidimensional Pruning Catalyst Optimizer DataSource API CassandraSourceRelation ● Filter Pushdown ● Column Pruning ● Samplingwith Qbeast ● Multidimensional pushdown? ● Filter Pushdown ● Column Pruning ● SamplingPushdown
  • 56. 56 Multidimensional Pruning 1. Detect the index 2. Analyze the predicate 3. Pushdown the Filters to Cassandra 4. Let Qbeast handle it! Processing the pushdown:
  • 57. private val qbeast = table.qbeastColumns.map(_.columnName) /** Returns the set of predicates that contains doubleranges for the index qBeast*/ val qbeastPredicatesToPushdown: Set[Predicate] = { val doubleRange = rangePredicatesByName.filter(p => p._2.exists(Predicates.isLessThanPredicate) && p._2.exists(Predicates.isGreaterThanOrEqualPredicate)) if (qbeast.toSet subsetOf doubleRange.keySet) { val eqQbeast = qbeast.flatMap(rangePredicatesByName) eqQbeast.toSet } else Set.empty }} Multidimensional Pruning val predicatesToPushDown: Set[Predicate] = partitionKeyPredicatesToPushDown ++ clusteringColumnPredicatesToPushDown ++ indexedColumnPredicatesToPushDow ++ qbeastPredicatesToPushdown org.apache.spark.sql.cassandra.BasicCassandraPredicateToPushdown
  • 58. Multidimensional Pushdown SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 FILTER(isNotNull) PrunedFilteredScan FILTER(x,y, z, isNotNull) Full Table Scan
  • 64. Future Work ● Dimensional Aware ● Join Strategy ● Storage
  • 65. ● Useful for Data Locality Strategies ● Physical Planning Dimensional Aware
  • 66. ● Shuffle-Hash-Join ● Broadcast-Join ● Sort-Merge-Join 66 Join Strategy in Spark
  • 67. ● Dimensional Aware Data Partition ● Speculative optimization on Sampling Join on Qbeast
  • 68. ● Save Qbeast data in Arrow ● Static column with file information ● Make Analytics Faster ● Spark support since 2.3 Integration with Arrow
  • 69. Future Work ● Dimensional Aware ● Join Strategy ● Storage ● DataSource V2
  • 70. ● New Java Class ● New method to detect sampling operator and Datasource 70 DataSourceV2 70 DataSourceAPIv2 Supports Pushdown Sampling
  • 71. package org.apache.spark.sql.sources.v2.reader; @InterfaceStability.Evolving public interface SupportsPushDownSampling extends DataSourceReader { boolean pushSampling(Sample sample); } DataSourceV2 case s @ Sample(_, _, _, _, l @ PhysicalOperation(p, f, e: DataSourceV2Relation)) => //implementation of pruning and filter pushdown ProjectExec(p, withFilter) :: Nil case _ => Nil }