Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU.
Abstract
One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy!
In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering.
About the speakers:
Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector.
Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.
3. At first it was Extraction Transformation Loading
Hybrid Transactional Analytical Processing
4. Then the Lambda architecture tried to reduce latency
Hybrid Transactional Analytical Processing
5. A plot of the relative
bandwidth of system
components in the Titan
supercomputer at the Oak
Ridge Leadership Class Facility. Source: Bauer, Andrew C., et
al. "In situ methods,
infrastructures, and
applications on high
performance computing
platforms."
5
6. Consistent and transactional (at various
degree) level
Storage:
● Memory
● Local storage
Big Data HTAP: general design
Fast consistent layer
Weak consistency, high-latency,
immutable files
Storage:
• No-POSIX distributed file system
• Object Stores
Cheap/throughput layer
On-demand resources - decoupled
storage/ CPU
Temporary storage:
• Local disk
• Object Stores.
Query execution
Data ingestion
Periodical flushes
Data
8. Big Data HTAP: min-max pruning, zone maps, bloom filters..
Primary key partition A
Primary key partition B
Meta
Min/max
Bloom
range
Metadata server
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
June May
MarchJune
25. 25
User Query
SELECT sum(v)
FROM
SELECT t1.id, t1.value+1+2 AS v
FROM t1 JOIN t2
WHERE
(t1.id == t2.id AND t2.id > 50)
● Expressions
○ New value computedon input values
● Attributes
○ Column of a data collection
○ Dataset,Data Operation
28. ● Tree
○ Abstraction of users program
○ Node objects
● Rules
○ Transform the tree
○ Logical Optimization
○ Heuristics
Logical Plan
SELECT t1.value+1+2 AS v
ADD
ADDT1.value
Literal(1) Literal(2)
33. ● Key part to integrate datasources
○ How to read/writefrom/tostorage
○ Statistics
○ Physical Planning
● Hadoop, Hive
● Presto and Cassandra connectors
DataSource API
API
35. class DefaultSource extends RelationProvider with
SchemaRelationProvider {
override def createRelation(sqlContext: SQLContext,
parameters: Map[String, String])
: BaseRelation = {
createRelation(sqlContext, parameters, null)
}
//creates a relation with an Undefined Schema (null)
override def createRelation( “”, “” schema: StructType)
: BaseRelation = {
//implementation
return new MyCustomRelation(<>, schema)(sqlContext)
}
//gets the Schema of the table and produces a
MyCustomRelation
}
DataSource API
class MyCustomRelation(location: String,
userSchema: StructType)
(@transient val sqlContext: SQLContext)
extends BaseRelation
with Serializable {
override def schema: StructType = {
//implementation which returns
// StructType
// (or a sequence of StructFields)
}
}
}
36. ● Limited extension
● Lack of info about partition
● Lack of Columnar and Streaming
support
DataSource API
trait LimitedScan {
def buildScan(limit: Int): RDD[Row]
}
trait PrunedLimitedScan {
def buildScan(requiredColumns: Array[String],
limit: Int): RDD[Row]
}
trait PrunedFilteredLimitedScan {
def buildScan(requiredColumns: Array[String],
filters: Array[Filter], limit: Int): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
37. ● Writed in Java since 2.3
● ReadSupport or
WriteSupport
● Own partitioner
● Mix-in some Support
interfaces
DataSourcev2 API
DataSourceV2
with ReadSupport
with ReadSupportWithSchema
DataSourceReader
with SupportPushdownFilters
with SupportPushdownRequiredColumns
....
InputPartitions
InputPartitionReader
38. DataSourcev2 API
public interface ReadSupport extends DataSourceV2 {
DataSourceReader createReader
(DataSourceOptions options);
}
public interface DataSourceReader {
StructType readSchema();
List<InputPartitions<Row>>planInputPartitions()
}
public interface SupportsPushDownRequiredColumns
extends DataSourceReader {
void pruneColumns
(StructType requiredSchema);
}
public interface InputPartition<T> {
InputPartitionReader<T>
createPartitionReader();
}
public interface InputPartitionReader<T> extends
Closeable {
boolean next();
T get();
}
44. 44
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
45. 45
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
FILTERPUSHDOWN
SAMPLING PUSHDOWN
46. ● Sample
○ lower/upper bound
○ with/without Replacement
○ seed
Sample Operator on Spark
SELECT * from keyspace.table
TABLESAMPLE(5 ROWS)
SELECT * from keyspace.table
TABLESAMPLE(10 PERCENT)
df.sample(...)
50. case s @ Sample(_, _, _, _, physical_op @ PhysicalOperation(p, f, l:
LogicalRelation)) =>
l.relation match {
case scan: PrunedSampledFilteredScan if scan.pushSampling(s) =>
pruneFilterProject(
l,
p,
f,
(a, f) => toCatalystRDD(l, a,
scan.buildScan(a.map(_.name).toArray, f, s))) :: Nil
case _ => Nil
}
Sampling Pushdown
org.apache.spark.sql.execution.datasources.DataSourceStrategy
51. 51
Sampling Pushdown
1. User level option to pushdown sampling
2. Detection of Sample
3. Analysis
4. Write CQL expression to query the index
5. Let Qbeast handle it again!
Processing the pushdown:
52. Sampling Pushdown
private[cassandra] class
CassandraSourceRelation(
//other stuff
sampling: Boolean
override val sqlContext: SQLContext)
extends BaseRelation
with InsertableRelation
with PrunedFilteredScan
with PrunedFilteredSampledScan
with Logging
override def pushSampling(sample: Sample): Boolean = {
//check if the table is indexed and the user wants to
pushdown the operator
}
override def buildScan
(requiredColumns: Array[String], filters: Array[Filter],
sample: Sample): RDD[Row] = {
//construct the index CQL code and push it through the
scanning
}
org.apache.spark.sql.cassandra.CassandraConn...
53. Sampling Pushdown
SELECT * from keyspace.table
TABLESAMPLE (5 PERCENT)
Simple LookupSample(0.0,0,05,false, 983653)
Full Table Scan
56. 56
Multidimensional Pruning
1. Detect the index
2. Analyze the predicate
3. Pushdown the Filters to Cassandra
4. Let Qbeast handle it!
Processing the pushdown:
57. private val qbeast = table.qbeastColumns.map(_.columnName)
/** Returns the set of predicates that contains doubleranges
for the index qBeast*/
val qbeastPredicatesToPushdown: Set[Predicate] = {
val doubleRange = rangePredicatesByName.filter(p =>
p._2.exists(Predicates.isLessThanPredicate)
&&
p._2.exists(Predicates.isGreaterThanOrEqualPredicate))
if (qbeast.toSet subsetOf doubleRange.keySet) {
val eqQbeast = qbeast.flatMap(rangePredicatesByName)
eqQbeast.toSet
}
else
Set.empty
}}
Multidimensional Pruning
val predicatesToPushDown: Set[Predicate] =
partitionKeyPredicatesToPushDown ++
clusteringColumnPredicatesToPushDown ++
indexedColumnPredicatesToPushDow ++
qbeastPredicatesToPushdown
org.apache.spark.sql.cassandra.BasicCassandraPredicateToPushdown
58. Multidimensional Pushdown
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
FILTER(isNotNull)
PrunedFilteredScan
FILTER(x,y, z, isNotNull)
Full Table Scan