Introduction to Datasource V2 API

Introduction to Data
Source V2 API
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Technical Lead at Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com

Agenda
● Structured Data Processing
● Shortcomings of Datasource API
● Goals of Datasource V2
● Anatomy of Datasource V2
● In Memory Datasource
● Mysql Data source

Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL

Spark SQL Components
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc

Datasource V1 API
● Introduced in Spark 1.3 version
● Spark has evolved quite a bit after 1.3
○ Custom Memory Management
○ Dataset abstraction
○ Structured Streaming
● Datasource API has not evolved as with new versions of
spark

Difficulty in evolving DataSource API
● Data source is lowest level abstraction in structured
processing which talks to data sources directly
● Datasource are often written by the third party vendors
to connect to different sources
● They will be not often updated as spark codes which
makes changing API challenging
● So data source API remained same when rest of the
spark changed quite a bit

Dependency on High Level API
● Data sources are lowest level abstraction in the stack
● Data source V1 API depended on high level user facing
abstractions like SQLContext and DataFrame
def createRelation(sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
● As the spark evolved, these abstractions got deprecated
and replaced by better ones

Dependency on High Level API
● But the as data source was not changed, it still stuck
with old abstractions
● So having low level abstractions depend upon high level
abstractions is not a good idea

Lack of Support for Columnar Read
def buildScan(): RDD[Row]
● From above API , it’s apparent that API reads the data
in row format
● But many analytics sources today are columnar in
nature
● So if the underneath columnar source is read into row
format it loses all the performance benefits.

Lack of Partition and Sorting Info
● Many data sources distribute the data using partitioning
over multiple machines
● In datasource V1 API there was no support to share the
partition locality information with spark engine
● This resulted in random reads from spark engine with lot
of network traffic
● Spark builtin sources solved this issue using internal
API’s
● But all third party connectors suffered from it

No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file

Limited Extendability
● Data source v1 API only supports filter push down and
column pruning
● But many smart sources, data sources with processing
power, do more capabilities than that.
● These sources can do sorting, aggregation in the
source level itself
● Currently data source API doesn’t good mechanism to
push more catalyst expressions to underneath source.

Introduction to Datasource V2 API

V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.

Goals
● No Dependency on High level abstractions like
SparkContext,DataFrame etc
● Java Friendly
● Support for Filter Pushdown and Pruning
● Support for Partition locality
● Support for both columnar and row based reading
● Transaction support in write interface
● Get rid of internal API’s like HadoopFsRelation

Java API
● Being Java friendly is one of the goal of the V2 API
● So all the base interfaces for the V2 API are defined as
Java interfaces
● This makes interoperability between Java much easier
compared to V1 API
● Little painful in scala to deal with Java models
● The next slides covers the all the different interfaces in
the read path

Interfaces
Master
Datasource Reader
Data Reader Factory
User code
Worker
Data Reader
Worker
Data Reader

DataSourceReader Interface
● Entry Point to the Data Source
● SubClass of ReaderSupport
● Has Two Methods
○ def readSchema():StructType
○ def createDataReaderFactories:List[DataReaderFactory]
● Responsible for Schema inference and creating data
factories
● What may be list indicate here?

DataReaderFactory Interface
● Follow factory design pattern of Java to create actual
data reader
● This code runs in master which may be responsible for
common code , like jdbc connections, across the data
readers.
● It has one method to create data reader
def createDataReader:DataReader

DataReader Interface
● Interface Responsible for actual reading of data
● Runs in worker nodes
● Should be serializable
● Method exposed are
def next : Boolean
def get : T
● Looks very similar to Iterator interface

Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● The return type of DataReader is T which suggests that
it can support both row and columnar read
● Reader interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.

Simple In-Memory Interface
● In-memory Interface which reads the data from
in-memory array
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleDataSource.scala

Multiple Partitions
● In this example, we extend our simple in-memory data
source to have multiple partitions
● In the code, we will have multiple data reader factories
compared one in earlier example
● In Data Reader code, we track the partition using it’s
start and end
● Mimics the HDFS InputFormat
● Ex : SimpleMultiDataSource.scala

Filter Push with Mysql Datasource
● Filter push allows data sources to push the spark SQL
filters to the data source
● In the smart sources, like relational databases, source
has capability to filter efficiently
● In this example, filter push is implemented for Mysql
Source
● This source will be using JDBC interface to
communicate with Mysql
● Ex : SimpleMysqlDataSource.scala

References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Introduction to Datasource V2 API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Datasource V2 API

Similar to Introduction to Datasource V2 API (20)

More from datamantra

More from datamantra (20)

Recently uploaded

Recently uploaded (20)

Introduction to Datasource V2 API