2. ● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
3. Agenda
● Introduction to Data Source V2
● Shortcomings of Datasource Write API
● Anatomy of Datasource V2 Write API
● Per Partition Transaction
● Source Level Transaction
● Partition Affinity
6. Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra
etc
7. ShortComings of V1 API
● Introduced in 1.3 but not evolved compared to other
parts of spark
● Dependency on High Level API like DataFrame,
SparkContext etc
● Lack of support for column read
● Lack of partition awareness
● No Transaction support in write API
● Lack of extendibility
9. V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.
11. No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file
14. WriterSupport Interface
● Entry Point to the Data Source
● Has One Method
def createWriter(jobId: String, schema: StructType, mode: SaveMode,options:
DataSourceOptions): Optional[DataSourceWriter]
● SaveMode and Schema same as V1 API
● Returns Optional for ready only sources
15. DataSourceWriter Interface
● Entry Point to Writer
● Has Three Methods
○ def createWriterFactory(): DataWriterFactory[Row]
○ def commit(messages: Array[WriterCommitMessage])
○ def abort(messages: Array[WriterCommitMessage])
● Responsible for create writer factory
● WriterCommitMessage is interface for communication
● Can see transactional support throughout of the API
16. DataWriterFactory Interface
● Follow factory design pattern of Java to create actual
data writes
● This code for creating writers for uniquely identifying
different partitions
● It has one method to create data write
def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row]
● attemptNumber for retrying tasks
17. DataWriter Interface
● Interface Responsible for actual write of data
● Runs in worker nodes
● Method exposed are
● def write(record: Row)
● def commit(): WriterCommitMessage
● def abort()
● Looks very similar to Hadoop Write interface
18. Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● Transaction support throughout the API
● Write interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.
20. Mysql Source
● Mysql source is responsible for writing data using jdbc
API
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleMysqlWriter.scala
22. Distributed Writes
● Distributed writing is hard
● There are many reasons write can fail
○ Connection is dropped
○ Error in Writing Data for partition
○ Duplicate data because of retrying
● Many of these issues crop up very frequently in spark
applications
● Ex : MysqlTransactionExample.scala
23. Transactional Support
● In Datasource V2 API, there is good support for
transaction
● Transaction can be implemented at
○ Partition level
○ Job level
● This transaction support helps to handle issue in error in
partial write data
● Ex : MysqlWithTransaction
25. Partition Locations
● Many data sources today provide native support for
partitioning
● These partitions can be distributed over cluster of
machines
● Making spark aware of these partition scheme makes
reading much faster
● Works best for co located data sources
26. Preferred Locations
● DataReader Factory expose getPreferredLocations API
to send the partitioning information to spark
● This API returns the host name of the machines where
partition is available
● Spark uses it as hint. It may not use it
● If we return hostname which is not recognisable to
spark , it just ignores it
● Store this information in the RDD
● Ex : SimpleDataSourceWithPartitionAffinity.scala