Data Source API in Spark

Data Source API in Spark
Yin Huai
3/25/2015 - Bay Area Spark Meetup

About Me
Spark SQL developer @databricks
One of the main developers of Data Source API
Used to work on Hive a lot (Hive Committer)
2

Spark: A Unified Platform
3
Spark Core Engine
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL
Alpha/Pre-alpha

DataFrames in Spark
Distributed collection of data grouped into named columns
(i.e. RDD with schema)
Domain-specific functions designed for common tasks
•  Metadata
•  Sampling
•  Relational data processing: project, filter, aggregation, join, ...
•  UDFs
Available in Python, Scala, Java, and R (via SparkR)
4

5
Every Spark application starts
with loading data and ends with
saving data

Datasets Stored in Various Formats/
Systems
6
Spark Core Engine
Alpha/Pre-alpha
{ JSON }
JDBC
and more…
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL

Loading and Saving Data is Not Easy
Convert/parse raw data
•  e.g. parse text records, parse JSON records, deserialize
data stored in binary
Data format transformation
•  e.g. convert your Java objects to Avro records/JSON
records/Parquet records/HBase rows/…
Applications often end up with in-flexible input/output logic
7

Data Sources API
8
Data Source API
Spark Core Engine
Alpha/Pre-alpha
{ JSON }
JDBC
and more…
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL

Data Source Libraries
Users can use libraries based on Data Source API to read/write
DataFrames from/to a variety of formats/systems.
9
{ JSON }
Built-In Support External Libraries
JDBC
and more…

Goals of Data Source API
Developers: build libraries for various data sources
•  No need to get your code merged in Spark codebase
•  Share your library with others through Spark Packages
Users: easy loading/saving DataFrames
Eﬀicient data access powered by Spark SQL query optimizer
•  Have interfaces allowing optimizations to be pushed down
to data source
e.g. Avoid reading unnecessary data for a query
10

11
Data Source API:
Easy loading/saving data

12
Demo 1:
Loading/saving data in Spark
(Generic load/save functions)
(Please see page 26 for code)

Demo 1: Summary
sqlContext.load: loads an existing dataset as a DataFrame
•  Data source name: what source we are loading from
•  Options: parameters for a specific data source, e.g. path of data
•  Schema: if a data source accepts a user-specific schema, you can
apply one
dataframe.save: saves the contents of the DataFrame to a source
•  Data source name: what source we are saving to
•  Save mode: what we should do when data already exists
•  Options: parameters for a specific data source, e.g. path of data
13

14
Share data with other Spark
applications/users?
Table: DataFrame with persisted metadata + name

Metadata Persistence
Configure data source once:
•  Data source name
•  Options
You give the DataFrame representing this dataset a name and
we persist metadata in the Hive Metastore
Anyone can retrieve the dataset by its name
•  In SQL or with DataFrame API
15

Data Source Tables in Hive Metastore
Metadata of data source tables are stored in its own representations
in Hive Metastore
•  Not limited by metastore’s internal restrictions (e.g. data types)
•  Data source tables are not Hive tables
(note: you can always read/write Hive tables with Spark SQL)
Two table types:
•  Managed tables: Users do not specify the location of the data.
DROP
TABLE
will delete the data.
•  External tables: Tables’ with user-specified locations. DROP

TABLE
will NOT delete the data.
16

createExternalTable and saveAsTable
sqlContext.createExternalTable
•  sqlContext.load + metadata persistence + name
dataframe.saveAsTable
•  dataframe.save + metadata persistence + name
Use sqlContext.table(name) to retrieve the DataFrame
Or, access the DataFrame by its name in SQL queries
17

18
Demo 2:
createExternalTable
and saveAsTable

19
Performance of data access?
Eﬀicient data access powered by
Spark SQL query optimizer1
1The data source needs to support optimizations by implementing corresponding interfaces

20
events
=
sqlCtx.load("/data/events",
"parquet")

training_data
=

events

.where("city
=
'New
York'
and
year
=
2015")

.select("timestamp").collect()

events
(many columns)
2011
2012
2013
2014
2015
All columns of 5 years’ data
(Expensive!!!)
events
(city, year, timestamp)
2011
2012
2013
2014
2015
Needed columns
(Better)
events
(city, year, timestamp)
2011
2012
2013
2014
2015
Needed columns and records
(Much better)
Column pruning Partitioning pruning1
1Supported for Parquet and Hive, more support coming in Spark 1.4

21
Build A Data Source Library

Implementing three interfaces for reading data from a data
source
•  BaseRelation: The abstraction of a DataFrame loaded from
a data source. It provides schema of the data.
•  RelationProvider: Handle users’ options and create a
BaseRelation
•  TableScan (BaseRelation for read): Read the data from the
data source and construct rows
For write path and supporting optimizations on data access,
take a look at our Scala Doc/Java Doc
22

23
Demo 3:

Starting From Here
More about Data Source API:
Data Source Section in Spark SQL programming guide
More about how to build a Data Source Library:
Take a look at Spark Avro
Want to share your data source library:
Submit to Spark Packages
24

26
The following slides are code of demos.

Notes about Demo Code
The code is based on Spark 1.3.0.
Demos were done in Databricks Cloud
To try the demo code with your Spark 1.3.0 deployment, just
replace display(…) with .show()
for showing results

e.g. Replace
with
27
display(peopleJson.select("name))

peopleJson.select("name).show()

28
Demo 1:
Loading/saving data in Spark
(Generic load/save functions)

Load a JSON dataset as a DataFrame.
Command took 0.11s -- by yin at 3/25/2015, 7:13:41 PM on yin-meetup-demo
>
json: org.apache.spark.rdd.RDD[String] = /home/yin/meetup/people.json M
apPartitionsRDD[206] at textFile at <console>:29
>
{"name":"Cheng"}
{"name":"Michael"}
{"location":{"state":"California"},"name":"Reynold"}
{"location":{"city":"San Francisco","state":"California"},"name":"Yin"}
> val peopleJson =
sqlContext.load("/home/yin/meetup/people.json", "json")
peopleJson.printSchema()
root
|-- location: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
peopleJson: org.apache.spark.sql.DataFrame = [location: struct<city:str
ing,state:string>, name: string]
> display(peopleJson.select("name", "location.state"))
name state
Cheng null
Michael null
Reynold California
Yin California
val json = sc.textFile("/home/yin/meetup/people.json")
json.collect().foreach(println)
Demo1_Scala
29
> display(
peopleJson
.filter("location.city = 'San Francisco' and
location.state = 'California'")
.select("name"))
name
Yin
>
Save peopleJson to Parquet.
>
>
Save peopleJson to Avro.
> peopleJson.save("/home/yin/meetup/people.avro",
"com.databricks.spark.avro")
>
Save peopleJson to CSV
>
>
peopleJson.save("/home/yin/meetup/people.parquet",
"parquet")
peopleJson
.select("name", "location.city", "location.state")
.save("/home/yin/meetup/people.csv",
"com.databricks.spark.csv")

30
Save people.avro to Parquet.
> val peopleAvro =
sqlContext.load("/home/yin/meetup/people.avro",
"com.databricks.spark.avro")
display(peopleAvro)
location name
null Cheng
null Michael
{"city":null,"state":"California"} Reynold
{"city":"San Francisco","state":"California"} Yin
java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread
y exists.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne
wParquet.scala:110)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s
cala:308)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
> peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet")
>
Save mode needs to be control the
behavior of save when data already
exists.
> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
>
The default save mode is ErrorIfExists.
java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread
y exists.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne
wParquet.scala:110)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s
cala:308)
"parquet", SaveMode.ErrorIfExists)
>
Let's overwrite the existing people.parquet (use
SaveMode.Overwrite).
>
>
SaveMode.Append is for appending data (from a single
user).
"parquet", SaveMode.Append)
val peopleParquet =
sqlContext.load("/home/yin/meetup/people.parquet",
"parquet")
display(peopleParquet)
location name
null Cheng
null Michael
peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet", SaveMode.Overwrite)

31
null Cheng
null Michael
>
For load, we can infer the schema from
JSON, Parquet, and Avro.
>
You can also apply a schema to the
data.
> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
> val schema = StructType(StructField("name", StringType) ::
StructField("city", StringType) :: Nil)
val options = Map("path" -> "/home/yin/meetup/people.csv")
val peopleJsonWithSchema =
sqlContext.load("com.databricks.spark.csv", schema,
options)
schema: org.apache.spark.sql.types.StructType = StructType(StructFiel
d(name,StringType,true), StructField(city,StringType,true))
options: scala.collection.immutable.Map[String,String] = Map(path -> /h
ome/yin/meetup/people.csv)
peopleJsonWithSchema: org.apache.spark.sql.DataFrame = [name: string, c
ity: string]
> peopleJsonWithSchema.printSchema()
root
|-- name: string (nullable = true)
|-- city: string (nullable = true)
> display(peopleJsonWithSchema)
name city
Cheng null
Michael null
Reynold null
Yin San Francisco

32
Demo 2:
createExternalTable
and saveAsTable

33
Create a table with existing dataset
with sqlContext.createExternalTable
>
Out[7]: DataFrame[location: struct<city:string,state:string>, name: str
ing]
>
location name
null Cheng
null Michael
>
name city
Cheng null
Michael null
Reynold null
Yin San Francisco
>
You can also provide a schema to createExternalTable (if
your data source support user-speciﬁed schema)
sqlContext.createExternalTable(
tableName="people_json_table",
path="/home/yin/meetup/people.json",
source="json")
display(sqlContext.table("people_json_table"))
%sql SELECT name, location.city FROM people_json_table
Demo2_Python
>
Save a DataFrame as a Table
> people_json =
sqlContext.load(path="/home/yin/meetup/people.json",
source="json")
people_json.saveAsTable(tableName="people_parquet_table",
source="parquet")
> display(sqlContext.table("people_parquet_table").select("n
ame"))
name
Cheng
Michael
Reynold
Yin
>
Save mode can also be used with
saveAsTable
> people_json.saveAsTable(tableName="people_parquet_table",
source="parquet", mode="append")
> display(sqlContext.table("people_parquet_table").select("n
ame"))
name
Cheng
Michael
Reynold

34
Yin
Cheng
Michael
Reynold
Yin

35
Demo 3:

36
Usually, you want to import the
following ...
>
>
Write your own BaseRelation and
RelationProvider
IntegerRelation: A relation to generate integer
numbers for the range deﬁned by [from, to].
>
>
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
case class IntegerRelation(from: Int, to: Int)(@transient
val sqlContext: SQLContext)
extends BaseRelation with TableScan {
// This relation has a single column "integer_num".
override def schema =
StructType(StructField("integer_num", IntegerType,
nullable = false) :: Nil)
override def buildScan() =
sqlContext.sparkContext.parallelize(from to
to).map(Row(_))
}
Demo3_Scala IntegerRelationProvider: Handle user's
parameter (from and to) and create an
IntegerRelation.
> class IntegerRelationProvider extends RelationProvider {
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
IntegerRelation(parameters("from").toInt,
parameters("to").toInt)(sqlContext)
}
}
>
Use sqlContext.load to get a DataFrame for
IntegerRelation. The range of integer numbers is
[1, 10].
>
options: scala.collection.immutable.Map[String,String] = Map(from -> 1,
to -> 10)
df: org.apache.spark.sql.DataFrame = [integer_num: int]
> display(df)
integer_num
1
2
3
4
5
6
val options = Map("from"->"1", "to"->"10")
val df =
sqlContext.load("com.databricks.sources.number.IntegerRela
tionProvider", options)

37
7
8
> display(df.select($"integer_num" * 100))
(integer_num
*
100)
100
200
300
400
500
600
700
800
900
>
If the RelationProvider's class name is
DefaultSource, users only need to provide the
package name
(com.databricks.sources.number instead of
com.databricks.sources.number.IntegerRelationProvider)

Data Source API in Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Source API in Spark

Similar to Data Source API in Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Data Source API in Spark