SlideShare a Scribd company logo
1 of 37
Download to read offline
Data Source API in Spark
Yin Huai
3/25/2015 - Bay Area Spark Meetup
About Me
Spark SQL developer @databricks
One of the main developers of Data Source API
Used to work on Hive a lot (Hive Committer)
2
Spark: A Unified Platform
3
Spark Core Engine
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL
Alpha/Pre-alpha
DataFrames in Spark
Distributed collection of data grouped into named columns
(i.e. RDD with schema)
Domain-specific functions designed for common tasks
•  Metadata
•  Sampling
•  Relational data processing: project, filter, aggregation, join, ...
•  UDFs
Available in Python, Scala, Java, and R (via SparkR)
4
5
Every Spark application starts
with loading data and ends with
saving data
	
  
Datasets Stored in Various Formats/
Systems
6
Spark Core Engine
Alpha/Pre-alpha
{ JSON }
JDBC
and more…
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL
Loading and Saving Data is Not Easy
Convert/parse raw data
•  e.g. parse text records, parse JSON records, deserialize
data stored in binary
Data format transformation
•  e.g. convert your Java objects to Avro records/JSON
records/Parquet records/HBase rows/…
Applications often end up with in-flexible input/output logic
7
Data Sources API
8
Data Source API
Spark Core Engine
Alpha/Pre-alpha
{ JSON }
JDBC
and more…
DataFrame
Spark
Streaming
Streaming
MLlib
Machine
Learning
Graphx
Graph
Computation
Spark R
R on Spark
Spark SQL
Data Source Libraries
Users can use libraries based on Data Source API to read/write
DataFrames from/to a variety of formats/systems.
9
{ JSON }
Built-In Support External Libraries
JDBC
and more…
Goals of Data Source API
Developers: build libraries for various data sources
•  No need to get your code merged in Spark codebase
•  Share your library with others through Spark Packages
Users: easy loading/saving DataFrames
Efficient data access powered by Spark SQL query optimizer
•  Have interfaces allowing optimizations to be pushed down
to data source
e.g. Avoid reading unnecessary data for a query
10
11
Data Source API:
Easy loading/saving data
	
  
12
Demo 1:
Loading/saving data in Spark
(Generic load/save functions)
(Please see page 26 for code)
	
  
Demo 1: Summary
sqlContext.load: loads an existing dataset as a DataFrame
•  Data source name: what source we are loading from
•  Options: parameters for a specific data source, e.g. path of data
•  Schema: if a data source accepts a user-specific schema, you can
apply one
dataframe.save: saves the contents of the DataFrame to a source
•  Data source name: what source we are saving to
•  Save mode: what we should do when data already exists
•  Options: parameters for a specific data source, e.g. path of data
13
14
Share data with other Spark
applications/users?
Table: DataFrame with persisted metadata + name
Metadata Persistence
Configure data source once:
•  Data source name
•  Options
You give the DataFrame representing this dataset a name and
we persist metadata in the Hive Metastore
Anyone can retrieve the dataset by its name
•  In SQL or with DataFrame API
15
Data Source Tables in Hive Metastore
Metadata of data source tables are stored in its own representations
in Hive Metastore
•  Not limited by metastore’s internal restrictions (e.g. data types)
•  Data source tables are not Hive tables
(note: you can always read/write Hive tables with Spark SQL)
Two table types:
•  Managed tables: Users do not specify the location of the data.
DROP	
  TABLE	
  will delete the data.
•  External tables: Tables’ with user-specified locations. DROP	
  
TABLE	
  will NOT delete the data.
16
createExternalTable and saveAsTable
sqlContext.createExternalTable
•  sqlContext.load + metadata persistence + name
dataframe.saveAsTable
•  dataframe.save + metadata persistence + name
Use sqlContext.table(name) to retrieve the DataFrame
Or, access the DataFrame by its name in SQL queries
17
18
Demo 2:
createExternalTable
and saveAsTable
(Please see page 26 for code)
19
Performance of data access?
Efficient data access powered by
Spark SQL query optimizer1
1The data source needs to support optimizations by implementing corresponding interfaces
20
events	
  =	
  sqlCtx.load("/data/events",	
  "parquet")	
  
training_data	
  =	
  
	
  	
  events	
  
	
  	
  	
  	
  .where("city	
  =	
  'New	
  York'	
  and	
  year	
  =	
  2015")	
  
	
  	
  	
  	
  .select("timestamp").collect()	
  	
  
events
(many columns)
2011
2012
2013
2014
2015
All columns of 5 years’ data
(Expensive!!!)
events
(city, year, timestamp)
2011
2012
2013
2014
2015
Needed columns
(Better)
events
(city, year, timestamp)
2011
2012
2013
2014
2015
Needed columns and records
(Much better)
Column pruning Partitioning pruning1
1Supported for Parquet and Hive, more support coming in Spark 1.4
21
Build A Data Source Library
Build A Data Source Library
Implementing three interfaces for reading data from a data
source
•  BaseRelation: The abstraction of a DataFrame loaded from
a data source. It provides schema of the data.
•  RelationProvider: Handle users’ options and create a
BaseRelation
•  TableScan (BaseRelation for read): Read the data from the
data source and construct rows
For write path and supporting optimizations on data access,
take a look at our Scala Doc/Java Doc
22
23
Demo 3:
Build A Data Source Library
(Please see page 26 for code)
	
  
Starting From Here
More about Data Source API:
Data Source Section in Spark SQL programming guide
More about how to build a Data Source Library:
Take a look at Spark Avro
Want to share your data source library:
Submit to Spark Packages
24
Thank you!
26
The following slides are code of demos. 	
  
Notes about Demo Code
The code is based on Spark 1.3.0.
Demos were done in Databricks Cloud
To try the demo code with your Spark 1.3.0 deployment, just
replace display(…) with .show()	
  for showing results	
  
e.g. Replace
with
27
display(peopleJson.select("name))	
  
peopleJson.select("name).show()	
  
28
Demo 1:
Loading/saving data in Spark
(Generic load/save functions)
	
  
Load a JSON dataset as a DataFrame.
Command took 0.11s -- by yin at 3/25/2015, 7:13:41 PM on yin-meetup-demo
> 
json: org.apache.spark.rdd.RDD[String] = /home/yin/meetup/people.json M
apPartitionsRDD[206] at textFile at <console>:29
Command took 0.77s -- by yin at 3/25/2015, 7:13:52 PM on yin-meetup-demo
> 
{"name":"Cheng"}
{"name":"Michael"}
{"location":{"state":"California"},"name":"Reynold"}
{"location":{"city":"San Francisco","state":"California"},"name":"Yin"}
Command took 0.60s -- by yin at 3/25/2015, 7:14:41 PM on yin-meetup-demo
>  val peopleJson =
sqlContext.load("/home/yin/meetup/people.json", "json")
peopleJson.printSchema()
root
|-- location: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
peopleJson: org.apache.spark.sql.DataFrame = [location: struct<city:str
ing,state:string>, name: string]
Command took 0.70s -- by yin at 3/25/2015, 7:15:10 PM on yin-meetup-demo
>  display(peopleJson.select("name", "location.state"))
name state
Cheng null
Michael null
Reynold California
Yin California
val json = sc.textFile("/home/yin/meetup/people.json")
json.collect().foreach(println)
Demo1_Scala
29
Command took 0.49s -- by yin at 3/25/2015, 7:15:28 PM on yin-meetup-demo
>  display(
peopleJson
.filter("location.city = 'San Francisco' and
location.state = 'California'")
.select("name"))
name
Yin
> 
Save peopleJson to Parquet.
Command took 3.27s -- by yin at 3/25/2015, 7:15:49 PM on yin-meetup-demo
> 
> 
Save peopleJson to Avro.
Command took 0.52s -- by yin at 3/25/2015, 7:15:57 PM on yin-meetup-demo
>  peopleJson.save("/home/yin/meetup/people.avro",
"com.databricks.spark.avro")
> 
Save peopleJson to CSV
Command took 0.89s -- by yin at 3/25/2015, 7:16:24 PM on yin-meetup-demo
> 
> 
peopleJson.save("/home/yin/meetup/people.parquet",
"parquet")
peopleJson
.select("name", "location.city", "location.state")
.save("/home/yin/meetup/people.csv",
"com.databricks.spark.csv")
30
Save people.avro to Parquet.
Command took 1.21s -- by yin at 3/25/2015, 7:16:43 PM on yin-meetup-demo
>  val peopleAvro =
sqlContext.load("/home/yin/meetup/people.avro",
"com.databricks.spark.avro")
display(peopleAvro)
location name
null Cheng
null Michael
{"city":null,"state":"California"} Reynold
{"city":"San Francisco","state":"California"} Yin
java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread
y exists.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne
wParquet.scala:110)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s
cala:308)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1083)
Command took 0.19s -- by yin at 3/25/2015, 7:17:01 PM on yin-meetup-demo
>  peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet")
> 
Save mode needs to be control the
behavior of save when data already
exists.
Command took 0.09s -- by yin at 3/25/2015, 7:17:33 PM on yin-meetup-demo
>  import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
> 
The default save mode is ErrorIfExists.
java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread
y exists.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne
wParquet.scala:110)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s
cala:308)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1094)
Command took 0.20s -- by yin at 3/25/2015, 7:17:40 PM on yin-meetup-demo
>  peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet", SaveMode.ErrorIfExists)
> 
Let's overwrite the existing people.parquet (use
SaveMode.Overwrite).
Command took 2.82s -- by yin at 3/25/2015, 7:17:50 PM on yin-meetup-demo
> 
> 
SaveMode.Append is for appending data (from a single
user).
>  peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet", SaveMode.Append)
val peopleParquet =
sqlContext.load("/home/yin/meetup/people.parquet",
"parquet")
display(peopleParquet)
location name
null Cheng
null Michael
peopleAvro.save("/home/yin/meetup/people.parquet",
"parquet", SaveMode.Overwrite)
31
Command took 3.54s -- by yin at 3/25/2015, 7:18:09 PM on yin-meetup-demo
{"city":null,"state":"California"} Reynold
{"city":"San Francisco","state":"California"} Yin
null Cheng
null Michael
{"city":null,"state":"California"} Reynold
{"city":"San Francisco","state":"California"} Yin
> 
For load, we can infer the schema from
JSON, Parquet, and Avro.
> 
You can also apply a schema to the
data.
Command took 0.09s -- by yin at 3/25/2015, 7:18:55 PM on yin-meetup-demo
>  import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
Command took 0.30s -- by yin at 3/25/2015, 7:19:36 PM on yin-meetup-demo
>  val schema = StructType(StructField("name", StringType) ::
StructField("city", StringType) :: Nil)
val options = Map("path" -> "/home/yin/meetup/people.csv")
val peopleJsonWithSchema =
sqlContext.load("com.databricks.spark.csv", schema,
options)
schema: org.apache.spark.sql.types.StructType = StructType(StructFiel
d(name,StringType,true), StructField(city,StringType,true))
options: scala.collection.immutable.Map[String,String] = Map(path -> /h
ome/yin/meetup/people.csv)
peopleJsonWithSchema: org.apache.spark.sql.DataFrame = [name: string, c
ity: string]
>  peopleJsonWithSchema.printSchema()
Command took 0.11s -- by yin at 3/25/2015, 7:19:39 PM on yin-meetup-demo
root
|-- name: string (nullable = true)
|-- city: string (nullable = true)
Command took 0.78s -- by yin at 3/25/2015, 7:19:46 PM on yin-meetup-demo
>  display(peopleJsonWithSchema)
name city
Cheng null
Michael null
Reynold null
Yin San Francisco
32
Demo 2:
createExternalTable
and saveAsTable	
  
33
Create a table with existing dataset
with sqlContext.createExternalTable
Command took 0.93s -- by yin at 3/25/2015, 7:25:39 PM on yin-meetup-demo
> 
Out[7]: DataFrame[location: struct<city:string,state:string>, name: str
ing]
Command took 0.50s -- by yin at 3/25/2015, 7:25:49 PM on yin-meetup-demo
> 
location name
null Cheng
null Michael
{"city":null,"state":"California"} Reynold
{"city":"San Francisco","state":"California"} Yin
Command took 0.43s -- by yin at 3/25/2015, 7:25:58 PM on yin-meetup-demo
> 
name city
Cheng null
Michael null
Reynold null
Yin San Francisco
> 
You can also provide a schema to createExternalTable (if
your data source support user-specified schema)
sqlContext.createExternalTable(
tableName="people_json_table",
path="/home/yin/meetup/people.json",
source="json")
display(sqlContext.table("people_json_table"))
%sql SELECT name, location.city FROM people_json_table
Demo2_Python
> 
Save a DataFrame as a Table
Command took 4.83s -- by yin at 3/25/2015, 7:26:57 PM on yin-meetup-demo
>  people_json =
sqlContext.load(path="/home/yin/meetup/people.json",
source="json")
people_json.saveAsTable(tableName="people_parquet_table",
source="parquet")
Command took 0.74s -- by yin at 3/25/2015, 7:27:10 PM on yin-meetup-demo
>  display(sqlContext.table("people_parquet_table").select("n
ame"))
name
Cheng
Michael
Reynold
Yin
> 
Save mode can also be used with
saveAsTable
Command took 3.53s -- by yin at 3/25/2015, 7:27:42 PM on yin-meetup-demo
>  people_json.saveAsTable(tableName="people_parquet_table",
source="parquet", mode="append")
>  display(sqlContext.table("people_parquet_table").select("n
ame"))
name
Cheng
Michael
Reynold
34
Command took 0.82s -- by yin at 3/25/2015, 7:27:48 PM on yin-meetup-demo
Yin
Cheng
Michael
Reynold
Yin
35
Demo 3:
Build A Data Source Library
	
  
36
Usually, you want to import the
following ...
> 
> 
Write your own BaseRelation and
RelationProvider
IntegerRelation: A relation to generate integer
numbers for the range defined by [from, to].
> 
> 
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
case class IntegerRelation(from: Int, to: Int)(@transient
val sqlContext: SQLContext)
extends BaseRelation with TableScan {
// This relation has a single column "integer_num".
override def schema =
  StructType(StructField("integer_num", IntegerType,
nullable = false) :: Nil)
override def buildScan() =
sqlContext.sparkContext.parallelize(from to
to).map(Row(_))
}
Demo3_Scala IntegerRelationProvider: Handle user's
parameter (from and to) and create an
IntegerRelation.
>  class IntegerRelationProvider extends RelationProvider {
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
IntegerRelation(parameters("from").toInt,
parameters("to").toInt)(sqlContext)
}
}
> 
Use sqlContext.load to get a DataFrame for
IntegerRelation. The range of integer numbers is
[1, 10].
Command took 0.18s -- by yin at 3/25/2015, 7:35:00 PM on yin-meetup-demo
> 
options: scala.collection.immutable.Map[String,String] = Map(from -> 1,
to -> 10)
df: org.apache.spark.sql.DataFrame = [integer_num: int]
>  display(df)
integer_num
1
2
3
4
5
6
val options = Map("from"->"1", "to"->"10")
val df =
sqlContext.load("com.databricks.sources.number.IntegerRela
tionProvider", options)
37
Command took 0.19s -- by yin at 3/25/2015, 7:35:09 PM on yin-meetup-demo
7
8
Command took 0.21s -- by yin at 3/25/2015, 7:35:24 PM on yin-meetup-demo
>  display(df.select($"integer_num" * 100))
(integer_num
 *
 100)
100
200
300
400
500
600
700
800
900
> 
If the RelationProvider's class name is
DefaultSource, users only need to provide the
package name
(com.databricks.sources.number instead of
com.databricks.sources.number.IntegerRelationProvider)

More Related Content

What's hot

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

What's hot (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 

Similar to Data Source API in Spark

Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 

Similar to Data Source API in Spark (20)

Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Informatica slides
Informatica slidesInformatica slides
Informatica slides
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

Data Source API in Spark

  • 1. Data Source API in Spark Yin Huai 3/25/2015 - Bay Area Spark Meetup
  • 2. About Me Spark SQL developer @databricks One of the main developers of Data Source API Used to work on Hive a lot (Hive Committer) 2
  • 3. Spark: A Unified Platform 3 Spark Core Engine DataFrame Spark Streaming Streaming MLlib Machine Learning Graphx Graph Computation Spark R R on Spark Spark SQL Alpha/Pre-alpha
  • 4. DataFrames in Spark Distributed collection of data grouped into named columns (i.e. RDD with schema) Domain-specific functions designed for common tasks •  Metadata •  Sampling •  Relational data processing: project, filter, aggregation, join, ... •  UDFs Available in Python, Scala, Java, and R (via SparkR) 4
  • 5. 5 Every Spark application starts with loading data and ends with saving data  
  • 6. Datasets Stored in Various Formats/ Systems 6 Spark Core Engine Alpha/Pre-alpha { JSON } JDBC and more… DataFrame Spark Streaming Streaming MLlib Machine Learning Graphx Graph Computation Spark R R on Spark Spark SQL
  • 7. Loading and Saving Data is Not Easy Convert/parse raw data •  e.g. parse text records, parse JSON records, deserialize data stored in binary Data format transformation •  e.g. convert your Java objects to Avro records/JSON records/Parquet records/HBase rows/… Applications often end up with in-flexible input/output logic 7
  • 8. Data Sources API 8 Data Source API Spark Core Engine Alpha/Pre-alpha { JSON } JDBC and more… DataFrame Spark Streaming Streaming MLlib Machine Learning Graphx Graph Computation Spark R R on Spark Spark SQL
  • 9. Data Source Libraries Users can use libraries based on Data Source API to read/write DataFrames from/to a variety of formats/systems. 9 { JSON } Built-In Support External Libraries JDBC and more…
  • 10. Goals of Data Source API Developers: build libraries for various data sources •  No need to get your code merged in Spark codebase •  Share your library with others through Spark Packages Users: easy loading/saving DataFrames Efficient data access powered by Spark SQL query optimizer •  Have interfaces allowing optimizations to be pushed down to data source e.g. Avoid reading unnecessary data for a query 10
  • 11. 11 Data Source API: Easy loading/saving data  
  • 12. 12 Demo 1: Loading/saving data in Spark (Generic load/save functions) (Please see page 26 for code)  
  • 13. Demo 1: Summary sqlContext.load: loads an existing dataset as a DataFrame •  Data source name: what source we are loading from •  Options: parameters for a specific data source, e.g. path of data •  Schema: if a data source accepts a user-specific schema, you can apply one dataframe.save: saves the contents of the DataFrame to a source •  Data source name: what source we are saving to •  Save mode: what we should do when data already exists •  Options: parameters for a specific data source, e.g. path of data 13
  • 14. 14 Share data with other Spark applications/users? Table: DataFrame with persisted metadata + name
  • 15. Metadata Persistence Configure data source once: •  Data source name •  Options You give the DataFrame representing this dataset a name and we persist metadata in the Hive Metastore Anyone can retrieve the dataset by its name •  In SQL or with DataFrame API 15
  • 16. Data Source Tables in Hive Metastore Metadata of data source tables are stored in its own representations in Hive Metastore •  Not limited by metastore’s internal restrictions (e.g. data types) •  Data source tables are not Hive tables (note: you can always read/write Hive tables with Spark SQL) Two table types: •  Managed tables: Users do not specify the location of the data. DROP  TABLE  will delete the data. •  External tables: Tables’ with user-specified locations. DROP   TABLE  will NOT delete the data. 16
  • 17. createExternalTable and saveAsTable sqlContext.createExternalTable •  sqlContext.load + metadata persistence + name dataframe.saveAsTable •  dataframe.save + metadata persistence + name Use sqlContext.table(name) to retrieve the DataFrame Or, access the DataFrame by its name in SQL queries 17
  • 19. 19 Performance of data access? Efficient data access powered by Spark SQL query optimizer1 1The data source needs to support optimizations by implementing corresponding interfaces
  • 20. 20 events  =  sqlCtx.load("/data/events",  "parquet")   training_data  =      events          .where("city  =  'New  York'  and  year  =  2015")          .select("timestamp").collect()     events (many columns) 2011 2012 2013 2014 2015 All columns of 5 years’ data (Expensive!!!) events (city, year, timestamp) 2011 2012 2013 2014 2015 Needed columns (Better) events (city, year, timestamp) 2011 2012 2013 2014 2015 Needed columns and records (Much better) Column pruning Partitioning pruning1 1Supported for Parquet and Hive, more support coming in Spark 1.4
  • 21. 21 Build A Data Source Library
  • 22. Build A Data Source Library Implementing three interfaces for reading data from a data source •  BaseRelation: The abstraction of a DataFrame loaded from a data source. It provides schema of the data. •  RelationProvider: Handle users’ options and create a BaseRelation •  TableScan (BaseRelation for read): Read the data from the data source and construct rows For write path and supporting optimizations on data access, take a look at our Scala Doc/Java Doc 22
  • 23. 23 Demo 3: Build A Data Source Library (Please see page 26 for code)  
  • 24. Starting From Here More about Data Source API: Data Source Section in Spark SQL programming guide More about how to build a Data Source Library: Take a look at Spark Avro Want to share your data source library: Submit to Spark Packages 24
  • 26. 26 The following slides are code of demos.  
  • 27. Notes about Demo Code The code is based on Spark 1.3.0. Demos were done in Databricks Cloud To try the demo code with your Spark 1.3.0 deployment, just replace display(…) with .show()  for showing results   e.g. Replace with 27 display(peopleJson.select("name))   peopleJson.select("name).show()  
  • 28. 28 Demo 1: Loading/saving data in Spark (Generic load/save functions)  
  • 29. Load a JSON dataset as a DataFrame. Command took 0.11s -- by yin at 3/25/2015, 7:13:41 PM on yin-meetup-demo >  json: org.apache.spark.rdd.RDD[String] = /home/yin/meetup/people.json M apPartitionsRDD[206] at textFile at <console>:29 Command took 0.77s -- by yin at 3/25/2015, 7:13:52 PM on yin-meetup-demo >  {"name":"Cheng"} {"name":"Michael"} {"location":{"state":"California"},"name":"Reynold"} {"location":{"city":"San Francisco","state":"California"},"name":"Yin"} Command took 0.60s -- by yin at 3/25/2015, 7:14:41 PM on yin-meetup-demo >  val peopleJson = sqlContext.load("/home/yin/meetup/people.json", "json") peopleJson.printSchema() root |-- location: struct (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) |-- name: string (nullable = true) peopleJson: org.apache.spark.sql.DataFrame = [location: struct<city:str ing,state:string>, name: string] Command took 0.70s -- by yin at 3/25/2015, 7:15:10 PM on yin-meetup-demo >  display(peopleJson.select("name", "location.state")) name state Cheng null Michael null Reynold California Yin California val json = sc.textFile("/home/yin/meetup/people.json") json.collect().foreach(println) Demo1_Scala 29 Command took 0.49s -- by yin at 3/25/2015, 7:15:28 PM on yin-meetup-demo >  display( peopleJson .filter("location.city = 'San Francisco' and location.state = 'California'") .select("name")) name Yin >  Save peopleJson to Parquet. Command took 3.27s -- by yin at 3/25/2015, 7:15:49 PM on yin-meetup-demo >  >  Save peopleJson to Avro. Command took 0.52s -- by yin at 3/25/2015, 7:15:57 PM on yin-meetup-demo >  peopleJson.save("/home/yin/meetup/people.avro", "com.databricks.spark.avro") >  Save peopleJson to CSV Command took 0.89s -- by yin at 3/25/2015, 7:16:24 PM on yin-meetup-demo >  >  peopleJson.save("/home/yin/meetup/people.parquet", "parquet") peopleJson .select("name", "location.city", "location.state") .save("/home/yin/meetup/people.csv", "com.databricks.spark.csv")
  • 30. 30 Save people.avro to Parquet. Command took 1.21s -- by yin at 3/25/2015, 7:16:43 PM on yin-meetup-demo >  val peopleAvro = sqlContext.load("/home/yin/meetup/people.avro", "com.databricks.spark.avro") display(peopleAvro) location name null Cheng null Michael {"city":null,"state":"California"} Reynold {"city":"San Francisco","state":"California"} Yin java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread y exists. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne wParquet.scala:110) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s cala:308) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1083) Command took 0.19s -- by yin at 3/25/2015, 7:17:01 PM on yin-meetup-demo >  peopleAvro.save("/home/yin/meetup/people.parquet", "parquet") >  Save mode needs to be control the behavior of save when data already exists. Command took 0.09s -- by yin at 3/25/2015, 7:17:33 PM on yin-meetup-demo >  import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SaveMode >  The default save mode is ErrorIfExists. java.lang.RuntimeException: path /home/yin/meetup/people.parquet alread y exists. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.DefaultSource.createRelation(ne wParquet.scala:110) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.s cala:308) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1094) Command took 0.20s -- by yin at 3/25/2015, 7:17:40 PM on yin-meetup-demo >  peopleAvro.save("/home/yin/meetup/people.parquet", "parquet", SaveMode.ErrorIfExists) >  Let's overwrite the existing people.parquet (use SaveMode.Overwrite). Command took 2.82s -- by yin at 3/25/2015, 7:17:50 PM on yin-meetup-demo >  >  SaveMode.Append is for appending data (from a single user). >  peopleAvro.save("/home/yin/meetup/people.parquet", "parquet", SaveMode.Append) val peopleParquet = sqlContext.load("/home/yin/meetup/people.parquet", "parquet") display(peopleParquet) location name null Cheng null Michael peopleAvro.save("/home/yin/meetup/people.parquet", "parquet", SaveMode.Overwrite)
  • 31. 31 Command took 3.54s -- by yin at 3/25/2015, 7:18:09 PM on yin-meetup-demo {"city":null,"state":"California"} Reynold {"city":"San Francisco","state":"California"} Yin null Cheng null Michael {"city":null,"state":"California"} Reynold {"city":"San Francisco","state":"California"} Yin >  For load, we can infer the schema from JSON, Parquet, and Avro. >  You can also apply a schema to the data. Command took 0.09s -- by yin at 3/25/2015, 7:18:55 PM on yin-meetup-demo >  import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ Command took 0.30s -- by yin at 3/25/2015, 7:19:36 PM on yin-meetup-demo >  val schema = StructType(StructField("name", StringType) :: StructField("city", StringType) :: Nil) val options = Map("path" -> "/home/yin/meetup/people.csv") val peopleJsonWithSchema = sqlContext.load("com.databricks.spark.csv", schema, options) schema: org.apache.spark.sql.types.StructType = StructType(StructFiel d(name,StringType,true), StructField(city,StringType,true)) options: scala.collection.immutable.Map[String,String] = Map(path -> /h ome/yin/meetup/people.csv) peopleJsonWithSchema: org.apache.spark.sql.DataFrame = [name: string, c ity: string] >  peopleJsonWithSchema.printSchema() Command took 0.11s -- by yin at 3/25/2015, 7:19:39 PM on yin-meetup-demo root |-- name: string (nullable = true) |-- city: string (nullable = true) Command took 0.78s -- by yin at 3/25/2015, 7:19:46 PM on yin-meetup-demo >  display(peopleJsonWithSchema) name city Cheng null Michael null Reynold null Yin San Francisco
  • 33. 33 Create a table with existing dataset with sqlContext.createExternalTable Command took 0.93s -- by yin at 3/25/2015, 7:25:39 PM on yin-meetup-demo >  Out[7]: DataFrame[location: struct<city:string,state:string>, name: str ing] Command took 0.50s -- by yin at 3/25/2015, 7:25:49 PM on yin-meetup-demo >  location name null Cheng null Michael {"city":null,"state":"California"} Reynold {"city":"San Francisco","state":"California"} Yin Command took 0.43s -- by yin at 3/25/2015, 7:25:58 PM on yin-meetup-demo >  name city Cheng null Michael null Reynold null Yin San Francisco >  You can also provide a schema to createExternalTable (if your data source support user-specified schema) sqlContext.createExternalTable( tableName="people_json_table", path="/home/yin/meetup/people.json", source="json") display(sqlContext.table("people_json_table")) %sql SELECT name, location.city FROM people_json_table Demo2_Python >  Save a DataFrame as a Table Command took 4.83s -- by yin at 3/25/2015, 7:26:57 PM on yin-meetup-demo >  people_json = sqlContext.load(path="/home/yin/meetup/people.json", source="json") people_json.saveAsTable(tableName="people_parquet_table", source="parquet") Command took 0.74s -- by yin at 3/25/2015, 7:27:10 PM on yin-meetup-demo >  display(sqlContext.table("people_parquet_table").select("n ame")) name Cheng Michael Reynold Yin >  Save mode can also be used with saveAsTable Command took 3.53s -- by yin at 3/25/2015, 7:27:42 PM on yin-meetup-demo >  people_json.saveAsTable(tableName="people_parquet_table", source="parquet", mode="append") >  display(sqlContext.table("people_parquet_table").select("n ame")) name Cheng Michael Reynold
  • 34. 34 Command took 0.82s -- by yin at 3/25/2015, 7:27:48 PM on yin-meetup-demo Yin Cheng Michael Reynold Yin
  • 35. 35 Demo 3: Build A Data Source Library  
  • 36. 36 Usually, you want to import the following ... >  >  Write your own BaseRelation and RelationProvider IntegerRelation: A relation to generate integer numbers for the range defined by [from, to]. >  >  import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Row import org.apache.spark.sql.sources._ import org.apache.spark.sql.types._ case class IntegerRelation(from: Int, to: Int)(@transient val sqlContext: SQLContext) extends BaseRelation with TableScan { // This relation has a single column "integer_num". override def schema =   StructType(StructField("integer_num", IntegerType, nullable = false) :: Nil) override def buildScan() = sqlContext.sparkContext.parallelize(from to to).map(Row(_)) } Demo3_Scala IntegerRelationProvider: Handle user's parameter (from and to) and create an IntegerRelation. >  class IntegerRelationProvider extends RelationProvider { override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = { IntegerRelation(parameters("from").toInt, parameters("to").toInt)(sqlContext) } } >  Use sqlContext.load to get a DataFrame for IntegerRelation. The range of integer numbers is [1, 10]. Command took 0.18s -- by yin at 3/25/2015, 7:35:00 PM on yin-meetup-demo >  options: scala.collection.immutable.Map[String,String] = Map(from -> 1, to -> 10) df: org.apache.spark.sql.DataFrame = [integer_num: int] >  display(df) integer_num 1 2 3 4 5 6 val options = Map("from"->"1", "to"->"10") val df = sqlContext.load("com.databricks.sources.number.IntegerRela tionProvider", options)
  • 37. 37 Command took 0.19s -- by yin at 3/25/2015, 7:35:09 PM on yin-meetup-demo 7 8 Command took 0.21s -- by yin at 3/25/2015, 7:35:24 PM on yin-meetup-demo >  display(df.select($"integer_num" * 100)) (integer_num * 100) 100 200 300 400 500 600 700 800 900 >  If the RelationProvider's class name is DefaultSource, users only need to provide the package name (com.databricks.sources.number instead of com.databricks.sources.number.IntegerRelationProvider)