SlideShare a Scribd company logo
1 of 33
Download to read offline
Jean Georges Perrin, @jgperrin, Oplo
Extending Spark’s
Ingestion: Build Your Own
Java Data Source
#EUdev6
2#EUdev6
JGP
@jgperrin
Jean Georges Perrin
Chapel Hill, NC, USA
I 🏗SW
#Knowledge =
𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
& #Software
#IBMChampion x9 #KeepLearning
@ http://jgp.net
2#EUdev6
I💚#
A Little Help
Hello Ruby,
Nathaniel,
Jack,
and PN!
3@jgperrin - #EUdev6
And What About You?
• Who is a Java programmer?
• Who swears by Scala?
• Who has tried to build a custom data source?
• In Java?
• Who has succeeded?
4@jgperrin - #EUdev6
My Problem…
5@jgperrin - #EUdev6
CSV
JSON
RDBMS REST
Other
Solution #1
6@jgperrin - #EUdev6
• Look in the Spark Packages library
• https://spark-packages.org/
• 48 data sources listed (some dups)
• Some with Source Code
package
net.jgp.labs.spark.datasources.l100_
photo_datasource;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import
org.apache.spark.sql.SparkSession;
public class
PhotoMetadataIngestionApp {
public static void main(String[]
args) {
PhotoMetadataIngestionApp
app = new
PhotoMetadataIngestionApp();
app.start();
}
private boolean start() {
SparkSession spark =
SparkSession.builder()
.appName("EXIF to
Dataset")
.master("local[*]").
getOrCreate();
String importDirectory =
"/Users/jgp/Pictures/All
Photos/2010-2019/2016";
Solution #2
7@jgperrin - #EUdev6
• Write it yourself
• In Java
What is ingestion anyway?
• Loading
– CSV, JSON, RDBMS…
– Basically any data
• Getting a Dataframe (Dataset<Row>) with your
data
8@jgperrin - #EUdev6
9@jgperrin - #EUdev6
Our lab
Import some of
the metadata
of our photos
and start
analysis…
Code
Dataset<Row> df = spark.read()
.format("exif")
.option("recursive", "true")
.option("limit", "80000")
.option("extensions", "jpg,jpeg")
.load(importDirectory);
#EUdev6
Spark Session
Short name for
your data source
Options
Where to start
/jgperrin/net.jgp.labs.spark.datasources
Short name
• Optional: can specify a full class name
• An alias to a class name
• Needs to be registered
• Not recommended during development phase
11@jgperrin - #EUdev6
Project
12#EUdev6
Information for
registration of
the data source
Data source’s
short name
Data source
The app
Relation
Utilities
An existing
library you can
embed/link to
/jgperrin
/net.jgp.labs.spark.datasources
Library code
• A library you already use, you developed or
acquired
• No need for anything special
13@jgperrin - #EUdev6
/jgperrin
/net.jgp.labs.spark.datasources
Scala 2 Java
conversion
(sorry!) – All the
options
Data Source Code
14@jgperrin - #EUdev6
Provides a
relation
Needed by
Spark
Needs to be
passed to the
relation
public class ExifDirectoryDataSource implements RelationProvider {
@Override
public BaseRelation createRelation(
SQLContext sqlContext,
Map<String, String> params) {
java.util.Map<String, String> javaMap =
mapAsJavaMapConverter(params).asJava();
ExifDirectoryRelation br = new ExifDirectoryRelation();
br.setSqlContext(sqlContext);
...
br.setPhotoLister(photoLister);
return br;
}
}
The relation to
be exploited
Relation
• Plumbing between
– Spark
– Your existing library
• Mission
– Returns the schema as a StructType
– Returns the data as a RDD<Row>
15@jgperrin - #EUdev6
/jgperrin
/net.jgp.labs.spark.datasources
TableScan is the
key, other more
specialized Scan
available
Relation
16@jgperrin - #EUdev6
The schema:
may/will be
called first
The data…
SQL Context
public class ExifDirectoryRelation extends BaseRelation
implements Serializable, TableScan {
private static final long serialVersionUID = 4598175080399877334L;
@Override
public RDD<Row> buildScan() {
...
return rowRDD.rdd();
}
@Override
public StructType schema() {
...
return schema.getSparkSchema();
}
@Override
public SQLContext sqlContext() {
return this.sqlContext;
}
...
}
/jgperrin
/net.jgp.labs.spark.datasources
A utility function
that introspect a
Java bean and
turn it into a
”Super” schema,
which contains
the required
StructType for
Spark
Relation – Schema
17@jgperrin - #EUdev6
@Override
public StructType schema() {
if (schema == null) {
schema =
SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class);
}
return schema.getSparkSchema();
}
/jgperrin
/net.jgp.labs.spark.datasources
Collect the data
Relation – Data
18@jgperrin - #EUdev6
@Override
public RDD<Row> buildScan() {
schema();
List<PhotoMetadata> table = collectData();
JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(table)
.map(photo -> SparkBeanUtils.getRowFromBean(schema, photo));
return rowRDD.rdd();
}
private List<PhotoMetadata> collectData() {
List<File> photosToProcess = this.photoLister.getFiles();
List<PhotoMetadata> list = new ArrayList<PhotoMetadata>();
PhotoMetadata photo;
for (File photoToProcess : photosToProcess) {
photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath());
list.add(photo);
}
return list;
}
Scans the files,
extract EXIF
information: the
interface to your
library…
Creates the RDD
by parallelizing
the list of photos
/jgperrin
/net.jgp.labs.spark.datasources
Application
19@jgperrin - #EUdev6
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class PhotoMetadataIngestionApp {
public static void main(String[] args) {
PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp();
app.start();
}
private boolean start() {
SparkSession spark = SparkSession.builder()
.appName("EXIF to Dataset").master("local[*]").getOrCreate();
Dataset<Row> df = spark.read()
.format("exif")
.option("recursive", "true")
.option("limit", "80000")
.option("extensions", "jpg,jpeg")
.load("/Users/jgp/Pictures");
df = df
.filter(df.col("GeoX").isNotNull())
.filter(df.col("GeoZ").notEqual("NaN"))
.orderBy(df.col("GeoZ").desc());
System.out.println("I have imported " + df.count() + " photos.");
df.printSchema();
df.show(5);
return true;
}
}
Normal imports,
no reference to
our data source
Local mode
Classic read
Standard
dataframe API:
getting my
”highest” photos!
Standard output
mechanism
Output
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
| Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width|
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
| IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592|
| IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592|
| IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264|
| IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264|
| IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264|
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
only showing top 5 rows
20@jgperrin - #EUdev6
+--------------------+-------+-----------+-------------------+---------+----------+
| Name| Size| GeoY| Date| GeoX| GeoZ|
+--------------------+-------+-----------+-------------------+---------+----------+
| IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743|
| IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743|
| IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197|
| IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197|
| IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197|
+--------------------+-------+-----------+-------------------+---------+----------+
only showing top 5 rows
dsfsd
21#EUdev6
IMG_0176.JPG
Rijsttafel in Rotterdam, NL
2.6m (9ft) above sea level
dsfsd
22#EUdev6
IMG_0176.JPG
Hoover Dam & Lake Mead, AZ
9314.7m (30560ft) above sea level
@jgperrin - #EUdev6
A Little Extra
23
Data
Metadata
(Schema)
Schema, Beans, and Annotations
24@jgperrin - #EUdev6
Schema
Bean
StructType
SparkBeanUtils.
getSchemaFromBean()
SparkBeanUtils.
getRowFromBean()
Schema.
getSparkSchema()
RowColumn
Column
Column
Column
Can be augmented via
the @SparkColumn
annotation
Mass production
• Easily import from any Java Bean
• Conversion done by utility functions
• Schema is a superset of StructType
25#EUdev6@jgperrin - #EUdev6
Conclusion
• Reuse the Bean to schema and Bean to data in
your project
• Building custom data source to REST server or non
standard format is easy
• No need for a pricey conversion to CSV or JSON
• There is always a solution in Java
• On you: check for parallelism, optimization, extend
schema (order of columns)
26@jgperrin - #EUdev6
Going Further
• Check out the code (fork/like)
https://github.com/jgperrin/net.jgp.labs.spark.dataso
urces
• Follow me @jgperrin
• Watch for my Java + Spark book, coming out soon!
• (If you come in the RTP area in NC, USA, come for
a Spark meet-up and let’s have a drink!)
27#EUdev6
Go raibh maith agaibh.
Jean Georges “JGP” Perrin
@jgperrin
Don’t forget
to rate
Backup Slides & Other Resources
30Discover CLEGO @ http://bit.ly/spark-clego
More Reading & Resources
• My blog: http://jgp.net
• This code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark.dat
asources
• Java Code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark
/jgperrin/net.jgp.labs.spark.datasources
Photo & Other Credits
• JGP @ Oplo, Problem, Solution 1&2, Relation,
Beans: Pexels
• Lab, Hoover Dam, Clego: © Jean Georges
Perrin
• Rijsttafel: © Nathaniel Perrin
Abstract
EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE
By Jean Georges Perrin (@jgperrin, Oplo)
Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion
features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats
you want to use. Your solution could be to convert your data in a CSV or JSON file and then
ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will
explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will
first understand how Spark works for ingestion, then walk through the development of this data
source plug-in.
Targeted audience: Software and data engineers who need to expand Spark’s ingestion
capability.
Key takeaways:
• Requirements, needs & architecture – 15%.
• Build the required tool set in Java – 85%.
Session hashtag: #EUdev6

More Related Content

What's hot

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 

What's hot (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark
SparkSpark
Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Elk
Elk Elk
Elk
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 

Viewers also liked

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiSpark Summit
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanDatabricks
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 

Viewers also liked (11)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Similar to Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionRakebul Hasan
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Adding To the Leaf Pile
Adding To the Leaf PileAdding To the Leaf Pile
Adding To the Leaf Pilestuporglue
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev PROIDEA
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 

Similar to Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin (20)

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Adding To the Leaf Pile
Adding To the Leaf PileAdding To the Leaf Pile
Adding To the Leaf Pile
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

  • 1. Jean Georges Perrin, @jgperrin, Oplo Extending Spark’s Ingestion: Build Your Own Java Data Source #EUdev6
  • 2. 2#EUdev6 JGP @jgperrin Jean Georges Perrin Chapel Hill, NC, USA I 🏗SW #Knowledge = 𝑓 ( ∑ (#SmallData, #BigData), #DataScience) & #Software #IBMChampion x9 #KeepLearning @ http://jgp.net 2#EUdev6 I💚#
  • 3. A Little Help Hello Ruby, Nathaniel, Jack, and PN! 3@jgperrin - #EUdev6
  • 4. And What About You? • Who is a Java programmer? • Who swears by Scala? • Who has tried to build a custom data source? • In Java? • Who has succeeded? 4@jgperrin - #EUdev6
  • 5. My Problem… 5@jgperrin - #EUdev6 CSV JSON RDBMS REST Other
  • 6. Solution #1 6@jgperrin - #EUdev6 • Look in the Spark Packages library • https://spark-packages.org/ • 48 data sources listed (some dups) • Some with Source Code
  • 7. package net.jgp.labs.spark.datasources.l100_ photo_datasource; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset") .master("local[*]"). getOrCreate(); String importDirectory = "/Users/jgp/Pictures/All Photos/2010-2019/2016"; Solution #2 7@jgperrin - #EUdev6 • Write it yourself • In Java
  • 8. What is ingestion anyway? • Loading – CSV, JSON, RDBMS… – Basically any data • Getting a Dataframe (Dataset<Row>) with your data 8@jgperrin - #EUdev6
  • 9. 9@jgperrin - #EUdev6 Our lab Import some of the metadata of our photos and start analysis…
  • 10. Code Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load(importDirectory); #EUdev6 Spark Session Short name for your data source Options Where to start /jgperrin/net.jgp.labs.spark.datasources
  • 11. Short name • Optional: can specify a full class name • An alias to a class name • Needs to be registered • Not recommended during development phase 11@jgperrin - #EUdev6
  • 12. Project 12#EUdev6 Information for registration of the data source Data source’s short name Data source The app Relation Utilities An existing library you can embed/link to /jgperrin /net.jgp.labs.spark.datasources
  • 13. Library code • A library you already use, you developed or acquired • No need for anything special 13@jgperrin - #EUdev6
  • 14. /jgperrin /net.jgp.labs.spark.datasources Scala 2 Java conversion (sorry!) – All the options Data Source Code 14@jgperrin - #EUdev6 Provides a relation Needed by Spark Needs to be passed to the relation public class ExifDirectoryDataSource implements RelationProvider { @Override public BaseRelation createRelation( SQLContext sqlContext, Map<String, String> params) { java.util.Map<String, String> javaMap = mapAsJavaMapConverter(params).asJava(); ExifDirectoryRelation br = new ExifDirectoryRelation(); br.setSqlContext(sqlContext); ... br.setPhotoLister(photoLister); return br; } } The relation to be exploited
  • 15. Relation • Plumbing between – Spark – Your existing library • Mission – Returns the schema as a StructType – Returns the data as a RDD<Row> 15@jgperrin - #EUdev6
  • 16. /jgperrin /net.jgp.labs.spark.datasources TableScan is the key, other more specialized Scan available Relation 16@jgperrin - #EUdev6 The schema: may/will be called first The data… SQL Context public class ExifDirectoryRelation extends BaseRelation implements Serializable, TableScan { private static final long serialVersionUID = 4598175080399877334L; @Override public RDD<Row> buildScan() { ... return rowRDD.rdd(); } @Override public StructType schema() { ... return schema.getSparkSchema(); } @Override public SQLContext sqlContext() { return this.sqlContext; } ... }
  • 17. /jgperrin /net.jgp.labs.spark.datasources A utility function that introspect a Java bean and turn it into a ”Super” schema, which contains the required StructType for Spark Relation – Schema 17@jgperrin - #EUdev6 @Override public StructType schema() { if (schema == null) { schema = SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class); } return schema.getSparkSchema(); }
  • 18. /jgperrin /net.jgp.labs.spark.datasources Collect the data Relation – Data 18@jgperrin - #EUdev6 @Override public RDD<Row> buildScan() { schema(); List<PhotoMetadata> table = collectData(); JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext()); JavaRDD<Row> rowRDD = sparkContext.parallelize(table) .map(photo -> SparkBeanUtils.getRowFromBean(schema, photo)); return rowRDD.rdd(); } private List<PhotoMetadata> collectData() { List<File> photosToProcess = this.photoLister.getFiles(); List<PhotoMetadata> list = new ArrayList<PhotoMetadata>(); PhotoMetadata photo; for (File photoToProcess : photosToProcess) { photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath()); list.add(photo); } return list; } Scans the files, extract EXIF information: the interface to your library… Creates the RDD by parallelizing the list of photos
  • 19. /jgperrin /net.jgp.labs.spark.datasources Application 19@jgperrin - #EUdev6 import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset").master("local[*]").getOrCreate(); Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load("/Users/jgp/Pictures"); df = df .filter(df.col("GeoX").isNotNull()) .filter(df.col("GeoZ").notEqual("NaN")) .orderBy(df.col("GeoZ").desc()); System.out.println("I have imported " + df.count() + " photos."); df.printSchema(); df.show(5); return true; } } Normal imports, no reference to our data source Local mode Classic read Standard dataframe API: getting my ”highest” photos! Standard output mechanism
  • 20. Output +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592| | IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592| | IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264| | IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264| | IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ only showing top 5 rows 20@jgperrin - #EUdev6 +--------------------+-------+-----------+-------------------+---------+----------+ | Name| Size| GeoY| Date| GeoX| GeoZ| +--------------------+-------+-----------+-------------------+---------+----------+ | IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| | IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| | IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| | IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| | IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| +--------------------+-------+-----------+-------------------+---------+----------+ only showing top 5 rows
  • 22. dsfsd 22#EUdev6 IMG_0176.JPG Hoover Dam & Lake Mead, AZ 9314.7m (30560ft) above sea level
  • 23. @jgperrin - #EUdev6 A Little Extra 23 Data Metadata (Schema)
  • 24. Schema, Beans, and Annotations 24@jgperrin - #EUdev6 Schema Bean StructType SparkBeanUtils. getSchemaFromBean() SparkBeanUtils. getRowFromBean() Schema. getSparkSchema() RowColumn Column Column Column Can be augmented via the @SparkColumn annotation
  • 25. Mass production • Easily import from any Java Bean • Conversion done by utility functions • Schema is a superset of StructType 25#EUdev6@jgperrin - #EUdev6
  • 26. Conclusion • Reuse the Bean to schema and Bean to data in your project • Building custom data source to REST server or non standard format is easy • No need for a pricey conversion to CSV or JSON • There is always a solution in Java • On you: check for parallelism, optimization, extend schema (order of columns) 26@jgperrin - #EUdev6
  • 27. Going Further • Check out the code (fork/like) https://github.com/jgperrin/net.jgp.labs.spark.dataso urces • Follow me @jgperrin • Watch for my Java + Spark book, coming out soon! • (If you come in the RTP area in NC, USA, come for a Spark meet-up and let’s have a drink!) 27#EUdev6
  • 28. Go raibh maith agaibh. Jean Georges “JGP” Perrin @jgperrin Don’t forget to rate
  • 29. Backup Slides & Other Resources
  • 30. 30Discover CLEGO @ http://bit.ly/spark-clego
  • 31. More Reading & Resources • My blog: http://jgp.net • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark.dat asources • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.spark.datasources
  • 32. Photo & Other Credits • JGP @ Oplo, Problem, Solution 1&2, Relation, Beans: Pexels • Lab, Hoover Dam, Clego: © Jean Georges Perrin • Rijsttafel: © Nathaniel Perrin
  • 33. Abstract EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE By Jean Georges Perrin (@jgperrin, Oplo) Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience: Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways: • Requirements, needs & architecture – 15%. • Build the required tool set in Java – 85%. Session hashtag: #EUdev6