SlideShare a Scribd company logo
1 of 42
SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber
Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
@linkedin lead on Voldemort
@Oracle focussed log based replication, HPC
and stream processing
Works currently @Uber on streaming systems.
Prior to this worked
@Cloudera on Ingestion
for Hadoop and @Linkedin
on fronted and service infra
Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps
In the beginning…
Data Ingestion Tool
• Primary need
• Transferring data
from SQL to
HADOOP
• SQOOP solved it
well!
Ingestion needs Evolved…
Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
• Multi-stage pipeline
Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Pluggable Data Sources
• Server-Client Design
Sqoop 2
• CONNECTOR
• JOB
Connector
• Connectors
represent Data
Sources
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Lets talk about
MYSQL to KAFKA
example
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create MySQL link
Create Kafka link
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
numExtractors
numLoaders
Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
Add MySQL From Config
Add kafka To Config
numExtractors
numLoaders
Create Job API
public static void createJob(String[] jobconfigs) {
CommandLine cArgs = parseArgs(createOptions(), jobconfigs);
MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs);
MLink toLink = createToLink(‘kafka-connector’, jobconfigs);
MJob sqoopJob = createJob(fromLink, toLink, jobconfigs);
}
Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Configuration Object is used to pass FROM/
TO and Driver Configs to the MR engine
• Submits the Job via MR-client and tracks job status and
stats such as counters
Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Remember!
Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM Extract API
• Reducers for Loading
• Invokes TO Load API
• OutputFormat for Commits/ Aborts
So What’s the Scoop?
So What’s the Scoop?
It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not extend the Connector APIs to support
simple (T) transformations along with (EL) ?
Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing Community embracing Apache Spark
• Low effort less than few weeks to build a POC
• EL to ETL -> Nifty transformations can be easily added
Lets talk SQOOP on SPARK
implementation!
Spark Sqoop Job
• Creating a Job
• Job Submission
• Job Execution
Create Sqoop Spark Job
• Create a SparkContext from the relevant configs
• Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..)
that wraps both Sqoop and Spark initialization
• As before Create a Sqoop Job with createJob API
• Invoke SqoopSparkJob.execute(conf, context)
public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSparkJob();
CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args);
SparkConf conf = sparkJob.init(cArgs);
JavaSparkContext context = new JavaSparkContext(conf);
MLink fromLink = getJDBCLink();
MLink toLink = getHDFSLink();
MJob sqoopJob = createJob(fromLink, toLink);
sparkJob.setJob(sqoopJob);
sparkJob.execute(conf, context);
}
Create Sqoop Spark Job
1
2
3
4
Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• Use Remote Spark Context used by Hive on Spark to
submit
• Sqoop Job as a driver for the Spark submit command
Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark yarn client or directly via
command line submit the driver program to yarn client/
• bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/
—jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir
hdfs://path/to/output —numE 4 —numL 4
Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an RDD
• Partition API determines parallelism, Map stage uses
Extract API to read records
• Another Map stage uses Load API to write records
Spark Job Execution
SqoopSparkJob.execute(…){
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp,
sp.size());
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request));


extractRDD.map(new SqoopLoadFunction(request)).collect();
}
1
2
3
Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallelism, for instance we may need to restrict the
TO based on number of Kafka Partitions on the topic
• We can repartition before we invoke the Load stage
Micro Benchmark —>MySQL to HDFS
Table w/ 300K records, numExtractors = numLoaders
Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS
What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone and Cluster mode for quick end-end
testing and faster iteration
• Scheduling Spark sqoop jobs via Oozie
What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
• We had to convert a bunch of Sqoop core classes such as IDF
(internal representation for records transferred) to be serializable
• Managing Hadoop and spark dependencies together and CNF
caused some pain
Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the submission
history
• Proposed Connector Filter API (cleaning, data masking)
• We want to work with Sqoop community to merge this back if its
useful
• https://github.com/vybs/sqoop-on-spark
Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Cloudera and @Uber !!!
• You can reach us @vybs, @byte_array

More Related Content

What's hot

Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 

What's hot (20)

Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 

Similar to SQOOP on SPARK for Faster Data Ingestion

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJohn Godoi
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
using-apache-spark-for-generating-elasticsearch-indices-offline
using-apache-spark-for-generating-elasticsearch-indices-offlineusing-apache-spark-for-generating-elasticsearch-indices-offline
using-apache-spark-for-generating-elasticsearch-indices-offlineAndrej Babolcai
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRick Copeland
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 

Similar to SQOOP on SPARK for Faster Data Ingestion (20)

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Spark core
Spark coreSpark core
Spark core
 
Apache spark
Apache sparkApache spark
Apache spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
using-apache-spark-for-generating-elasticsearch-indices-offline
using-apache-spark-for-generating-elasticsearch-indices-offlineusing-apache-spark-for-generating-elasticsearch-indices-offline
using-apache-spark-for-generating-elasticsearch-indices-offline
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

SQOOP on SPARK for Faster Data Ingestion

  • 1. SQOOP on SPARK for Data Ingestion Veena Basavaraj & Vinoth Chandar @Uber
  • 2. Works currently @ Uber focussed on building a real time pipeline for ingestion to Hadoop for batch and stream processing. @linkedin lead on Voldemort @Oracle focussed log based replication, HPC and stream processing Works currently @Uber on streaming systems. Prior to this worked @Cloudera on Ingestion for Hadoop and @Linkedin on fronted and service infra
  • 3. Agenda • Data Ingestion Today • Introduction Apache Sqoop2 • Sqoop Jobs on Apache Spark • Insights & Next Steps
  • 5. Data Ingestion Tool • Primary need • Transferring data from SQL to HADOOP • SQOOP solved it well!
  • 7. Data Ingestion Pipeline • Ingestion pipeline can now have • Non SQL like data sources • Messaging Systems as data sources • Multi-stage pipeline
  • 8. Sqoop 2 • Generic Data Transfer Service • FROM - egress data out from ANY source • TO - ingress data into ANY source • Pluggable Data Sources • Server-Client Design
  • 11. Connector • Data Source properties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  • 12. Connector • Data Source properties represented via Configs • LINK config to connect to data source • JOB config to read/write data from the data source
  • 13. Connector API • Pluggable Connector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet !
  • 14. Sqoop Job • Creating a Job • Job Submission • Job Execution
  • 15. Lets talk about MYSQL to KAFKA example
  • 16. Create Job • Create LINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK
  • 17. Create Job • Create LINKs • Populate FROM link Config and create FROM LINK • Populate TO link Config and create TO LINK Create MySQL link Create Kafka link
  • 18. Create Job • Create JOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load numExtractors numLoaders
  • 19. Create Job • Create JOB associating FROM and TO LINKS • Populate the FROM and TO Job Config • Populate Driver Config such as parallelism for extract and load Add MySQL From Config Add kafka To Config numExtractors numLoaders
  • 20. Create Job API public static void createJob(String[] jobconfigs) { CommandLine cArgs = parseArgs(createOptions(), jobconfigs); MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs); MLink toLink = createToLink(‘kafka-connector’, jobconfigs); MJob sqoopJob = createJob(fromLink, toLink, jobconfigs); }
  • 21. Job Submit • Sqoop uses MR engine to transfer data between FROM and TO data sources • Hadoop Configuration Object is used to pass FROM/ TO and Driver Configs to the MR engine • Submits the Job via MR-client and tracks job status and stats such as counters
  • 22. Connector API • Pluggable Connector API implemented by Connectors • Partition(P) API for parallelism • (E) Extract API to egress data • (L) Load API to ingress data • No (T) Transform yet ! Remember!
  • 23. Job Execution • InputFormat/Splits for Partitioning • Invokes FROM Partition API • Mappers for Extraction • Invokes FROM Extract API • Reducers for Loading • Invokes TO Load API • OutputFormat for Commits/ Aborts
  • 24. So What’s the Scoop?
  • 25. So What’s the Scoop?
  • 26. It turns out… • Sqoop 2 supports pluggable Execution Engine • Why not replace MR with Spark for parallelism? • Why not extend the Connector APIs to support simple (T) transformations along with (EL) ?
  • 27. Why Apache Spark ? • Why not ? Data Pipeline expressed as Spark jobs • Speed is a feature! Faster than MapReduce • Growing Community embracing Apache Spark • Low effort less than few weeks to build a POC • EL to ETL -> Nifty transformations can be easily added
  • 28. Lets talk SQOOP on SPARK implementation!
  • 29. Spark Sqoop Job • Creating a Job • Job Submission • Job Execution
  • 30. Create Sqoop Spark Job • Create a SparkContext from the relevant configs • Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..) that wraps both Sqoop and Spark initialization • As before Create a Sqoop Job with createJob API • Invoke SqoopSparkJob.execute(conf, context)
  • 31. public class SqoopJDBCHDFSJobDriver { public static void main(String[] args){ final SqoopSparkJob sparkJob = new SqoopSparkJob(); CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args); SparkConf conf = sparkJob.init(cArgs); JavaSparkContext context = new JavaSparkContext(conf); MLink fromLink = getJDBCLink(); MLink toLink = getHDFSLink(); MJob sqoopJob = createJob(fromLink, toLink); sparkJob.setJob(sqoopJob); sparkJob.execute(conf, context); } Create Sqoop Spark Job 1 2 3 4
  • 32. Spark Job Submission • We explored a few options.! • Invoke Spark in process within the Sqoop Server to execute the job • Use Remote Spark Context used by Hive on Spark to submit • Sqoop Job as a driver for the Spark submit command
  • 33. Spark Job Submission • Build a “uber.jar” with the driver and all the sqoop dependencies • Programmatically using Spark yarn client or directly via command line submit the driver program to yarn client/ • bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://path/to/output —numE 4 —numL 4
  • 34. Spark Job Execution • 3 main stages • Obtain containers for parallel execution by simply converting job’s partitions to an RDD • Partition API determines parallelism, Map stage uses Extract API to read records • Another Map stage uses Load API to write records
  • 35. Spark Job Execution SqoopSparkJob.execute(…){ List<Partition> sp = getPartitions(request,numMappers); JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size()); JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request)); 
 extractRDD.map(new SqoopLoadFunction(request)).collect(); } 1 2 3
  • 36. Spark Job Execution • We chose to have 2 map stages for a reason • Load parallelism can be different from Extract parallelism, for instance we may need to restrict the TO based on number of Kafka Partitions on the topic • We can repartition before we invoke the Load stage
  • 37. Micro Benchmark —>MySQL to HDFS Table w/ 300K records, numExtractors = numLoaders
  • 38. Table w/ 2.8M records, numExtractors = numLoaders good partitioning!! Micro Benchmark —>MySQL to HDFS
  • 39. What was Easy? • Reusing existing Connectors, NO changes to the Connector API required. • Inbuilt support for Standalone and Cluster mode for quick end-end testing and faster iteration • Scheduling Spark sqoop jobs via Oozie
  • 40. What was not Easy? • No clean Spark Job Submit API that provides job statistics, using Yarn UI for Job status and health. • We had to convert a bunch of Sqoop core classes such as IDF (internal representation for records transferred) to be serializable • Managing Hadoop and spark dependencies together and CNF caused some pain
  • 41. Next Steps! • Explore alternative ways for Spark Sqoop Job Submission • Expose Spark job stats such as accumulators in the submission history • Proposed Connector Filter API (cleaning, data masking) • We want to work with Sqoop community to merge this back if its useful • https://github.com/vybs/sqoop-on-spark
  • 42. Questions! • Apache Sqoop Project - sqoop.apache.org • Apache Spark Project - spark.apache.org • Thanks to the Folks @Cloudera and @Uber !!! • You can reach us @vybs, @byte_array