SlideShare a Scribd company logo
1 of 36
Effective Sqoop 
Alex Silva 
Principal Software Engineer 
alex.silva@rackspace.com
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Formatting Arguments 
The default delimiters are: comma (,) for fields, newline (n) for 
records, no quote character, and no escape character. 
Formatting Argument What is it for? 
enclosed-by The field enclosing character. 
escaped-by The escape character. 
fields-terminated-by The field separator character. 
lines-terminated-by The end of line char, 
mysql-delimiters 
Default delimiters: fields (,) lines (n) escaped-by 
() optionally-enclosed-by (') 
optionally-enclosed-by The field enclosing character. 
Copyright 2014 Rackspace
ID LABEL STATUS 
1 Critical, test. ACTIVE 
3 By “agent-nd01” DISABLED 
$ sqoop import … 
$ sqoop import 
--fields-terminated-by , 
--escaped-by  
--enclosed-by '"' ... 
“1”,”Critical, test”, “ACTIVE” 
1,”Critical, test”, ACTIVE “3”,“By ”agent-nd01””,”DISABLED” 
3,“By ”agent-nd01””,DISABLED 
1,Critical,test,ACTIVE 
3,By “agent-nd01”,DISABLED 
$ sqoop import 
--fields-terminated-by , 
--escaped-by  
--optionally-enclosed-by '"' ... 
Sometimes the problem 
doesn’t show up until much later… 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Taming the Elephant 
• Sqoop delegates all processing to Hadoop: 
• Each mapper transfers a slice of the table. 
• The parameter --num-mappers (defaults to 4) 
tells Sqoop how many mappers to use to slice the 
data. 
Copyright 2014 Rackspace
How Many Mappers? 
• The optimal number depends on a few variables: 
• The database type. 
• How does it handle parallelism internally? 
• The server hardware and infrastructure. 
• Overall impact to other requests. 
Copyright 2014 Rackspace
Gotchas! 
• More mappers can lead to faster jobs, but only up 
to a saturation point. This varies per table, job 
parameters, time of day and server availability. 
• Too many mappers will increase the load on the 
database: people will notice! 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Connectors 
• Two types of connectors: common (JDBC) and 
direct (vendor specific batch tools). 
Common Connectors 
MySQL 
PostgreSQL 
Oracle 
SQL Server 
DB2 
Generic 
Direct Connectors 
MySQL 
PostgreSQL 
Oracle 
Teradata 
And others 
Copyright 2014 Rackspace
Direct Connectors 
• Performance! 
• --direct parameter. 
• Utilities need to be available on all task nodes. 
• Escape characters, type mapping, column and row 
delimiters may not be supported. 
• Binary formats don’t work. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Splitting Data 
• By default, the primary key is used. 
• Prior to starting the transfer, Sqoop will retrieve the 
min/max values for this column. 
• Changed column with the --split-by parameter: 
• Required in tables with no index columns or multi-column 
keys. 
Copyright 2014 Rackspace
Boundary Queries 
What if your split-by column is skewed, table is 
not indexed or can be retrieved from another 
table? 
Use a boundary query to create the splits. 
select min(<split-by>), max(<split-by>) from <table name> 
Copyright 2014 Rackspace
Splitting Free form Queries 
• By default, Sqoop will use the entire query as a subquery to 
calculate max/min: INEFFECTIVE! 
• Solution: use a --boundary-query. 
• Good choices: 
• Store boundary values in a separate table. 
• Good for incremental imports. (--last-value) 
• Run query prior to Sqoop and save its output in a 
temporary table. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Options Files 
• Reusable Arguments that do not change. 
• Pass it to the command line via --options-file 
argument. 
• Composition: more than one option file is allowed. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability 
7. Use the proper file format for your needs 
8. Prefer batch mode when exporting 
9. Use a staging table. 
10.Aggregate data in Hive 
Copyright 2014 Rackspace
File Formats 
• Text (default): 
• Non-binary data types. 
• Simple and human-readable. 
• Platform independent. 
• Binary (AVRO and sequence files): 
• Precise representation and with efficient storage. 
• Good for Text containing separators. 
Copyright 2014 Rackspace
Environment 
• A combination of text and AVRO files mostly. 
• Why Avro? 
• Compact, splittable binary encoding. 
• Supports versioning and is language agnostic. 
• Also used as a container for smaller files. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Exports 
• Experiment with batching multiple insert statements 
together: 
• --batch parameter 
• sqoop.export.records.per.statement 
(100) property. 
• sqoop.export.statements.per.transaction 
(100) property. 
Copyright 2014 Rackspace
Batch Exports 
• The --batch parameter uses the JDBC batch API. 
(addBatch/executeBatch) 
• However… 
• Implementation can vary among drivers. 
• Some drivers actually perform worse in batch 
mode! (serialization and internal caches) 
Copyright 2014 Rackspace
Batch Exports 
• The sqoop.export.records.per.statement 
property will aggregate multiple rows inside one 
single insert statement. 
• However… 
• Not supported by all databases (most do.) 
• Be aware that most dbs have limits on the 
maximum query size. 
Copyright 2014 Rackspace
Batch Exports 
• The 
sqoop.export.records.per.transaction: 
how many insert statements will be issued per 
transaction. 
• However… 
• Exact behavior depends on database. 
• Be aware of table-level write locks. 
Copyright 2014 Rackspace
Which is better? 
• No silver bullet that applies to all use cases. 
• Start with enabling batch import. 
• Find out what’s the maximum query size for your 
database. 
• Set the number of rows per statement to roughly 
that value. 
• Go from there. 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Staging Tables are our 
Friends 
• All data is written to staging table first. 
• Data is copied to the final destination iff all tasks 
succeed: all-or-nothing semantics. 
• Structure must match exactly: columns and types. 
• Staging table must exist before and must be empty. 
(--clear-staging-table parameter) 
Copyright 2014 Rackspace
Ten Best Practices 
1. It pays to use formatting arguments. 
2. With the power of parallelism comes great responsibility! 
3. Use direct connectors for fast prototyping and performance. 
4. Use a boundary query for better performance. 
5. Do not use the same table for import and export. 
6. Use an options file for reusability. 
7. Use the proper file format for your needs. 
8. Prefer batch mode when exporting. 
9. Use a staging table. 
10.Aggregate data in Hive. 
Copyright 2014 Rackspace
Hive 
• --hive-import parameter. 
• BONUS: If table doesn’t exist, Sqoop will create it for 
you! 
• Override default type mappings with --map-column-hive. 
• Data is first loaded into HDFS and then loaded into 
Hive.. 
• Default behavior is append. (—hive-overwrite.) 
Copyright 2014 Rackspace
Hive partitions 
• Two parameters: 
• --hive-partition-key 
• --hive-partition-value 
• Current Limitations: 
• One level of partitioning only. 
• The partition value has to be an actual value and not 
a column name. 
Copyright 2014 Rackspace
Hive and AVRO 
• Currently not compatible!! 
• Workaround is to create an EXTERNAL table. 
CREATE EXTERNAL TABLE cs_atom_events 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION ‘/user/cloud-analytics/snapshot/atom_events/cloud-servers’ 
TBLPROPERTIES (‘avro.schema.url’=‘hdfs:///user/cloud-analytics/avro/cs_cff_atom.avsc'); 
Copyright 2014 Rackspace
Data Pipeline 
Copyright 2014 Rackspace
Call to Action 
www.rackspace.com/cloud/big-data 
(On-Metal Free Trial) 
• Try it out! 
• Deploy a CBD Cluster, connect to your RDBMS. 
• Extract value from your data!
Thank you! 
Alex Silva 
alex.silva@rackspace.com 
Copyright 2014 Rackspace

More Related Content

Viewers also liked

Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015IMC Institute
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/ProductionIMC Institute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Kanban boards step by step
Kanban boards step by stepKanban boards step by step
Kanban boards step by stepGiulio Roggero
 

Viewers also liked (8)

Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Kanban boards step by step
Kanban boards step by stepKanban boards step by step
Kanban boards step by step
 

More from Alex Silva

Leveraging the power of the unbundled database
Leveraging the power of the unbundled databaseLeveraging the power of the unbundled database
Leveraging the power of the unbundled databaseAlex Silva
 
Designing a reactive real-time data platform: Architecture and Infrastructure...
Designing a reactive real-time data platform: Architecture and Infrastructure...Designing a reactive real-time data platform: Architecture and Infrastructure...
Designing a reactive real-time data platform: Architecture and Infrastructure...Alex Silva
 
Designing a Scalable Data Platform
Designing a Scalable Data PlatformDesigning a Scalable Data Platform
Designing a Scalable Data PlatformAlex Silva
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkAlex Silva
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Alex Silva
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive Alex Silva
 

More from Alex Silva (6)

Leveraging the power of the unbundled database
Leveraging the power of the unbundled databaseLeveraging the power of the unbundled database
Leveraging the power of the unbundled database
 
Designing a reactive real-time data platform: Architecture and Infrastructure...
Designing a reactive real-time data platform: Architecture and Infrastructure...Designing a reactive real-time data platform: Architecture and Infrastructure...
Designing a reactive real-time data platform: Architecture and Infrastructure...
 
Designing a Scalable Data Platform
Designing a Scalable Data PlatformDesigning a Scalable Data Platform
Designing a Scalable Data Platform
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 

Recently uploaded

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 

Recently uploaded (20)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 

Effective Sqoop: Best Practices, Pitfalls and Lessons

  • 1. Effective Sqoop Alex Silva Principal Software Engineer alex.silva@rackspace.com
  • 2. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 3. Formatting Arguments The default delimiters are: comma (,) for fields, newline (n) for records, no quote character, and no escape character. Formatting Argument What is it for? enclosed-by The field enclosing character. escaped-by The escape character. fields-terminated-by The field separator character. lines-terminated-by The end of line char, mysql-delimiters Default delimiters: fields (,) lines (n) escaped-by () optionally-enclosed-by (') optionally-enclosed-by The field enclosing character. Copyright 2014 Rackspace
  • 4. ID LABEL STATUS 1 Critical, test. ACTIVE 3 By “agent-nd01” DISABLED $ sqoop import … $ sqoop import --fields-terminated-by , --escaped-by --enclosed-by '"' ... “1”,”Critical, test”, “ACTIVE” 1,”Critical, test”, ACTIVE “3”,“By ”agent-nd01””,”DISABLED” 3,“By ”agent-nd01””,DISABLED 1,Critical,test,ACTIVE 3,By “agent-nd01”,DISABLED $ sqoop import --fields-terminated-by , --escaped-by --optionally-enclosed-by '"' ... Sometimes the problem doesn’t show up until much later… Copyright 2014 Rackspace
  • 5. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 6. Taming the Elephant • Sqoop delegates all processing to Hadoop: • Each mapper transfers a slice of the table. • The parameter --num-mappers (defaults to 4) tells Sqoop how many mappers to use to slice the data. Copyright 2014 Rackspace
  • 7. How Many Mappers? • The optimal number depends on a few variables: • The database type. • How does it handle parallelism internally? • The server hardware and infrastructure. • Overall impact to other requests. Copyright 2014 Rackspace
  • 8. Gotchas! • More mappers can lead to faster jobs, but only up to a saturation point. This varies per table, job parameters, time of day and server availability. • Too many mappers will increase the load on the database: people will notice! Copyright 2014 Rackspace
  • 9. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 10. Connectors • Two types of connectors: common (JDBC) and direct (vendor specific batch tools). Common Connectors MySQL PostgreSQL Oracle SQL Server DB2 Generic Direct Connectors MySQL PostgreSQL Oracle Teradata And others Copyright 2014 Rackspace
  • 11. Direct Connectors • Performance! • --direct parameter. • Utilities need to be available on all task nodes. • Escape characters, type mapping, column and row delimiters may not be supported. • Binary formats don’t work. Copyright 2014 Rackspace
  • 12. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 13. Splitting Data • By default, the primary key is used. • Prior to starting the transfer, Sqoop will retrieve the min/max values for this column. • Changed column with the --split-by parameter: • Required in tables with no index columns or multi-column keys. Copyright 2014 Rackspace
  • 14. Boundary Queries What if your split-by column is skewed, table is not indexed or can be retrieved from another table? Use a boundary query to create the splits. select min(<split-by>), max(<split-by>) from <table name> Copyright 2014 Rackspace
  • 15. Splitting Free form Queries • By default, Sqoop will use the entire query as a subquery to calculate max/min: INEFFECTIVE! • Solution: use a --boundary-query. • Good choices: • Store boundary values in a separate table. • Good for incremental imports. (--last-value) • Run query prior to Sqoop and save its output in a temporary table. Copyright 2014 Rackspace
  • 16. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 17. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 18. Options Files • Reusable Arguments that do not change. • Pass it to the command line via --options-file argument. • Composition: more than one option file is allowed. Copyright 2014 Rackspace
  • 19. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability 7. Use the proper file format for your needs 8. Prefer batch mode when exporting 9. Use a staging table. 10.Aggregate data in Hive Copyright 2014 Rackspace
  • 20. File Formats • Text (default): • Non-binary data types. • Simple and human-readable. • Platform independent. • Binary (AVRO and sequence files): • Precise representation and with efficient storage. • Good for Text containing separators. Copyright 2014 Rackspace
  • 21. Environment • A combination of text and AVRO files mostly. • Why Avro? • Compact, splittable binary encoding. • Supports versioning and is language agnostic. • Also used as a container for smaller files. Copyright 2014 Rackspace
  • 22. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 23. Exports • Experiment with batching multiple insert statements together: • --batch parameter • sqoop.export.records.per.statement (100) property. • sqoop.export.statements.per.transaction (100) property. Copyright 2014 Rackspace
  • 24. Batch Exports • The --batch parameter uses the JDBC batch API. (addBatch/executeBatch) • However… • Implementation can vary among drivers. • Some drivers actually perform worse in batch mode! (serialization and internal caches) Copyright 2014 Rackspace
  • 25. Batch Exports • The sqoop.export.records.per.statement property will aggregate multiple rows inside one single insert statement. • However… • Not supported by all databases (most do.) • Be aware that most dbs have limits on the maximum query size. Copyright 2014 Rackspace
  • 26. Batch Exports • The sqoop.export.records.per.transaction: how many insert statements will be issued per transaction. • However… • Exact behavior depends on database. • Be aware of table-level write locks. Copyright 2014 Rackspace
  • 27. Which is better? • No silver bullet that applies to all use cases. • Start with enabling batch import. • Find out what’s the maximum query size for your database. • Set the number of rows per statement to roughly that value. • Go from there. Copyright 2014 Rackspace
  • 28. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 29. Staging Tables are our Friends • All data is written to staging table first. • Data is copied to the final destination iff all tasks succeed: all-or-nothing semantics. • Structure must match exactly: columns and types. • Staging table must exist before and must be empty. (--clear-staging-table parameter) Copyright 2014 Rackspace
  • 30. Ten Best Practices 1. It pays to use formatting arguments. 2. With the power of parallelism comes great responsibility! 3. Use direct connectors for fast prototyping and performance. 4. Use a boundary query for better performance. 5. Do not use the same table for import and export. 6. Use an options file for reusability. 7. Use the proper file format for your needs. 8. Prefer batch mode when exporting. 9. Use a staging table. 10.Aggregate data in Hive. Copyright 2014 Rackspace
  • 31. Hive • --hive-import parameter. • BONUS: If table doesn’t exist, Sqoop will create it for you! • Override default type mappings with --map-column-hive. • Data is first loaded into HDFS and then loaded into Hive.. • Default behavior is append. (—hive-overwrite.) Copyright 2014 Rackspace
  • 32. Hive partitions • Two parameters: • --hive-partition-key • --hive-partition-value • Current Limitations: • One level of partitioning only. • The partition value has to be an actual value and not a column name. Copyright 2014 Rackspace
  • 33. Hive and AVRO • Currently not compatible!! • Workaround is to create an EXTERNAL table. CREATE EXTERNAL TABLE cs_atom_events ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION ‘/user/cloud-analytics/snapshot/atom_events/cloud-servers’ TBLPROPERTIES (‘avro.schema.url’=‘hdfs:///user/cloud-analytics/avro/cs_cff_atom.avsc'); Copyright 2014 Rackspace
  • 34. Data Pipeline Copyright 2014 Rackspace
  • 35. Call to Action www.rackspace.com/cloud/big-data (On-Metal Free Trial) • Try it out! • Deploy a CBD Cluster, connect to your RDBMS. • Extract value from your data!
  • 36. Thank you! Alex Silva alex.silva@rackspace.com Copyright 2014 Rackspace

Editor's Notes

  1. While importing to CSV file format is easy for testing, it can cause trouble down the road when the text stored in the database uses special characters. Importing to binary format such as Avro will avoid these issues and can make farther processing in Hadoop faster. But it doesn’t eliminate the problem down the road.
  2. While increasing the number of mappers, there is a point at which you will fully saturate your database. Increasing the number of mappers beyond this point won’t lead to faster job completion; in fact, it will have the opposite effect as your database server spends more time doing context switching rather than serving data.
  3. The parameter --num-mappers is really a hint. There is no optimal number of mappers that works for all scenarios. Experiment! Start small and ramp up. Always talk to the database owner first!
  4. Sqoop's direct mode does not support imports of BLOB, CLOB, or LONGVARBINARY columns. Use JDBC-based imports for these columns; do not supply the --direct argument to the import tool.
  5. mysqldump and mysqlimport will be used for retrieving data from the database server or moving data back. In the case of PostgreSQL, Sqoop will take advantage of the pg_dump utility to import data.
  6. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
  7. select min(col), max(col) from ($YOUR_QUERY). Free form Can be powerful, however: Query will be run once at the same time for the different slices of data: problematic! No metadata is available to Sqoop. --split-by is required parameters and query needs a $CONDITIONS placeholder.
  8. When importing data to HDFS, it is important that you ensure access to a consistent snapshot of the source data. Map tasks reading from a database in parallel are running in separate processes. Thus, they cannot share a single database transaction. The best way to do this is to ensure that any processes that update existing rows of a table are disabled during the import.
  9. Each line identifies an in the same order it would appear on the command line. --password-file: password file permissions to 400, so no one else can open the file and fetch the password.
  10. Text files cannot hold binary fields (VARBINARY) and distinguishing between null values and String-based fields containing the value "null" can be problematic . Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive. Sequence files: value= NullWritable; key=generated class.
  11. If you’re working with records imported to SequenceFiles, it is inevitable that you’ll need to use the generated classes
  12. sqoop export \ -Dsqoop.export.records.per.statement=50 \ -Dsqoop.export.statements.per.transaction=50 \ --connect jdbc:mysql://mysql.example.com/maas \ --username maas --password maas \ --query 'SELECT alarm.id, account.id FROM alarms JOIN accounts USING(account_id) \ WHERE $CONDITIONS' \ --split-by id \ --boundary-query "select min(id), max(id) from alarm_ids”
  13. --map-column-hive id=STRING,price=DECIMAL --hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive. --hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive
  14. --hive-partition-key: the column name. Must be of type STRING.
  15. Not compatible with sequence files either.