A fast paced, in-depth, "no frills" talk about how to effectively use Sqoop as part of your data flow and ingestion pipeline. We will cover topics such as delimiters in text files, Hadoop, MapReduce execution and map tasks with Sqoop, parallelism, boundary queries and splitting data, connectors, different file formats available in Sqoop, batch exports, Hive, Hive exports and HiveQL.
2. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
3. Formatting Arguments
The default delimiters are: comma (,) for fields, newline (n) for
records, no quote character, and no escape character.
Formatting Argument What is it for?
enclosed-by The field enclosing character.
escaped-by The escape character.
fields-terminated-by The field separator character.
lines-terminated-by The end of line char,
mysql-delimiters
Default delimiters: fields (,) lines (n) escaped-by
() optionally-enclosed-by (')
optionally-enclosed-by The field enclosing character.
Copyright 2014 Rackspace
4. ID LABEL STATUS
1 Critical, test. ACTIVE
3 By “agent-nd01” DISABLED
$ sqoop import …
$ sqoop import
--fields-terminated-by ,
--escaped-by
--enclosed-by '"' ...
“1”,”Critical, test”, “ACTIVE”
1,”Critical, test”, ACTIVE “3”,“By ”agent-nd01””,”DISABLED”
3,“By ”agent-nd01””,DISABLED
1,Critical,test,ACTIVE
3,By “agent-nd01”,DISABLED
$ sqoop import
--fields-terminated-by ,
--escaped-by
--optionally-enclosed-by '"' ...
Sometimes the problem
doesn’t show up until much later…
Copyright 2014 Rackspace
5. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
6. Taming the Elephant
• Sqoop delegates all processing to Hadoop:
• Each mapper transfers a slice of the table.
• The parameter --num-mappers (defaults to 4)
tells Sqoop how many mappers to use to slice the
data.
Copyright 2014 Rackspace
7. How Many Mappers?
• The optimal number depends on a few variables:
• The database type.
• How does it handle parallelism internally?
• The server hardware and infrastructure.
• Overall impact to other requests.
Copyright 2014 Rackspace
8. Gotchas!
• More mappers can lead to faster jobs, but only up
to a saturation point. This varies per table, job
parameters, time of day and server availability.
• Too many mappers will increase the load on the
database: people will notice!
Copyright 2014 Rackspace
9. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
10. Connectors
• Two types of connectors: common (JDBC) and
direct (vendor specific batch tools).
Common Connectors
MySQL
PostgreSQL
Oracle
SQL Server
DB2
Generic
Direct Connectors
MySQL
PostgreSQL
Oracle
Teradata
And others
Copyright 2014 Rackspace
11. Direct Connectors
• Performance!
• --direct parameter.
• Utilities need to be available on all task nodes.
• Escape characters, type mapping, column and row
delimiters may not be supported.
• Binary formats don’t work.
Copyright 2014 Rackspace
12. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
13. Splitting Data
• By default, the primary key is used.
• Prior to starting the transfer, Sqoop will retrieve the
min/max values for this column.
• Changed column with the --split-by parameter:
• Required in tables with no index columns or multi-column
keys.
Copyright 2014 Rackspace
14. Boundary Queries
What if your split-by column is skewed, table is
not indexed or can be retrieved from another
table?
Use a boundary query to create the splits.
select min(<split-by>), max(<split-by>) from <table name>
Copyright 2014 Rackspace
15. Splitting Free form Queries
• By default, Sqoop will use the entire query as a subquery to
calculate max/min: INEFFECTIVE!
• Solution: use a --boundary-query.
• Good choices:
• Store boundary values in a separate table.
• Good for incremental imports. (--last-value)
• Run query prior to Sqoop and save its output in a
temporary table.
Copyright 2014 Rackspace
16. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
17. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
18. Options Files
• Reusable Arguments that do not change.
• Pass it to the command line via --options-file
argument.
• Composition: more than one option file is allowed.
Copyright 2014 Rackspace
19. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability
7. Use the proper file format for your needs
8. Prefer batch mode when exporting
9. Use a staging table.
10.Aggregate data in Hive
Copyright 2014 Rackspace
20. File Formats
• Text (default):
• Non-binary data types.
• Simple and human-readable.
• Platform independent.
• Binary (AVRO and sequence files):
• Precise representation and with efficient storage.
• Good for Text containing separators.
Copyright 2014 Rackspace
21. Environment
• A combination of text and AVRO files mostly.
• Why Avro?
• Compact, splittable binary encoding.
• Supports versioning and is language agnostic.
• Also used as a container for smaller files.
Copyright 2014 Rackspace
22. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
24. Batch Exports
• The --batch parameter uses the JDBC batch API.
(addBatch/executeBatch)
• However…
• Implementation can vary among drivers.
• Some drivers actually perform worse in batch
mode! (serialization and internal caches)
Copyright 2014 Rackspace
25. Batch Exports
• The sqoop.export.records.per.statement
property will aggregate multiple rows inside one
single insert statement.
• However…
• Not supported by all databases (most do.)
• Be aware that most dbs have limits on the
maximum query size.
Copyright 2014 Rackspace
26. Batch Exports
• The
sqoop.export.records.per.transaction:
how many insert statements will be issued per
transaction.
• However…
• Exact behavior depends on database.
• Be aware of table-level write locks.
Copyright 2014 Rackspace
27. Which is better?
• No silver bullet that applies to all use cases.
• Start with enabling batch import.
• Find out what’s the maximum query size for your
database.
• Set the number of rows per statement to roughly
that value.
• Go from there.
Copyright 2014 Rackspace
28. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
29. Staging Tables are our
Friends
• All data is written to staging table first.
• Data is copied to the final destination iff all tasks
succeed: all-or-nothing semantics.
• Structure must match exactly: columns and types.
• Staging table must exist before and must be empty.
(--clear-staging-table parameter)
Copyright 2014 Rackspace
30. Ten Best Practices
1. It pays to use formatting arguments.
2. With the power of parallelism comes great responsibility!
3. Use direct connectors for fast prototyping and performance.
4. Use a boundary query for better performance.
5. Do not use the same table for import and export.
6. Use an options file for reusability.
7. Use the proper file format for your needs.
8. Prefer batch mode when exporting.
9. Use a staging table.
10.Aggregate data in Hive.
Copyright 2014 Rackspace
31. Hive
• --hive-import parameter.
• BONUS: If table doesn’t exist, Sqoop will create it for
you!
• Override default type mappings with --map-column-hive.
• Data is first loaded into HDFS and then loaded into
Hive..
• Default behavior is append. (—hive-overwrite.)
Copyright 2014 Rackspace
32. Hive partitions
• Two parameters:
• --hive-partition-key
• --hive-partition-value
• Current Limitations:
• One level of partitioning only.
• The partition value has to be an actual value and not
a column name.
Copyright 2014 Rackspace
33. Hive and AVRO
• Currently not compatible!!
• Workaround is to create an EXTERNAL table.
CREATE EXTERNAL TABLE cs_atom_events
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION ‘/user/cloud-analytics/snapshot/atom_events/cloud-servers’
TBLPROPERTIES (‘avro.schema.url’=‘hdfs:///user/cloud-analytics/avro/cs_cff_atom.avsc');
Copyright 2014 Rackspace
35. Call to Action
www.rackspace.com/cloud/big-data
(On-Metal Free Trial)
• Try it out!
• Deploy a CBD Cluster, connect to your RDBMS.
• Extract value from your data!
36. Thank you!
Alex Silva
alex.silva@rackspace.com
Copyright 2014 Rackspace
Editor's Notes
While importing to CSV file format is easy for testing, it can cause trouble down the road when the text stored in the database uses special characters. Importing to binary format such as Avro will avoid these issues and can make farther processing in Hadoop faster. But it doesn’t eliminate the problem down the road.
While increasing the number of mappers, there is a point at which you will fully saturate your database. Increasing the number of mappers beyond this point won’t lead to faster job completion; in fact, it will have the opposite effect as your database server spends more time doing context switching rather than serving data.
The parameter --num-mappers is really a hint.
There is no optimal number of mappers that works for all scenarios.
Experiment! Start small and ramp up.
Always talk to the database owner first!
Sqoop's direct mode does not support imports of BLOB, CLOB, or LONGVARBINARY columns. Use JDBC-based imports for these columns; do not supply the --direct argument to the import tool.
mysqldump and mysqlimport will be used for retrieving data from the database server or moving data back. In the case of PostgreSQL, Sqoop will take advantage of the pg_dump utility to import data.
If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
select min(col), max(col) from ($YOUR_QUERY).
Free form Can be powerful, however:
Query will be run once at the same time for the different slices of data: problematic!
No metadata is available to Sqoop.
--split-by is required parameters and query needs a $CONDITIONS placeholder.
When importing data to HDFS, it is important that you ensure access to a consistent snapshot of the source data. Map tasks reading from a database in parallel are running in separate processes. Thus, they cannot share a single database transaction. The best way to do this is to ensure that any processes that update existing rows of a table are disabled during the import.
Each line identifies an in the same order it would appear on the command line.
--password-file: password file permissions to 400, so no one else can open the file and fetch the password.
Text files cannot hold binary fields (VARBINARY) and distinguishing between null values and String-based fields containing the value "null" can be problematic .
Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.
Sequence files: value= NullWritable; key=generated class.
If you’re working with records imported to SequenceFiles, it is inevitable that you’ll need to use the generated classes
sqoop export \
-Dsqoop.export.records.per.statement=50 \
-Dsqoop.export.statements.per.transaction=50 \
--connect jdbc:mysql://mysql.example.com/maas \
--username maas --password maas \
--query 'SELECT alarm.id, account.id FROM alarms JOIN accounts USING(account_id) \ WHERE $CONDITIONS' \
--split-by id \
--boundary-query "select min(id), max(id) from alarm_ids”
--map-column-hive id=STRING,price=DECIMAL
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive
--hive-partition-key: the column name. Must be of type STRING.