SlideShare a Scribd company logo
1 of 60
Download to read offline
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Noritaka Sekiyama
Senior Cloud Support Engineer, Amazon Web Services Japan
2019.03.14
Amazon S3 Best Practice and
Tuning for Hadoop/Spark in the Cloud
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Noritaka Sekiyama
Senior Cloud Support Engineer
- Engineer in AWS Support
- Speciality: Big Data
(EMR, Glue, Athena, …)
- SME of AWS Glue
- Apache Spark lover ;)
Who I am...
@moomindani
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Question
• Are you already using S3 on Hadoop/Spark?
• Will you start using Hadoop/Spark on S3 in the future?
• Are you just interested in using cloud storage in Hadoop/Spark?
About today’s session
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationship between Hadoop/Spark and S3
Difference between HDFS and S3, and use-case
Detailed behavior of S3 from the viewpoint of Hadoop/Spark
Well-known pitfalls and tunings
Service updates on AWS/S3 related to Hadoop/Spark
Recent activities in Hadoop/Spark community related to S3
Agenda
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationship between
Hadoop/Spark and S3
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hadoop/Spark processes large data and write output to
destination
Possible to locate input/output data on various file systems like
HDFS
Hadoop/Spark accesses various file system via Hadoop
FileSystem API
Data operation on Hadoop/Spark
FileSystem
App
FileSystem API
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hadoop FileSystem API
• Interface to operate Hadoop file system
⎼ open: Open input stream
⎼ create: Create files
⎼ append: Append files
⎼ getFileBlockLocations: Get block locations
⎼ rename: Rename files
⎼ mkdir: Create directories
⎼ listFiles: List files
⎼ delete: Delete files
• Possible to use various file system like HDFS when using various
implementation of FileSystem API
⎼ LocalFileSystem, S3AFileSystem, EmrFileSystem
Hadoop/Spark and file system
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HDFS is
• Distributed file system that enables high throughput data access
• Optimized to large files (100MB+, GB, TB)
⎼ Not good for lots of small files
• There are NameNode which manages metadata, and DataNode which
manages/stores data blocks
• In Hadoop 3.x, there are many features (e.g. Erasure Coding, Router
based federation, Tiered storage, Ozone) that are actively developed.
How to access
• Hadoop FileSystem API
• $ hadoop fs ...
• HDFS Web UI
HDFS: Hadoop Distributed File System
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 is
• Object storage service that achieves high scalability, availability,
durability, security, and performance
• Pricing is mainly based on data size and requests
• Maximum size of single object: 5TB
• Objects have unique keys under bucket
⎼ There are no directories in S3 although S3 console emulates directories.
⎼ S3 is not a file system
How to access
• REST API (GET, SELECT, PUT, POST, COPY, LIST, DELETE, CANCEL, ...)
• AWS CLI
• AWS SDK
• S3 Console
Amazon S3
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enables to handle S3 like HDFS from Hadoop/Spark
History of S3 FileSystem API implementation
• S3: S3FileSystem
• S3N: NativeS3FileSystem
• S3A: S3AFileSystem
• EMRFS: EmrFileSystem
S3’s implementation of Hadoop FileSystem API
Cluster S3
HDFS
App
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-574: want FileSystem implementation for Amazon S3
Developed to use S3 as file system in 2006
• Object data on S3 = Block data(≠ File data)
• Blocks are stored on S3 directly
• Limited to read/write from S3FileSystem
URL prefix: s3://
Deprecated in 2016
https://issues.apache.org/jira/browse/HADOOP-574
S3: S3FileSystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-931: Make writes to S3FileSystem world visible only on
completion
Developed to solve issues in S3FileSystem in 2008
• Object data on S3 =File data(≠Block data)
• Empty directories are represented with empty object “xyz_$folder$“
• Limited to use files which does not exceed 5GB
Uses jets3t to access S3 (not use AWS SDK)
URL prefix: s3n://
https://issues.apache.org/jira/browse/HADOOP-931
S3N: NativeS3FileSystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-10400: Incorporate new S3A FileSystem implementation
Developed to solve issues in NativeS3FileSystem in 2014
• Support parallel copy and rename
• Compatible with S3 console about empty directories ("xyz_$folder$“-
>”xyz/”)
• Support IAM role authentication
• Support 5GB+ files and multipart uploads
• Support S3 server side encryption
• Improve recovery from error
Uses AWS SDK for Java to access S3
URL prefix: s3a://
Amazon EMR does not support S3A officially
https://issues.apache.org/jira/browse/HADOOP-10400
S3A: S3AFileSystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FileSystem implementation at Amazon EMR (Limited to use on
EMR)
Developed by Amazon (optimized for S3 specification)
• Support IAM role authentication
• Support 5GB+ files and multipart uploads
• Support both S3 server-side/client-side encryption
• Support EMRFS S3-optimized Committer
• Support pushdown with S3 SELECT
Uses AWS SDK for Java to access S3
URL prefix: s3:// (or s3n://)
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
EMRFS: EmrFileSystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A
• On-premise
• Other cloud
• Hadoop/Spark on EC2
EMRFS
• Amazon EMR
How to choose S3A or EMRFS
EMR ClusterS3
HDFS
App
HDFS
App
S3A EMRFS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Choice at AWS
• EMR: Covers most of use-cases
• Hadoop/Spark on EC2: Good for specific use-case
⎼ Multi-master (Coming Soon in EMR)
https://www.slideshare.net/AmazonWebServices/a-deep-dive-into-whats-new-with-amazon-emr-ant340r1-aws-reinvent-2018/64
⎼ Needs combination of applications/versions which are not supported in EMR
⎼ Needs specific distribution of Hadoop/Spark
• Glue: Can be used as serverless Spark
Hadoop/Spark and AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Difference between HDFS
and S3, and use-case
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Possible to access via Hadoop FileSystem API
Can be changed based on URL prefix (“hdfs://”, “s3://”, “s3a://”)
Common features in both HDFS and S3
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extremely high I/O performance
Frequent data access
Temporary data
High consistency
• Cannot accept S3 consistency model and EMRFS consistent view, S3 Guard
Fixed cost for both storage and I/O is preferred
The use-case where data locality work well
(network bandwidth between nodes < 1G)
Physical location of data needs to be controlled
Work load and data where HDFS is better
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extremely high durability and availability
• Durability: 99.999999999%
• Availability:99.99%
Cold data is stored for long-term use
https://aws.amazon.com/s3/storage-classes/
Lower cost for data size is preferred
• External blog post said it is less than 1/5 of HDFS
Data size is huge and incrementally increasing
Work load and data where S3 is better (1/2)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wants to separate storage from computing cluster
• Data on S3 remains after terminating clusters
Multiple clusters and applications share the same file system
• Multiple Hadoop/Spark clusters
• EMR, Glue, Athena, Redshift Spectrum, Hadoop/Spark on EC2, etc.
Centerized security is preferred (including other components than Hadoop)
• IAM, S3 bucket policy, VPC Endpoint, Glue Data Catalog, etc.
• Will be improved by AWS LakeFormation
Note: S3 cannot be used as default file system (fs.defaultFS)
https://aws.amazon.com/premiumsupport/knowledge-center/configure-emr-s3-hadoop-storage/
Work load and data where S3 is better (2/2)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Detailed behavior of S3
from the viewpoint of
Hadoop/Spark
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read
Write
S3 (EMRFS/S3A)
Master Node
Name Node
Cluster Driver
Disk
Slave Node
Data Node
Cluster Worker
Disk
Slave Node
Data Node
Cluster Worker
Disk
Slave Node
Data Node
Cluster Worker
Spark
Driver
Disk
Slave Node
Data Node
Cluster Worker
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
S3
Client
Spark
Client
Cluster
S3 API
Endpoint
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Split is
• Data chunk which is generated by splitting target data so that Hadoop/Spark can
process
• Splittable format(e.g. bzip2) data is splitted based on pre-defined size
Well-known issues
• Increased overhead due to lots of splits from lots of small files
• Out of memory due to large unsplittable file
Default split size
• HDFS: 128MB (Recommendation:HDFS Block size = Split size)
• S3 EMRFS: 64MB (fs.s3.block.size)
• S3 S3A: 32MB (fs.s3a.block.size)
Request for unsplittable files
• S3 GetObject API with specifying content length in Range parameter
• Status code: 206 (Partial Content)
S3 (EMRFS/S3A): Split
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 consistency model
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PUTs of new objects: read-after-write consistency
• Consistent result is retrieved when you get object just after putting it.
HEAD/GET of non-existing objects: eventual consistency
• If you make a HEAD or GET request to the key name (to find if the object exists)
before creating the object, S3 provides eventual consistency for read-after-write.
PUTs/DELETEs for existing objects: eventual consistency
• A process replaces an existing object and immediately attempts to read it. Until
the change is fully propagated, S3 might return the prior data
• A process deletes an existing object and immediately attempts to read it. Until
the deletion is fully propagated, S3 might return the deleted data.
LIST of objects: eventual consistency
• A process writes a new object to S3 and immediately lists keys within its bucket.
Until the change is fully propagated, the object might not appear in the list.
• A process deletes an existing object and immediately lists keys within its bucket.
Until the deletion is fully propagated, S3 might list the deleted object.
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
S3 consistency model
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example:ETL data pipeline where there are multiple steps
• Step 1: Transforming and converting input data
• Step 2: Statistic processing of converted data
Expected issue
• Step 2 will get object list without some of data written in step 1
Workload where impact is expected
• Requires immediate, incremental, consistent processing consists of
multiple steps
Impact to Hadoop/Spark due to S3 consistency
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Write into HDFS, then write into S3
• Write data into HDFS as a intermediate storage, then move data from
HDFS to S3.
• DiscCP or S3DistCp can be used to move data from HDFS to S3.
• Cons: Overhead in moving data, adding intermediate process, delay to
reflect the latest data
Mitigating consistency impact in Hadoop/Spark
Cluster S3
HDFS
App
Input
Output
Backup
Restore
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 Guard (S3A), EMRFS Consistent view (EMRFS)
• Mechanism to check S3 consistency (especially LIST consistency)
• Use DynamoDB to manage S3 object metadata
• Provide the latest view to compare results returned from S3 and
DynamoDB
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
Mitigating consistency impact in Hadoop/Spark
Cluster
S3
HDFS
App
Temp
data
DynamoDB
Object PUT/GET
Metadata PUT/GET
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Update data on S3 from outside clusters
• There would be difference between metadata on DynamoDB and data
on S3 if you write data not using S3A or EMRFS
→ Limit basic operations only from inside cluster
→ Sync metadata when updating S3 data from outside cluster
EMRFS CLI $ emrfs ...
S3Guard CLI $ hadoop s3guard ...
Common troubles and workarounds
Cluster
S3
HDFS
App
Temp
data
DynamoDB
Object PUT/GET
Metadata PUT/GET
Client
PUT
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DynamoDB I/O throttling
• Fail to get/update metadata if there is not enough capacity in DynamoDB table.
→ Provision enough capacity, or use on-demand mode instead
→ Retry I/O to mitigate impact
S3A: fs.s3a.s3guard.ddb.max.retries,fs.s3a.s3guard.ddb.throttle.retry.interval,..
→ Notify when there is inconsistency
EMRFS: fs.s3.consistent.notification.CloudWatch, etc.
Common troubles and workarounds
Cluster
S3
HDFS
App
Temp
data
DynamoDB
Object PUT/GET
Metadata PUT/GET
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 multipart uploads
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multipart uploads are used when Hadoop/Spark uploads
large data to S3
• Both S3A and EMRFS supports S3 multipart uploads.
• Size threshold can be set in parameters
EMRFS: fs.s3n.multipart.uploads.split.size, etc.
S3A: fs.s3a.multipart.threshold, etc.
Case of EMR: Multipart uploads are always used when EMRFS S3-
optimized Commiter is used
Case of OSS Hadoop/Spark: Multipart uploads are always used
when S3A committer is used
Hadoop/Spark and S3 multipart uploads
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multipart Upload Initiation
• When you send a request to initiate a multipart upload, S3 returns a response
with an upload ID, which is a unique identifier for your multipart upload.
Parts Upload
• When uploading a part,you must specify a part number and the upload ID.
• Only after you either complete or abort a multipart upload will S3 free up the
parts storage and stop charging you for the parts storage.
Multipart Upload Completion (or Abort)
• When you complete a multipart upload, S3 creates an object by concatenating
the parts in ascending order based on the part number.
• If any object metadata was provided in the initiate multipart upload request, S3
associates that metadata with the object.
• After a successful complete request, the parts no longer exist.
Steps in multipart uploads
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Remaining multipart uploads
• Part might remain when jobs are aborted, clusters are terminated
unexpectedly.
• S3 console does not show remaining parts which are not completed
→ Delete remaining parts periodically based on S3 life cicle
→ Configure multipart related parameters
EMRFS: fs.s3.multipart.clean.enabled, etc.
S3A: fs.s3a.multipart.purge, etc.
→ You can check if there are remaining parts or not via AWS CLI
Common troubles and workarounds
$ aws s3api list-multipart-uploads --bucket bucket-name
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 request performance
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Request performance (per prefix)
• 3,500 PUT/POST/DELETE requests per second
• 5,500 GET requests per second
• ”HTTP 503 Slowdown” might be returned if some condition is met
https://www.slideshare.net/AmazonWebServices/best-practices-for-amazon-s3-and-amazon-glacier-stg203r2-aws-reinvent-2018/50
In case that it is difficult to split the prefixes due to use-case
• Query over multiple prefixes (e.g. query with ‘*’ not specifying ‘US’/’CA’)
→ Please reach out to AWS support to get proactive support
S3 performance and throttling
s3://MyBucket/customers/dt=yyyy-mm-dd/0000001.csv
s3://MyBucket/customers/US/dt=yyyy-mm-dd/0000001.csv
s3://MyBucket/customers/CA/dt=yyyy-mm-dd/0000002.csv
Performance will be improved if prefixes are splitted into
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 connections
• Configure the number of connections in order to adjust concurrency of
S3 requests
EMRFS: fs.s3.maxConnections, etc.
S3A: fs.s3a.connection.maximum, etc.
S3 request retries
• Configure request retry behavior in order to address request throttling
EMRFS: fs.s3.retryPeriodSeconds (EMR 5.14 or later), fs.s3.maxRetries (EMR 5.12or
later), etc.
S3A: fs.s3a.retry.throttle.limit, fs.s3a.retry.throttle.interval, etc.
Tuning S3 requests
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- Others
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common troubles
• Performance degrade when writing data into Hive tables on S3
⎼ Lack of parallelism of write I/O
⎼ Writing not only output data, but also intermediate data to S3
Workarounds
• Parallelism
⎼ hive.mv.files.threads
• Intermediate data
⎼ hive.blobstore.use.blobstore.as.scratchdir = false
⎼ There is an example that achieves 10 times faster performance.
https://issues.apache.org/jira/browse/HIVE-14269
https://issues.apache.org/jira/browse/HIVE-14270
Hive Write performance tuning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common troubles
• Slow file upload via S3A
• Consuming too much disk space or memory when uploading data
Workarounds
• Tuning S3A Fast Upload related parameters
⎼ fs.s3a.fast.upload.buffer: disk, array, bytebuffer
⎼ fs.s3a.fast.upload.active.blocks
Tuning S3A Fast Upload
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service updates on AWS/S3
related to Hadoop/Spark
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Previous request performance
• 100 PUT/LIST/DELETE requests per second
• 300 GET requests per second
Current request performance (per prefix)
• 3,500 PUT/POST/DELETE requests per second
• 5,500 GET requests per second
2018.7: S3 request performance improvement
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 SELECT is
• A feature to enable querying
required data from object
• Support queries from API, S3
console
• Possible to retrieve max 40MB
record from max 128 MB source
file
Supported formats
• CSV
• JSON
• Parquet <-New!
https://aws.amazon.com/jp/about-aws/whats-new/2018/09/amazon-s3-announces-
2018.9: S3 SELECT supports Parquet format
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EMRFS supports pushdown by using S3 SELECT queries
• Expected outcome: performance improvement, faster data transfer
• Supported applications: Hive, Spark, Presto
• How to use: Configure per application
Note: EMRFS does not decide if it uses S3 SELECT or not automatically based on
workload.
• Guidelines to determine if your application is a candidate for S3 Select:
⎼ Your query filters out more than half of the original data set.
⎼ Your network connection between S3 and the EMR cluster has good transfer
speed and available bandwidth.
⎼ Your query filter predicates use columns that have a data type supported by
both S3 Select and application (Hive/Spark/Presto)
→ Recommend to do benchmark to ensure if S3 SELECT is better for your
workloads
Hive: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-s3select.html
Spark: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html
Presto: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-s3select.html
2018.10: EMRFS supports pushdown by S3 SELECT
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EMRFS S3-optimized committer is
• Output committer introduced in EMR 5.19.0 or later (Default in
5.20.0 or later)
• Used when you use Spark SQL / DataFrames / Datasets to write
Parquet file
• Based on S3 multipart uploads
Pros
• Improve performance by avoiding S3 LIST/RENAME during job/task
commit phase.
• Improve correctness of job with failed tasks by avoiding S3 consistency
impact during job/task commit phase
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html
https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
2018.11: EMRFS S3-optimized committer
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Difference between FileOutputCommitter and EMRFS S3-optimized
committer
• FileOutputCommitterV1: 2 phases RENAME
⎼ RENAME to commit individual task output
⎼ RENAME to commit whole job output from completed/succeeded tasks
• FileOutputCommitterV2: 1 phase RENAME
⎼ RENAME to commit files to final destination.
⎼ Note: Intermediate data would be visible before completing jobs.
(Both versions have RENAME operations to write data into intermediate location.)
• EMRFS S3-optimized committer
⎼ Avoid RENAME to take advantage of S3 multipart uploads
The reason to focus on RENAME
• HDFS RENAME:Metadata only operation. Fast.
• S3 RENAME:N times data copy and deletion. Slow.
2018.11: EMRFS S3-optimized committer
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance comparison
• EMR 5.19.0 (master m5d.2xlarge / core m5d.2xlarge x 8台)
• Input data: 15 GB (100 Parquet files)
2018.11: EMRFS S3-optimized committer
EMRFS consistent view is disabled EMRFS consistent view is enabled
INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’
USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Provisioned
• Configure provisioned capacity for Read/Write I/O
Ondemand
• No need to configure capacity (Auto-scale based on workloads)
• EMRFS consistent view, S3 Guard can take advantage of this.
https://aws.amazon.com/jp/blogs/news/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/
2018.11: DynamoDB Ondemand
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recent activities in
Hadoop/Spark community
related to S3
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-16132: Support multipart download in S3AFileSystem
• Improve download performance to refer AWS CLI implementation.
https://issues.apache.org/jira/browse/HADOOP-16132
HADOOP-15364: Add support for S3 Select to S3A
• S3A supports S3 SELECT
https://issues.apache.org/jira/browse/HADOOP-15364
JIRA – S3A
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-15999: S3Guard: Better support for out-of-band operations
• Improve handling of files updated from outside S3Guard
https://issues.apache.org/jira/browse/HADOOP-15999
HADOOP-15837: DynamoDB table Update can fail S3A FS init
• Improve S3Guard initiation when DynamoDB AutoScaling is enabled
https://issues.apache.org/jira/browse/HADOOP-15837
HADOOP-15619: Über-JIRA: S3Guard Phase IV: Hadoop 3.3 features
• Hadoop 3.3 S3Guard related parent JIRA (Hadoop 3.0, 3.1, 3.2 have its own specific JIRA)
https://issues.apache.org/jira/browse/HADOOP-15619
HADOOP-15426: Make S3guard client resilient to DDB throttle events and network
failures
• Improve S3Guard CLI behavior when there are throttling in DynamoDB
https://issues.apache.org/jira/browse/HADOOP-15426
HADOOP-15349: S3Guard DDB retryBackoff to be more informative on limits
exceeded
• Improve S3Guard behavior when there are throttling in DynamoDB
JIRA – S3Guard
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HADOOP-15281: Distcp to add no-rename copy option
• DistCp adds new option without RENAME (mainly for S3)
https://issues.apache.org/jira/browse/HADOOP-15281
HIVE-20517: Creation of staging directory and Move operation is
taking time in S3
• Change Hive behavior to write data into final destination to avoid
RENAME operations.
https://issues.apache.org/jira/browse/HIVE-20517
JIRA – Others
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Issue
• Spark writes intermediate files and RENAMEs then when writing data
• Even intermediate data is written into S3. It caused slow performance.
• HIVE-14270 related Issue
Approach
• Divide location (HDFS for intermediate files, S3 for final destination)
• Expected outcome: performance improvement, S3 cost reduction.
Current status
• My implementation is 2 times faster, but still in testing phase.
https://issues.apache.org/jira/browse/SPARK-21514
SPARK-21514:Hive has updated with new support for
S3 and InsertIntoHiveTable.scala should update also
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationship between Hadoop/Spark and S3
Difference between HDFS and S3, and use-case
Detailed behavior of S3 from the viewpoint of Hadoop/Spark
Well-known pitfalls and tunings
Service updates on AWS/S3 related to Hadoop/Spark
Recent activities in Hadoop/Spark community related to S3
Conclusion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Appendix
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient clusters
• Batch job
• One-time data conversion
• Machine learning
• ETL into other DWH or data lake
Persistent clusters
• Ad-hoc jobs
• Streaming processing
• Continuous data conversion
• Notebook
• Experiments
Major use-case of Hadoop/Spark on AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resource/Daemon logs
• Name Node logs
• Data Node logs
• HDFS block reports
Request logs
• HDFS audit logs
Metrics
• Hadoop Metrics v2
Useful information in troubleshooting HDFS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Request logs
• S3 Access logs
⎼ Logs are written when you configure S3 bucket in advance.
• CloudTrail
⎼ Management events:Records control plane operations
⎼ Data events:Records data plane operations (Need to be configured)
Metrics
• CloudWatch S3 metrics
⎼ Storage metrics
– There are 2 types of metrics; bucket size and the number of objects
– Updated once a day
⎼ Request metrics (Need to be configured)
– 16 types of metrics including request counts (GET, PUT, HEAD, …) and 4XX/5XX errors
– Updated once a minute
Useful information in troubleshooting S3

More Related Content

What's hot

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 

What's hot (20)

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 

Similar to AWS S3 Best Practices and Tuning for Hadoop/Spark in the Cloud

Hybrid and Edge Architectures.pdf
Hybrid and Edge Architectures.pdfHybrid and Edge Architectures.pdf
Hybrid and Edge Architectures.pdfAmazon Web Services
 
Building Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSBuilding Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Amazon Web Services
 
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakejavier ramirez
 
Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive Amazon Web Services
 
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...Amazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...Amazon Web Services
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...Amazon Web Services
 
Soluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web ServicesSoluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web ServicesAmazon Web Services
 
Migrating_Large_Scale_Data_Sets_to_the_Cloud
Migrating_Large_Scale_Data_Sets_to_the_CloudMigrating_Large_Scale_Data_Sets_to_the_Cloud
Migrating_Large_Scale_Data_Sets_to_the_CloudAmazon Web Services
 
How To Deploy Your File Workloads Quickly & Easily with AWS
How To Deploy Your File Workloads Quickly & Easily with AWSHow To Deploy Your File Workloads Quickly & Easily with AWS
How To Deploy Your File Workloads Quickly & Easily with AWSAmazon Web Services
 
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...Amazon Web Services
 
“Lift and shift” storage for business-critical applications - STG203 - New Yo...
“Lift and shift” storage for business-critical applications - STG203 - New Yo...“Lift and shift” storage for business-critical applications - STG203 - New Yo...
“Lift and shift” storage for business-critical applications - STG203 - New Yo...Amazon Web Services
 

Similar to AWS S3 Best Practices and Tuning for Hadoop/Spark in the Cloud (20)

Hybrid and Edge Architectures
Hybrid and Edge ArchitecturesHybrid and Edge Architectures
Hybrid and Edge Architectures
 
Hybrid and Edge Architectures.pdf
Hybrid and Edge Architectures.pdfHybrid and Edge Architectures.pdf
Hybrid and Edge Architectures.pdf
 
Building Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSBuilding Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWS
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
 
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lake
 
Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scale
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive
 
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...
Best practices for migrating big data workloads to Amazon EMR - ADB204 - Chic...
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
 
Soluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web ServicesSoluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web Services
 
Migrating_Large_Scale_Data_Sets_to_the_Cloud
Migrating_Large_Scale_Data_Sets_to_the_CloudMigrating_Large_Scale_Data_Sets_to_the_Cloud
Migrating_Large_Scale_Data_Sets_to_the_Cloud
 
How To Deploy Your File Workloads Quickly & Easily with AWS
How To Deploy Your File Workloads Quickly & Easily with AWSHow To Deploy Your File Workloads Quickly & Easily with AWS
How To Deploy Your File Workloads Quickly & Easily with AWS
 
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...
AWS re:Invent 2018: Deep Dive: Hybrid Cloud Storage Arch. w/Storage Gateway, ...
 
“Lift and shift” storage for business-critical applications - STG203 - New Yo...
“Lift and shift” storage for business-critical applications - STG203 - New Yo...“Lift and shift” storage for business-critical applications - STG203 - New Yo...
“Lift and shift” storage for business-critical applications - STG203 - New Yo...
 

More from Noritaka Sekiyama

5分ではじめるApache Spark on AWS
5分ではじめるApache Spark on AWS5分ではじめるApache Spark on AWS
5分ではじめるApache Spark on AWSNoritaka Sekiyama
 
VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話Noritaka Sekiyama
 
AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザNoritaka Sekiyama
 
Sparkにプルリク投げてみた
Sparkにプルリク投げてみたSparkにプルリク投げてみた
Sparkにプルリク投げてみたNoritaka Sekiyama
 
Modernizing Big Data Workload Using Amazon EMR & AWS Glue
Modernizing Big Data Workload Using Amazon EMR & AWS GlueModernizing Big Data Workload Using Amazon EMR & AWS Glue
Modernizing Big Data Workload Using Amazon EMR & AWS GlueNoritaka Sekiyama
 
Effective Data Lakes - ユースケースとデザインパターン
Effective Data Lakes - ユースケースとデザインパターンEffective Data Lakes - ユースケースとデザインパターン
Effective Data Lakes - ユースケースとデザインパターンNoritaka Sekiyama
 
S3 整合性モデルと Hadoop/Spark の話
S3 整合性モデルと Hadoop/Spark の話S3 整合性モデルと Hadoop/Spark の話
S3 整合性モデルと Hadoop/Spark の話Noritaka Sekiyama
 
Introduction to New CloudWatch Agent
Introduction to New CloudWatch AgentIntroduction to New CloudWatch Agent
Introduction to New CloudWatch AgentNoritaka Sekiyama
 
Security Operations and Automation on AWS
Security Operations and Automation on AWSSecurity Operations and Automation on AWS
Security Operations and Automation on AWSNoritaka Sekiyama
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
運用視点でのAWSサポート利用Tips
運用視点でのAWSサポート利用Tips運用視点でのAWSサポート利用Tips
運用視点でのAWSサポート利用TipsNoritaka Sekiyama
 
基礎から学ぶ? EC2マルチキャスト
基礎から学ぶ? EC2マルチキャスト基礎から学ぶ? EC2マルチキャスト
基礎から学ぶ? EC2マルチキャストNoritaka Sekiyama
 
Floodlightってぶっちゃけどうなの?
Floodlightってぶっちゃけどうなの?Floodlightってぶっちゃけどうなの?
Floodlightってぶっちゃけどうなの?Noritaka Sekiyama
 

More from Noritaka Sekiyama (14)

5分ではじめるApache Spark on AWS
5分ではじめるApache Spark on AWS5分ではじめるApache Spark on AWS
5分ではじめるApache Spark on AWS
 
VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話
 
AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザ
 
Sparkにプルリク投げてみた
Sparkにプルリク投げてみたSparkにプルリク投げてみた
Sparkにプルリク投げてみた
 
Modernizing Big Data Workload Using Amazon EMR & AWS Glue
Modernizing Big Data Workload Using Amazon EMR & AWS GlueModernizing Big Data Workload Using Amazon EMR & AWS Glue
Modernizing Big Data Workload Using Amazon EMR & AWS Glue
 
Running Apache Spark on AWS
Running Apache Spark on AWSRunning Apache Spark on AWS
Running Apache Spark on AWS
 
Effective Data Lakes - ユースケースとデザインパターン
Effective Data Lakes - ユースケースとデザインパターンEffective Data Lakes - ユースケースとデザインパターン
Effective Data Lakes - ユースケースとデザインパターン
 
S3 整合性モデルと Hadoop/Spark の話
S3 整合性モデルと Hadoop/Spark の話S3 整合性モデルと Hadoop/Spark の話
S3 整合性モデルと Hadoop/Spark の話
 
Introduction to New CloudWatch Agent
Introduction to New CloudWatch AgentIntroduction to New CloudWatch Agent
Introduction to New CloudWatch Agent
 
Security Operations and Automation on AWS
Security Operations and Automation on AWSSecurity Operations and Automation on AWS
Security Operations and Automation on AWS
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
 
運用視点でのAWSサポート利用Tips
運用視点でのAWSサポート利用Tips運用視点でのAWSサポート利用Tips
運用視点でのAWSサポート利用Tips
 
基礎から学ぶ? EC2マルチキャスト
基礎から学ぶ? EC2マルチキャスト基礎から学ぶ? EC2マルチキャスト
基礎から学ぶ? EC2マルチキャスト
 
Floodlightってぶっちゃけどうなの?
Floodlightってぶっちゃけどうなの?Floodlightってぶっちゃけどうなの?
Floodlightってぶっちゃけどうなの?
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

AWS S3 Best Practices and Tuning for Hadoop/Spark in the Cloud

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Noritaka Sekiyama Senior Cloud Support Engineer, Amazon Web Services Japan 2019.03.14 Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Noritaka Sekiyama Senior Cloud Support Engineer - Engineer in AWS Support - Speciality: Big Data (EMR, Glue, Athena, …) - SME of AWS Glue - Apache Spark lover ;) Who I am... @moomindani
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Question • Are you already using S3 on Hadoop/Spark? • Will you start using Hadoop/Spark on S3 in the future? • Are you just interested in using cloud storage in Hadoop/Spark? About today’s session
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relationship between Hadoop/Spark and S3 Difference between HDFS and S3, and use-case Detailed behavior of S3 from the viewpoint of Hadoop/Spark Well-known pitfalls and tunings Service updates on AWS/S3 related to Hadoop/Spark Recent activities in Hadoop/Spark community related to S3 Agenda
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relationship between Hadoop/Spark and S3
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hadoop/Spark processes large data and write output to destination Possible to locate input/output data on various file systems like HDFS Hadoop/Spark accesses various file system via Hadoop FileSystem API Data operation on Hadoop/Spark FileSystem App FileSystem API
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hadoop FileSystem API • Interface to operate Hadoop file system ⎼ open: Open input stream ⎼ create: Create files ⎼ append: Append files ⎼ getFileBlockLocations: Get block locations ⎼ rename: Rename files ⎼ mkdir: Create directories ⎼ listFiles: List files ⎼ delete: Delete files • Possible to use various file system like HDFS when using various implementation of FileSystem API ⎼ LocalFileSystem, S3AFileSystem, EmrFileSystem Hadoop/Spark and file system
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HDFS is • Distributed file system that enables high throughput data access • Optimized to large files (100MB+, GB, TB) ⎼ Not good for lots of small files • There are NameNode which manages metadata, and DataNode which manages/stores data blocks • In Hadoop 3.x, there are many features (e.g. Erasure Coding, Router based federation, Tiered storage, Ozone) that are actively developed. How to access • Hadoop FileSystem API • $ hadoop fs ... • HDFS Web UI HDFS: Hadoop Distributed File System
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 is • Object storage service that achieves high scalability, availability, durability, security, and performance • Pricing is mainly based on data size and requests • Maximum size of single object: 5TB • Objects have unique keys under bucket ⎼ There are no directories in S3 although S3 console emulates directories. ⎼ S3 is not a file system How to access • REST API (GET, SELECT, PUT, POST, COPY, LIST, DELETE, CANCEL, ...) • AWS CLI • AWS SDK • S3 Console Amazon S3
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enables to handle S3 like HDFS from Hadoop/Spark History of S3 FileSystem API implementation • S3: S3FileSystem • S3N: NativeS3FileSystem • S3A: S3AFileSystem • EMRFS: EmrFileSystem S3’s implementation of Hadoop FileSystem API Cluster S3 HDFS App
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-574: want FileSystem implementation for Amazon S3 Developed to use S3 as file system in 2006 • Object data on S3 = Block data(≠ File data) • Blocks are stored on S3 directly • Limited to read/write from S3FileSystem URL prefix: s3:// Deprecated in 2016 https://issues.apache.org/jira/browse/HADOOP-574 S3: S3FileSystem
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-931: Make writes to S3FileSystem world visible only on completion Developed to solve issues in S3FileSystem in 2008 • Object data on S3 =File data(≠Block data) • Empty directories are represented with empty object “xyz_$folder$“ • Limited to use files which does not exceed 5GB Uses jets3t to access S3 (not use AWS SDK) URL prefix: s3n:// https://issues.apache.org/jira/browse/HADOOP-931 S3N: NativeS3FileSystem
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-10400: Incorporate new S3A FileSystem implementation Developed to solve issues in NativeS3FileSystem in 2014 • Support parallel copy and rename • Compatible with S3 console about empty directories ("xyz_$folder$“- >”xyz/”) • Support IAM role authentication • Support 5GB+ files and multipart uploads • Support S3 server side encryption • Improve recovery from error Uses AWS SDK for Java to access S3 URL prefix: s3a:// Amazon EMR does not support S3A officially https://issues.apache.org/jira/browse/HADOOP-10400 S3A: S3AFileSystem
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FileSystem implementation at Amazon EMR (Limited to use on EMR) Developed by Amazon (optimized for S3 specification) • Support IAM role authentication • Support 5GB+ files and multipart uploads • Support both S3 server-side/client-side encryption • Support EMRFS S3-optimized Committer • Support pushdown with S3 SELECT Uses AWS SDK for Java to access S3 URL prefix: s3:// (or s3n://) https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html EMRFS: EmrFileSystem
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3A • On-premise • Other cloud • Hadoop/Spark on EC2 EMRFS • Amazon EMR How to choose S3A or EMRFS EMR ClusterS3 HDFS App HDFS App S3A EMRFS
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Choice at AWS • EMR: Covers most of use-cases • Hadoop/Spark on EC2: Good for specific use-case ⎼ Multi-master (Coming Soon in EMR) https://www.slideshare.net/AmazonWebServices/a-deep-dive-into-whats-new-with-amazon-emr-ant340r1-aws-reinvent-2018/64 ⎼ Needs combination of applications/versions which are not supported in EMR ⎼ Needs specific distribution of Hadoop/Spark • Glue: Can be used as serverless Spark Hadoop/Spark and AWS
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Difference between HDFS and S3, and use-case
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Possible to access via Hadoop FileSystem API Can be changed based on URL prefix (“hdfs://”, “s3://”, “s3a://”) Common features in both HDFS and S3
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extremely high I/O performance Frequent data access Temporary data High consistency • Cannot accept S3 consistency model and EMRFS consistent view, S3 Guard Fixed cost for both storage and I/O is preferred The use-case where data locality work well (network bandwidth between nodes < 1G) Physical location of data needs to be controlled Work load and data where HDFS is better
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extremely high durability and availability • Durability: 99.999999999% • Availability:99.99% Cold data is stored for long-term use https://aws.amazon.com/s3/storage-classes/ Lower cost for data size is preferred • External blog post said it is less than 1/5 of HDFS Data size is huge and incrementally increasing Work load and data where S3 is better (1/2)
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Wants to separate storage from computing cluster • Data on S3 remains after terminating clusters Multiple clusters and applications share the same file system • Multiple Hadoop/Spark clusters • EMR, Glue, Athena, Redshift Spectrum, Hadoop/Spark on EC2, etc. Centerized security is preferred (including other components than Hadoop) • IAM, S3 bucket policy, VPC Endpoint, Glue Data Catalog, etc. • Will be improved by AWS LakeFormation Note: S3 cannot be used as default file system (fs.defaultFS) https://aws.amazon.com/premiumsupport/knowledge-center/configure-emr-s3-hadoop-storage/ Work load and data where S3 is better (2/2)
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Detailed behavior of S3 from the viewpoint of Hadoop/Spark
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read Write S3 (EMRFS/S3A) Master Node Name Node Cluster Driver Disk Slave Node Data Node Cluster Worker Disk Slave Node Data Node Cluster Worker Disk Slave Node Data Node Cluster Worker Spark Driver Disk Slave Node Data Node Cluster Worker Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor S3 Client Spark Client Cluster S3 API Endpoint
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Split is • Data chunk which is generated by splitting target data so that Hadoop/Spark can process • Splittable format(e.g. bzip2) data is splitted based on pre-defined size Well-known issues • Increased overhead due to lots of splits from lots of small files • Out of memory due to large unsplittable file Default split size • HDFS: 128MB (Recommendation:HDFS Block size = Split size) • S3 EMRFS: 64MB (fs.s3.block.size) • S3 S3A: 32MB (fs.s3a.block.size) Request for unsplittable files • S3 GetObject API with specifying content length in Range parameter • Status code: 206 (Partial Content) S3 (EMRFS/S3A): Split
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Well-known pitfalls and tunings - S3 consistency model
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PUTs of new objects: read-after-write consistency • Consistent result is retrieved when you get object just after putting it. HEAD/GET of non-existing objects: eventual consistency • If you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, S3 provides eventual consistency for read-after-write. PUTs/DELETEs for existing objects: eventual consistency • A process replaces an existing object and immediately attempts to read it. Until the change is fully propagated, S3 might return the prior data • A process deletes an existing object and immediately attempts to read it. Until the deletion is fully propagated, S3 might return the deleted data. LIST of objects: eventual consistency • A process writes a new object to S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list. • A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, S3 might list the deleted object. https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel S3 consistency model
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example:ETL data pipeline where there are multiple steps • Step 1: Transforming and converting input data • Step 2: Statistic processing of converted data Expected issue • Step 2 will get object list without some of data written in step 1 Workload where impact is expected • Requires immediate, incremental, consistent processing consists of multiple steps Impact to Hadoop/Spark due to S3 consistency
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Write into HDFS, then write into S3 • Write data into HDFS as a intermediate storage, then move data from HDFS to S3. • DiscCP or S3DistCp can be used to move data from HDFS to S3. • Cons: Overhead in moving data, adding intermediate process, delay to reflect the latest data Mitigating consistency impact in Hadoop/Spark Cluster S3 HDFS App Input Output Backup Restore
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Guard (S3A), EMRFS Consistent view (EMRFS) • Mechanism to check S3 consistency (especially LIST consistency) • Use DynamoDB to manage S3 object metadata • Provide the latest view to compare results returned from S3 and DynamoDB https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html Mitigating consistency impact in Hadoop/Spark Cluster S3 HDFS App Temp data DynamoDB Object PUT/GET Metadata PUT/GET
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Update data on S3 from outside clusters • There would be difference between metadata on DynamoDB and data on S3 if you write data not using S3A or EMRFS → Limit basic operations only from inside cluster → Sync metadata when updating S3 data from outside cluster EMRFS CLI $ emrfs ... S3Guard CLI $ hadoop s3guard ... Common troubles and workarounds Cluster S3 HDFS App Temp data DynamoDB Object PUT/GET Metadata PUT/GET Client PUT
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DynamoDB I/O throttling • Fail to get/update metadata if there is not enough capacity in DynamoDB table. → Provision enough capacity, or use on-demand mode instead → Retry I/O to mitigate impact S3A: fs.s3a.s3guard.ddb.max.retries,fs.s3a.s3guard.ddb.throttle.retry.interval,.. → Notify when there is inconsistency EMRFS: fs.s3.consistent.notification.CloudWatch, etc. Common troubles and workarounds Cluster S3 HDFS App Temp data DynamoDB Object PUT/GET Metadata PUT/GET
  • 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Well-known pitfalls and tunings - S3 multipart uploads
  • 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multipart uploads are used when Hadoop/Spark uploads large data to S3 • Both S3A and EMRFS supports S3 multipart uploads. • Size threshold can be set in parameters EMRFS: fs.s3n.multipart.uploads.split.size, etc. S3A: fs.s3a.multipart.threshold, etc. Case of EMR: Multipart uploads are always used when EMRFS S3- optimized Commiter is used Case of OSS Hadoop/Spark: Multipart uploads are always used when S3A committer is used Hadoop/Spark and S3 multipart uploads
  • 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multipart Upload Initiation • When you send a request to initiate a multipart upload, S3 returns a response with an upload ID, which is a unique identifier for your multipart upload. Parts Upload • When uploading a part,you must specify a part number and the upload ID. • Only after you either complete or abort a multipart upload will S3 free up the parts storage and stop charging you for the parts storage. Multipart Upload Completion (or Abort) • When you complete a multipart upload, S3 creates an object by concatenating the parts in ascending order based on the part number. • If any object metadata was provided in the initiate multipart upload request, S3 associates that metadata with the object. • After a successful complete request, the parts no longer exist. Steps in multipart uploads
  • 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Remaining multipart uploads • Part might remain when jobs are aborted, clusters are terminated unexpectedly. • S3 console does not show remaining parts which are not completed → Delete remaining parts periodically based on S3 life cicle → Configure multipart related parameters EMRFS: fs.s3.multipart.clean.enabled, etc. S3A: fs.s3a.multipart.purge, etc. → You can check if there are remaining parts or not via AWS CLI Common troubles and workarounds $ aws s3api list-multipart-uploads --bucket bucket-name
  • 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Well-known pitfalls and tunings - S3 request performance
  • 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Request performance (per prefix) • 3,500 PUT/POST/DELETE requests per second • 5,500 GET requests per second • ”HTTP 503 Slowdown” might be returned if some condition is met https://www.slideshare.net/AmazonWebServices/best-practices-for-amazon-s3-and-amazon-glacier-stg203r2-aws-reinvent-2018/50 In case that it is difficult to split the prefixes due to use-case • Query over multiple prefixes (e.g. query with ‘*’ not specifying ‘US’/’CA’) → Please reach out to AWS support to get proactive support S3 performance and throttling s3://MyBucket/customers/dt=yyyy-mm-dd/0000001.csv s3://MyBucket/customers/US/dt=yyyy-mm-dd/0000001.csv s3://MyBucket/customers/CA/dt=yyyy-mm-dd/0000002.csv Performance will be improved if prefixes are splitted into
  • 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 connections • Configure the number of connections in order to adjust concurrency of S3 requests EMRFS: fs.s3.maxConnections, etc. S3A: fs.s3a.connection.maximum, etc. S3 request retries • Configure request retry behavior in order to address request throttling EMRFS: fs.s3.retryPeriodSeconds (EMR 5.14 or later), fs.s3.maxRetries (EMR 5.12or later), etc. S3A: fs.s3a.retry.throttle.limit, fs.s3a.retry.throttle.interval, etc. Tuning S3 requests
  • 39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Well-known pitfalls and tunings - Others
  • 40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Common troubles • Performance degrade when writing data into Hive tables on S3 ⎼ Lack of parallelism of write I/O ⎼ Writing not only output data, but also intermediate data to S3 Workarounds • Parallelism ⎼ hive.mv.files.threads • Intermediate data ⎼ hive.blobstore.use.blobstore.as.scratchdir = false ⎼ There is an example that achieves 10 times faster performance. https://issues.apache.org/jira/browse/HIVE-14269 https://issues.apache.org/jira/browse/HIVE-14270 Hive Write performance tuning
  • 41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Common troubles • Slow file upload via S3A • Consuming too much disk space or memory when uploading data Workarounds • Tuning S3A Fast Upload related parameters ⎼ fs.s3a.fast.upload.buffer: disk, array, bytebuffer ⎼ fs.s3a.fast.upload.active.blocks Tuning S3A Fast Upload
  • 42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service updates on AWS/S3 related to Hadoop/Spark
  • 43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Previous request performance • 100 PUT/LIST/DELETE requests per second • 300 GET requests per second Current request performance (per prefix) • 3,500 PUT/POST/DELETE requests per second • 5,500 GET requests per second 2018.7: S3 request performance improvement
  • 44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 SELECT is • A feature to enable querying required data from object • Support queries from API, S3 console • Possible to retrieve max 40MB record from max 128 MB source file Supported formats • CSV • JSON • Parquet <-New! https://aws.amazon.com/jp/about-aws/whats-new/2018/09/amazon-s3-announces- 2018.9: S3 SELECT supports Parquet format
  • 45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EMRFS supports pushdown by using S3 SELECT queries • Expected outcome: performance improvement, faster data transfer • Supported applications: Hive, Spark, Presto • How to use: Configure per application Note: EMRFS does not decide if it uses S3 SELECT or not automatically based on workload. • Guidelines to determine if your application is a candidate for S3 Select: ⎼ Your query filters out more than half of the original data set. ⎼ Your network connection between S3 and the EMR cluster has good transfer speed and available bandwidth. ⎼ Your query filter predicates use columns that have a data type supported by both S3 Select and application (Hive/Spark/Presto) → Recommend to do benchmark to ensure if S3 SELECT is better for your workloads Hive: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-s3select.html Spark: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html Presto: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-s3select.html 2018.10: EMRFS supports pushdown by S3 SELECT
  • 46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EMRFS S3-optimized committer is • Output committer introduced in EMR 5.19.0 or later (Default in 5.20.0 or later) • Used when you use Spark SQL / DataFrames / Datasets to write Parquet file • Based on S3 multipart uploads Pros • Improve performance by avoiding S3 LIST/RENAME during job/task commit phase. • Improve correctness of job with failed tasks by avoiding S3 consistency impact during job/task commit phase https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/ 2018.11: EMRFS S3-optimized committer
  • 47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Difference between FileOutputCommitter and EMRFS S3-optimized committer • FileOutputCommitterV1: 2 phases RENAME ⎼ RENAME to commit individual task output ⎼ RENAME to commit whole job output from completed/succeeded tasks • FileOutputCommitterV2: 1 phase RENAME ⎼ RENAME to commit files to final destination. ⎼ Note: Intermediate data would be visible before completing jobs. (Both versions have RENAME operations to write data into intermediate location.) • EMRFS S3-optimized committer ⎼ Avoid RENAME to take advantage of S3 multipart uploads The reason to focus on RENAME • HDFS RENAME:Metadata only operation. Fast. • S3 RENAME:N times data copy and deletion. Slow. 2018.11: EMRFS S3-optimized committer
  • 48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Performance comparison • EMR 5.19.0 (master m5d.2xlarge / core m5d.2xlarge x 8台) • Input data: 15 GB (100 Parquet files) 2018.11: EMRFS S3-optimized committer EMRFS consistent view is disabled EMRFS consistent view is enabled INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’ USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});
  • 49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Provisioned • Configure provisioned capacity for Read/Write I/O Ondemand • No need to configure capacity (Auto-scale based on workloads) • EMRFS consistent view, S3 Guard can take advantage of this. https://aws.amazon.com/jp/blogs/news/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/ 2018.11: DynamoDB Ondemand
  • 50. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recent activities in Hadoop/Spark community related to S3
  • 51. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-16132: Support multipart download in S3AFileSystem • Improve download performance to refer AWS CLI implementation. https://issues.apache.org/jira/browse/HADOOP-16132 HADOOP-15364: Add support for S3 Select to S3A • S3A supports S3 SELECT https://issues.apache.org/jira/browse/HADOOP-15364 JIRA – S3A
  • 52. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-15999: S3Guard: Better support for out-of-band operations • Improve handling of files updated from outside S3Guard https://issues.apache.org/jira/browse/HADOOP-15999 HADOOP-15837: DynamoDB table Update can fail S3A FS init • Improve S3Guard initiation when DynamoDB AutoScaling is enabled https://issues.apache.org/jira/browse/HADOOP-15837 HADOOP-15619: Über-JIRA: S3Guard Phase IV: Hadoop 3.3 features • Hadoop 3.3 S3Guard related parent JIRA (Hadoop 3.0, 3.1, 3.2 have its own specific JIRA) https://issues.apache.org/jira/browse/HADOOP-15619 HADOOP-15426: Make S3guard client resilient to DDB throttle events and network failures • Improve S3Guard CLI behavior when there are throttling in DynamoDB https://issues.apache.org/jira/browse/HADOOP-15426 HADOOP-15349: S3Guard DDB retryBackoff to be more informative on limits exceeded • Improve S3Guard behavior when there are throttling in DynamoDB JIRA – S3Guard
  • 53. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HADOOP-15281: Distcp to add no-rename copy option • DistCp adds new option without RENAME (mainly for S3) https://issues.apache.org/jira/browse/HADOOP-15281 HIVE-20517: Creation of staging directory and Move operation is taking time in S3 • Change Hive behavior to write data into final destination to avoid RENAME operations. https://issues.apache.org/jira/browse/HIVE-20517 JIRA – Others
  • 54. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Issue • Spark writes intermediate files and RENAMEs then when writing data • Even intermediate data is written into S3. It caused slow performance. • HIVE-14270 related Issue Approach • Divide location (HDFS for intermediate files, S3 for final destination) • Expected outcome: performance improvement, S3 cost reduction. Current status • My implementation is 2 times faster, but still in testing phase. https://issues.apache.org/jira/browse/SPARK-21514 SPARK-21514:Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also
  • 55. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conclusion
  • 56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relationship between Hadoop/Spark and S3 Difference between HDFS and S3, and use-case Detailed behavior of S3 from the viewpoint of Hadoop/Spark Well-known pitfalls and tunings Service updates on AWS/S3 related to Hadoop/Spark Recent activities in Hadoop/Spark community related to S3 Conclusion
  • 57. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Appendix
  • 58. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient clusters • Batch job • One-time data conversion • Machine learning • ETL into other DWH or data lake Persistent clusters • Ad-hoc jobs • Streaming processing • Continuous data conversion • Notebook • Experiments Major use-case of Hadoop/Spark on AWS
  • 59. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Resource/Daemon logs • Name Node logs • Data Node logs • HDFS block reports Request logs • HDFS audit logs Metrics • Hadoop Metrics v2 Useful information in troubleshooting HDFS
  • 60. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Request logs • S3 Access logs ⎼ Logs are written when you configure S3 bucket in advance. • CloudTrail ⎼ Management events:Records control plane operations ⎼ Data events:Records data plane operations (Need to be configured) Metrics • CloudWatch S3 metrics ⎼ Storage metrics – There are 2 types of metrics; bucket size and the number of objects – Updated once a day ⎼ Request metrics (Need to be configured) – 16 types of metrics including request counts (GET, PUT, HEAD, …) and 4XX/5XX errors – Updated once a minute Useful information in troubleshooting S3