Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
10. Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload
11. Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types
16. Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done
• Data persist on
Amazon S3
• Input & output
Data on
Amazon S3
17. Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
•
Cluster goes away when job is done
3. Practice cloud architecture
•
Pay for what you use
•
Data processing as a workflow
18. When to use Transient cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs < 24
Use Transient Clusters
Else
Use Alive Clusters
19. When to use Transient cluster?
( 20min data load + 1 hour
Processing time) * 10 jobs = 13
hours < 24 hour = Use Transient
Clusters
20. Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup
21. Alive Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
•
Design your data processing workflow to account for
failure
• You can use workflow managements such as
AWS Data Pipeline
22. Benefits of Alive Clusters
•
Ability to share data between multiple jobs
Transient cluster
Long running clusters
EMR
EMR
Amazon S3
Amazon S3
EMR
HDFS
HDFS
23. Benefit of Alive Clusters
•
Cost effective for repetitive jobs
pm
EMR
pm
pm
EMR
EMR
EMR
EMR
pm
24. When to use Alive cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs > 24
Use Alive Clusters
Else
Use Transient Clusters
25. When to use Alive cluster?
( 20min data load + 1 hour
Processing time) * 20 jobs =
26hours > 24 hour = Use Alive
Clusters
27. Core Nodes
Amazon EMR cluster
Master instance group
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
HDFS
HDFS
Core instance group
28. Core Nodes
Amazon EMR cluster
Master instance group
Can add core
nodes
HDFS
HDFS
Core instance group
29. Core Nodes
Amazon EMR cluster
Can add core
nodes
Master instance group
More HDFS
space
More
CPU/mem
HDFS
HDFS
Core instance group
HDFS
30. Core Nodes
Amazon EMR cluster
Can’t remove
core nodes
because of
HDFS
Master instance group
HDFS
HDFS
Core instance group
HDFS
31. Amazon EMR Task Nodes
Amazon EMR cluster
Run TaskTrackers
Master instance group
No HDFS
Reads from core
node HDFS
HDFS
HDFS
Core instance group
Task instance group
32. Amazon EMR Task Nodes
Amazon EMR cluster
Can add
task nodes
Master instance group
HDFS
HDFS
Core instance group
Task instance group
33. Amazon EMR Task Nodes
Amazon EMR cluster
More CPU
power
Master instance group
More
memory
HDFS
HDFS
Core instance group
Task instance group
34. Amazon EMR Task Nodes
You can
remove task
nodes
Amazon EMR cluster
Master instance group
HDFS
HDFS
Core instance group
Task instance group
35. Amazon EMR Task Nodes
You can
remove task
nodes
Amazon EMR cluster
Master instance group
HDFS
HDFS
Core instance group
Task instance group
36. Tasknode Use-Case #1
• Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price
• Nodes can come and go without
interruption to your cluster
37. Tasknode Use-Case #2
• When you need extra horse power
for a short amount of time
• Example: Need to pull large amount
of data from Amazon S3
42. Amazon S3 as HDFS
Amazon EMR
cluster
• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between
jobs
• No additional step to
copy data to HDFS
HDF
S
HDF
S
Core instance group
Task instance group
Amazon S3
43. Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
44. Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity
• Replication for durability
• Amazon S3 scales with your data
• Both in IOPs and data storage
45. Benefits: Amazon S3 as HDFS
• Ability to share data between multiple clusters
•
Hard to do with HDFS
EMR
EMR
Amazon S3
46. Benefits: Amazon S3 as HDFS
•
Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption
•
Build elastic clusters
•
Add nodes to read from Amazon S3
•
Remove nodes with data safe on Amazon S3
47. What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bounded data,
locality doesn’t make a difference
48. Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once
• Disk I/O intensive workloads
60. Amazon EMR Elastic Cluster (a)
3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk
61. Amazon EMR Elastic Cluster (a)
4. Your app calls the API to add nodes to your
cluster
API
62. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
63. Amazon EMR Nodes and Size
• Use m1.smal, m1.large, c1.medium for
functional testing
• Use M1.xlarge and larger nodes for
production workloads
64. Amazon EMR Nodes and Size
• Use CC2 for memory and CPU intensive
jobs
• Use CC2/C1.xlarge for CPU intensive
jobs
• Hs1 instances for HDFS workloads
65. Amazon EMR Nodes and Size
• Hi1 and HS1 instances for disk I/Ointensive workload
• CC2 instances are more cost effective
than M2.4xlarge
• Prefer smaller cluster of larger nodes than
larger cluster of smaller nodes
74. Introduction to Hadoop Splits
•
Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay
Queue
75. Introduction to Hadoop Splits
•
More nodes = reduced queue size = faster
processing
Queue
76. Calculating the Number of Splits for
Your Job
Uncompressed files: Hadoop splits a single file to
multiple splits.
Example: 128MB = 2 splits based on 64MB split size
77. Calculating the Number of Splits for
Your Job
Compressed files:
1. Splittable compressions: same logic as uncompressed files
64MB BZIP
128MB BZIP
78. Calculating the Number of Splits for
Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ
128MB GZ
79. Calculating the Number of Splits for
Your Job
Number of splits
If data files have unsplittable compression
# of splits = number of files
Example: 10 GZ files = 10 mappers
82. Cluster Sizing Calculation
2. Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
83. Cluster Sizing Calculation
3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
84. Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
85. Cluster Sizing Calculation
Estimated Number Of Nodes:
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
86. Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires
150
2. Pick an instance and note down the number of
mappers it can run in parallel
m1.xlarge with 8 mapper capacity
per instance
87. Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
88. Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
89. Cluster Sizing Calculation
Estimated number of nodes:
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
150 * 3 min
= 11 m1.xlarge
8 * 5 min
90. File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB
• Each mapper is a single JVM
• CPU time required to spawn
JVMs/mappers
91. File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time
92. File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time
93. File Size on Amazon S3: Best Practices
• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?
94. File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec
• Single mapper can get 10MB-15MB/s speed to
Amazon S3
60sec * 15MB
≈
1GB
96. Dealing with Small Files
• Use S3DistCP to combine smaller files together
• S3DistCP takes a pattern and target file to
combine smaller input files to larger ones
97. Dealing with Small Files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*,
--targetSize,128,
98. Compressions
• Compress as much as you can
• Compress Amazon S3 input data files
– Reduces cost
– Speed up Amazon S3->mapper data transfer
time
99. Compressions
• Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job
101. Compressions
• Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm
% Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP
13%
21MB/s
118MB/s
LZO
20%
135MB/s
410MB/s
Snappy
22%
172MB/s
409MB/s
102. Compressions
• If You Are Time Sensitive, Faster Compressions
Are A Better Choice
• If You Have Large Amount Of Data, Use Space
Efficient Compressions
• If You Don’t Care, Pick GZIP
103. Change Compression Type
• You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of
your files
• Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--outputCodec,lzo’
104. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
105. Architecting for cost
• AWS pricing models:
– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances
and get a discount
– Reserved Instance: upfront payment
(for 1 or 3 year) for reduction in overall
monthly payment
109. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
110. Adv. Optimizations (Stage 1)
• The best optimization is to structure your data
(i.e., smart data partitioning)
• Efficient data structuring= limit the amount of
data being processed by Hadoop= faster jobs
111. Adv. Optimizations (Stage 1)
• Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit
–
Twitter Storm
–
Spark
–
Amazon Redshift, etc.
112. Adv. Optimizations (Stage 1)
• Amazon EMR team has done a great deal of
optimization already
• For smaller clusters, Amazon EMR configuration
optimization won’t buy you much
– Remember you’re paying for the full hour
cost of an instance
120. Network IO
• Most important metric to watch for if using
Amazon S3 for storage
• Goal: Drive as much network IO as possible
from a single instance
121. Network IO
• Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps
• Optimize to get more out of your instance
throughput
– Add more mappers?
122. Network IO
• If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network
throughput.
• Your goal is to maximize your NIC
throughput by having enough mappers
per node
123. Network IO, Example
Low network utilization
Increase number of mappers if possible to drive
more traffic
124. CPU
• Watch for CPU utilization of your clusters
• If >50% idle, increase # of mapper/reducer
per instance
– Reduces the number of nodes and reduces
cost
127. Disk IO
• Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS
bytes written metrics
• One play to pay attention to is mapper/reducer
disk spill
128. Disk IO
• Mapper has in memory buffer
mapper
mapper memory buffer
129. Disk IO
• When memory gets full, data spills to disk
mapper
mapper memory buffer
data spills to
disk
130.
131. Disk IO
• If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper
• Excessive spill when ratio of
“MAPPER_SPILLED_RECORDS” and
“MAPPER_OUTPUT_RECORDS” is more than 1
133. Disk IO
• Increase mapper buffer memory by increasing
“io.sort.mb”
<property><name>io.sort.mb<name><value>200</value><
/property>
• Same logic applies to reducers
134. Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait
135. Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait