Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon Elastic MapReduce:
Deep Dive and Best Practices
Parviz Deyhim
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations

Hadoop-as-a-service
Map-Reduce engine

Integrated with tools

What is EMR?
Massively parallel

Integrated to AWS services

Cost effective AWS wrapper

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB

Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB

Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB

Amazon
RDS

Data management

Analytics languages

HDFS

Amazon EMR

Amazon
Redshift

Amazon
RDS

AWS Data Pipeline
Amazon S3

Amazon
DynamoDB

Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload

• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types

• Integration with Spot market
• 70-80% discount

Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 as HDFS
Pattern #4: Amazon S3 & HDFS
Pattern #5: Elastic Clusters

Pattern #1: Transient vs. Alive
Clusters

Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done

• Data persist on
Amazon S3
• Input & output
Data on
Amazon S3

Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
•

Cluster goes away when job is done

3. Practice cloud architecture
•

Pay for what you use

•

Data processing as a workflow

When to use Transient cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs < 24
Use Transient Clusters
Else
Use Alive Clusters

When to use Transient cluster?

( 20min data load + 1 hour
Processing time) * 10 jobs = 13
hours < 24 hour = Use Transient
Clusters

Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup

Alive Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
•

Design your data processing workflow to account for
failure

• You can use workflow managements such as
AWS Data Pipeline

Benefits of Alive Clusters
•

Ability to share data between multiple jobs

Transient cluster

Long running clusters

EMR

EMR

Amazon S3
Amazon S3

EMR
HDFS

HDFS

Benefit of Alive Clusters
•

Cost effective for repetitive jobs

pm
EMR

pm

pm

EMR

EMR
EMR

EMR

pm

When to use Alive cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs > 24
Use Alive Clusters
Else
Use Transient Clusters

When to use Alive cluster?

( 20min data load + 1 hour
Processing time) * 20 jobs =
26hours > 24 hour = Use Alive
Clusters

Core Nodes
Amazon EMR cluster
Master instance group

Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)

HDFS

HDFS

Core instance group

Core Nodes
Amazon EMR cluster

Can add core
nodes

HDFS

HDFS

Core instance group

Core Nodes
Amazon EMR cluster

Can add core
nodes


More HDFS
space
More
CPU/mem

HDFS

HDFS

Core instance group

HDFS

Core Nodes
Amazon EMR cluster

Can’t remove
core nodes
because of
HDFS


HDFS

HDFS

Core instance group

HDFS

Amazon EMR Task Nodes
Amazon EMR cluster

Run TaskTrackers


No HDFS
Reads from core
node HDFS
HDFS

HDFS

Core instance group

Task instance group


Amazon EMR cluster

Can add
task nodes


HDFS

HDFS

Core instance group

Task instance group


Amazon EMR cluster

More CPU
power


More
memory
HDFS

HDFS

Core instance group

Task instance group


You can
remove task
nodes

Amazon EMR cluster

HDFS

HDFS

Core instance group

Task instance group

Tasknode Use-Case #1
• Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price

• Nodes can come and go without
interruption to your cluster

Tasknode Use-Case #2
• When you need extra horse power
for a short amount of time
• Example: Need to pull large amount
of data from Amazon S3

Example:
HS1
48TB
HDFS

HS1
48TB
HDFS

Amazon S3

Add Spot task
nodes to load
data from
Amazon S3

Example:
HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
Amazon S3

m1.xl
m1.xl

Example:
HS1
48TB
HDFS

Remove after
data load from
Amazon S3
m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

Amazon S3

Amazon S3 as HDFS
Amazon EMR
cluster

• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between
jobs
• No additional step to
copy data to HDFS

HDF
S

HDF
S

Core instance group

Task instance group

Amazon S3

Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability

• No need to scale HDFS
• Capacity
• Replication for durability

• Amazon S3 scales with your data
• Both in IOPs and data storage

• Ability to share data between multiple clusters
•

Hard to do with HDFS

EMR

EMR

Amazon S3

•

Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption

•

Build elastic clusters
•

Add nodes to read from Amazon S3

•

Remove nodes with data safe on Amazon S3

What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bounded data,
locality doesn’t make a difference

Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once

• Disk I/O intensive workloads

Amazon S3 & HDFS
1. Data persist on Amazon S3

2. Launch Amazon EMR and
copy data to HDFS with
S3distcp

S3DistCp

Amazon S3 & HDFS

3. Start processing data on
HDFS

S3DistCp

Amazon S3 & HDFS

Benefits: Amazon S3 & HDFS
• Better pattern for I/O-intensive workloads
• Amazon S3 benefits discussed previously
applies
• Durability
• Scalability
• Cost
• Features: lifecycle policy, security

Amazon EMR Elastic Cluster (m)
1. Start cluster with certain number of nodes

2. Monitor your cluster with Amazon CloudWatch
metrics
• Map Tasks
Running
• Map Tasks
Remaining
• Cluster Idle?
• Avg. Jobs Failed

3. Increase the number of nodes as you need
more capacity by manually calling the API

Amazon EMR Elastic Cluster (a)
1. Start your cluster with certain number of nodes

2. Monitor cluster capacity with Amazon
CloudWatch metrics
• Map Tasks Running
• Map Tasks Remaining
• Cluster Idle?
• Avg. Jobs Failed

3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk

4. Your app calls the API to add nodes to your
cluster

API

Amazon EMR Nodes and Size
• Use m1.smal, m1.large, c1.medium for
functional testing

• Use M1.xlarge and larger nodes for
production workloads

• Use CC2 for memory and CPU intensive
jobs

• Use CC2/C1.xlarge for CPU intensive
jobs

• Hs1 instances for HDFS workloads

• Hi1 and HS1 instances for disk I/Ointensive workload
• CC2 instances are more cost effective
than M2.4xlarge
• Prefer smaller cluster of larger nodes than
larger cluster of smaller nodes

Holy Grail Question
How many nodes do I need?

Introduction to Hadoop Splits
• Depends on how much data you have

• And how fast you like your data to be
processed

Before we understand Amazon EMR
capacity planning, we need to understand
Hadoop’s inner working of splits

• Data gets broken up to splits (64MB or 128)
Data
128MB

Splits

• Splits get packaged into mappers
Data

Splits

Mappers

• Mappers get assigned to Mappers
nodes for processing

Instances

• More data = More splits = More mappers

• More data = More splits = More mappers

Queue

•

Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay

Queue

•

More nodes = reduced queue size = faster
processing

Queue

Calculating the Number of Splits for
Your Job
Uncompressed files: Hadoop splits a single file to
multiple splits.
Example: 128MB = 2 splits based on 64MB split size

Your Job
Compressed files:
1. Splittable compressions: same logic as uncompressed files
64MB BZIP
128MB BZIP

Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ

128MB GZ

Your Job
Number of splits
If data files have unsplittable compression
# of splits = number of files
Example: 10 GZ files = 10 mappers

Cluster Sizing Calculation

Just tell me how many nodes I
need for my job!!


1. Estimate the number of mappers your job
requires.


2. Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel


3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.


4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.

Estimated Number Of Nodes:
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time

Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires

150
2. Pick an instance and note down the number of
mappers it can run in parallel

m1.xlarge with 8 mapper capacity
per instance

3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.

8 files selected for our sample test


4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.

3 min to process 8 files

Estimated number of nodes:
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time

150 * 3 min
= 11 m1.xlarge
8 * 5 min

File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB

• Each mapper is a single JVM
• CPU time required to spawn
JVMs/mappers

Mappers take 2 sec to spawn up and be ready
for processing

10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time

Mappers take 2 sec to spawn up and be ready
for processing

10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time

File Size on Amazon S3: Best Practices

• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?

File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec

• Single mapper can get 10MB-15MB/s speed to
Amazon S3

60sec * 15MB

≈

1GB

Holy Grail Question
What if I have small file issues?

Dealing with Small Files
• Use S3DistCP to combine smaller files together

• S3DistCP takes a pattern and target file to
combine smaller input files to larger ones

Dealing with Small Files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar

--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*,
--targetSize,128,

Compressions
• Compress as much as you can
• Compress Amazon S3 input data files

– Reduces cost
– Speed up Amazon S3->mapper data transfer
time

Compressions
• Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job

Compressions
• Compress Mappers and Reducer Output
• Reduces Disk IO

Compressions
• Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm

% Space
Remaining

Encoding
Speed

Decoding
Speed

GZIP

13%

21MB/s

118MB/s

LZO

20%

135MB/s

410MB/s

Snappy

22%

172MB/s

409MB/s

Compressions
• If You Are Time Sensitive, Faster Compressions
Are A Better Choice

• If You Have Large Amount Of Data, Use Space
Efficient Compressions

• If You Don’t Care, Pick GZIP

Change Compression Type
• You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of
your files
• Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--outputCodec,lzo’

Architecting for cost
• AWS pricing models:

– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances
and get a discount
– Reserved Instance: upfront payment
(for 1 or 3 year) for reduction in overall
monthly payment

Reserved Instances use-case
For alive and long-running clusters

For ad-hoc and unpredictable workloads,
use medium utilization

For unpredictable workloads, use Spot or
on-demand pricing

Adv. Optimizations (Stage 1)
• The best optimization is to structure your data
(i.e., smart data partitioning)

• Efficient data structuring= limit the amount of
data being processed by Hadoop= faster jobs

• Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit
–

Twitter Storm

–

Spark

–

Amazon Redshift, etc.

• Amazon EMR team has done a great deal of
optimization already

• For smaller clusters, Amazon EMR configuration
optimization won’t buy you much
– Remember you’re paying for the full hour
cost of an instance


Best Optimization??


Add more nodes

• Monitor your cluster
using Ganglia

• Amazon EMR has
Ganglia bootstrap
action

• Monitor and look for bottlenecks
– Memory
– CPU
– Disk IO
– Network IO

Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo

Network IO
• Most important metric to watch for if using
Amazon S3 for storage

• Goal: Drive as much network IO as possible
from a single instance

Network IO
• Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps

• Optimize to get more out of your instance
throughput
– Add more mappers?

Network IO
• If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network
throughput.

• Your goal is to maximize your NIC
throughput by having enough mappers
per node

Network IO, Example
Low network utilization
Increase number of mappers if possible to drive
more traffic

CPU
• Watch for CPU utilization of your clusters

• If >50% idle, increase # of mapper/reducer
per instance
– Reduces the number of nodes and reduces
cost

Example Adv. Optimizations (Stage
2)
What potential optimization
do you see in this graph?

Example Adv. Optimizations (Stage
2)
40% CPU idle. Maybe add more mappers?

Disk IO
• Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS
bytes written metrics
• One play to pay attention to is mapper/reducer
disk spill

Disk IO
• Mapper has in memory buffer
mapper

mapper memory buffer

Disk IO
• When memory gets full, data spills to disk
mapper

mapper memory buffer

data spills to
disk

Disk IO
• If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper

• Excessive spill when ratio of
“MAPPER_SPILLED_RECORDS” and
“MAPPER_OUTPUT_RECORDS” is more than 1

Disk IO
Example:

MAPPER_SPILLED_RECORDS = 221200123
MAPPER_OUTPUT_RECORDS = 101200123

Disk IO
• Increase mapper buffer memory by increasing
“io.sort.mb”

<property><name>io.sort.mb<name><value>200</value><
/property>

• Same logic applies to reducers

Disk IO
• Monitor disk IO using Ganglia

• Look out for disk IO wait

Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait

Remember!
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo

Please give us your feedback on this
presentation

BDT404
As a thank you, we will select prize
winners daily for completed surveys!

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Similar to Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013