SlideShare a Scribd company logo
1 of 137
Download to read offline
Amazon Elastic MapReduce:
Deep Dive and Best Practices
Parviz Deyhim
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Hadoop-as-a-service
Map-Reduce engine

Integrated with tools

What is EMR?
Massively parallel

Integrated to AWS services

Cost effective AWS wrapper
HDFS

Amazon EMR
HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB
Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB
Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB

Amazon
RDS
Data management

Analytics languages

HDFS

Amazon EMR

Amazon
Redshift

Amazon
RDS

AWS Data Pipeline
Amazon S3

Amazon
DynamoDB
Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload
Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types
Amazon EMR Introduction
• Integration with Spot market
• 70-80% discount
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 as HDFS
Pattern #4: Amazon S3 & HDFS
Pattern #5: Elastic Clusters
Pattern #1: Transient vs. Alive
Clusters
Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done

• Data persist on
Amazon S3
• Input & output
Data on
Amazon S3
Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
•

Cluster goes away when job is done

3. Practice cloud architecture
•

Pay for what you use

•

Data processing as a workflow
When to use Transient cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs < 24
Use Transient Clusters
Else
Use Alive Clusters
When to use Transient cluster?

( 20min data load + 1 hour
Processing time) * 10 jobs = 13
hours < 24 hour = Use Transient
Clusters
Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup
Alive Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
•

Design your data processing workflow to account for
failure

• You can use workflow managements such as
AWS Data Pipeline
Benefits of Alive Clusters
•

Ability to share data between multiple jobs

Transient cluster

Long running clusters

EMR

EMR

Amazon S3
Amazon S3

EMR
HDFS

HDFS
Benefit of Alive Clusters
•

Cost effective for repetitive jobs

pm
EMR

pm

pm

EMR

EMR
EMR

EMR

pm
When to use Alive cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs > 24
Use Alive Clusters
Else
Use Transient Clusters
When to use Alive cluster?

( 20min data load + 1 hour
Processing time) * 20 jobs =
26hours > 24 hour = Use Alive
Clusters
Pattern #2: Core & Task nodes
Core Nodes
Amazon EMR cluster
Master instance group

Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)

HDFS

HDFS

Core instance group
Core Nodes
Amazon EMR cluster
Master instance group

Can add core
nodes

HDFS

HDFS

Core instance group
Core Nodes
Amazon EMR cluster

Can add core
nodes

Master instance group

More HDFS
space
More
CPU/mem

HDFS

HDFS

Core instance group

HDFS
Core Nodes
Amazon EMR cluster

Can’t remove
core nodes
because of
HDFS

Master instance group

HDFS

HDFS

Core instance group

HDFS
Amazon EMR Task Nodes
Amazon EMR cluster

Run TaskTrackers

Master instance group

No HDFS
Reads from core
node HDFS
HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

Amazon EMR cluster

Can add
task nodes

Master instance group

HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

Amazon EMR cluster

More CPU
power

Master instance group

More
memory
HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

You can
remove task
nodes

Amazon EMR cluster
Master instance group

HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

You can
remove task
nodes

Amazon EMR cluster
Master instance group

HDFS

HDFS

Core instance group

Task instance group
Tasknode Use-Case #1
• Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price

• Nodes can come and go without
interruption to your cluster
Tasknode Use-Case #2
• When you need extra horse power
for a short amount of time
• Example: Need to pull large amount
of data from Amazon S3
Example:
HS1
48TB
HDFS

HS1
48TB
HDFS

Amazon S3
Add Spot task
nodes to load
data from
Amazon S3

Example:
HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
Amazon S3

m1.xl
m1.xl
Example:
HS1
48TB
HDFS

Remove after
data load from
Amazon S3
m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

Amazon S3
Pattern #3: Amazon S3 as HDFS
Amazon S3 as HDFS
Amazon EMR
cluster

• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between
jobs
• No additional step to
copy data to HDFS

HDF
S

HDF
S

Core instance group

Task instance group

Amazon S3
Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity
• Replication for durability

• Amazon S3 scales with your data
• Both in IOPs and data storage
Benefits: Amazon S3 as HDFS
• Ability to share data between multiple clusters
•

Hard to do with HDFS

EMR

EMR

Amazon S3
Benefits: Amazon S3 as HDFS
•

Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption

•

Build elastic clusters
•

Add nodes to read from Amazon S3

•

Remove nodes with data safe on Amazon S3
What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bounded data,
locality doesn’t make a difference
Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once

• Disk I/O intensive workloads
Pattern #4: Amazon S3 & HDFS
Amazon S3 & HDFS
1. Data persist on Amazon S3
2. Launch Amazon EMR and
copy data to HDFS with
S3distcp

S3DistCp

Amazon S3 & HDFS
3. Start processing data on
HDFS

S3DistCp

Amazon S3 & HDFS
Benefits: Amazon S3 & HDFS
• Better pattern for I/O-intensive workloads
• Amazon S3 benefits discussed previously
applies
• Durability
• Scalability
• Cost
• Features: lifecycle policy, security
Pattern #5: Elastic Clusters
Amazon EMR Elastic Cluster (m)
1. Start cluster with certain number of nodes
Amazon EMR Elastic Cluster (m)
2. Monitor your cluster with Amazon CloudWatch
metrics
• Map Tasks
Running
• Map Tasks
Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (m)
3. Increase the number of nodes as you need
more capacity by manually calling the API
Amazon EMR Elastic Cluster (a)
1. Start your cluster with certain number of nodes
Amazon EMR Elastic Cluster (a)
2. Monitor cluster capacity with Amazon
CloudWatch metrics
• Map Tasks Running
• Map Tasks Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (a)
3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk
Amazon EMR Elastic Cluster (a)
4. Your app calls the API to add nodes to your
cluster

API
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Nodes and Size
• Use m1.smal, m1.large, c1.medium for
functional testing

• Use M1.xlarge and larger nodes for
production workloads
Amazon EMR Nodes and Size
• Use CC2 for memory and CPU intensive
jobs

• Use CC2/C1.xlarge for CPU intensive
jobs

• Hs1 instances for HDFS workloads
Amazon EMR Nodes and Size
• Hi1 and HS1 instances for disk I/Ointensive workload
• CC2 instances are more cost effective
than M2.4xlarge
• Prefer smaller cluster of larger nodes than
larger cluster of smaller nodes
Holy Grail Question
How many nodes do I need?
Introduction to Hadoop Splits
• Depends on how much data you have

• And how fast you like your data to be
processed
Introduction to Hadoop Splits
Before we understand Amazon EMR
capacity planning, we need to understand
Hadoop’s inner working of splits
Introduction to Hadoop Splits
• Data gets broken up to splits (64MB or 128)
Data
128MB

Splits
Introduction to Hadoop Splits
• Splits get packaged into mappers
Data

Splits

Mappers
Introduction to Hadoop Splits
• Mappers get assigned to Mappers
nodes for processing

Instances
Introduction to Hadoop Splits
• More data = More splits = More mappers
Introduction to Hadoop Splits
• More data = More splits = More mappers

Queue
Introduction to Hadoop Splits
•

Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay

Queue
Introduction to Hadoop Splits
•

More nodes = reduced queue size = faster
processing

Queue
Calculating the Number of Splits for
Your Job
Uncompressed files: Hadoop splits a single file to
multiple splits.
Example: 128MB = 2 splits based on 64MB split size
Calculating the Number of Splits for
Your Job
Compressed files:
1. Splittable compressions: same logic as uncompressed files
64MB BZIP
128MB BZIP
Calculating the Number of Splits for
Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ

128MB GZ
Calculating the Number of Splits for
Your Job
Number of splits
If data files have unsplittable compression
# of splits = number of files
Example: 10 GZ files = 10 mappers
Cluster Sizing Calculation

Just tell me how many nodes I
need for my job!!
Cluster Sizing Calculation

1. Estimate the number of mappers your job
requires.
Cluster Sizing Calculation

2. Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
Cluster Sizing Calculation

3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
Cluster Sizing Calculation
Estimated Number Of Nodes:
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires

150
2. Pick an instance and note down the number of
mappers it can run in parallel

m1.xlarge with 8 mapper capacity
per instance
Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.

8 files selected for our sample test
Example: Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.

3 min to process 8 files
Cluster Sizing Calculation
Estimated number of nodes:
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time

150 * 3 min
= 11 m1.xlarge
8 * 5 min
File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB

• Each mapper is a single JVM
• CPU time required to spawn
JVMs/mappers
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing

10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing

10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time
File Size on Amazon S3: Best Practices

• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?
File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec

• Single mapper can get 10MB-15MB/s speed to
Amazon S3

60sec * 15MB

≈

1GB
Holy Grail Question
What if I have small file issues?
Dealing with Small Files
• Use S3DistCP to combine smaller files together

• S3DistCP takes a pattern and target file to
combine smaller input files to larger ones
Dealing with Small Files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 

--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*,
--targetSize,128,
Compressions
• Compress as much as you can
• Compress Amazon S3 input data files

– Reduces cost
– Speed up Amazon S3->mapper data transfer
time
Compressions
• Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job
Compressions
• Compress Mappers and Reducer Output
• Reduces Disk IO
Compressions
• Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm

% Space
Remaining

Encoding
Speed

Decoding
Speed

GZIP

13%

21MB/s

118MB/s

LZO

20%

135MB/s

410MB/s

Snappy

22%

172MB/s

409MB/s
Compressions
• If You Are Time Sensitive, Faster Compressions
Are A Better Choice

• If You Have Large Amount Of Data, Use Space
Efficient Compressions

• If You Don’t Care, Pick GZIP
Change Compression Type
• You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of
your files
• Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--outputCodec,lzo’
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Architecting for cost
• AWS pricing models:

– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances
and get a discount
– Reserved Instance: upfront payment
(for 1 or 3 year) for reduction in overall
monthly payment
Reserved Instances use-case
For alive and long-running clusters
Reserved Instances use-case
For ad-hoc and unpredictable workloads,
use medium utilization
Reserved Instances use-case
For unpredictable workloads, use Spot or
on-demand pricing
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Adv. Optimizations (Stage 1)
• The best optimization is to structure your data
(i.e., smart data partitioning)

• Efficient data structuring= limit the amount of
data being processed by Hadoop= faster jobs
Adv. Optimizations (Stage 1)
• Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit
–

Twitter Storm

–

Spark

–

Amazon Redshift, etc.
Adv. Optimizations (Stage 1)
• Amazon EMR team has done a great deal of
optimization already

• For smaller clusters, Amazon EMR configuration
optimization won’t buy you much
– Remember you’re paying for the full hour
cost of an instance
Adv. Optimizations (Stage 1)

Best Optimization??
Adv. Optimizations (Stage 1)

Add more nodes
Adv. Optimizations (Stage 2)
• Monitor your cluster
using Ganglia

• Amazon EMR has
Ganglia bootstrap
action
Adv. Optimizations (Stage 2)
• Monitor and look for bottlenecks
– Memory
– CPU
– Disk IO
– Network IO
Adv. Optimizations
Run Job
Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory
Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo
Network IO
• Most important metric to watch for if using
Amazon S3 for storage

• Goal: Drive as much network IO as possible
from a single instance
Network IO
• Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps

• Optimize to get more out of your instance
throughput
– Add more mappers?
Network IO
• If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network
throughput.

• Your goal is to maximize your NIC
throughput by having enough mappers
per node
Network IO, Example
Low network utilization
Increase number of mappers if possible to drive
more traffic
CPU
• Watch for CPU utilization of your clusters

• If >50% idle, increase # of mapper/reducer
per instance
– Reduces the number of nodes and reduces
cost
Example Adv. Optimizations (Stage
2)
What potential optimization
do you see in this graph?
Example Adv. Optimizations (Stage
2)
40% CPU idle. Maybe add more mappers?
Disk IO
• Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS
bytes written metrics
• One play to pay attention to is mapper/reducer
disk spill
Disk IO
• Mapper has in memory buffer
mapper

mapper memory buffer
Disk IO
• When memory gets full, data spills to disk
mapper

mapper memory buffer

data spills to
disk
Disk IO
• If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper

• Excessive spill when ratio of
“MAPPER_SPILLED_RECORDS” and
“MAPPER_OUTPUT_RECORDS” is more than 1
Disk IO
Example:

MAPPER_SPILLED_RECORDS = 221200123
MAPPER_OUTPUT_RECORDS = 101200123
Disk IO
• Increase mapper buffer memory by increasing
“io.sort.mb”

<property><name>io.sort.mb<name><value>200</value><
/property>

• Same logic applies to reducers
Disk IO
• Monitor disk IO using Ganglia

• Look out for disk IO wait
Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait
Remember!
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo
Please give us your feedback on this
presentation

BDT404
As a thank you, we will select prize
winners daily for completed surveys!

More Related Content

What's hot

Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimizationSANG WON PARK
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesAmazon Web Services
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Amazon Web Services
 

What's hot (20)

Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimization
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
 

Viewers also liked

Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Amazon Web Services
 
Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Amazon Web Services
 
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesAmazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesRobert Wilson
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)Amazon Web Services
 

Viewers also liked (6)

Amazon Virtual Private Cloud
Amazon Virtual Private CloudAmazon Virtual Private Cloud
Amazon Virtual Private Cloud
 
Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)
 
Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)
 
Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesAmazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
 

Similar to Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 

Similar to Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013 (20)

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

  • 1. Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 3. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 4. Hadoop-as-a-service Map-Reduce engine Integrated with tools What is EMR? Massively parallel Integrated to AWS services Cost effective AWS wrapper
  • 7. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB
  • 8. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB Amazon RDS
  • 9. Data management Analytics languages HDFS Amazon EMR Amazon Redshift Amazon RDS AWS Data Pipeline Amazon S3 Amazon DynamoDB
  • 10. Amazon EMR Introduction • Launch clusters of any size in a matter of minutes • Use variety of different instance sizes that match your workload
  • 11. Amazon EMR Introduction • Don’t get stuck with hardware • Don’t deal with capacity planning • Run multiple clusters with different sizes, specs and node types
  • 12. Amazon EMR Introduction • Integration with Spot market • 70-80% discount
  • 13. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 14. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 as HDFS Pattern #4: Amazon S3 & HDFS Pattern #5: Elastic Clusters
  • 15. Pattern #1: Transient vs. Alive Clusters
  • 16. Pattern #1: Transient Clusters • Cluster lives for the duration of the job • Shut down the cluster when the job is done • Data persist on Amazon S3 • Input & output Data on Amazon S3
  • 17. Benefits of Transient Clusters 1. Control your cost 2. Minimum maintenance • Cluster goes away when job is done 3. Practice cloud architecture • Pay for what you use • Data processing as a workflow
  • 18. When to use Transient cluster? If ( Data Load Time + Processing Time) * Number Of Jobs < 24 Use Transient Clusters Else Use Alive Clusters
  • 19. When to use Transient cluster? ( 20min data load + 1 hour Processing time) * 10 jobs = 13 hours < 24 hour = Use Transient Clusters
  • 20. Alive Clusters • Very similar to traditional Hadoop deployments • Cluster stays around after the job is done • Data persistence model: • Amazon S3 • Amazon S3 Copy To HDFS • HDFS and Amazon S3 as backup
  • 21. Alive Clusters • Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage • Get in the habit of shutting down your cluster and start a new one, once a week or month • Design your data processing workflow to account for failure • You can use workflow managements such as AWS Data Pipeline
  • 22. Benefits of Alive Clusters • Ability to share data between multiple jobs Transient cluster Long running clusters EMR EMR Amazon S3 Amazon S3 EMR HDFS HDFS
  • 23. Benefit of Alive Clusters • Cost effective for repetitive jobs pm EMR pm pm EMR EMR EMR EMR pm
  • 24. When to use Alive cluster? If ( Data Load Time + Processing Time) * Number Of Jobs > 24 Use Alive Clusters Else Use Transient Clusters
  • 25. When to use Alive cluster? ( 20min data load + 1 hour Processing time) * 20 jobs = 26hours > 24 hour = Use Alive Clusters
  • 26. Pattern #2: Core & Task nodes
  • 27. Core Nodes Amazon EMR cluster Master instance group Run TaskTrackers (Compute) Run DataNode (HDFS) HDFS HDFS Core instance group
  • 28. Core Nodes Amazon EMR cluster Master instance group Can add core nodes HDFS HDFS Core instance group
  • 29. Core Nodes Amazon EMR cluster Can add core nodes Master instance group More HDFS space More CPU/mem HDFS HDFS Core instance group HDFS
  • 30. Core Nodes Amazon EMR cluster Can’t remove core nodes because of HDFS Master instance group HDFS HDFS Core instance group HDFS
  • 31. Amazon EMR Task Nodes Amazon EMR cluster Run TaskTrackers Master instance group No HDFS Reads from core node HDFS HDFS HDFS Core instance group Task instance group
  • 32. Amazon EMR Task Nodes Amazon EMR cluster Can add task nodes Master instance group HDFS HDFS Core instance group Task instance group
  • 33. Amazon EMR Task Nodes Amazon EMR cluster More CPU power Master instance group More memory HDFS HDFS Core instance group Task instance group
  • 34. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  • 35. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  • 36. Tasknode Use-Case #1 • Speed up job processing using Spot market • Run task nodes on Spot market • Get discount on hourly price • Nodes can come and go without interruption to your cluster
  • 37. Tasknode Use-Case #2 • When you need extra horse power for a short amount of time • Example: Need to pull large amount of data from Amazon S3
  • 39. Add Spot task nodes to load data from Amazon S3 Example: HS1 48TB HDFS m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl Amazon S3 m1.xl m1.xl
  • 40. Example: HS1 48TB HDFS Remove after data load from Amazon S3 m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl m1.xl m1.xl Amazon S3
  • 41. Pattern #3: Amazon S3 as HDFS
  • 42. Amazon S3 as HDFS Amazon EMR cluster • Use Amazon S3 as your permanent data store • HDFS for temporary storage data between jobs • No additional step to copy data to HDFS HDF S HDF S Core instance group Task instance group Amazon S3
  • 43. Benefits: Amazon S3 as HDFS • Ability to shut down your cluster HUGE Benefit!! • Use Amazon S3 as your durable storage 11 9s of durability
  • 44. Benefits: Amazon S3 as HDFS • No need to scale HDFS • Capacity • Replication for durability • Amazon S3 scales with your data • Both in IOPs and data storage
  • 45. Benefits: Amazon S3 as HDFS • Ability to share data between multiple clusters • Hard to do with HDFS EMR EMR Amazon S3
  • 46. Benefits: Amazon S3 as HDFS • Take advantage of Amazon S3 features • Amazon S3 ServerSideEncryption • Amazon S3 LifeCyclePolicy • Amazon S3 versioning to protect against corruption • Build elastic clusters • Add nodes to read from Amazon S3 • Remove nodes with data safe on Amazon S3
  • 47. What About Data Locality? • Run your job in the same region as your Amazon S3 bucket • Amazon EMR nodes have high speed connectivity to Amazon S3 • If your job Is CPU/memory-bounded data, locality doesn’t make a difference
  • 48. Anti-Pattern: Amazon S3 as HDFS • Iterative workloads – If you’re processing the same dataset more than once • Disk I/O intensive workloads
  • 49. Pattern #4: Amazon S3 & HDFS
  • 50. Amazon S3 & HDFS 1. Data persist on Amazon S3
  • 51. 2. Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Amazon S3 & HDFS
  • 52. 3. Start processing data on HDFS S3DistCp Amazon S3 & HDFS
  • 53. Benefits: Amazon S3 & HDFS • Better pattern for I/O-intensive workloads • Amazon S3 benefits discussed previously applies • Durability • Scalability • Cost • Features: lifecycle policy, security
  • 55. Amazon EMR Elastic Cluster (m) 1. Start cluster with certain number of nodes
  • 56. Amazon EMR Elastic Cluster (m) 2. Monitor your cluster with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  • 57. Amazon EMR Elastic Cluster (m) 3. Increase the number of nodes as you need more capacity by manually calling the API
  • 58. Amazon EMR Elastic Cluster (a) 1. Start your cluster with certain number of nodes
  • 59. Amazon EMR Elastic Cluster (a) 2. Monitor cluster capacity with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  • 60. Amazon EMR Elastic Cluster (a) 3. Get HTTP Amazon SNS notification to a simple app deployed on Elastic Beanstalk
  • 61. Amazon EMR Elastic Cluster (a) 4. Your app calls the API to add nodes to your cluster API
  • 62. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 63. Amazon EMR Nodes and Size • Use m1.smal, m1.large, c1.medium for functional testing • Use M1.xlarge and larger nodes for production workloads
  • 64. Amazon EMR Nodes and Size • Use CC2 for memory and CPU intensive jobs • Use CC2/C1.xlarge for CPU intensive jobs • Hs1 instances for HDFS workloads
  • 65. Amazon EMR Nodes and Size • Hi1 and HS1 instances for disk I/Ointensive workload • CC2 instances are more cost effective than M2.4xlarge • Prefer smaller cluster of larger nodes than larger cluster of smaller nodes
  • 66. Holy Grail Question How many nodes do I need?
  • 67. Introduction to Hadoop Splits • Depends on how much data you have • And how fast you like your data to be processed
  • 68. Introduction to Hadoop Splits Before we understand Amazon EMR capacity planning, we need to understand Hadoop’s inner working of splits
  • 69. Introduction to Hadoop Splits • Data gets broken up to splits (64MB or 128) Data 128MB Splits
  • 70. Introduction to Hadoop Splits • Splits get packaged into mappers Data Splits Mappers
  • 71. Introduction to Hadoop Splits • Mappers get assigned to Mappers nodes for processing Instances
  • 72. Introduction to Hadoop Splits • More data = More splits = More mappers
  • 73. Introduction to Hadoop Splits • More data = More splits = More mappers Queue
  • 74. Introduction to Hadoop Splits • Data mappers > cluster mapper capacity = mappers wait for capacity = processing delay Queue
  • 75. Introduction to Hadoop Splits • More nodes = reduced queue size = faster processing Queue
  • 76. Calculating the Number of Splits for Your Job Uncompressed files: Hadoop splits a single file to multiple splits. Example: 128MB = 2 splits based on 64MB split size
  • 77. Calculating the Number of Splits for Your Job Compressed files: 1. Splittable compressions: same logic as uncompressed files 64MB BZIP 128MB BZIP
  • 78. Calculating the Number of Splits for Your Job Compressed files: 2. Unsplittable compressions: the entire file is a single split. 128MB GZ 128MB GZ
  • 79. Calculating the Number of Splits for Your Job Number of splits If data files have unsplittable compression # of splits = number of files Example: 10 GZ files = 10 mappers
  • 80. Cluster Sizing Calculation Just tell me how many nodes I need for my job!!
  • 81. Cluster Sizing Calculation 1. Estimate the number of mappers your job requires.
  • 82. Cluster Sizing Calculation 2. Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  • 83. Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  • 84. Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 85. Cluster Sizing Calculation Estimated Number Of Nodes: Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time
  • 86. Example: Cluster Sizing Calculation 1. Estimate the number of mappers your job requires 150 2. Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  • 87. Example: Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 88. Example: Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 89. Cluster Sizing Calculation Estimated number of nodes: Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time 150 * 3 min = 11 m1.xlarge 8 * 5 min
  • 90. File Size Best Practices • Avoid small files at all costs • Anything smaller than 100MB • Each mapper is a single JVM • CPU time required to spawn JVMs/mappers
  • 91. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 100mgB = 100,000 mappers * 2Sec = total of 55 hours mapper CPU setup time
  • 92. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 1000MB = 10,000 mappers * 2Sec = total of 5 hours mapper CPU setup time
  • 93. File Size on Amazon S3: Best Practices • What’s the best Amazon S3 file size for Hadoop? About 1-2GB • Why?
  • 94. File Size on Amazon S3: Best Practices • Life of mapper should not be less than 60 sec • Single mapper can get 10MB-15MB/s speed to Amazon S3 60sec * 15MB ≈ 1GB
  • 95. Holy Grail Question What if I have small file issues?
  • 96. Dealing with Small Files • Use S3DistCP to combine smaller files together • S3DistCP takes a pattern and target file to combine smaller input files to larger ones
  • 97. Dealing with Small Files Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*, --targetSize,128,
  • 98. Compressions • Compress as much as you can • Compress Amazon S3 input data files – Reduces cost – Speed up Amazon S3->mapper data transfer time
  • 99. Compressions • Always Compress Data Files On Amazon S3 • Reduces Storage Cost • Reduces Bandwidth Between Amazon S3 and Amazon EMR • Speeds Up Your Job
  • 100. Compressions • Compress Mappers and Reducer Output • Reduces Disk IO
  • 101. Compressions • Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 102. Compressions • If You Are Time Sensitive, Faster Compressions Are A Better Choice • If You Have Large Amount Of Data, Use Space Efficient Compressions • If You Don’t Care, Pick GZIP
  • 103. Change Compression Type • You May Decide To Change Compression Type • Use S3DistCP to change the compression types of your files • Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --outputCodec,lzo’
  • 104. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 105. Architecting for cost • AWS pricing models: – On-demand: Pay as you go model. – Spot: Market place. Bid for instances and get a discount – Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  • 106. Reserved Instances use-case For alive and long-running clusters
  • 107. Reserved Instances use-case For ad-hoc and unpredictable workloads, use medium utilization
  • 108. Reserved Instances use-case For unpredictable workloads, use Spot or on-demand pricing
  • 109. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 110. Adv. Optimizations (Stage 1) • The best optimization is to structure your data (i.e., smart data partitioning) • Efficient data structuring= limit the amount of data being processed by Hadoop= faster jobs
  • 111. Adv. Optimizations (Stage 1) • Hadoop is a batch processing framework • Data processing time = an hour to days • Not a great use-case for shorter jobs • Other frameworks may be a better fit – Twitter Storm – Spark – Amazon Redshift, etc.
  • 112. Adv. Optimizations (Stage 1) • Amazon EMR team has done a great deal of optimization already • For smaller clusters, Amazon EMR configuration optimization won’t buy you much – Remember you’re paying for the full hour cost of an instance
  • 113. Adv. Optimizations (Stage 1) Best Optimization??
  • 114. Adv. Optimizations (Stage 1) Add more nodes
  • 115. Adv. Optimizations (Stage 2) • Monitor your cluster using Ganglia • Amazon EMR has Ganglia bootstrap action
  • 116. Adv. Optimizations (Stage 2) • Monitor and look for bottlenecks – Memory – CPU – Disk IO – Network IO
  • 120. Network IO • Most important metric to watch for if using Amazon S3 for storage • Goal: Drive as much network IO as possible from a single instance
  • 121. Network IO • Larger instances can drive > 600Mbps • Cluster computes can drive 1Gbps -2 Gbps • Optimize to get more out of your instance throughput – Add more mappers?
  • 122. Network IO • If you’re using Amazon S3 with Amazon EMR, monitor Ganglia and watch network throughput. • Your goal is to maximize your NIC throughput by having enough mappers per node
  • 123. Network IO, Example Low network utilization Increase number of mappers if possible to drive more traffic
  • 124. CPU • Watch for CPU utilization of your clusters • If >50% idle, increase # of mapper/reducer per instance – Reduces the number of nodes and reduces cost
  • 125. Example Adv. Optimizations (Stage 2) What potential optimization do you see in this graph?
  • 126. Example Adv. Optimizations (Stage 2) 40% CPU idle. Maybe add more mappers?
  • 127. Disk IO • Limit the amount of disk IO • Can increase mapper/reducer memory • Compress data anywhere you can • Monitor cluster and pay attention to HDFS bytes written metrics • One play to pay attention to is mapper/reducer disk spill
  • 128. Disk IO • Mapper has in memory buffer mapper mapper memory buffer
  • 129. Disk IO • When memory gets full, data spills to disk mapper mapper memory buffer data spills to disk
  • 130.
  • 131. Disk IO • If you see mapper/reducer excessive spill to disk, increase buffer memory per mapper • Excessive spill when ratio of “MAPPER_SPILLED_RECORDS” and “MAPPER_OUTPUT_RECORDS” is more than 1
  • 132. Disk IO Example: MAPPER_SPILLED_RECORDS = 221200123 MAPPER_OUTPUT_RECORDS = 101200123
  • 133. Disk IO • Increase mapper buffer memory by increasing “io.sort.mb” <property><name>io.sort.mb<name><value>200</value>< /property> • Same logic applies to reducers
  • 134. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  • 135. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  • 137. Please give us your feedback on this presentation BDT404 As a thank you, we will select prize winners daily for completed surveys!