SlideShare a Scribd company logo
1 of 81
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Elastic MapReduce:
Deep Dive and Best Practices
Ian Meyers, AWS (meyersi@)
John Telford, Channel 4 (jtelford@)
April 30, 2014
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Map-­‐Reduce	
  Engine	
   Vibrant	
  Ecosystem	
  
Hadoop-­‐as-­‐a-­‐Service	
  
Massively	
  Parallel	
  
Cost	
  Effec>ve	
  AWS	
  Wrapper	
  
Integrated	
  to	
  AWS	
  services	
  
What	
  is	
  EMR?	
  
HDFS	
  
Amazon EMR
HDFS	
  
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon EMR
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon EMR
Amazon
RDS
Amazon S3 Amazon
DynamoDB
HDFS	
  
Analytics languagesData management
Amazon
Redshift
Amazon EMR
Amazon
RDS
Amazon S3 Amazon
DynamoDB
AWS Data Pipeline
Amazon EMR Introduction
•  Launch clusters of any size in a matter of
minutes
•  Use variety of different instance sizes that match
your workload
Amazon EMR Introduction
•  Don’t get stuck with hardware
•  Don’t deal with capacity planning
•  Run multiple clusters with different sizes, specs
and node types
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Architecting for cost
•  EC2/EMR pricing models:
–  On-demand: Pay as you go model.
–  Spot: Market place. Bid for instances and
get a discount
–  Reserved Instance: upfront payment (for
1 or 3 year) for reduction in overall
monthly payment
Architecting for cost
•  On-demand
–  Research & Development, Data Science
•  Spot
–  Restartable Tasks
–  Embarrassingly Parallel Workloads
•  Reserved Instance
–  Well Understood, Frequent and Predicable Workloads
EMR Architecture for Optimal Cost
Heavy Utilisation RI’s for alive and long-
running clusters
Use Medium Utilisation RI’s for ad-hoc and
unpredictable workloads
EMR Architecture for Optimal Cost
Supplement with Spot for unpredictable
workloads or Turbo Boost
EMR Architecture for Optimal Cost
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 & HDFS
Pattern #1: Transient vs. Alive Clusters
Pattern #1: Transient Clusters
•  Cluster lives for the duration of the job
•  Shut down the cluster when the job is done
•  Data persist on
Amazon S3
•  Input & Output
Data on
Amazon S3
Benefits of Transient Clusters
1.  Control your cost
2.  Minimum maintenance
•  Cluster goes away when job is done
3.  Practice cloud architecture
•  Pay for what you use
•  Data processing as a workflow
Alive Clusters
•  Very similar to traditional Hadoop deployments
•  Cluster stays around after the job is done
•  Data persistence model:
•  Amazon S3
•  Amazon S3 Copy To HDFS
•  HDFS and Amazon S3 as
backup
Alive Clusters
•  Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
•  Get in the habit of shutting down your cluster and
start a new one, once a week or month
•  Design your data processing workflow to account for failure
•  You can use workflow managements such as AWS
Data Pipeline
Pattern #2: Core & Task nodes
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
	

Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
Core Nodes
	

Can add core
nodes
More HDFS
space
More CPU/
memory
	

Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS HDFS
Core Nodes
	

Can’t remove
core nodes
because of
HDFS
	

Master instance group
Core instance group
HDFS HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Can add
task nodes
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
More CPU
power
More
memory
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
Task Node Use-Cases
•  Speed up job processing using Spot market
–  Run task nodes on Spot market
•  Get discount on hourly price
–  Nodes can come and go without interruption to your cluster
•  When you need extra horsepower for a short amount of time
–  Example: Need to pull large amount of data from Amazon S3
Pattern #3: Amazon S3 & HDFS
Option 1: Amazon S3 as HDFS
•  Use Amazon S3 as your
permanent data store
•  HDFS for temporary
storage data between jobs
•  No additional step to copy
data to HDFS
Amazon EMR
cluster
Task instance groupCore instance group
HD
FS
HD
FS
Amazon S3
Benefits: Amazon S3 as HDFS
•  Ability to shut down your cluster
HUGE Benefit!!
•  Use Amazon S3 as your durable storage
11 9s of durability
Benefits: Amazon S3 as HDFS
•  No need to scale HDFS
•  Capacity
•  Replication for durability
•  Amazon S3 scales with your data
•  Both in IOPs and data storage
Benefits: Amazon S3 as HDFS
•  Ability to share data between multiple clusters
•  Hard to do with HDFS
Amazon S3
EMR
EMR
Benefits: Amazon S3 as HDFS
•  Take advantage of Amazon S3 features
•  Amazon S3 Server Side Encryption
•  Amazon S3 Lifecycle Policies
•  Amazon S3 versioning to protect against corruption
•  Build elastic clusters
•  Add nodes to read from Amazon S3
•  Remove nodes with data safe on Amazon S3
What About Data Locality?
•  Run your job in the same region as your
Amazon S3 bucket
•  Amazon EMR nodes have high speed
connectivity to Amazon S3
•  If your job Is CPU/memory-bound, locality
doesn’t make a huge difference
Anti-Pattern: Amazon S3 as HDFS
•  Iterative workloads
–  If you’re processing the same dataset more than once
•  Disk I/O intensive workloads
Option 2: Optimise for Latency with HDFS
1.  Data persisted on Amazon S3
2.  Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3DistCp
Option 2: Optimise for Latency with HDFS
3.  Start processing data on
HDFS
S3DistCp
Option 2: Optimise for Latency with HDFS
Benefits: HDFS instead of S3
•  Better pattern for I/O-intensive workloads
•  Amazon S3 as system of record
•  Durability
•  Scalability
•  Cost
•  Features: lifecycle policy, security
Outline
Introduction to Amazon EMR
Architecting EMR for Cost
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon EMR Nodes and Size
•  Use m1 and c1 family for functional testing
•  Use m3 and c3 xlarge and larger nodes for
production workloads
•  Use cc2/c3 for memory and CPU intensive jobs
•  hs1, hi1, i2 instances for HDFS workloads
•  Prefer a smaller cluster of larger nodes
Holy Grail Question
How many nodes do I
need?
Cluster Sizing Calculation
1.  Estimate the number of mappers your job
requires.
Cluster Sizing Calculation
2.  Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
Resource
Capability /
Instance
Type
EC2 Instance Type
 Mappers
 Reducers
m1.small
 2
 1
m1.large
 3
 1
m1.xlarge
 8
 3
m2.xlarge
 3
 1
m2.2xlarge
 6
 2
m2.4xlarge
 14
 4
m3.xlarge
 6
 1
m3.2xlarge
 12
 3
cc2.8xlarge
 24
 6
c3.4xlarge
 24
 6
hi1.4xlarge
 24
 6
hs1.8xlarge
 24
 6
cr1.8xlarge & c3.8xlarge
 48
 12
Cluster Sizing Calculation
3.  We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
Cluster Sizing Calculation
4.  Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
Cluster Sizing Calculation
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Estimated Number Of Nodes:
Example: Cluster Sizing Calculation
1.  Estimate the number of mappers your job
requires
150
2.  Pick an instance and note down the number of
mappers it can run in parallel
	

m1.xlarge with 8 mapper capacity per
instance
Example: Cluster Sizing Calculation
3.  We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
	

8 files selected for our sample test
Example: Cluster Sizing Calculation
	

4.  Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
	

3 min to process 8 files
Cluster Sizing Calculation
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min
8 * 5 min
= 11 m1.xlarge
File Best Practices
•  Avoid small files at all costs (smaller than
100MB)
•  Use Compression
Holy Grail Question
What if I have small file
issues?
Dealing with Small Files
•  Use S3DistCP to
combine smaller files
together
•  S3DistCP takes a
pattern and target
file to combine
smaller input files to
larger ones
./elastic-mapreduce –jar /home/hadoop/lib/
emr-s3distcp-1.0.jar 
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-
[0-9]+-[0-9]+).*,
--targetSize,128,
Compression
•  Always Compress Data Files On Amazon S3
•  Reduces Bandwidth Between Amazon S3
and Amazon EMR
•  Speeds Up Your Job
•  Compress Mappers and Reducer Output
•  Compression Types:
–  Some are fast BUT offer less space reduction
–  Some are space efficient BUT Slower
–  Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Compression
In Summary
•  Practice Cloud Architecture with Transient
Clusters
•  Utilise Task Nodes on Spot for Increased
performance and Lower Cost
•  Utilize S3 as the system of record for durability
bit.ly/1n0hRSr
John Telford
Enterprise Architect
Channel 4
@jtelford1
johntelforduk
EMR at C4
1.  Who we are.
2.  What we’re doing with EMR.
3.  Lessons learnt.
Channel 4
•  State owned, public service broadcaster.
•  Self-funded mostly by selling advertising (no TV license fee money!)
•  Turnover £1B.
•  800 employees.
•  Programmes supplied by 250 independent production companies.
12 Years A Slave
C4 Virtuous Circle
Ad Revenue (£s) = Impacts x Rate
Brilliant
Program
mes
Oodles of
Viewers
Massive
Ad
Revenue
Gigantic
Program
me
Budget
C4 Viewer Insight Database
•  Clickstream & Ad Server behavioral data.
•  10M registered viewers.
•  Viewer Panel / Survey & 3rd Party Data.
•  Programme metadata.
•  60 Tbytes of S3 storage.
Google “Channel 4 viewer promise”
Expect to pre-process your data
We want our Data Scientists to enjoy a User Friendly, High
Performance system, containing High Quality Data.
Embellish DeriveDecorateIngestAcquire
AWS storage S3
Hive HQL query
Raw DD
Smoke test Analytical
Outputs
Row by row
Drop columns
Cleanse data
Add flags
Lookup values
Decorated DD
Multirow
Multipass
Dwell
Last visit hit
Embellish DD
Segmentations
Last activity
Summary tables
Derived DD
Raw Data
Data profiling
SELECT
SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)),
SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0, 1)),
SUM (IF (page_url <> '', 0, 1)),
COUNT (DISTINCT service)
FROM raw_clickstream;
Big Data requires Big Data Profiling.
Partitioning
CREATE EXTERNAL TABLE web_log (
hit_time_gmt BIGINT,
cookie STRING
-- and many more columnns.
) PARTITIONED BY (month STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘s3n://bucket/’;
ALTER TABLE web_log ADD PARTITION (month='2010-06') LOCATION '2010-06';
ALTER TABLE web_log ADD PARTITION (month='2010-07') LOCATION '2010-07';
-- etc.
Help EMR go direct to the data it needs.
Connecting data
1
Instanc
e
RD
SSlave
s
Slave
Old approach
Redis
New approach
Slave
Redis
Slave
Redis
Slave
Redis
Handling large amounts of data
•  AWS Import/Export.
–  Consumer grade USB drives… sent by courier.
•  AWS Direct Connect.
–  Dedicated network connection from your
premises to AWS.
–  We have not completed our implementation.
•  Glacier.
Choosing instances for EMR
Source: https://aws.amazon.com/ec2/pricing/
Some instance types omitted from diagram to ease clarity.
Exchange rate, $1 = £0.61.
Social engineering
•  Make the Data Scientists aware of EMR costs.
•  We give them visibility of clusters running, who started
them, idle time, etc.
John Telford
Enterprise Architect
Channel 4
@jtelford1
johntelforduk
Thanks!
Youtube: “Channel 4 Paralympics Meet the Superhumans”
AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors
Please rate
this session
using the AWS
Summits App
and help us build
better events
#AWSSummit
@AWScloud @AWS_UKI
bit.ly/1n0hRSr

More Related Content

What's hot

Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWSAWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWSAmazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Amazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksAmazon Web Services
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWSAmazon Web Services
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)Amazon Web Services
 
AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018James Bromberger
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
PASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesPASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesAmazon Web Services
 

What's hot (20)

Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWSAWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWS
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
 
AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
PASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best PracticesPASS 17 SQL Server on AWS Best Practices
PASS 17 SQL Server on AWS Best Practices
 

Viewers also liked

AWS Summit Milan - Applicazioni Enterprise con AWS
AWS Summit Milan - Applicazioni Enterprise con AWSAWS Summit Milan - Applicazioni Enterprise con AWS
AWS Summit Milan - Applicazioni Enterprise con AWSAmazon Web Services
 
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...Amazon Web Services
 
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013Amazon Web Services
 
Large Scale Data Analysis with AWS
Large Scale Data Analysis with AWSLarge Scale Data Analysis with AWS
Large Scale Data Analysis with AWSAmazon Web Services
 
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...Amazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceAmazon Web Services
 
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...Amazon Web Services
 
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)Amazon Web Services
 
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...Amazon Web Services
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)Amazon Web Services
 
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)Amazon Web Services
 
White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersGigya
 
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013Amazon Web Services
 
AWS Tips for LAUNCHing Your Infrastructure in the Cloud
AWS Tips for LAUNCHing Your Infrastructure in the CloudAWS Tips for LAUNCHing Your Infrastructure in the Cloud
AWS Tips for LAUNCHing Your Infrastructure in the CloudAmazon Web Services
 

Viewers also liked (17)

AWS Summit Milan - Applicazioni Enterprise con AWS
AWS Summit Milan - Applicazioni Enterprise con AWSAWS Summit Milan - Applicazioni Enterprise con AWS
AWS Summit Milan - Applicazioni Enterprise con AWS
 
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...
McGraw-Hill Education: Global Migration in Less than 2 Years (ENT211) | AWS r...
 
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013
How to Host and Manage Enterprise Customers on AWS (ARC213) | AWS re:Invent 2013
 
Large Scale Data Analysis with AWS
Large Scale Data Analysis with AWSLarge Scale Data Analysis with AWS
Large Scale Data Analysis with AWS
 
IP Expo - What is AWS?
IP Expo - What is AWS?IP Expo - What is AWS?
IP Expo - What is AWS?
 
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...
How Intuit Leveraged AWS OpsWorks as the Engine of Our PaaS (DMG305) | AWS re...
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk Performance
 
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...
 
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)AWS Summit London 2014 | Introduction to Amazon EC2 (100)
AWS Summit London 2014 | Introduction to Amazon EC2 (100)
 
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...
Media Content Ingest, Storage, and Archiving with AWS (MED301) | AWS re:Inven...
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
AWS Summit London 2014 | Optimising TCO for the AWS Cloud (100)
 
White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known Customers
 
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013
Amazon WorkSpaces: Desktop Computing in the Cloud (ENT104) | AWS re:Invent 2013
 
AWS Tips for LAUNCHing Your Infrastructure in the Cloud
AWS Tips for LAUNCHing Your Infrastructure in the CloudAWS Tips for LAUNCHing Your Infrastructure in the Cloud
AWS Tips for LAUNCHing Your Infrastructure in the Cloud
 

Similar to AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)

Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 

Similar to AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400) (20)

Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 

Recently uploaded (20)

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practices (400)

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Ian Meyers, AWS (meyersi@) John Telford, Channel 4 (jtelford@) April 30, 2014
  • 2. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 3. Map-­‐Reduce  Engine   Vibrant  Ecosystem   Hadoop-­‐as-­‐a-­‐Service   Massively  Parallel   Cost  Effec>ve  AWS  Wrapper   Integrated  to  AWS  services   What  is  EMR?  
  • 5. HDFS   Amazon EMR Amazon S3 Amazon DynamoDB
  • 6. HDFS   Analytics languagesData management Amazon EMR Amazon S3 Amazon DynamoDB
  • 7. HDFS   Analytics languagesData management Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB
  • 8. HDFS   Analytics languagesData management Amazon Redshift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB AWS Data Pipeline
  • 9. Amazon EMR Introduction •  Launch clusters of any size in a matter of minutes •  Use variety of different instance sizes that match your workload
  • 10. Amazon EMR Introduction •  Don’t get stuck with hardware •  Don’t deal with capacity planning •  Run multiple clusters with different sizes, specs and node types
  • 12. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 13. Architecting for cost •  EC2/EMR pricing models: –  On-demand: Pay as you go model. –  Spot: Market place. Bid for instances and get a discount –  Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  • 14. Architecting for cost •  On-demand –  Research & Development, Data Science •  Spot –  Restartable Tasks –  Embarrassingly Parallel Workloads •  Reserved Instance –  Well Understood, Frequent and Predicable Workloads
  • 15. EMR Architecture for Optimal Cost Heavy Utilisation RI’s for alive and long- running clusters
  • 16. Use Medium Utilisation RI’s for ad-hoc and unpredictable workloads EMR Architecture for Optimal Cost
  • 17. Supplement with Spot for unpredictable workloads or Turbo Boost EMR Architecture for Optimal Cost
  • 18. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 19. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 & HDFS
  • 20. Pattern #1: Transient vs. Alive Clusters
  • 21. Pattern #1: Transient Clusters •  Cluster lives for the duration of the job •  Shut down the cluster when the job is done •  Data persist on Amazon S3 •  Input & Output Data on Amazon S3
  • 22. Benefits of Transient Clusters 1.  Control your cost 2.  Minimum maintenance •  Cluster goes away when job is done 3.  Practice cloud architecture •  Pay for what you use •  Data processing as a workflow
  • 23. Alive Clusters •  Very similar to traditional Hadoop deployments •  Cluster stays around after the job is done •  Data persistence model: •  Amazon S3 •  Amazon S3 Copy To HDFS •  HDFS and Amazon S3 as backup
  • 24. Alive Clusters •  Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage •  Get in the habit of shutting down your cluster and start a new one, once a week or month •  Design your data processing workflow to account for failure •  You can use workflow managements such as AWS Data Pipeline
  • 25. Pattern #2: Core & Task nodes
  • 26. Core Nodes Master instance group Amazon EMR cluster Core instance group HDFS HDFS Run TaskTrackers (Compute) Run DataNode (HDFS)
  • 27. Core Nodes Can add core nodes More HDFS space More CPU/ memory Master instance group Amazon EMR cluster Core instance group HDFS HDFS HDFS
  • 28. Core Nodes Can’t remove core nodes because of HDFS Master instance group Core instance group HDFS HDFS HDFS Amazon EMR cluster
  • 29. Amazon EMR Task Nodes Run TaskTrackers No HDFS Reads from core node HDFS Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 30. Amazon EMR Task Nodes Can add task nodes Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 31. Amazon EMR Task Nodes More CPU power More memory Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 32. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 33. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 34. Task Node Use-Cases •  Speed up job processing using Spot market –  Run task nodes on Spot market •  Get discount on hourly price –  Nodes can come and go without interruption to your cluster •  When you need extra horsepower for a short amount of time –  Example: Need to pull large amount of data from Amazon S3
  • 35. Pattern #3: Amazon S3 & HDFS
  • 36. Option 1: Amazon S3 as HDFS •  Use Amazon S3 as your permanent data store •  HDFS for temporary storage data between jobs •  No additional step to copy data to HDFS Amazon EMR cluster Task instance groupCore instance group HD FS HD FS Amazon S3
  • 37. Benefits: Amazon S3 as HDFS •  Ability to shut down your cluster HUGE Benefit!! •  Use Amazon S3 as your durable storage 11 9s of durability
  • 38. Benefits: Amazon S3 as HDFS •  No need to scale HDFS •  Capacity •  Replication for durability •  Amazon S3 scales with your data •  Both in IOPs and data storage
  • 39. Benefits: Amazon S3 as HDFS •  Ability to share data between multiple clusters •  Hard to do with HDFS Amazon S3 EMR EMR
  • 40. Benefits: Amazon S3 as HDFS •  Take advantage of Amazon S3 features •  Amazon S3 Server Side Encryption •  Amazon S3 Lifecycle Policies •  Amazon S3 versioning to protect against corruption •  Build elastic clusters •  Add nodes to read from Amazon S3 •  Remove nodes with data safe on Amazon S3
  • 41. What About Data Locality? •  Run your job in the same region as your Amazon S3 bucket •  Amazon EMR nodes have high speed connectivity to Amazon S3 •  If your job Is CPU/memory-bound, locality doesn’t make a huge difference
  • 42. Anti-Pattern: Amazon S3 as HDFS •  Iterative workloads –  If you’re processing the same dataset more than once •  Disk I/O intensive workloads
  • 43. Option 2: Optimise for Latency with HDFS 1.  Data persisted on Amazon S3
  • 44. 2.  Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Option 2: Optimise for Latency with HDFS
  • 45. 3.  Start processing data on HDFS S3DistCp Option 2: Optimise for Latency with HDFS
  • 46. Benefits: HDFS instead of S3 •  Better pattern for I/O-intensive workloads •  Amazon S3 as system of record •  Durability •  Scalability •  Cost •  Features: lifecycle policy, security
  • 47. Outline Introduction to Amazon EMR Architecting EMR for Cost Amazon EMR Design Patterns Amazon EMR Best Practices
  • 48. Amazon EMR Nodes and Size •  Use m1 and c1 family for functional testing •  Use m3 and c3 xlarge and larger nodes for production workloads •  Use cc2/c3 for memory and CPU intensive jobs •  hs1, hi1, i2 instances for HDFS workloads •  Prefer a smaller cluster of larger nodes
  • 49. Holy Grail Question How many nodes do I need?
  • 50. Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires.
  • 51. Cluster Sizing Calculation 2.  Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  • 52. Resource Capability / Instance Type EC2 Instance Type Mappers Reducers m1.small 2 1 m1.large 3 1 m1.xlarge 8 3 m2.xlarge 3 1 m2.2xlarge 6 2 m2.4xlarge 14 4 m3.xlarge 6 1 m3.2xlarge 12 3 cc2.8xlarge 24 6 c3.4xlarge 24 6 hi1.4xlarge 24 6 hs1.8xlarge 24 6 cr1.8xlarge & c3.8xlarge 48 12
  • 53. Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  • 54. Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 55. Cluster Sizing Calculation Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time Estimated Number Of Nodes:
  • 56. Example: Cluster Sizing Calculation 1.  Estimate the number of mappers your job requires 150 2.  Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  • 57. Example: Cluster Sizing Calculation 3.  We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 58. Example: Cluster Sizing Calculation 4.  Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 59. Cluster Sizing Calculation Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time Estimated number of nodes: 150 * 3 min 8 * 5 min = 11 m1.xlarge
  • 60. File Best Practices •  Avoid small files at all costs (smaller than 100MB) •  Use Compression
  • 61. Holy Grail Question What if I have small file issues?
  • 62. Dealing with Small Files •  Use S3DistCP to combine smaller files together •  S3DistCP takes a pattern and target file to combine smaller input files to larger ones ./elastic-mapreduce –jar /home/hadoop/lib/ emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[0-9]+- [0-9]+-[0-9]+).*, --targetSize,128,
  • 63. Compression •  Always Compress Data Files On Amazon S3 •  Reduces Bandwidth Between Amazon S3 and Amazon EMR •  Speeds Up Your Job •  Compress Mappers and Reducer Output
  • 64. •  Compression Types: –  Some are fast BUT offer less space reduction –  Some are space efficient BUT Slower –  Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s Compression
  • 65. In Summary •  Practice Cloud Architecture with Transient Clusters •  Utilise Task Nodes on Spot for Increased performance and Lower Cost •  Utilize S3 as the system of record for durability bit.ly/1n0hRSr
  • 66. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk EMR at C4 1.  Who we are. 2.  What we’re doing with EMR. 3.  Lessons learnt.
  • 67. Channel 4 •  State owned, public service broadcaster. •  Self-funded mostly by selling advertising (no TV license fee money!) •  Turnover £1B. •  800 employees. •  Programmes supplied by 250 independent production companies.
  • 68. 12 Years A Slave
  • 69. C4 Virtuous Circle Ad Revenue (£s) = Impacts x Rate Brilliant Program mes Oodles of Viewers Massive Ad Revenue Gigantic Program me Budget
  • 70. C4 Viewer Insight Database •  Clickstream & Ad Server behavioral data. •  10M registered viewers. •  Viewer Panel / Survey & 3rd Party Data. •  Programme metadata. •  60 Tbytes of S3 storage. Google “Channel 4 viewer promise”
  • 71. Expect to pre-process your data We want our Data Scientists to enjoy a User Friendly, High Performance system, containing High Quality Data. Embellish DeriveDecorateIngestAcquire AWS storage S3 Hive HQL query Raw DD Smoke test Analytical Outputs Row by row Drop columns Cleanse data Add flags Lookup values Decorated DD Multirow Multipass Dwell Last visit hit Embellish DD Segmentations Last activity Summary tables Derived DD Raw Data
  • 72. Data profiling SELECT SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)), SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0, 1)), SUM (IF (page_url <> '', 0, 1)), COUNT (DISTINCT service) FROM raw_clickstream; Big Data requires Big Data Profiling.
  • 73. Partitioning CREATE EXTERNAL TABLE web_log ( hit_time_gmt BIGINT, cookie STRING -- and many more columnns. ) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘s3n://bucket/’; ALTER TABLE web_log ADD PARTITION (month='2010-06') LOCATION '2010-06'; ALTER TABLE web_log ADD PARTITION (month='2010-07') LOCATION '2010-07'; -- etc. Help EMR go direct to the data it needs.
  • 74. Connecting data 1 Instanc e RD SSlave s Slave Old approach Redis New approach Slave Redis Slave Redis Slave Redis
  • 75. Handling large amounts of data •  AWS Import/Export. –  Consumer grade USB drives… sent by courier. •  AWS Direct Connect. –  Dedicated network connection from your premises to AWS. –  We have not completed our implementation. •  Glacier.
  • 76. Choosing instances for EMR Source: https://aws.amazon.com/ec2/pricing/ Some instance types omitted from diagram to ease clarity. Exchange rate, $1 = £0.61.
  • 77. Social engineering •  Make the Data Scientists aware of EMR costs. •  We give them visibility of clusters running, who started them, idle time, etc.
  • 78. John Telford Enterprise Architect Channel 4 @jtelford1 johntelforduk Thanks! Youtube: “Channel 4 Paralympics Meet the Superhumans”
  • 79. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  • 80. Please rate this session using the AWS Summits App and help us build better events