SlideShare a Scribd company logo
1 of 46
Download to read offline
Deep Dive – Amazon Elastic MapReduce
Ian Meyers, Solution Architect – Amazon Web Services
Guest Speakers: Ian McDonald & James Aley - Swiftkey
(ian@swiftkey.com, james.aley@swiftkey.com)
Agenda
Amazon Elastic MapReduce (EMR)
Leverage Amazon Simple Storage Service (S3) with Amazon
EMR File System (EMRFS)
Design patterns and optimizations
Space Ape Games
Amazon Elastic MapReduce
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster
Easy to deploy
AWS Management Console Command Line
Or use the Amazon EMR API with your favorite SDK.
Easy to monitor and debug
Integrated with Amazon CloudWatch
Monitor Cluster, Node, IO, Hadoop 1 & 2 Processes
Monitor Debug
Amazon S3 and HDFS Browser
Query Editor
Job Browser
Try different configurations to find the
optimal cost/performance balance.
CPU
c3 family
cc2.8xlarge
d2 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
ETL ML Spark HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Use bootstrap actions to install applications…
https://github.com/awslabs/emr-bootstrap-actions
…or to configure Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name
Shortcut
Key-Value Pair
Shortcut
core-site.xml	
   core C c
hdfs-site.xml	
   hdfs H h
mapred-site.xml	
   mapred M m
yarn-site.xml	
   yarn Y y
  Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
  No intermediate data
persistence required
  Simple way to introduce real-time sources into
batch-oriented systems
  Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Leverage Amazon S3
Amazon S3 as your persistent data store
Amazon S3
Designed for 99.999999999% durability
Separate compute and storage
Resize and shut down Amazon EMR clusters
with no data loss
Point multiple Amazon EMR clusters at same
data in Amazon S3
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view
For consistent list and read-after-write for new puts
Support for Amazon S3 server-side and client-side encryption
Faster listing of large prefixes via EMRFS metadata
EMRFS support for Amazon S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor

AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Amazon S3
 EMRFS metadata 

in Amazon DynamoDB
List and read-after-write consistency

Faster list operations
Number of
objects
Without Consistent
View
With Consistent View
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of Amazon S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
Optimize to leverage HDFS
Iterative workloads
If you’re processing the same dataset more than once
Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to HDFS for processing.
Amazon EMR – Design Patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Amazon EMR example #2: HBase Server Cluster
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent daily
Logs stored in
Amazon S3
Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multi-petabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
Optimizations for Storage
File formats
Row oriented
Text files
Sequence files
Writable object
Avro data files
Described by schema
Columnar format
Object Record Columnar (ORC)
Parquet
Logical Table
Row oriented
Column oriented
Choosing the right file format
Processing and query tools
Hive, Pig, Impala, Presto, Spark
Evolution of schema
Avro for schema and Orc/Parquet for storage
File format “splittability”
Avoid JSON/XML files with newlines - default Split is n
Compression
Block, File or Internal
File sizes
Avoid small files
Avoid anything smaller than 100 MB
Each process is a single Java Virtual machine (JVM)
CPU time is required to spawn JVMs
Fewer files, matching closely to block size
Fewer calls to Amazon S3
Fewer network/HDFS requests
Dealing with small files
You *can* reduce HDFS block size (e.g., 1 MB [default is 128
MB])
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/
configure-hadoop --args “-m,dfs.block.size=1048576”
Instead use S3DistCp to combine small files together
S3DistCp takes a pattern and target path to combine smaller input
files into larger ones
Supply a target size and compression codec
Compression
Always compress data files on Amazon S3
Reduces network traffic between Amazon S3 and
Amazon EMR
Speeds up your job
Compress Mapper and Reducer output
Amazon EMR compresses internode traffic with LZO on
Hadoop 1, and Snappy on Hadoop 2.
Choosing the right compression
Time sensitive: faster compressions are a better choice (Snappy)
Large amount of data: use space-efficient compressions (Gzip)
Combined workload: use LZO.
Algorithm Splittable? Compression Ratio
Compress +
Decompress Speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Cost-saving tips
Use Amazon S3 as your persistent data store (only pay for
compute when you need it!).
Use Amazon EC2 Spot Instances (especially with task nodes) to
save 80 percent or more on the Amazon EC2 cost.
Use Amazon EC2 Reserved Instances if you have steady
workloads.
Create CloudWatch alerts to notify you if a cluster is
underutilized so that you can shut it down (e.g. Mappers running
== 0 for more than N hours).
Cost-saving tips
Contact your account manager about custom pricing options, if
you are spending more than $10K per month on Amazon EMR.
©2015,  Amazon  Web  Services,  Inc.  or  its  affiliates.  All  rights  reserved
Next Generation Analytics
Ian McDonald, Director of IT
SwiftKey
Twitter: @imcdnzl
What is SwiftKey?
Architecture
Data Capture Architecture
ETL Architecture
Cascalog
•  Cascalog is an open source Clojure library
implemented using Cascading
•  Using instead of things like Hive / Pig
•  Write a few lines of Clojure and end up with an
EMR job
Parquet (Apache project)
•  Developed by Cloudera and Twitter
•  Efficient compression and encoding on Hadoop /
EMR
•  Use for storing and processing our data
Things we’ve learnt
Lessons
•  Get on top of serialisation
•  Don’t just stick with JSON / Gzip
•  Many small files in S3 painful – rebuild to fewer
bigger files
•  Use spot instances for EMR (except Master
node)
•  Experiment with different instance types to find
best speed / cost effectiveness
What’s Next
Apache Spark
•  Easier / faster than Hadoop or database queries
•  Processes in RAM
•  Directly against S3 data
•  Available on EMR
•  Not necessarily great for big joins
LONDON

More Related Content

What's hot

Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Amazon Web Services
 
Introduction to Amazon Elasticsearch Service
Introduction to  Amazon Elasticsearch ServiceIntroduction to  Amazon Elasticsearch Service
Introduction to Amazon Elasticsearch ServiceAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Encryption and Key Management in AWS
Encryption and Key Management in AWSEncryption and Key Management in AWS
Encryption and Key Management in AWSAmazon Web Services
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon Web Services
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 

What's hot (20)

Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Deep dive into AWS IAM
Deep dive into AWS IAMDeep dive into AWS IAM
Deep dive into AWS IAM
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
 
Introduction to Amazon Elasticsearch Service
Introduction to  Amazon Elasticsearch ServiceIntroduction to  Amazon Elasticsearch Service
Introduction to Amazon Elasticsearch Service
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Encryption and Key Management in AWS
Encryption and Key Management in AWSEncryption and Key Management in AWS
Encryption and Key Management in AWS
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 

Viewers also liked

Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Amazon Web Services
 
Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Amazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesAmazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesRobert Wilson
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)Amazon Web Services
 

Viewers also liked (7)

Amazon Virtual Private Cloud
Amazon Virtual Private CloudAmazon Virtual Private Cloud
Amazon Virtual Private Cloud
 
Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)Using Virtual Private Cloud (vpc)
Using Virtual Private Cloud (vpc)
 
Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)Deep Dive - Amazon Virtual Private Cloud (VPC)
Deep Dive - Amazon Virtual Private Cloud (VPC)
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web ServicesAmazon Virtual Private Cloud VPC Architecture AWS Web Services
Amazon Virtual Private Cloud VPC Architecture AWS Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
 

Similar to Deep Dive - Amazon Elastic MapReduce (EMR)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesBest Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesAmazon Web Services
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWSDanilo Poccia
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 

Similar to Deep Dive - Amazon Elastic MapReduce (EMR) (20)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesBest Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS Architecting In The Cloud
AWS Architecting In The CloudAWS Architecting In The Cloud
AWS Architecting In The Cloud
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Deep Dive - Amazon Elastic MapReduce (EMR)

  • 1. Deep Dive – Amazon Elastic MapReduce Ian Meyers, Solution Architect – Amazon Web Services Guest Speakers: Ian McDonald & James Aley - Swiftkey (ian@swiftkey.com, james.aley@swiftkey.com)
  • 2. Agenda Amazon Elastic MapReduce (EMR) Leverage Amazon Simple Storage Service (S3) with Amazon EMR File System (EMRFS) Design patterns and optimizations Space Ape Games
  • 4. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  • 5. Easy to deploy AWS Management Console Command Line Or use the Amazon EMR API with your favorite SDK.
  • 6. Easy to monitor and debug Integrated with Amazon CloudWatch Monitor Cluster, Node, IO, Hadoop 1 & 2 Processes Monitor Debug
  • 7. Amazon S3 and HDFS Browser
  • 10. Try different configurations to find the optimal cost/performance balance. CPU c3 family cc2.8xlarge d2 family Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types ETL ML Spark HDFS
  • 11. Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Resizable clusters
  • 12. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 13. Use bootstrap actions to install applications… https://github.com/awslabs/emr-bootstrap-actions
  • 14. …or to configure Hadoop --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml   core C c hdfs-site.xml   hdfs H h mapred-site.xml   mapred M m yarn-site.xml   yarn Y y
  • 15.   Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams   No intermediate data persistence required   Simple way to introduce real-time sources into batch-oriented systems   Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  • 17. Amazon S3 as your persistent data store Amazon S3 Designed for 99.999999999% durability Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3
  • 18. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing of large prefixes via EMRFS metadata
  • 19. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor
 AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 20. Amazon S3 EMRFS metadata 
 in Amazon DynamoDB List and read-after-write consistency Faster list operations Number of objects Without Consistent View With Consistent View 1,000,000 147.72 29.70 100,000 12.70 3.69 Fast listing of Amazon S3 objects using EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  • 21. Optimize to leverage HDFS Iterative workloads If you’re processing the same dataset more than once Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing.
  • 22. Amazon EMR – Design Patterns
  • 23. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 24. Amazon EMR example #2: HBase Server Cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 25. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multi-petabyte warehouse http://techblog.netflix.com/2014/10/using-presto-in-our-big- data-platform.html
  • 27. File formats Row oriented Text files Sequence files Writable object Avro data files Described by schema Columnar format Object Record Columnar (ORC) Parquet Logical Table Row oriented Column oriented
  • 28. Choosing the right file format Processing and query tools Hive, Pig, Impala, Presto, Spark Evolution of schema Avro for schema and Orc/Parquet for storage File format “splittability” Avoid JSON/XML files with newlines - default Split is n Compression Block, File or Internal
  • 29. File sizes Avoid small files Avoid anything smaller than 100 MB Each process is a single Java Virtual machine (JVM) CPU time is required to spawn JVMs Fewer files, matching closely to block size Fewer calls to Amazon S3 Fewer network/HDFS requests
  • 30. Dealing with small files You *can* reduce HDFS block size (e.g., 1 MB [default is 128 MB]) --bootstrap-action s3://elasticmapreduce/bootstrap-actions/ configure-hadoop --args “-m,dfs.block.size=1048576” Instead use S3DistCp to combine small files together S3DistCp takes a pattern and target path to combine smaller input files into larger ones Supply a target size and compression codec
  • 31. Compression Always compress data files on Amazon S3 Reduces network traffic between Amazon S3 and Amazon EMR Speeds up your job Compress Mapper and Reducer output Amazon EMR compresses internode traffic with LZO on Hadoop 1, and Snappy on Hadoop 2.
  • 32. Choosing the right compression Time sensitive: faster compressions are a better choice (Snappy) Large amount of data: use space-efficient compressions (Gzip) Combined workload: use LZO. Algorithm Splittable? Compression Ratio Compress + Decompress Speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 33. Cost-saving tips Use Amazon S3 as your persistent data store (only pay for compute when you need it!). Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost. Use Amazon EC2 Reserved Instances if you have steady workloads. Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).
  • 34. Cost-saving tips Contact your account manager about custom pricing options, if you are spending more than $10K per month on Amazon EMR.
  • 35. ©2015,  Amazon  Web  Services,  Inc.  or  its  affiliates.  All  rights  reserved Next Generation Analytics Ian McDonald, Director of IT SwiftKey Twitter: @imcdnzl
  • 40. Cascalog •  Cascalog is an open source Clojure library implemented using Cascading •  Using instead of things like Hive / Pig •  Write a few lines of Clojure and end up with an EMR job
  • 41. Parquet (Apache project) •  Developed by Cloudera and Twitter •  Efficient compression and encoding on Hadoop / EMR •  Use for storing and processing our data
  • 43. Lessons •  Get on top of serialisation •  Don’t just stick with JSON / Gzip •  Many small files in S3 painful – rebuild to fewer bigger files •  Use spot instances for EMR (except Master node) •  Experiment with different instance types to find best speed / cost effectiveness
  • 45. Apache Spark •  Easier / faster than Hadoop or database queries •  Processes in RAM •  Directly against S3 data •  Available on EMR •  Not necessarily great for big joins