Join this advanced technical session on Amazon Elastic MapReduce (EMR) for an introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, how you can take advantage of both long and short-lived clusters as well as other Amazon EMR architectural patterns. Learn how to scale your cluster up or down dynamically and about ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
9. Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload
10. Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types
13. Architecting for cost
• EC2/EMR pricing models:
– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances and
get a discount
– Reserved Instance: upfront payment (for
1 or 3 year) for reduction in overall
monthly payment
14. Architecting for cost
• On-demand
– Research & Development, Data Science
• Spot
– Restartable Tasks
– Embarrassingly Parallel Workloads
• Reserved Instance
– Well Understood, Frequent and Predicable Workloads
15. EMR Architecture for Optimal Cost
Heavy Utilisation RI’s for alive and long-
running clusters
16. Use Medium Utilisation RI’s for ad-hoc and
unpredictable workloads
EMR Architecture for Optimal Cost
17. Supplement with Spot for unpredictable
workloads or Turbo Boost
EMR Architecture for Optimal Cost
21. Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done
• Data persist on
Amazon S3
• Input & Output
Data on
Amazon S3
22. Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
• Cluster goes away when job is done
3. Practice cloud architecture
• Pay for what you use
• Data processing as a workflow
23. Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup
24. Alive Clusters
• Always keep data safe on Amazon S3 even if you’re
using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
• Design your data processing workflow to account for failure
• You can use workflow managements such as AWS
Data Pipeline
26. Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
27. Core Nodes
Can add core
nodes
More HDFS
space
More CPU/
memory
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS HDFS
28. Core Nodes
Can’t remove
core nodes
because of
HDFS
Master instance group
Core instance group
HDFS HDFS HDFS
Amazon EMR cluster
29. Amazon EMR Task Nodes
Run TaskTrackers
No HDFS
Reads from core
node HDFS Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
30. Amazon EMR Task Nodes
Can add
task nodes
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
31. Amazon EMR Task Nodes
More CPU
power
More
memory
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
32. Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
33. Amazon EMR Task Nodes
You can
remove task
nodes when
processing is
completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
34. Task Node Use-Cases
• Speed up job processing using Spot market
– Run task nodes on Spot market
• Get discount on hourly price
– Nodes can come and go without interruption to your cluster
• When you need extra horsepower for a short amount of time
– Example: Need to pull large amount of data from Amazon S3
36. Option 1: Amazon S3 as HDFS
• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between jobs
• No additional step to copy
data to HDFS
Amazon EMR
cluster
Task instance groupCore instance group
HD
FS
HD
FS
Amazon S3
37. Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
38. Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity
• Replication for durability
• Amazon S3 scales with your data
• Both in IOPs and data storage
39. Benefits: Amazon S3 as HDFS
• Ability to share data between multiple clusters
• Hard to do with HDFS
Amazon S3
EMR
EMR
40. Benefits: Amazon S3 as HDFS
• Take advantage of Amazon S3 features
• Amazon S3 Server Side Encryption
• Amazon S3 Lifecycle Policies
• Amazon S3 versioning to protect against corruption
• Build elastic clusters
• Add nodes to read from Amazon S3
• Remove nodes with data safe on Amazon S3
41. What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bound, locality
doesn’t make a huge difference
42. Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once
• Disk I/O intensive workloads
44. 2. Launch Amazon EMR and
copy data to HDFS with
S3distcp
S3DistCp
Option 2: Optimise for Latency with HDFS
45. 3. Start processing data on
HDFS
S3DistCp
Option 2: Optimise for Latency with HDFS
46. Benefits: HDFS instead of S3
• Better pattern for I/O-intensive workloads
• Amazon S3 as system of record
• Durability
• Scalability
• Cost
• Features: lifecycle policy, security
48. Amazon EMR Nodes and Size
• Use m1 and c1 family for functional testing
• Use m3 and c3 xlarge and larger nodes for
production workloads
• Use cc2/c3 for memory and CPU intensive jobs
• hs1, hi1, i2 instances for HDFS workloads
• Prefer a smaller cluster of larger nodes
53. Cluster Sizing Calculation
3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
54. Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
55. Cluster Sizing Calculation
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Estimated Number Of Nodes:
56. Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires
150
2. Pick an instance and note down the number of
mappers it can run in parallel
m1.xlarge with 8 mapper capacity per
instance
57. Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
58. Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
59. Cluster Sizing Calculation
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min
8 * 5 min
= 11 m1.xlarge
60. File Best Practices
• Avoid small files at all costs (smaller than
100MB)
• Use Compression
62. Dealing with Small Files
• Use S3DistCP to
combine smaller files
together
• S3DistCP takes a
pattern and target
file to combine
smaller input files to
larger ones
./elastic-mapreduce –jar /home/hadoop/lib/
emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-
[0-9]+-[0-9]+).*,
--targetSize,128,
63. Compression
• Always Compress Data Files On Amazon S3
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job
• Compress Mappers and Reducer Output
64. • Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Compression
65. In Summary
• Practice Cloud Architecture with Transient
Clusters
• Utilise Task Nodes on Spot for Increased
performance and Lower Cost
• Utilize S3 as the system of record for durability
bit.ly/1n0hRSr
67. Channel 4
• State owned, public service broadcaster.
• Self-funded mostly by selling advertising (no TV license fee money!)
• Turnover £1B.
• 800 employees.
• Programmes supplied by 250 independent production companies.
69. C4 Virtuous Circle
Ad Revenue (£s) = Impacts x Rate
Brilliant
Program
mes
Oodles of
Viewers
Massive
Ad
Revenue
Gigantic
Program
me
Budget
70. C4 Viewer Insight Database
• Clickstream & Ad Server behavioral data.
• 10M registered viewers.
• Viewer Panel / Survey & 3rd Party Data.
• Programme metadata.
• 60 Tbytes of S3 storage.
Google “Channel 4 viewer promise”
71. Expect to pre-process your data
We want our Data Scientists to enjoy a User Friendly, High
Performance system, containing High Quality Data.
Embellish DeriveDecorateIngestAcquire
AWS storage S3
Hive HQL query
Raw DD
Smoke test Analytical
Outputs
Row by row
Drop columns
Cleanse data
Add flags
Lookup values
Decorated DD
Multirow
Multipass
Dwell
Last visit hit
Embellish DD
Segmentations
Last activity
Summary tables
Derived DD
Raw Data
72. Data profiling
SELECT
SUM (IF (visit_num REGEXP '^[0-9]+$', 0, 1)),
SUM (IF (ip REGEXP '^[0-9]+.[0-9]+.[0-9]+.[0-9]+$', 0, 1)),
SUM (IF (page_url <> '', 0, 1)),
COUNT (DISTINCT service)
FROM raw_clickstream;
Big Data requires Big Data Profiling.
73. Partitioning
CREATE EXTERNAL TABLE web_log (
hit_time_gmt BIGINT,
cookie STRING
-- and many more columnns.
) PARTITIONED BY (month STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION ‘s3n://bucket/’;
ALTER TABLE web_log ADD PARTITION (month='2010-06') LOCATION '2010-06';
ALTER TABLE web_log ADD PARTITION (month='2010-07') LOCATION '2010-07';
-- etc.
Help EMR go direct to the data it needs.
75. Handling large amounts of data
• AWS Import/Export.
– Consumer grade USB drives… sent by courier.
• AWS Direct Connect.
– Dedicated network connection from your
premises to AWS.
– We have not completed our implementation.
• Glacier.
76. Choosing instances for EMR
Source: https://aws.amazon.com/ec2/pricing/
Some instance types omitted from diagram to ease clarity.
Exchange rate, $1 = £0.61.
77. Social engineering
• Make the Data Scientists aware of EMR costs.
• We give them visibility of clusters running, who started
them, idle time, etc.