2. Agenda
• Two different companies – 2 stories
• Challenges with Big Data on premises
• Technical introduction to Amazon EMR
• Amazon EMR features and benefits
• Use case of AOL – moving 2 PB on-prem Hadoop
cluster to the AWS cloud
• Short demos
4. • In 2007 New York Times has decided create a digital
archive on the web – all articles from 1851-1922
• 11 million articles (4 TB of data) composed of:
• 405,000 large TIFF images
• 405,000 XML files
• 3.3 million SGML files
• Used Amazon EC2 and Hadoop to process the data
7. (Undisclosed international company) –
subsidiary in France
• In 2014 - has decided to run a POC on Big Data
analytics
• What was the 1st step they did?
Invested €7M into server purchase
8. “Want to increase innovation?
Lower the cost of failure.”
Joi Ito, Director of MIT Media Lab
9. How many big ticket
technology ideas can
your budget tolerate?
10. (Big) Data for Competitive Advantage
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real-time bidding
Clickstream analysis
Fraud detection
Security threat detection
11. Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
13. Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
14. Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud
15. What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
16. Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
24. You Are Up and Running!
Information about the software you are
running, logs and features
25. You Are Up and Running!
Infrastructure for this cluster
26. You Are Up and Running!
Security Groups and Roles
27. Use the CLI
aws emr create-cluster
--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK
41. Amazon S3 is Your Persistent Data Store
Designed for 11 9’s durability
$0.03 / GB / month in Ireland
Lifecycle policies
Versioning
Distributed by default
EMRFSAmazon S3
42. The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system
• Streams data directly from Amazon S3
• Uses HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Consistent view – consistency for read after write
• Support for encryption
• Fast listing of objects
43. Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
44. Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION
's3://elasticmapreduce.samples/pig-
apache/input/'
54. Spot Integration with Amazon EMR
• Can provision instances from the Spot market
• Impact of interruption
• Master node – Can lose the cluster
• Core node – Can lose intermediate data
• Task nodes – Jobs will restart on other nodes (application
dependent)
55. Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
65. AOL Data Platforms Architecture 2014
AOL
Source Systems In-house Hadoop
Cluster
Database
Reporting Tools
Users
66. Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
67. Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
68. AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Source
Systems
Amazon S3
Amazon EMR
Cluster
Watchdog
Amazon SNS
Amazon IAM
AOL
AWS Direct
Connect
Reporting
Tools
Database
Users
69. EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
70. AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Service
Cost Comparison
0 2 4 6
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.