Hadoop AWS infrastructure cost evaluation

Hadoop Platform infrastructure
cost evaluation

Agenda
• High level requirements

• Cloud architecture
• Major architecture components
• Amazon AWS

• Hadoop distributions
• Capacity Planning
• Amazon AWS – EMR
• Hadoop distributions
• On-premise hardware costs
• Gotcha’s
2

High Level Requirements
• Build an Analytical & BI platform for web log analytics
• Ingest multiple data sources:
• Log data
• internal user data
• Apply complex business rules
• Manage Events, filter Crawler Driven Logs, apply
Industry and Domain Specific rules

• Populate/export to a BI tool for visualization.
3

Non-Functional Requirements
• Today’s baseline: ~42 TB per year (~ 3.5TB raw
data per month), 3 years store

• SLA: Should process data every day. Currently
done once a month.

• Predefined processing via Hive; no exploratory
analysis

• Everything in the cloud:

• Store (HDFS), Compute (M/R), Analysis (BI tool)
4

Non-Functional Requirements [2]
• Seeding data in S3 (3 year’s data worth)
• Adding monthly net-new data only.
• Speed not of primary importance

5

Data Estimates for Capacity planning [2]
• Cleaned-up log data per year 42 TB (3 years = 126 TB)
• Total disk space required should consider
• Compression (LZO 40%) – Reduces disk space
required to  ~25 *

• Replication Factor of 3 : ~75 TB
• 75% disk utilization maximum in Hadoop: 100TB
• Total disk capacity required for DN: ~100TB / year
(17.5TB/ mo)
•

(*disclaimer: depends on codec and data input)

6

Data Estimates for Capacity planning:
reduced logs
Expected
Data
data
Log data
After compression
Replication 70% disk utilization
volume volume (TB) (Gzip 40%)
on 3 nodes maximum (TB)
1 month
3.6
2.16
6.5
1 year
42
25
75

9.2
107

3 years

322

126

75.6

226

• Total disk capacity required for DN: ~10TB/ month
7

Cloud Solution Architecture
2. Export
data to
HDFS

Amazon AWS

3. Process
in M/R

Hive Tables
BI Tool

Hadoop

S3

HDFS

1. Copy
data to S3

Client

Logs

4. Display in
BI tool

Metadata
Extraction

Webservers

8

User

5. Retain
results into
S3

Hadoop on AWS: EC2
• Amazon Elastic Compute Cloud (EC2) is a web
service that provides resizable compute capacity
in the cloud.

• Manual set up of Hadoop on EC2
• Use EBS for storage capacity (HDFS)
• Storage on S3

9

Running Hadoop on AWS: EC2
• EC2 instances options
• Choose instance type
• Choose instance type availability
• Choose instance family

• Choose where the data resides:
• S3 – high latency, but highly available
• EBS

• Permanent storage?
• Snapshots to S3?
• Apache Whirr for set up
10

Amazon EC2 – Instance features
• Other choices:
• EBS-optimized instances: dedicated throughput
between Amazon EC2 and Amazon EBS, with options
between 500 Mbps and 1000 Mbps depending on the
instance type used.

• Inter-region data transfer
• Dedicated instances: run on single-tenant hardware
dedicated to a single customer.

• Spot instances: Name your price
11

Amazon Instance Families
•

Amazon EC2 instances are grouped into six families: General purpose, Memory
optimized, Compute optimized, Storage optimized, micro and GPU.

•

General-purpose instances have memory to CPU ratios suitable for most
general purpose apps.

•

Memory-optimized instances offer larger memory sizes for high throughput
applications.

•

Compute-optimized instances have proportionally more CPU resources
than memory (RAM) and are well suited for compute-intensive applications.

•

Storage-optimized instances are optimized for very high random I/O
performance , or very high storage density, low storage cost, and high
sequential I/O performance.

•

micro instances provide a small amount of CPU with the ability to burst to
higher amounts for brief periods.

•

GPU instances, for dynamic applications.

Data
nodes

12

Amazon Instances types availability
•

On-Demand Instances – On-Demand Instances let you pay for compute
capacity by the hour with no long-term commitments. This frees you from the
costs and complexities of planning, purchasing, and maintaining.

•

Reserved Instances – Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a
discount on the hourly charge for that instance. There are three Reserved
Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that
enable you to balance the amount you pay upfront with your effective hourly
price.

•

Spot Instances – Spot Instances allow customers to bid on unused Amazon
EC2 capacity and run those instances for as long as their bid exceeds the
current Spot Price. The Spot Price changes periodically based on supply and
demand, and customers whose bids meet or exceed it gain access to the
available Spot Instances. If you have flexibility in when your applications can run,
Spot Instances can significantly lower your Amazon EC2 costs.

13

Amazon EC2 – Instance types

Data nodes

BI
instances
Master
nodes
15

Systems Architecture – EC2
AWS
Hadoop
NN

SN

DNs

EN

Client
Logs

HDFS on EBS drives

S3
BI
Node

Node

Node

BI

• Hadoop cluster is initiated when analytics is run
• Data is streamed from S3 to EBS Volumes
• Results from analytics stored to S3 once
computed

16

• BI nodes permanent

Node

Hadoop on AWS: EC2
• Probably not the best choice:
• EBS volumes make the solution costly
• If instead using instance storage, choices of
EC2 instances either too small (a few Gigs) or
too big (48 TB/per instance).

• Don’t need the flexibility – just want to use
Hive

17

Hadoop on AWS: EMR

• EC2 Amazon Elastic MapReduce ( EMR) is a
web service that provides a hosted Hadoop
framework running on the EC2 and Amazon
Simple Storage Service (S3).
18

Running Hadoop on AWS - EMR
•

Elastic Map Reduce

•

For occasional jobs –
Ephemeral clusters

•

Ease of use, but 20% costlier

•

Data stored in S3 - Highly
tuned for S3 storage

•

Hive and Pig available

•

Only pay for S3 + instances
time while jobs running

•

Or: leave it always on.

19

Hadoop on AWS - EMR
• EC2 instances with own flavor of Hadoop
• Amazon Apache Hadoop is 1.0.3 version. You
can also choose MapR M3 or M5 (0.20.205)
version.

• You can run Hive (0.7.1 or 0.8.1), Custom JAR,
Streaming, Pig or Hbase.

20

Systems Architecture – EMR
AWS

Hadoop
EMR
DNs
SNNN

Client
Logs

HDFS from S3

S3
BI
Instanc
e

Instance

Instance

BI

• Hadoop cluster created elastically
• Data is streamed from S3 to initiate Hadoop cluster
dynamically

• Results from analytics stored to S3 once computed
• BI nodes permanent

Instance

21

Amazon EMR– Instance types

Data nodes

BI
instances
Master
nodes
22

AWS calculator – EMR calculation
• Calculate and add:
• S3 cost (seeded data)
• Incremental S3 cost, per month
• EC2 cost
• EMR cost
• In/out Transfer of data cost
• Amazon support cost
• Infrastructure support Engineer cost
23

• Say for 24hrs/day, EMR cost:

24

• Say for 24hrs/day, 3 year S3:

25

• Say for 24hrs/day, 3 year EC2:

26

Amazon EMR Pricing – Reduced log volume
Data
volume
(in
year)

Instances types

Price/year
Running 24
hours/day

Price/year
Running 8
hours/day

Price/year
Running 8
hours/wee
k

1 year storing
42TB on
S3

10 instances –
Data nodes:
m1.xlarge
NN: m2.2xlarge
BI: m2.2xlarge
Load balancer:
t1.micro
1 year reserved
10 EMR instances
(Subject to change
depending on
actual load)

$14.1k/mo *
12 = $169.2k

$8.9k * 12=
$106k

$6.6k * 12 =
$79.2k

$19.5k *36
mos = $684k

$15.5k * 36
mos =
$558k

$13.2k * 36
mos = $475

3 years
storing
126TB
on S3

27

Hadoop on AWS: trade-offs
Feature

EC2

EMR

Ease of use

Hard – IT Ops costs

Easy; Hadoop clusters can be of any size; can

have multiple clusters.
Cost

Cheaper

Costlier: pay for EC2 + EMR

Flexibility

Better: Access to full stack of
Hadoop ecosystem

Portability

Easier to move to dedicated
hardware

Speed

Faster

Lower performance: all data is streamed from S3 for
each job

Maintability

Can choose any vendor;
Can be updated to latest versoin;

Debugging tricky: cluster terminated, no logs

On demand Hadoop cluster: Ease of use Hadoop installed, but with limited options

28

EC2 Pricing Gotcha’s
• EMR with Spot instances seems to be the trend for
minimal cost, if SLA timeliness is not of primary
importance.

• Use reserved instances to bring down cost
drastically (60%).

• Compression on S3 ?
• Need to account for secondary NN?
• Ability to estimate better how many EMR nodes
are needed with AWS’s AMI task configuration
29

EMR Technical Gotcha’s
• Transferring data between S3 and EMR clusters is
very fast (and free), so long as your S3 bucket and
Hadoop cluster are in the same Amazon region

• EMR’S3 File System streams data directly to S3
instead of buffering to intermediate local files.

• EMR’S3 File System adds Multipart Upload, which
splits your writes into smaller chunks and uploads
them in parallel.

• Store fewer, larger files instead of many smaller
ones
30

•

http://blog.mortardata.com/post/58920122308/s3-hadoop-performance

In house Hadoop cluster
Data
volume (in
year)

Storage for Data nodes

Instances

Price, first year

126TB

6*12x2TB

10 data nodes, 3 Master

$10.6k * 6 DN + $7.3k * 3
= $128k

Dell PowerEdge R720: Processor
E5-2640 2.50GHz, 8 cores, 12M
Cache,Turbo,
Memory 64GB Memory, Quad
Ranked RDIMM for 2
Processors, Low Volt
Hard Drives 12 - 2TB 7.2K RPM
SATA 3.5in Hot Plug Hard Drive
Network Card Intel 82599 Dual
Port 10GE Mezzanine Card
BI

4 nodes

+ Vendor Support ($50k)
+ Full-time person
($150k)
=
$328k

$43k

31

Licensing and support
costs

32

Hadoop Distributions:
• Cloudera or Hortonworks
• Enterprise 24X7 Production Support - phone and support portal
access(Support Datasheet Attached)

• Minimum $50k$

33

Amazon – Support EC2 & EMR
Business

Enterprise

Response Time : 1 Hour
Access: Phone, Chat and Email 24/7

Response Time: 15 minutes
Access: Phone, Chat, TAM and Email
24/7

Costs
Greater of $100 - or •10% of monthly AWS usage for the first
$0-$10K
•7% of monthly AWS usage from $10K$80K
•3% of monthly AWS usage from $250K+
(about $800/yr)

http://aws.amazon.com/premiumsupport/

Costs
Greater of $15,000 - or •10% of monthly AWS usage for the first
$0-$150K
•5% of monthly AWS usage from $500K$1M
•3% of monthly AWS usage from $1M+

34

Hadoop AWS infrastructure cost evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop AWS infrastructure cost evaluation

Similar to Hadoop AWS infrastructure cost evaluation (20)

Recently uploaded

Recently uploaded (20)

Hadoop AWS infrastructure cost evaluation

Editor's Notes