SlideShare a Scribd company logo
1 of 35
Hadoop Platform infrastructure
cost evaluation
Agenda
• High level requirements

• Cloud architecture
• Major architecture components
• Amazon AWS

• Hadoop distributions
• Capacity Planning
• Amazon AWS – EMR
• Hadoop distributions
• On-premise hardware costs
• Gotcha’s
2
High Level Requirements
• Build an Analytical & BI platform for web log analytics
• Ingest multiple data sources:
• Log data
• internal user data
• Apply complex business rules
• Manage Events, filter Crawler Driven Logs, apply
Industry and Domain Specific rules

• Populate/export to a BI tool for visualization.
3
Non-Functional Requirements
• Today’s baseline: ~42 TB per year (~ 3.5TB raw
data per month), 3 years store

• SLA: Should process data every day. Currently
done once a month.

• Predefined processing via Hive; no exploratory
analysis

• Everything in the cloud:

• Store (HDFS), Compute (M/R), Analysis (BI tool)
4
Non-Functional Requirements [2]
• Seeding data in S3 (3 year’s data worth)
• Adding monthly net-new data only.
• Speed not of primary importance

5
Data Estimates for Capacity planning [2]
• Cleaned-up log data per year 42 TB (3 years = 126 TB)
• Total disk space required should consider
• Compression (LZO 40%) – Reduces disk space
required to  ~25 *

• Replication Factor of 3 : ~75 TB
• 75% disk utilization maximum in Hadoop: 100TB
• Total disk capacity required for DN: ~100TB / year
(17.5TB/ mo)
•

(*disclaimer: depends on codec and data input)

6
Data Estimates for Capacity planning:
reduced logs
Expected
Data
data
Log data
After compression
Replication 70% disk utilization
volume volume (TB) (Gzip 40%)
on 3 nodes maximum (TB)
1 month
3.6
2.16
6.5
1 year
42
25
75

9.2
107

3 years

322

126

75.6

226

• Total disk capacity required for DN: ~10TB/ month
7
Cloud Solution Architecture
2. Export
data to
HDFS

Amazon AWS

3. Process
in M/R

Hive Tables
BI Tool

Hadoop

S3

HDFS

1. Copy
data to S3

Client

Logs

4. Display in
BI tool

Metadata
Extraction

Webservers

8

User

5. Retain
results into
S3
Hadoop on AWS: EC2
• Amazon Elastic Compute Cloud (EC2) is a web
service that provides resizable compute capacity
in the cloud.

• Manual set up of Hadoop on EC2
• Use EBS for storage capacity (HDFS)
• Storage on S3

9
Running Hadoop on AWS: EC2
• EC2 instances options
• Choose instance type
• Choose instance type availability
• Choose instance family

• Choose where the data resides:
• S3 – high latency, but highly available
• EBS

• Permanent storage?
• Snapshots to S3?
• Apache Whirr for set up
10
Amazon EC2 – Instance features
• Other choices:
• EBS-optimized instances: dedicated throughput
between Amazon EC2 and Amazon EBS, with options
between 500 Mbps and 1000 Mbps depending on the
instance type used.

• Inter-region data transfer
• Dedicated instances: run on single-tenant hardware
dedicated to a single customer.

• Spot instances: Name your price
11
Amazon Instance Families
•

Amazon EC2 instances are grouped into six families: General purpose, Memory
optimized, Compute optimized, Storage optimized, micro and GPU.

•

General-purpose instances have memory to CPU ratios suitable for most
general purpose apps.

•

Memory-optimized instances offer larger memory sizes for high throughput
applications.

•

Compute-optimized instances have proportionally more CPU resources
than memory (RAM) and are well suited for compute-intensive applications.

•

Storage-optimized instances are optimized for very high random I/O
performance , or very high storage density, low storage cost, and high
sequential I/O performance.

•

micro instances provide a small amount of CPU with the ability to burst to
higher amounts for brief periods.

•

GPU instances, for dynamic applications.

Data
nodes

12
Amazon Instances types availability
•

On-Demand Instances – On-Demand Instances let you pay for compute
capacity by the hour with no long-term commitments. This frees you from the
costs and complexities of planning, purchasing, and maintaining.

•

Reserved Instances – Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a
discount on the hourly charge for that instance. There are three Reserved
Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that
enable you to balance the amount you pay upfront with your effective hourly
price.

•

Spot Instances – Spot Instances allow customers to bid on unused Amazon
EC2 capacity and run those instances for as long as their bid exceeds the
current Spot Price. The Spot Price changes periodically based on supply and
demand, and customers whose bids meet or exceed it gain access to the
available Spot Instances. If you have flexibility in when your applications can run,
Spot Instances can significantly lower your Amazon EC2 costs.

13
Amazon EC2 – Storage

14
Amazon EC2 – Instance types

Data nodes

BI
instances
Master
nodes
15
Systems Architecture – EC2
AWS
Hadoop
NN

SN

DNs

EN

Client
Logs

HDFS on EBS drives

S3
BI
Node

Node

Node

BI

• Hadoop cluster is initiated when analytics is run
• Data is streamed from S3 to EBS Volumes
• Results from analytics stored to S3 once
computed

16

• BI nodes permanent

Node
Hadoop on AWS: EC2
• Probably not the best choice:
• EBS volumes make the solution costly
• If instead using instance storage, choices of
EC2 instances either too small (a few Gigs) or
too big (48 TB/per instance).

• Don’t need the flexibility – just want to use
Hive

17
Hadoop on AWS: EMR

• EC2 Amazon Elastic MapReduce ( EMR) is a
web service that provides a hosted Hadoop
framework running on the EC2 and Amazon
Simple Storage Service (S3).
18
Running Hadoop on AWS - EMR
•

Elastic Map Reduce

•

For occasional jobs –
Ephemeral clusters

•

Ease of use, but 20% costlier

•

Data stored in S3 - Highly
tuned for S3 storage

•

Hive and Pig available

•

Only pay for S3 + instances
time while jobs running

•

Or: leave it always on.

19
Hadoop on AWS - EMR
• EC2 instances with own flavor of Hadoop
• Amazon Apache Hadoop is 1.0.3 version. You
can also choose MapR M3 or M5 (0.20.205)
version.

• You can run Hive (0.7.1 or 0.8.1), Custom JAR,
Streaming, Pig or Hbase.

20
Systems Architecture – EMR
AWS

Hadoop
EMR
DNs
SNNN

Client
Logs

HDFS from S3

S3
BI
Instanc
e

Instance

Instance

BI

• Hadoop cluster created elastically
• Data is streamed from S3 to initiate Hadoop cluster
dynamically

• Results from analytics stored to S3 once computed
• BI nodes permanent

Instance

21
Amazon EMR– Instance types

Data nodes

BI
instances
Master
nodes
22
AWS calculator – EMR calculation
• Calculate and add:
• S3 cost (seeded data)
• Incremental S3 cost, per month
• EC2 cost
• EMR cost
• In/out Transfer of data cost
• Amazon support cost
• Infrastructure support Engineer cost
23
AWS calculator – EMR calculation
• Say for 24hrs/day, EMR cost:

24
AWS calculator – EMR calculation
• Say for 24hrs/day, 3 year S3:

25
AWS calculator – EMR calculation
• Say for 24hrs/day, 3 year EC2:

26
Amazon EMR Pricing – Reduced log volume
Data
volume
(in
year)

Instances types

Price/year
Running 24
hours/day

Price/year
Running 8
hours/day

Price/year
Running 8
hours/wee
k

1 year storing
42TB on
S3

10 instances –
Data nodes:
m1.xlarge
NN: m2.2xlarge
BI: m2.2xlarge
Load balancer:
t1.micro
1 year reserved
10 EMR instances
(Subject to change
depending on
actual load)

$14.1k/mo *
12 = $169.2k

$8.9k * 12=
$106k

$6.6k * 12 =
$79.2k

$19.5k *36
mos = $684k

$15.5k * 36
mos =
$558k

$13.2k * 36
mos = $475

3 years
storing
126TB
on S3

27
Hadoop on AWS: trade-offs
Feature

EC2

EMR

Ease of use

Hard – IT Ops costs

Easy; Hadoop clusters can be of any size; can

have multiple clusters.
Cost

Cheaper

Costlier: pay for EC2 + EMR

Flexibility

Better: Access to full stack of
Hadoop ecosystem

Portability

Easier to move to dedicated
hardware

Speed

Faster

Lower performance: all data is streamed from S3 for
each job

Maintability

Can choose any vendor;
Can be updated to latest versoin;

Debugging tricky: cluster terminated, no logs

On demand Hadoop cluster: Ease of use Hadoop installed, but with limited options

28
EC2 Pricing Gotcha’s
• EMR with Spot instances seems to be the trend for
minimal cost, if SLA timeliness is not of primary
importance.

• Use reserved instances to bring down cost
drastically (60%).

• Compression on S3 ?
• Need to account for secondary NN?
• Ability to estimate better how many EMR nodes
are needed with AWS’s AMI task configuration
29
EMR Technical Gotcha’s
• Transferring data between S3 and EMR clusters is
very fast (and free), so long as your S3 bucket and
Hadoop cluster are in the same Amazon region

• EMR’S3 File System streams data directly to S3
instead of buffering to intermediate local files.

• EMR’S3 File System adds Multipart Upload, which
splits your writes into smaller chunks and uploads
them in parallel.

• Store fewer, larger files instead of many smaller
ones
30

•

http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
In house Hadoop cluster
Data
volume (in
year)

Storage for Data nodes

Instances

Price, first year

126TB

6*12x2TB

10 data nodes, 3 Master

$10.6k * 6 DN + $7.3k * 3
= $128k

Dell PowerEdge R720: Processor
E5-2640 2.50GHz, 8 cores, 12M
Cache,Turbo,
Memory 64GB Memory, Quad
Ranked RDIMM for 2
Processors, Low Volt
Hard Drives 12 - 2TB 7.2K RPM
SATA 3.5in Hot Plug Hard Drive
Network Card Intel 82599 Dual
Port 10GE Mezzanine Card
BI

4 nodes

+ Vendor Support ($50k)
+ Full-time person
($150k)
=
$328k

$43k

31
Licensing and support
costs

32
Hadoop Distributions:
• Cloudera or Hortonworks
• Enterprise 24X7 Production Support - phone and support portal
access(Support Datasheet Attached)

• Minimum $50k$

33
Amazon – Support EC2 & EMR
Business

Enterprise

Response Time : 1 Hour
Access: Phone, Chat and Email 24/7

Response Time: 15 minutes
Access: Phone, Chat, TAM and Email
24/7

Costs
Greater of $100 - or •10% of monthly AWS usage for the first
$0-$10K
•7% of monthly AWS usage from $10K$80K
•5% of monthly AWS usage from $80K$250K
•3% of monthly AWS usage from $250K+
(about $800/yr)

http://aws.amazon.com/premiumsupport/

Costs
Greater of $15,000 - or •10% of monthly AWS usage for the first
$0-$150K
•7% of monthly AWS usage from $150K$500K
•5% of monthly AWS usage from $500K$1M
•3% of monthly AWS usage from $1M+

34
Thank You

35

More Related Content

What's hot

DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureUtkarsh Pandey
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Amazon Web Services
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 

What's hot (20)

Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azure
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 

Similar to Hadoop AWS infrastructure cost evaluation

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAmazon Web Services
 
Amazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service MeetupAmazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service Meetupcyrilkhairallah
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...Amazon Web Services
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWSAmazon Web Services
 
AWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAmazon Web Services
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database ServicesAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot InstancesArun Sirimalla
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rasmus Ekman
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSAmazon Web Services
 
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017Amazon Web Services
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...Amazon Web Services
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Introduction to AWS Storage Services
Introduction to AWS Storage ServicesIntroduction to AWS Storage Services
Introduction to AWS Storage ServicesAmazon Web Services
 

Similar to Hadoop AWS infrastructure cost evaluation (20)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
 
Amazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service MeetupAmazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service Meetup
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
 
Best Practices running SQL Server on AWS
Best Practices running SQL Server on AWSBest Practices running SQL Server on AWS
Best Practices running SQL Server on AWS
 
AWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS CloudAWS Webcast - Explore the AWS Cloud
AWS Webcast - Explore the AWS Cloud
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Introduction to Database Services
Introduction to Database ServicesIntroduction to Database Services
Introduction to Database Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
 
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017
Your Basic Building Blocks - AWS Compute - AWS Summit Tel Aviv 2017
 
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
AWS re:Invent 2016: Amazon Aurora Best Practices: Getting the Best Out of You...
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Introduction to AWS Storage Services
Introduction to AWS Storage ServicesIntroduction to AWS Storage Services
Introduction to AWS Storage Services
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Hadoop AWS infrastructure cost evaluation

  • 2. Agenda • High level requirements • Cloud architecture • Major architecture components • Amazon AWS • Hadoop distributions • Capacity Planning • Amazon AWS – EMR • Hadoop distributions • On-premise hardware costs • Gotcha’s 2
  • 3. High Level Requirements • Build an Analytical & BI platform for web log analytics • Ingest multiple data sources: • Log data • internal user data • Apply complex business rules • Manage Events, filter Crawler Driven Logs, apply Industry and Domain Specific rules • Populate/export to a BI tool for visualization. 3
  • 4. Non-Functional Requirements • Today’s baseline: ~42 TB per year (~ 3.5TB raw data per month), 3 years store • SLA: Should process data every day. Currently done once a month. • Predefined processing via Hive; no exploratory analysis • Everything in the cloud: • Store (HDFS), Compute (M/R), Analysis (BI tool) 4
  • 5. Non-Functional Requirements [2] • Seeding data in S3 (3 year’s data worth) • Adding monthly net-new data only. • Speed not of primary importance 5
  • 6. Data Estimates for Capacity planning [2] • Cleaned-up log data per year 42 TB (3 years = 126 TB) • Total disk space required should consider • Compression (LZO 40%) – Reduces disk space required to  ~25 * • Replication Factor of 3 : ~75 TB • 75% disk utilization maximum in Hadoop: 100TB • Total disk capacity required for DN: ~100TB / year (17.5TB/ mo) • (*disclaimer: depends on codec and data input) 6
  • 7. Data Estimates for Capacity planning: reduced logs Expected Data data Log data After compression Replication 70% disk utilization volume volume (TB) (Gzip 40%) on 3 nodes maximum (TB) 1 month 3.6 2.16 6.5 1 year 42 25 75 9.2 107 3 years 322 126 75.6 226 • Total disk capacity required for DN: ~10TB/ month 7
  • 8. Cloud Solution Architecture 2. Export data to HDFS Amazon AWS 3. Process in M/R Hive Tables BI Tool Hadoop S3 HDFS 1. Copy data to S3 Client Logs 4. Display in BI tool Metadata Extraction Webservers 8 User 5. Retain results into S3
  • 9. Hadoop on AWS: EC2 • Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. • Manual set up of Hadoop on EC2 • Use EBS for storage capacity (HDFS) • Storage on S3 9
  • 10. Running Hadoop on AWS: EC2 • EC2 instances options • Choose instance type • Choose instance type availability • Choose instance family • Choose where the data resides: • S3 – high latency, but highly available • EBS • Permanent storage? • Snapshots to S3? • Apache Whirr for set up 10
  • 11. Amazon EC2 – Instance features • Other choices: • EBS-optimized instances: dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used. • Inter-region data transfer • Dedicated instances: run on single-tenant hardware dedicated to a single customer. • Spot instances: Name your price 11
  • 12. Amazon Instance Families • Amazon EC2 instances are grouped into six families: General purpose, Memory optimized, Compute optimized, Storage optimized, micro and GPU. • General-purpose instances have memory to CPU ratios suitable for most general purpose apps. • Memory-optimized instances offer larger memory sizes for high throughput applications. • Compute-optimized instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. • Storage-optimized instances are optimized for very high random I/O performance , or very high storage density, low storage cost, and high sequential I/O performance. • micro instances provide a small amount of CPU with the ability to burst to higher amounts for brief periods. • GPU instances, for dynamic applications. Data nodes 12
  • 13. Amazon Instances types availability • On-Demand Instances – On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining. • Reserved Instances – Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price. • Spot Instances – Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs. 13
  • 14. Amazon EC2 – Storage 14
  • 15. Amazon EC2 – Instance types Data nodes BI instances Master nodes 15
  • 16. Systems Architecture – EC2 AWS Hadoop NN SN DNs EN Client Logs HDFS on EBS drives S3 BI Node Node Node BI • Hadoop cluster is initiated when analytics is run • Data is streamed from S3 to EBS Volumes • Results from analytics stored to S3 once computed 16 • BI nodes permanent Node
  • 17. Hadoop on AWS: EC2 • Probably not the best choice: • EBS volumes make the solution costly • If instead using instance storage, choices of EC2 instances either too small (a few Gigs) or too big (48 TB/per instance). • Don’t need the flexibility – just want to use Hive 17
  • 18. Hadoop on AWS: EMR • EC2 Amazon Elastic MapReduce ( EMR) is a web service that provides a hosted Hadoop framework running on the EC2 and Amazon Simple Storage Service (S3). 18
  • 19. Running Hadoop on AWS - EMR • Elastic Map Reduce • For occasional jobs – Ephemeral clusters • Ease of use, but 20% costlier • Data stored in S3 - Highly tuned for S3 storage • Hive and Pig available • Only pay for S3 + instances time while jobs running • Or: leave it always on. 19
  • 20. Hadoop on AWS - EMR • EC2 instances with own flavor of Hadoop • Amazon Apache Hadoop is 1.0.3 version. You can also choose MapR M3 or M5 (0.20.205) version. • You can run Hive (0.7.1 or 0.8.1), Custom JAR, Streaming, Pig or Hbase. 20
  • 21. Systems Architecture – EMR AWS Hadoop EMR DNs SNNN Client Logs HDFS from S3 S3 BI Instanc e Instance Instance BI • Hadoop cluster created elastically • Data is streamed from S3 to initiate Hadoop cluster dynamically • Results from analytics stored to S3 once computed • BI nodes permanent Instance 21
  • 22. Amazon EMR– Instance types Data nodes BI instances Master nodes 22
  • 23. AWS calculator – EMR calculation • Calculate and add: • S3 cost (seeded data) • Incremental S3 cost, per month • EC2 cost • EMR cost • In/out Transfer of data cost • Amazon support cost • Infrastructure support Engineer cost 23
  • 24. AWS calculator – EMR calculation • Say for 24hrs/day, EMR cost: 24
  • 25. AWS calculator – EMR calculation • Say for 24hrs/day, 3 year S3: 25
  • 26. AWS calculator – EMR calculation • Say for 24hrs/day, 3 year EC2: 26
  • 27. Amazon EMR Pricing – Reduced log volume Data volume (in year) Instances types Price/year Running 24 hours/day Price/year Running 8 hours/day Price/year Running 8 hours/wee k 1 year storing 42TB on S3 10 instances – Data nodes: m1.xlarge NN: m2.2xlarge BI: m2.2xlarge Load balancer: t1.micro 1 year reserved 10 EMR instances (Subject to change depending on actual load) $14.1k/mo * 12 = $169.2k $8.9k * 12= $106k $6.6k * 12 = $79.2k $19.5k *36 mos = $684k $15.5k * 36 mos = $558k $13.2k * 36 mos = $475 3 years storing 126TB on S3 27
  • 28. Hadoop on AWS: trade-offs Feature EC2 EMR Ease of use Hard – IT Ops costs Easy; Hadoop clusters can be of any size; can have multiple clusters. Cost Cheaper Costlier: pay for EC2 + EMR Flexibility Better: Access to full stack of Hadoop ecosystem Portability Easier to move to dedicated hardware Speed Faster Lower performance: all data is streamed from S3 for each job Maintability Can choose any vendor; Can be updated to latest versoin; Debugging tricky: cluster terminated, no logs On demand Hadoop cluster: Ease of use Hadoop installed, but with limited options 28
  • 29. EC2 Pricing Gotcha’s • EMR with Spot instances seems to be the trend for minimal cost, if SLA timeliness is not of primary importance. • Use reserved instances to bring down cost drastically (60%). • Compression on S3 ? • Need to account for secondary NN? • Ability to estimate better how many EMR nodes are needed with AWS’s AMI task configuration 29
  • 30. EMR Technical Gotcha’s • Transferring data between S3 and EMR clusters is very fast (and free), so long as your S3 bucket and Hadoop cluster are in the same Amazon region • EMR’S3 File System streams data directly to S3 instead of buffering to intermediate local files. • EMR’S3 File System adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel. • Store fewer, larger files instead of many smaller ones 30 • http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
  • 31. In house Hadoop cluster Data volume (in year) Storage for Data nodes Instances Price, first year 126TB 6*12x2TB 10 data nodes, 3 Master $10.6k * 6 DN + $7.3k * 3 = $128k Dell PowerEdge R720: Processor E5-2640 2.50GHz, 8 cores, 12M Cache,Turbo, Memory 64GB Memory, Quad Ranked RDIMM for 2 Processors, Low Volt Hard Drives 12 - 2TB 7.2K RPM SATA 3.5in Hot Plug Hard Drive Network Card Intel 82599 Dual Port 10GE Mezzanine Card BI 4 nodes + Vendor Support ($50k) + Full-time person ($150k) = $328k $43k 31
  • 33. Hadoop Distributions: • Cloudera or Hortonworks • Enterprise 24X7 Production Support - phone and support portal access(Support Datasheet Attached) • Minimum $50k$ 33
  • 34. Amazon – Support EC2 & EMR Business Enterprise Response Time : 1 Hour Access: Phone, Chat and Email 24/7 Response Time: 15 minutes Access: Phone, Chat, TAM and Email 24/7 Costs Greater of $100 - or •10% of monthly AWS usage for the first $0-$10K •7% of monthly AWS usage from $10K$80K •5% of monthly AWS usage from $80K$250K •3% of monthly AWS usage from $250K+ (about $800/yr) http://aws.amazon.com/premiumsupport/ Costs Greater of $15,000 - or •10% of monthly AWS usage for the first $0-$150K •7% of monthly AWS usage from $150K$500K •5% of monthly AWS usage from $500K$1M •3% of monthly AWS usage from $1M+ 34

Editor's Notes

  1. On demand: most flexible, it's also the most expensiveWith spot instances, you specify the maximum price you'll pay for an instance, and if there is space, you get that instance. If you're outbid, your instance could be terminated. This means that if you have large jobs that don't need to be completed during any specific time, you can utilize spot instances to complete the job when it's most economical.