SlideShare a Scribd company logo
1 of 14
Download to read offline
FlyData: Amazon Redshift
BENCHMARK Series 01
Amazon Redshift is
10x faster and cheaper
than Hadoop + Hive
Comparisons of speed and cost efficiency
www.flydata.com
Amazon Redshift took 155 seconds to run our queries for
1.2TB data
Hadoop + Hive took 1491 seconds to run our queries for
1.2TB data
Amazon Redshift was 10X faster
Amazon Redshift cost $20 to run a query every 30 minutes
Hadoop + Hive took $210 to run a query every 30 minutes
Amazon Redshift was 10X cost effective
www.flydata.com
Amazon Redshift is a new data warehouse for big
data on the cloud. Before Redshift, users had to turn
to Hadoop for querying over TBs of data.
We have run benchmarks to compare Redshift to
Hadoop (Amazon Elastic MapReduce), both on
AWS environments, specifically to show differences
for advertisement agencies.
• Between 100GB to ~50TB
• Frequent query (more than once an hour)
• Short turn around time required
www.flydata.com
Prerequisite - Data
TSV files, gzip compressed
Imp_lo
g
1) 300GB / 300M
record
2) 1.2TB / 1.2B record date datetime
publisher_id integer
ad_campaign_id integer
bid_price real
country varchar(30)
attr1-4 varchar(255)
click_l
og
1) 1.4GB / 1.5M
record
2) 5.6GB / 6M recorddate datetime
publisher_id integer
ad_campaign_id integer
country varchar(30)
attr1-4 varchar(255)
1) for 1 month
2) for 4
months
ad_campai
gn
100MB / 100k
record
publish
er
10MB / 10k
record
advertis
er
10MB / 10k
record
We use 5 tables to run a query which join tables and creates a report.
www.flydata.com
1. Query Speed
• Redshift takes 155
seconds to
complete our query
for 1.2TB
• Hadoop takes
1491 seconds to
complete our query
for 1.2TB
• Redshift is about
10 times faster
than Hadoop for
this query
Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift:
dw.hs1.xlarge).
672sec
38sec
155sec
1491sec
* The query used can be referenced in our Appendix
www.flydata.com
2. Total Cost
• Redshift costs $20
per month to run
queries every 30
minutes
• Hadoop costs $210
per month to run
queries every 30
minutes
• Redshift is about
10 times cheaper
than Hadoop to run
this job
Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of
time.
* The query used can be referenced in our Appendix
www.flydata.com
Redshift Query Result
Data Size Instance Type
Number of
Instances
Trial
Processing
Time
Average Server Cost Per Day
300GB dw.hs1.xlarge 1
1 58s
38s $20.40
2 43s
3 31s
4 30s
5 30s
1.2TB dw.hs1.xlarge 1
1 164s
155s $20.40
2 149s
3 158s
4 156s
5 150s
* The query used can be referenced in our Appendix
www.flydata.com
Hadoop Query Result
Data Size Instance Type Instance Number Processing Time Server Cost Per Day
300GB
c1.xlarge 1 1h 23m 2s $0.80
c1.medium 10 37m 48s $0.89
c1.xlarge 10 11m 12s $1.06
1.2TB
m1.xlarge 1 6h 43m 24s $3.22
c1.medium 4 5h 14m 0s $3.04
c1.xlarge 10 37m 7s $3.58
c1.xlarge 20 24m 51s $4.64
* The query used can be referenced in our Appendix
www.flydata.com
Discussion
• Consider Redshift
– If your data is big (>TB) and you need to run your
queries more than once an hour
– If you want to get quick results
• Consider Hadoop (EMR)
– If your data is too big (>PB)
– If your job queries are once a day, week or month
– If you already have invested in Hadoop
technology specialists
www.flydata.com
appendix – Sample Query
select
ac.ad_campaign_id as ad_campaign_id,
adv.advertiser_id as advertiser_id,
cs.spending as spending,
ims.imp_total as imp_total,
cs.click_total as click_total,
click_total/imp_total as CTR,
spending/click_total as CPC,
spending/(imp_total/1000) as CPM
from
ad_campaigns ac
join
advertisers adv
on (ac.advertiser_id = adv.advertiser_id)
join
(select
il.ad_campaign_id,
count(*) as imp_total
from
imp_logs il
group by
il.ad_campaign_id
) ims on (ims.ad_campaign_id =
ac.ad_campaign_id)
join
(select
cl.ad_campaign_id,
sum(cl.bid_price) as spending,
count(*) as click_total
from
click_logs cl
group by
cl.ad_campaign_id
) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click numbers,
advertiser spending, CTR, CPC and CPM.
www.flydata.com
APPENDIX - Additional Comments
• Redshift is good for an aggregate calculation such
as sum, average, max, min, etc. because it is a
columnar database
• Importing large amounts of data takes a lot of time
– 17 hours for 1.2TB in our case
– Continuous importing is useful
• Redshift supports only “Separated” formats like
CSV, TSV
– JSON is not supported
• Redshift supports only primitive data types
– 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE..
(as of Feb. 17,
2013)
www.flydata.com
APPENDIX – Additional Information
• All resources for our benchmark are on
our github repository
– https://github.com/hapyrus/redshift-
benchmark
– The dataset we use is open on S3, so you
can reproduce the benchmark
www.flydata.com
About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift,
with real-time data loading
– Automated ETL process with multiple supported
data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication
from RDBMS to Amazon Redshift
Contact us at: info@flydata.com
We are an official data
integration partner of
Amazon Redshift
Formerly known as Hapyrus
www.flydata.com
www.flydata.com www.flydata.com
Check us out!
-> http://flydata.com
sales@flydata.com
Toll Free: 1-855-427-9787
http://flydata.com
We are an official data integration
partner of Amazon Redshift

More Related Content

Viewers also liked

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Презентация Цейтлин Русинномед 26 сент 2011
Презентация Цейтлин  Русинномед 26 сент 2011Презентация Цейтлин  Русинномед 26 сент 2011
Презентация Цейтлин Русинномед 26 сент 2011Dmitry Tseitlin
 
Twitter Channel Presentation
Twitter Channel PresentationTwitter Channel Presentation
Twitter Channel PresentationLougan Bishop
 
How to 10X your Conversion
How to 10X your ConversionHow to 10X your Conversion
How to 10X your ConversionMatt Lerner
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...C4Media
 
Nielsen research: Social media impressions in Facebook ads
Nielsen research: Social media impressions in Facebook adsNielsen research: Social media impressions in Facebook ads
Nielsen research: Social media impressions in Facebook adsMitya Voskresensky
 
Business Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop BenchmarkBusiness Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop Benchmarkatscaleinc
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Laurent Leturgez
 
10x Thinking - Leadership Development Session
10x Thinking - Leadership Development Session10x Thinking - Leadership Development Session
10x Thinking - Leadership Development SessionKarina Ananta
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
Oracle 12c in memory en action
Oracle 12c in memory en actionOracle 12c in memory en action
Oracle 12c in memory en actionLaurent Leturgez
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionTanel Poder
 
NASA Commercial Crew Program 2014_04_14
NASA Commercial Crew Program 2014_04_14 NASA Commercial Crew Program 2014_04_14
NASA Commercial Crew Program 2014_04_14 Dmitry Tseitlin
 
AWS Innovate 2016 : Closing Keynote - Glenn Gore
AWS Innovate 2016 : Closing Keynote - Glenn GoreAWS Innovate 2016 : Closing Keynote - Glenn Gore
AWS Innovate 2016 : Closing Keynote - Glenn GoreAmazon Web Services Korea
 
AWS Innovate: Smart Deployment on AWS - Andy Kim
AWS Innovate: Smart Deployment on AWS - Andy KimAWS Innovate: Smart Deployment on AWS - Andy Kim
AWS Innovate: Smart Deployment on AWS - Andy KimAmazon Web Services Korea
 

Viewers also liked (18)

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Презентация Цейтлин Русинномед 26 сент 2011
Презентация Цейтлин  Русинномед 26 сент 2011Презентация Цейтлин  Русинномед 26 сент 2011
Презентация Цейтлин Русинномед 26 сент 2011
 
Twitter Channel Presentation
Twitter Channel PresentationTwitter Channel Presentation
Twitter Channel Presentation
 
How to 10X your Conversion
How to 10X your ConversionHow to 10X your Conversion
How to 10X your Conversion
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
 
Nielsen research: Social media impressions in Facebook ads
Nielsen research: Social media impressions in Facebook adsNielsen research: Social media impressions in Facebook ads
Nielsen research: Social media impressions in Facebook ads
 
Hanganalyze presentation
Hanganalyze presentationHanganalyze presentation
Hanganalyze presentation
 
Business Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop BenchmarkBusiness Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop Benchmark
 
Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1Oracle 12c r1 installation on solaris 11.1
Oracle 12c r1 installation on solaris 11.1
 
10x Thinking - Leadership Development Session
10x Thinking - Leadership Development Session10x Thinking - Leadership Development Session
10x Thinking - Leadership Development Session
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
Oracle 12c in memory en action
Oracle 12c in memory en actionOracle 12c in memory en action
Oracle 12c in memory en action
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
NASA Commercial Crew Program 2014_04_14
NASA Commercial Crew Program 2014_04_14 NASA Commercial Crew Program 2014_04_14
NASA Commercial Crew Program 2014_04_14
 
AWS Innovate 2016 : Closing Keynote - Glenn Gore
AWS Innovate 2016 : Closing Keynote - Glenn GoreAWS Innovate 2016 : Closing Keynote - Glenn Gore
AWS Innovate 2016 : Closing Keynote - Glenn Gore
 
AWS Innovate: Smart Deployment on AWS - Andy Kim
AWS Innovate: Smart Deployment on AWS - Andy KimAWS Innovate: Smart Deployment on AWS - Andy Kim
AWS Innovate: Smart Deployment on AWS - Andy Kim
 

More from FlyData Inc.

What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?FlyData Inc.
 
Three Things to Consider When Making Investments in Your Big Data Infrastructure
Three Things to Consider When Making Investments in Your Big Data InfrastructureThree Things to Consider When Making Investments in Your Big Data Infrastructure
Three Things to Consider When Making Investments in Your Big Data InfrastructureFlyData Inc.
 
Cognitive Biases in Data Science
Cognitive Biases in Data ScienceCognitive Biases in Data Science
Cognitive Biases in Data ScienceFlyData Inc.
 
How to Extract Data from Amazon Redshift
How to Extract Data from Amazon RedshiftHow to Extract Data from Amazon Redshift
How to Extract Data from Amazon RedshiftFlyData Inc.
 
Amazon Redshift - Create an Amazon Redshift Cluster
Amazon Redshift - Create an Amazon Redshift ClusterAmazon Redshift - Create an Amazon Redshift Cluster
Amazon Redshift - Create an Amazon Redshift ClusterFlyData Inc.
 
The Internet of Things
The Internet of ThingsThe Internet of Things
The Internet of ThingsFlyData Inc.
 
Create an Amazon Redshift Cluster with FlyData!
Create an Amazon Redshift Cluster with FlyData!Create an Amazon Redshift Cluster with FlyData!
Create an Amazon Redshift Cluster with FlyData!FlyData Inc.
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData FlyData Inc.
 
FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Inc.
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedFlyData Inc.
 
Amazon Redshift ベンチマーク Hadoop + Hiveと比較
Amazon Redshift ベンチマーク  Hadoop + Hiveと比較 Amazon Redshift ベンチマーク  Hadoop + Hiveと比較
Amazon Redshift ベンチマーク Hadoop + Hiveと比較 FlyData Inc.
 

More from FlyData Inc. (12)

What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?What is Change Data Capture (CDC) and Why is it Important?
What is Change Data Capture (CDC) and Why is it Important?
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?
 
Three Things to Consider When Making Investments in Your Big Data Infrastructure
Three Things to Consider When Making Investments in Your Big Data InfrastructureThree Things to Consider When Making Investments in Your Big Data Infrastructure
Three Things to Consider When Making Investments in Your Big Data Infrastructure
 
Cognitive Biases in Data Science
Cognitive Biases in Data ScienceCognitive Biases in Data Science
Cognitive Biases in Data Science
 
How to Extract Data from Amazon Redshift
How to Extract Data from Amazon RedshiftHow to Extract Data from Amazon Redshift
How to Extract Data from Amazon Redshift
 
Amazon Redshift - Create an Amazon Redshift Cluster
Amazon Redshift - Create an Amazon Redshift ClusterAmazon Redshift - Create an Amazon Redshift Cluster
Amazon Redshift - Create an Amazon Redshift Cluster
 
The Internet of Things
The Internet of ThingsThe Internet of Things
The Internet of Things
 
Create an Amazon Redshift Cluster with FlyData!
Create an Amazon Redshift Cluster with FlyData!Create an Amazon Redshift Cluster with FlyData!
Create an Amazon Redshift Cluster with FlyData!
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
FlyData Autoload: 事例集
FlyData Autoload: 事例集FlyData Autoload: 事例集
FlyData Autoload: 事例集
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Amazon Redshift ベンチマーク Hadoop + Hiveと比較
Amazon Redshift ベンチマーク  Hadoop + Hiveと比較 Amazon Redshift ベンチマーク  Hadoop + Hiveと比較
Amazon Redshift ベンチマーク Hadoop + Hiveと比較
 

Recently uploaded

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Recently uploaded (20)

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Amazon Redshift is 10x faster and cheaper than Hadoop + Hive

  • 1. FlyData: Amazon Redshift BENCHMARK Series 01 Amazon Redshift is 10x faster and cheaper than Hadoop + Hive Comparisons of speed and cost efficiency www.flydata.com
  • 2. Amazon Redshift took 155 seconds to run our queries for 1.2TB data Hadoop + Hive took 1491 seconds to run our queries for 1.2TB data Amazon Redshift was 10X faster Amazon Redshift cost $20 to run a query every 30 minutes Hadoop + Hive took $210 to run a query every 30 minutes Amazon Redshift was 10X cost effective www.flydata.com
  • 3. Amazon Redshift is a new data warehouse for big data on the cloud. Before Redshift, users had to turn to Hadoop for querying over TBs of data. We have run benchmarks to compare Redshift to Hadoop (Amazon Elastic MapReduce), both on AWS environments, specifically to show differences for advertisement agencies. • Between 100GB to ~50TB • Frequent query (more than once an hour) • Short turn around time required www.flydata.com
  • 4. Prerequisite - Data TSV files, gzip compressed Imp_lo g 1) 300GB / 300M record 2) 1.2TB / 1.2B record date datetime publisher_id integer ad_campaign_id integer bid_price real country varchar(30) attr1-4 varchar(255) click_l og 1) 1.4GB / 1.5M record 2) 5.6GB / 6M recorddate datetime publisher_id integer ad_campaign_id integer country varchar(30) attr1-4 varchar(255) 1) for 1 month 2) for 4 months ad_campai gn 100MB / 100k record publish er 10MB / 10k record advertis er 10MB / 10k record We use 5 tables to run a query which join tables and creates a report. www.flydata.com
  • 5. 1. Query Speed • Redshift takes 155 seconds to complete our query for 1.2TB • Hadoop takes 1491 seconds to complete our query for 1.2TB • Redshift is about 10 times faster than Hadoop for this query Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge). 672sec 38sec 155sec 1491sec * The query used can be referenced in our Appendix www.flydata.com
  • 6. 2. Total Cost • Redshift costs $20 per month to run queries every 30 minutes • Hadoop costs $210 per month to run queries every 30 minutes • Redshift is about 10 times cheaper than Hadoop to run this job Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time. * The query used can be referenced in our Appendix www.flydata.com
  • 7. Redshift Query Result Data Size Instance Type Number of Instances Trial Processing Time Average Server Cost Per Day 300GB dw.hs1.xlarge 1 1 58s 38s $20.40 2 43s 3 31s 4 30s 5 30s 1.2TB dw.hs1.xlarge 1 1 164s 155s $20.40 2 149s 3 158s 4 156s 5 150s * The query used can be referenced in our Appendix www.flydata.com
  • 8. Hadoop Query Result Data Size Instance Type Instance Number Processing Time Server Cost Per Day 300GB c1.xlarge 1 1h 23m 2s $0.80 c1.medium 10 37m 48s $0.89 c1.xlarge 10 11m 12s $1.06 1.2TB m1.xlarge 1 6h 43m 24s $3.22 c1.medium 4 5h 14m 0s $3.04 c1.xlarge 10 37m 7s $3.58 c1.xlarge 20 24m 51s $4.64 * The query used can be referenced in our Appendix www.flydata.com
  • 9. Discussion • Consider Redshift – If your data is big (>TB) and you need to run your queries more than once an hour – If you want to get quick results • Consider Hadoop (EMR) – If your data is too big (>PB) – If your job queries are once a day, week or month – If you already have invested in Hadoop technology specialists www.flydata.com
  • 10. appendix – Sample Query select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id); The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. www.flydata.com
  • 11. APPENDIX - Additional Comments • Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database • Importing large amounts of data takes a lot of time – 17 hours for 1.2TB in our case – Continuous importing is useful • Redshift supports only “Separated” formats like CSV, TSV – JSON is not supported • Redshift supports only primitive data types – 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE.. (as of Feb. 17, 2013) www.flydata.com
  • 12. APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshift- benchmark – The dataset we use is open on S3, so you can reproduce the benchmark www.flydata.com
  • 13. About Us - FlyData • FlyData Enterprise – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com We are an official data integration partner of Amazon Redshift Formerly known as Hapyrus www.flydata.com
  • 14. www.flydata.com www.flydata.com Check us out! -> http://flydata.com sales@flydata.com Toll Free: 1-855-427-9787 http://flydata.com We are an official data integration partner of Amazon Redshift