SlideShare a Scribd company logo
1 of 20
Download to read offline
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
An overview of Amazon Athena!
and how it performs against Amazon Redshift 
Julien Simon, Principal Technical Evangelist, AWS
julsimon@amazon.fr 
@julsimon
Never trust the first image on Google!
“Amazon Athena is a professional wrestler” 8-|
On second thought, that’s quite relevant!
Amazon Athena is a professional data wrestler!
•  New service announced at re:Invent 2016
•  Run interactive SQL queries on S3 data
•  No need to load or aggregate data: ‘schema-on-read’
•  S3 data is never modified
•  Cross-region buckets are supported
•  No infrastructure to create, manage or scale
•  Availability: us-east-1, us-west-2
•  Pricing: $5 per Terabyte scanned 
•  Scanned data rounded off to the nearest 10MB
•  Stored data: normal S3 pricing applies

AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
Athena queries
•  Service based on Presto (already available in Amazon EMR)
•  Table creation: Apache Hive DDL
–  CREATE EXTERNAL_TABLE only
–  CREATE TABLE AS SELECT is not supported
•  ANSI SQL operators and functions: what Presto supports
•  Unsupported operations
–  User-defined functions (UDF or UDAFs)
–  Stored procedures
–  Any transaction found in Hive or Presto 
https://prestodb.io 
http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
Athena names
•  Data catalog: AwsDataCatalog
•  Default database: default
–  Add your own with Hive’s CREATE DATABASE
•  Table names:!
"AwsDataCatalog".db_name.table_name
Data formats supported by Athena
•  Unstructured
–  Apache logs, with customizable regular expression
•  Semi-structured
–  Comma-separated values (CSV)
–  Tab-separated values (TSV)
–  Text File with custom delimiters
–  JSON
•  Structured
–  Apache Parquet
–  Apache ORC (Optimized Row Columnar) 
•  Compression formats: Snappy, Zlib, GZIP (no LZO)
–  Less I/O à better performance and cost optimization
https://parquet.apache.org/
https://orc.apache.org/
Data partitioning
•  Partitioning reduces the amount of scanned data
–  Better performance
–  Cost optimization
•  Data may be already partitioned in S3
–  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) 
–  MSCK REPAIR TABLE table_name
•  Data can also be partitioned at table creation time
–  CREATE EXTERNAL TABLE table_name(…)
–  ALTER TABLE table_name ADD PARTITION …
Running queries on Athena
•  AWS Console (quite cool, actually)
–  Wizard for schema definition and table creation
–  Saved queries
–  Query history
–  Multiple queries in parallel

•  JDBC driver
–  SQL Workbench/J 
–  Java application
http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
Setting up SQL Workbench/J
•  Download & install JDBC driver
•  Create a new connexion 

–  Driver: com.amazonaws.athena.jdbc.AthenaDriver
–  URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443/
–  Username: your AWS Access Key
–  Password: your AWS Secret Key
–  Add an extended property: s3_staging_dir
•  S3 bucket for output data, e.g. s3://jsimon-athena-output/
•  Make sure this S3 bucket is in the same region as Athena
•  You’re all set!
Using columnar formats for fun and profit
•  Apache Parquet
•  Apache ORC
•  Ditto: better performance & cost optimization
•  You can convert your data to a columnar format with an
Amazon EMR cluster
•  More information and tutorial at
http://docs.aws.amazon.com/athena/latest/ug/convert-to-
columnar.html
Athena vs Redshift
•  Redshift
–  Start 4-node cluster: dc1.large, 160GB SSD, low I/O, $0.25/hr
–  Start 4-node cluster: dc1.8xlarge, 2.56TB SSD, v.high I/O, $4.80/hr
–  Create table
–  Load data from S3 (COPY operation)
–  Run some queries
•  Athena
–  Create table
–  Run the same queries
Athena vs Redshift: start your engines!
•  Athena
–  Initialization : < 5s (table creation)
–  Cost: $0.0025 for a full scan (12GB + a few thousand S3 requests)
–  Unlimited storage in S3
•  Redshift (dc1.large)
–  Initialization: 6mn (create cluster) + 38mn (data load)
–  $1/hr ($0.36 with 3-yr, 100% upfront RIs)
–  Maximum storage: 640GB (about 2TB with compression)
•  Redshift (dc1.8xlarge)
–  Initialization: 6mn (create cluster) + 4mn (data load)
–  $19.20/hr ($6 with 3-yr, 100% upfront RIs)
–  Maximum storage: 10TB (about 30TB with compression)
Athena vs Redshift: data set!
Caveat: this isn’t a huge data set and it doesn’t have any joins!

•  1 table
•  1 billion lines of “e-commerce sales” (43GB)
•  CSV format, 10 columns
•  1000 files in S3, compressed to 12GB (bzip2)
Lastname, Firstname,Gender,State,Age,DayOfYear,Hour,Minutes,Items,Basket
YESTRAMSKI,KEELEY,F,Missouri,36,35,12,21,2,167
MAYOU,SCOTTIE,M,Arkansas,85,258,11,21,9,106
PFARR,SIDNEY,M,Indiana,59,146,22,21,3,163
RENZONI,ALLEN,M,Montana,31,227,13,49,10,106
CUMMINS,NICKY,M,Tennessee,50,362,1,33,1,115
THIMMESCH,BRIAN,M,Washington,29,302,20,41,2,95
Athena vs Redshift: table creation
CREATE EXTERNAL TABLE athenatest.sales (
lastname STRING,
firstname STRING,
gender STRING,
state STRING,
age INT,
day INT,
hour INT,
minutes INT,
items INT,
basket INT 
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://jsimon-redshift-demo-us/data/'
CREATE TABLE sales(
lastname VARCHAR(32) NOT NULL,
firstname VARCHAR(32) NOT NULL,
gender VARCHAR(1) NOT NULL,
state VARCHAR(32) NOT NULL,
age INT NOT NULL,
day INT NOT NULL,
hour INT NOT NULL,
minutes INT NOT NULL,
items INT NOT NULL,
basket INT NOT NULL)
DISTKEY(state)
COMPOUND SORTKEY (lastname,firstname);

COPY sales FROM 's3://jsimon-redshift-demo-us/data/'
REGION 'us-east-1' CREDENTIALS … DELIMITER ',' bzip2
COMPUPDATE ON;
Athena vs Redshift: SQL queries
; Q1: count sales (1 full scan)
SELECT count(*) FROM sales
; Q2: average basket per gender (1 full scan)
SELECT gender, avg(basket) FROM sales GROUP BY gender;
; Q3: 5-day intervals when women spend most
SELECT floor(day/5.00)*5, sum(basket) AS spend FROM sales WHERE gender='F' GROUP
BY floor(day/5.00)*5 ORDER BY spend DESC LIMIT 10;
; Q4: top 10 states where women spend most in December (1 full scan)
SELECT state, sum(basket) AS spend FROM sales WHERE gender='F' AND day>=334 GROUP
BY state ORDER BY spend DESC LIMIT 10;
; Q5: list the top 10000 female customers in the top 10 states (2 full scans)
SELECT lastname, firstname, spend FROM (
SELECT lastname, firstname, sum(basket) AS spend FROM sales WHERE gender='F'
AND state IN(
SELECT state FROM sales WHERE day>=334 GROUP BY state
ORDER BY sum(basket) DESC LIMIT 10
) AND day >=334 GROUP BY lastname,firstname
) WHERE spend >=500 ORDER BY spend DESC LIMIT 10000;
Identical queries
on both systems
Athena vs Redshift: SQL queries!
YMMV, standard disclaimer applies J
; Q1: count sales
Athena: 15-17s, Redshift: 2-3s, Redshift 8xl: <1s
; Q2: average basket per gender
Athena: 20-22s, Redshift: 15-17s, Redshift 8xl: 4-5s
; Q3: 5-day intervals when women spend most
Athena: 20-22s, Redshift: 20-22s, Redshift 8xl: 4-5s
; Q4: top 10 states where women spend most in December
Athena: 22-25s, Redshift: 10-12s, Redshift 8xl: 2-3s
(courtesy of the ‘state’ distribution key)
; Q5: list the top 10000 female customers in the top 10 states
Athena: 38-40s, Redshift: 34-36s, Redshift 8xl: 7-9s
So?
•  For this data set, Athena query performance is in the same ballpark as
a vanilla Redshift cluster of 4 dc1.large nodes
•  Athena saves you the long init time (cluster creation + data load)
•  And probably a lot of money as well 
–  Several orders of magnitude cheaper if you run a single query!
–  Similar to Lambda vs EC2

•  So… Athena looks great IMHO J
•  I can see it being used for much more than ad-hoc queries
•  Redshift still rules when you need the best performance possible
EMR, Redshift or Athena?
•  EMR
–  Scale-out data crunching
–  Custom code running complex transformations on unstructured data
–  Rich Apache Hadoop ecosystem, at the cost of complexity
•  Redshift
–  Petabyte-scale enterprise data warehouse
–  ETL, complex SQL queries and joins on long-lived, structured data
–  Many techniques for performance optimization
•  Athena
–  Answering questions in minutes, with zero infrastructure plumbing
–  Ad-hoc SQL queries, with probably a few or no joins
–  Emphasis on simplicity, not on raw performance
Athena in a nutshell
•  Run ad-hoc SQL queries on S3 data in minutes
•  No infrastructure
•  Multiple input formats supported
•  Slower than Redshift on 8xl nodes, but pretty fast!
•  A simpler, very cost-efficient alternative to EMR !
and Redshift for ad-hoc analysis
Thank you!!
	
Julien	Simon,	Principal	Technical	Evangelist,	AWS	
julsimon@amazon.fr	
@julsimon

More Related Content

What's hot

What's hot (20)

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 

Similar to An overview of Amazon Athena

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Amazon Web Services
 

Similar to An overview of Amazon Athena (20)

Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)Amazon Athena (March 2017)
Amazon Athena (March 2017)
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Redshift Chartio Event Presentation
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event Presentation
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 

More from Julien SIMON

More from Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

An overview of Amazon Athena

  • 1. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved An overview of Amazon Athena! and how it performs against Amazon Redshift Julien Simon, Principal Technical Evangelist, AWS julsimon@amazon.fr @julsimon
  • 2. Never trust the first image on Google! “Amazon Athena is a professional wrestler” 8-| On second thought, that’s quite relevant!
  • 3. Amazon Athena is a professional data wrestler! •  New service announced at re:Invent 2016 •  Run interactive SQL queries on S3 data •  No need to load or aggregate data: ‘schema-on-read’ •  S3 data is never modified •  Cross-region buckets are supported •  No infrastructure to create, manage or scale •  Availability: us-east-1, us-west-2 •  Pricing: $5 per Terabyte scanned •  Scanned data rounded off to the nearest 10MB •  Stored data: normal S3 pricing applies AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
  • 4. Athena queries •  Service based on Presto (already available in Amazon EMR) •  Table creation: Apache Hive DDL –  CREATE EXTERNAL_TABLE only –  CREATE TABLE AS SELECT is not supported •  ANSI SQL operators and functions: what Presto supports •  Unsupported operations –  User-defined functions (UDF or UDAFs) –  Stored procedures –  Any transaction found in Hive or Presto https://prestodb.io http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
  • 5. Athena names •  Data catalog: AwsDataCatalog •  Default database: default –  Add your own with Hive’s CREATE DATABASE •  Table names:! "AwsDataCatalog".db_name.table_name
  • 6. Data formats supported by Athena •  Unstructured –  Apache logs, with customizable regular expression •  Semi-structured –  Comma-separated values (CSV) –  Tab-separated values (TSV) –  Text File with custom delimiters –  JSON •  Structured –  Apache Parquet –  Apache ORC (Optimized Row Columnar) •  Compression formats: Snappy, Zlib, GZIP (no LZO) –  Less I/O à better performance and cost optimization https://parquet.apache.org/ https://orc.apache.org/
  • 7. Data partitioning •  Partitioning reduces the amount of scanned data –  Better performance –  Cost optimization •  Data may be already partitioned in S3 –  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) –  MSCK REPAIR TABLE table_name •  Data can also be partitioned at table creation time –  CREATE EXTERNAL TABLE table_name(…) –  ALTER TABLE table_name ADD PARTITION …
  • 8. Running queries on Athena •  AWS Console (quite cool, actually) –  Wizard for schema definition and table creation –  Saved queries –  Query history –  Multiple queries in parallel •  JDBC driver –  SQL Workbench/J –  Java application http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
  • 9. Setting up SQL Workbench/J •  Download & install JDBC driver •  Create a new connexion –  Driver: com.amazonaws.athena.jdbc.AthenaDriver –  URL: jdbc:awsathena://athena.us-east-1.amazonaws.com:443/ –  Username: your AWS Access Key –  Password: your AWS Secret Key –  Add an extended property: s3_staging_dir •  S3 bucket for output data, e.g. s3://jsimon-athena-output/ •  Make sure this S3 bucket is in the same region as Athena •  You’re all set!
  • 10. Using columnar formats for fun and profit •  Apache Parquet •  Apache ORC •  Ditto: better performance & cost optimization •  You can convert your data to a columnar format with an Amazon EMR cluster •  More information and tutorial at http://docs.aws.amazon.com/athena/latest/ug/convert-to- columnar.html
  • 11. Athena vs Redshift •  Redshift –  Start 4-node cluster: dc1.large, 160GB SSD, low I/O, $0.25/hr –  Start 4-node cluster: dc1.8xlarge, 2.56TB SSD, v.high I/O, $4.80/hr –  Create table –  Load data from S3 (COPY operation) –  Run some queries •  Athena –  Create table –  Run the same queries
  • 12. Athena vs Redshift: start your engines! •  Athena –  Initialization : < 5s (table creation) –  Cost: $0.0025 for a full scan (12GB + a few thousand S3 requests) –  Unlimited storage in S3 •  Redshift (dc1.large) –  Initialization: 6mn (create cluster) + 38mn (data load) –  $1/hr ($0.36 with 3-yr, 100% upfront RIs) –  Maximum storage: 640GB (about 2TB with compression) •  Redshift (dc1.8xlarge) –  Initialization: 6mn (create cluster) + 4mn (data load) –  $19.20/hr ($6 with 3-yr, 100% upfront RIs) –  Maximum storage: 10TB (about 30TB with compression)
  • 13. Athena vs Redshift: data set! Caveat: this isn’t a huge data set and it doesn’t have any joins! •  1 table •  1 billion lines of “e-commerce sales” (43GB) •  CSV format, 10 columns •  1000 files in S3, compressed to 12GB (bzip2) Lastname, Firstname,Gender,State,Age,DayOfYear,Hour,Minutes,Items,Basket YESTRAMSKI,KEELEY,F,Missouri,36,35,12,21,2,167 MAYOU,SCOTTIE,M,Arkansas,85,258,11,21,9,106 PFARR,SIDNEY,M,Indiana,59,146,22,21,3,163 RENZONI,ALLEN,M,Montana,31,227,13,49,10,106 CUMMINS,NICKY,M,Tennessee,50,362,1,33,1,115 THIMMESCH,BRIAN,M,Washington,29,302,20,41,2,95
  • 14. Athena vs Redshift: table creation CREATE EXTERNAL TABLE athenatest.sales ( lastname STRING, firstname STRING, gender STRING, state STRING, age INT, day INT, hour INT, minutes INT, items INT, basket INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',' ) LOCATION 's3://jsimon-redshift-demo-us/data/' CREATE TABLE sales( lastname VARCHAR(32) NOT NULL, firstname VARCHAR(32) NOT NULL, gender VARCHAR(1) NOT NULL, state VARCHAR(32) NOT NULL, age INT NOT NULL, day INT NOT NULL, hour INT NOT NULL, minutes INT NOT NULL, items INT NOT NULL, basket INT NOT NULL) DISTKEY(state) COMPOUND SORTKEY (lastname,firstname); COPY sales FROM 's3://jsimon-redshift-demo-us/data/' REGION 'us-east-1' CREDENTIALS … DELIMITER ',' bzip2 COMPUPDATE ON;
  • 15. Athena vs Redshift: SQL queries ; Q1: count sales (1 full scan) SELECT count(*) FROM sales ; Q2: average basket per gender (1 full scan) SELECT gender, avg(basket) FROM sales GROUP BY gender; ; Q3: 5-day intervals when women spend most SELECT floor(day/5.00)*5, sum(basket) AS spend FROM sales WHERE gender='F' GROUP BY floor(day/5.00)*5 ORDER BY spend DESC LIMIT 10; ; Q4: top 10 states where women spend most in December (1 full scan) SELECT state, sum(basket) AS spend FROM sales WHERE gender='F' AND day>=334 GROUP BY state ORDER BY spend DESC LIMIT 10; ; Q5: list the top 10000 female customers in the top 10 states (2 full scans) SELECT lastname, firstname, spend FROM ( SELECT lastname, firstname, sum(basket) AS spend FROM sales WHERE gender='F' AND state IN( SELECT state FROM sales WHERE day>=334 GROUP BY state ORDER BY sum(basket) DESC LIMIT 10 ) AND day >=334 GROUP BY lastname,firstname ) WHERE spend >=500 ORDER BY spend DESC LIMIT 10000; Identical queries on both systems
  • 16. Athena vs Redshift: SQL queries! YMMV, standard disclaimer applies J ; Q1: count sales Athena: 15-17s, Redshift: 2-3s, Redshift 8xl: <1s ; Q2: average basket per gender Athena: 20-22s, Redshift: 15-17s, Redshift 8xl: 4-5s ; Q3: 5-day intervals when women spend most Athena: 20-22s, Redshift: 20-22s, Redshift 8xl: 4-5s ; Q4: top 10 states where women spend most in December Athena: 22-25s, Redshift: 10-12s, Redshift 8xl: 2-3s (courtesy of the ‘state’ distribution key) ; Q5: list the top 10000 female customers in the top 10 states Athena: 38-40s, Redshift: 34-36s, Redshift 8xl: 7-9s
  • 17. So? •  For this data set, Athena query performance is in the same ballpark as a vanilla Redshift cluster of 4 dc1.large nodes •  Athena saves you the long init time (cluster creation + data load) •  And probably a lot of money as well –  Several orders of magnitude cheaper if you run a single query! –  Similar to Lambda vs EC2 •  So… Athena looks great IMHO J •  I can see it being used for much more than ad-hoc queries •  Redshift still rules when you need the best performance possible
  • 18. EMR, Redshift or Athena? •  EMR –  Scale-out data crunching –  Custom code running complex transformations on unstructured data –  Rich Apache Hadoop ecosystem, at the cost of complexity •  Redshift –  Petabyte-scale enterprise data warehouse –  ETL, complex SQL queries and joins on long-lived, structured data –  Many techniques for performance optimization •  Athena –  Answering questions in minutes, with zero infrastructure plumbing –  Ad-hoc SQL queries, with probably a few or no joins –  Emphasis on simplicity, not on raw performance
  • 19. Athena in a nutshell •  Run ad-hoc SQL queries on S3 data in minutes •  No infrastructure •  Multiple input formats supported •  Slower than Redshift on 8xl nodes, but pretty fast! •  A simpler, very cost-efficient alternative to EMR ! and Redshift for ad-hoc analysis