SlideShare a Scribd company logo
1 of 35
Download to read offline
Querying Data Pipeline with
AWS Athena
Yaroslav Tkachenko, Senior Software Engineer
Our main titles
Game clients
“Pipes”
Data Pipeline
Game servers
(events, metrics, telemetry, etc.)
Archiving
Data warehouse
Analytics
External services
The Problem
Now
Stream ProcessingData Warehouse
The Problem
Now
Stream ProcessingData Warehouse
● Not an option - want to query historical data● Need raw data
● Don’t want to support complex infrastructure
● Retention is usually short
Amazon Athena is an interactive query
service that makes it easy to analyze
data in Amazon S3 using standard SQL.
Athena is serverless, so there is no
infrastructure to manage, and you pay
only for the queries that you run.
AWS Athena
AWS Athena
The Solution
• AWS S3 - persisting streaming data
• Schema definition - Apache Hive1
DDL for describing schemas
• Query language - Presto2
(ANSI-compatible) SQL for querying
[1] Apache Hive - distributed data warehouse software
[2] Presto - distributed SQL query engine
Building blocks
AWS Athena
AWS S3
• JSON
• CSV
• Avro
• Parquet
• ORC
• Apache Web Server logs
• Logstash Grok
• CloudTrail
Supported formats
AWS S3
• Snappy
• Zlib
• GZIP
• LZO
Supported compression
AWS S3
S3?
Non-AWS pipelines:
• Kafka -> Kafka Connect, Secor
• ActiveMQ, RabbitMQ, etc. -> Camel
• ??? -> Custom Consumer
Streaming data to S3
AWS S3
AWS pipelines:
• Lambda <- SQS
• Kinesis Stream -> Lambda -> Kinesis Firehose
Kinesis Pipeline
AWS S3
1. Kinesis Stream as an input
2. Lambda to forward to Firehose and transform (optional)
3. Kinesis Firehose as a buffer (size or time), compression and another transformation
(optional, using Lambda)
Schema definition
Apache Hive Data Definition Language (DDL) is used for describing tables and databases:
Schema definition
ALTER DATABASE SET DBPROPERTIES
ALTER TABLE ADD PARTITION
ALTER TABLE DROP PARTITION
ALTER TABLE RENAME PARTITION
ALTER TABLE SET LOCATION
ALTER TABLE SET TBLPROPERTIES
CREATE DATABASE
CREATE TABLE
DESCRIBE TABLE
DROP DATABASE
DROP TABLE
MSCK REPAIR TABLE
SHOW COLUMNS
SHOW CREATE TABLE
SHOW DATABASES
SHOW PARTITIONS
SHOW TABLES
SHOW TBLPROPERTIES
VALUES
CREATE EXTERNAL TABLE table_name (
id STRING
data STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket-name/'
tblproperties ("parquet.compress"="SNAPPY")
Schema definition
TINYINT, SMALLINT, INT, BIGINT
BOOLEAN
DOUBLE, DECIMAL
STRING
BINARY
TIMESTAMP
DATE (not supported for Parquet)
VARCHAR
ARRAY, MAP, STRUCT
Schema definition
{
"metadata": {
"client_id": "21353253123",
"timestamp": 1497996200,
"category_id": "1"
},
"payload": "g3ng0943g93490gn3094"
}
Schema definition
CREATE EXTERNAL TABLE events (
metadata struct<client_id:string,
timestamp:timestamp,
category_id:string
>,
payload string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://events/'
Schema definition
Query language
Presto SQL is used for querying data:
Query language
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit
10;
SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date
'2014-07-05' AND date '2014-08-05' GROUP BY os;
SELECT customer.c_name, lineitem.l_quantity, orders.o_totalprice FROM lineitem,
orders, customer WHERE lineitem.l_orderkey = orders.o_orderkey AND
customer.c_custkey = orders.o_custkey;
Schema definition
Best practices
Good performance → Low cost!
Why?
Best practices
• Find good partitioning field like a date, version, user, etc.
• Update Athena with partitioning schema (use PARTITIONED BY in DDL) and
metadata
• You can create partitions manually or let Athena handle them (but that requires
certain structure)
• But there is no magic! You have to use partitioning fields in queries (like regular
fields), otherwise no partitioning is applied
Partitioning
Best practices
CREATE EXTERNAL TABLE events …
PARTITIONED BY (year string, month string, day string)
1) SELECT data FROM events WHERE event_id = '98632765’;
2) SELECT data FROM events WHERE event_id = '98632765' AND year = '2017' AND
month = '06' AND day = '21';
Partitioning
Best practices
• s3://events/2017/06/20/1.parquet
• s3://events/2017/06/20/2.parquet
• s3://events/2017/06/20/3.parquet
• s3://events/2017/06/21/1.parquet
• s3://events/2017/06/21/2.parquet
• s3://events/2017/06/21/3.parquet
• s3://events/2017/06/22/1.parquet
• …
Manual partitioning:
• ALTER TABLE events ADD PARTITION (date='2017-06-20') location
's3://events/2017/06/20/'
• ...
Partitioning
Best practices
• s3://events/date=2017-06-20/1.parquet
• s3://events/date=2017-06-20/2.parquet
• s3://events/date=2017-06-20/3.parquet
• s3://events/date=2017-06-21/1.parquet
• s3://events/date=2017-06-21/2.parquet
• s3://events/date=2017-06-21/3.parquet
• s3://events/date=2017-06-22/1.parquet
• …
Automatic partitioning:
• MSCK REPAIR TABLE events
Partitioning
Best practices
• Use binary formats like Parquet!
• Don’t forget about compression
• Only include the columns that you need
• LIMIT is amazing!
• For more SQL optimizations look at Presto best practices
• Avoid a lot of small files:
Performance tips
Best practices
Volume of data
The dilemma
Number and size
of files (buffering)
Time to index
Given certain data volume you want the number of files as less as possible with file sizes
as large as possible appear in S3 as soon as possible. It’s really hard. You have to give up
something.
Possible solutions?
• Don’t give up anything! Have two separate pipelines, one with long retention
(bigger files) and another one with short retention (smaller files, fast time to
index). Cons? Double on size.
• Give up on number of files and size. But! Periodically merge small files in
background. Cons? Lots of moving parts and slower queries against fresh data.
Demo
Amazon Product Reviews
http://jmcauley.ucsd.edu/data/amazon/
• AWS Athena is great, right?!
• Think about the file structure, formats, compression, etc.
• Streaming data to S3 is probably the hardest task
• Don’t forget to optimize - use partitioning, look at Presto SQL optimization tricks, etc.
• Good performance means low cost
Summary
Questions?
@sap1ens
https://www.demonware.net/careers

More Related Content

What's hot

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

What's hot (20)

Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 

Similar to Querying Data Pipeline with AWS Athena

MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB
 
Introducing DataWave
Introducing DataWaveIntroducing DataWave
Introducing DataWave
Data Works MD
 

Similar to Querying Data Pipeline with AWS Athena (20)

Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Presentation
PresentationPresentation
Presentation
 
Letgo Data Platform: A global overview
Letgo Data Platform: A global overviewLetgo Data Platform: A global overview
Letgo Data Platform: A global overview
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG 2017...
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG  2017...SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG  2017...
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG 2017...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Introducing DataWave
Introducing DataWaveIntroducing DataWave
Introducing DataWave
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 

More from Yaroslav Tkachenko

Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent HashingDynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
Быстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko
 

More from Yaroslav Tkachenko (18)

Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent HashingDynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Apache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know About
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
 
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty GamesDesigning Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
 
10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event Architectures
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
 
Why Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For MicroservicesWhy Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For Microservices
 
Why actor-based systems are the best for microservices
Why actor-based systems are the best for microservicesWhy actor-based systems are the best for microservices
Why actor-based systems are the best for microservices
 
Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture  Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture
 
Быстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Querying Data Pipeline with AWS Athena

  • 1. Querying Data Pipeline with AWS Athena Yaroslav Tkachenko, Senior Software Engineer
  • 2.
  • 4. Game clients “Pipes” Data Pipeline Game servers (events, metrics, telemetry, etc.) Archiving Data warehouse Analytics External services
  • 6. The Problem Now Stream ProcessingData Warehouse ● Not an option - want to query historical data● Need raw data ● Don’t want to support complex infrastructure ● Retention is usually short
  • 7. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. AWS Athena AWS Athena
  • 9. • AWS S3 - persisting streaming data • Schema definition - Apache Hive1 DDL for describing schemas • Query language - Presto2 (ANSI-compatible) SQL for querying [1] Apache Hive - distributed data warehouse software [2] Presto - distributed SQL query engine Building blocks AWS Athena
  • 11. • JSON • CSV • Avro • Parquet • ORC • Apache Web Server logs • Logstash Grok • CloudTrail Supported formats AWS S3 • Snappy • Zlib • GZIP • LZO Supported compression
  • 13. Non-AWS pipelines: • Kafka -> Kafka Connect, Secor • ActiveMQ, RabbitMQ, etc. -> Camel • ??? -> Custom Consumer Streaming data to S3 AWS S3 AWS pipelines: • Lambda <- SQS • Kinesis Stream -> Lambda -> Kinesis Firehose
  • 14. Kinesis Pipeline AWS S3 1. Kinesis Stream as an input 2. Lambda to forward to Firehose and transform (optional) 3. Kinesis Firehose as a buffer (size or time), compression and another transformation (optional, using Lambda)
  • 16. Apache Hive Data Definition Language (DDL) is used for describing tables and databases: Schema definition ALTER DATABASE SET DBPROPERTIES ALTER TABLE ADD PARTITION ALTER TABLE DROP PARTITION ALTER TABLE RENAME PARTITION ALTER TABLE SET LOCATION ALTER TABLE SET TBLPROPERTIES CREATE DATABASE CREATE TABLE DESCRIBE TABLE DROP DATABASE DROP TABLE MSCK REPAIR TABLE SHOW COLUMNS SHOW CREATE TABLE SHOW DATABASES SHOW PARTITIONS SHOW TABLES SHOW TBLPROPERTIES VALUES
  • 17. CREATE EXTERNAL TABLE table_name ( id STRING data STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' LOCATION 's3://bucket-name/' tblproperties ("parquet.compress"="SNAPPY") Schema definition
  • 18. TINYINT, SMALLINT, INT, BIGINT BOOLEAN DOUBLE, DECIMAL STRING BINARY TIMESTAMP DATE (not supported for Parquet) VARCHAR ARRAY, MAP, STRUCT Schema definition
  • 19. { "metadata": { "client_id": "21353253123", "timestamp": 1497996200, "category_id": "1" }, "payload": "g3ng0943g93490gn3094" } Schema definition
  • 20. CREATE EXTERNAL TABLE events ( metadata struct<client_id:string, timestamp:timestamp, category_id:string >, payload string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://events/' Schema definition
  • 22. Presto SQL is used for querying data: Query language SELECT [ ALL | DISTINCT ] select_expression [, ...] [ FROM from_item [, ...] ] [ WHERE condition ] [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ] [ HAVING condition ] [ UNION [ ALL | DISTINCT ] union_query ] [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ] [ LIMIT [ count | ALL ] ]
  • 23. SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit 10; SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os; SELECT customer.c_name, lineitem.l_quantity, orders.o_totalprice FROM lineitem, orders, customer WHERE lineitem.l_orderkey = orders.o_orderkey AND customer.c_custkey = orders.o_custkey; Schema definition
  • 25. Good performance → Low cost! Why? Best practices
  • 26. • Find good partitioning field like a date, version, user, etc. • Update Athena with partitioning schema (use PARTITIONED BY in DDL) and metadata • You can create partitions manually or let Athena handle them (but that requires certain structure) • But there is no magic! You have to use partitioning fields in queries (like regular fields), otherwise no partitioning is applied Partitioning Best practices
  • 27. CREATE EXTERNAL TABLE events … PARTITIONED BY (year string, month string, day string) 1) SELECT data FROM events WHERE event_id = '98632765’; 2) SELECT data FROM events WHERE event_id = '98632765' AND year = '2017' AND month = '06' AND day = '21'; Partitioning Best practices
  • 28. • s3://events/2017/06/20/1.parquet • s3://events/2017/06/20/2.parquet • s3://events/2017/06/20/3.parquet • s3://events/2017/06/21/1.parquet • s3://events/2017/06/21/2.parquet • s3://events/2017/06/21/3.parquet • s3://events/2017/06/22/1.parquet • … Manual partitioning: • ALTER TABLE events ADD PARTITION (date='2017-06-20') location 's3://events/2017/06/20/' • ... Partitioning Best practices
  • 29. • s3://events/date=2017-06-20/1.parquet • s3://events/date=2017-06-20/2.parquet • s3://events/date=2017-06-20/3.parquet • s3://events/date=2017-06-21/1.parquet • s3://events/date=2017-06-21/2.parquet • s3://events/date=2017-06-21/3.parquet • s3://events/date=2017-06-22/1.parquet • … Automatic partitioning: • MSCK REPAIR TABLE events Partitioning Best practices
  • 30. • Use binary formats like Parquet! • Don’t forget about compression • Only include the columns that you need • LIMIT is amazing! • For more SQL optimizations look at Presto best practices • Avoid a lot of small files: Performance tips Best practices
  • 31. Volume of data The dilemma Number and size of files (buffering) Time to index Given certain data volume you want the number of files as less as possible with file sizes as large as possible appear in S3 as soon as possible. It’s really hard. You have to give up something.
  • 32. Possible solutions? • Don’t give up anything! Have two separate pipelines, one with long retention (bigger files) and another one with short retention (smaller files, fast time to index). Cons? Double on size. • Give up on number of files and size. But! Periodically merge small files in background. Cons? Lots of moving parts and slower queries against fresh data.
  • 34. • AWS Athena is great, right?! • Think about the file structure, formats, compression, etc. • Streaming data to S3 is probably the hardest task • Don’t forget to optimize - use partitioning, look at Presto SQL optimization tricks, etc. • Good performance means low cost Summary