SlideShare a Scribd company logo
1 of 52
Download to read offline
© 2015 Nasdaq, Inc. All rights reserved.
“Nasdaq” and the Nasdaq logo are the trademarks of Nasdaq, Inc. and its affiliates in the U.S. and other countries.
“Amazon” and the Amazon Web Services logo are the trademarks of Amazon Web Services, Inc. or its affiliates in the U.S. and other countries
Nate Sammons, Principal Architect, Nasdaq, Inc.
October 2015
BDT314
Running a Big Data and Analytics
Application on Amazon EMR and Amazon
Redshift with a Focus on Security
NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES
What to Expect from the Session
• Motivations for extending an Amazon Redshift
warehouse with Amazon EMR
• How our data ingest workflow operates
• How to query encrypted data in Amazon S3 using Presto
and other Hadoop-ecosystem tools
• How to manage schemas and data migrations
• Future direction for our data warehouse
Current State
Amazon Redshift as Nasdaq’s Main Data Warehouse
• Transitioned from an on-premises warehouse to
Amazon Redshift
• Over 1,000 tables migrated
• More data sources added as needed
• Nearly two years of data
• Average daily ingest of over 7B rows
Never Throw Anything Away
• 23 node ds2.8xlarge Amazon Redshift cluster
• 828 vCPU, 5.48 TB of RAM
• 368 TB of DB storage capacity, over 1PB of local disk!
• 92 GB/sec aggregate disk I/O
• Resize once per quarter
• 2.7 trillion rows: 1.8T from sources, 900B derived
Many Data Sources
• Internal DBs, CSV files, stream captures, etc.
• Data from all 7 exchanges operated by Nasdaq
• Orders, quotes, trade executions
• Market “tick” data
• Security master
• Membership
• All highly structured and consistent row-oriented data
Data Corollary to the Ideal Gas Law
Motivations for Extending to Amazon EMR and
Amazon S3
• Resizing a 300+ TB Amazon Redshift cluster isn’t
instantaneous
• Continuing to grow the cluster is expensive
• Paying for CPU and disk to support infrequently accessed
data doesn’t make sense
• Data will expand to fill any container
Extending Our Warehouse
Goals
• Build a secure, cost effective, long-term data store
• Provide a SQL interface to all data
• Support new MPP analytics workloads (Spark, ML, etc.)
• Cap the size of our Amazon Redshift cluster
• Manage storage and compute resources separately
High Level Overview
Amazon Redshift’s Continuing Role
• All data lands in Amazon Redshift first
• Amazon Redshift clients have strict SLAs on data availability
• Must ensure data loads are finished quickly
• Aggregations and transformations performed in SQL
• SQL is easy and we have a lot of SQL expertise
• Transformed data is then unloaded to Amazon S3 for conversion
Decouple Storage and Compute Resources
Scale each independently as needed, run multiple different
apps on top of a common storage system
Especially for old, infrequently accessed data, no need to
run compute 24/7 to support it; we can keep data “forever”
Access needs drop off dramatically over time
• Yesterday >> last month >> last quarter >> last year
Account Structure and Cost Allocations
• Separate AWS accounts for each client / department
• Departments can run as much or as little compute as
they need; use different query tools, experiments
• No competition for compute resources across clients
• Amazon S3 costs are shared, compute costs are passed
through to each department
Data Ingest Workflow
Data Ingest Overview
Nasdaq Workflow Engine
• MySQL-backed workflow engine developed in-house
• Orchestrates over 40K operations daily
• Flexible scheduling and dependency management
• Ops GUI for retrying failed steps, root cause analysis
• Moving to Amazon Aurora + Amazon EC2 in 2016
• Clustered operation using Amazon S3 as temp storage
space
Amazon Redshift Data Ingest Workflow
• Data is pulled from various sources
• Validate data, convert to CSVs + manifest
• Store compressed, encrypted data in Amazon S3 temp
space
• Load into Amazon Redshift using COPY SQL statements
• Further transformation performed using SQL
• UNLOAD transformed data back to Amazon S3
• Notifications to other systems using Amazon SQS
Amazon EMR / Amazon S3 Data Ingest Workflow
• Automatically executed after Amazon Redshift loads and
transformations complete
• Uses Amazon Redshift schema metadata and manifest file
to drive conversions to Parquet
• Detects schema changes and bumps Hive schema version
• Alters schema in Hive Metastore to add new tables,
partitions as needed
Data Security and Encryption
VPC or Nothing
• Security is our #1 priority at all times
• All instances run in a VPC
• Locked down security groups, network ACLs, etc.
• Least-privilege IAM roles for each app and human
• See SEC302 – IAM Best Practices from Anders
• EC2 instance roles in Amazon EMR
• VPC endpoint for Amazon S3
• 10 G private AWS Direct Connect circuits into AWS
Encryption Key Management
• On-premises Safenet LUNA HSM cluster for key storage
• Amazon Redshift is directly integrated with our HSMs
• Nasdaq KMS:
• Internally known as “Vinz Clortho”
• Roots encryption keys in the HSM cluster
• Allows us full control over where keys are stored, used
Transparent Encryption in Amazon S3 and EMRFS
Amazon S3 SDK EncryptionMaterialsProvider
interface:
• Adapter to retrieve keys from our KMS
• Used when reading or writing data in Amazon S3
• User metadata to encode encryption key tokens
Encryption Performance with Amazon S3
• Roughly 25% slower than unencrypted
• Seek within an encrypted object works:
• Critical for performance
• Handled automatically
• Seeks are relative to the unencrypted size
• Create a new HTTP request at an offset within the object
• Encryption offset work is handled in the AWS SDK itself
• Worst case, we must read two extra blocks of AES data
Local disk encryption with Amazon EMR
• Bootstrap action to encrypt ephemeral disks
• Specifically to encrypt Presto’s local temp storage
• Standard Linux LUKS configuration
• Integrated with the Nasdaq KMS
• Retrieves key and mounts disks on startup using init.d
SELinux on Amazon EMR
• Bootstrap action to install SELinux packages
• Adds kernel command line arguments
• Rebuilds initrd image
• Reboots the node and re-labels the filesystem
• Increases cluster boot time
• Currently only working on Amazon EMR 3.8
• Working to refine SELinux policy files for Presto
Presto on Amazon EMR
What is Presto?
• https://prestodb.io
• Open Source MPP SQL database from Facebook
• Flexible data sources through Connector API
• JDBC, ODBC drivers
• Nice GUI from Airbnb: http://nerds.airbnb.com/airpal/
• Hive Connector:
• Table schemas defined in a Hive Metastore as external tables
• Data files stored in Amazon S3
Presto Overview
Running Presto on Amazon EMR
• Bootstrap action to download and install Java 8 & Presto
• Based on the Amazon EMR team’s Presto BA
• Adds support for custom encryption materials provider jars
• Configures Presto to use a remote Hive Metastore
• Currently using Amazon EMR 3.8, working towards 4.0
Data Encryption in Presto
• Presto doesn’t use EMRFS for access to Amazon S3
• We added support for Amazon S3
EncryptionMaterialsProvider to PrestoS3FileSystem.java
• Code available at github.com/nasdaq
• Working with Facebook to integrate these changes
Data Storage Formats
File Formats: Parquet vs. ORC
The two most widely used structured data file formats:
• Compressed, columnar record storage
• Structured, schema-validated data
• Supported by a variety of Hadoop-ecosystem apps
• Arbitrary user metadata encoded at the file level
ORC
Pros:
• DATE and TIMESTAMP type support in Hive, Presto
Cons:
• Rigid column ordering requirements
• Clunky Java API
• Unacceptable performance when encrypted in
Amazon S3
• 15-18x slower during our testing (!)
The Winner: Parquet
• Wide project support: Presto, Spark, Drill, etc.
• Actively developed project
• Adoption increasing
• Column referenced by name instead of position
• Set hive.parquet.use-column-names=true in Presto config
• Good performance when encrypted (~27% slower)
• Clean Java API
Parquet Schema Workarounds
DATE not supported in Hive or Presto
• Instead, convert DATEs to an INTs
• 2015-10-08 becomes 20151008
• Timestamps become a BIGINT (64bit integer in Hive)
• For nanosecond resolution records, we use a DATE and
a separate nanos-since-midnight column
Schema and Data Management
Hive Metastore
• Amazon EMR 4.0 cluster for the Metastore
• Easier for remote access from Presto
• Reachable through VPC peering with client accounts
• The “source of truth” for Hive schemas
• Metastore DB on Amazon RDS for MySQL
• Easy backups, encrypted storage
• Data ingest system creates/alters tables
• Alters tables to add new data partitions each day
• Detects newly changed schemas
Managing Versioned, Partitioned Tables in S3
• Store versions of a table in directories in Amazon S3:
s3://schema/table/version/date=YYYYMMDD/*.parquet
Works with “msck repair table” commands
• When a schema change is detected, increment the
version. New data is written to the new location, alerts
are generated for humans to determine changes.
• Data is migrated in Amazon S3 and old versions are
kept for now
Logical vs. Physical Schemas
• Track a “logical” and “physical” schema for each table
• Logical is compared with Amazon Redshift to detect changes
• Physical schema used to produce Hive DDL for Presto
• Schema definitions stored in MySQL
• Version management and change detection
• Amazon S3 location for each table
• Tools to export these schemas as .sql files
• Hive schema and table create statements
• “msck repair table” scripts
File-level Metadata
We encode information in file-level metadata:
• Partition column definition
• Time zone in which the file was parsed
• Current & original schema name and version number
• Column data type adjustments (DATE -> INT, etc.)
Allows us to always recreate logical schema representations
from physical files, re-migrate files if a data migration step
had a bug, etc.
Table Partitioning and Data Management
• Partition hive tables by date
• We have mostly timeseries data and are on a daily cadence
• Partitioning helps query performance
• Use `backticks` when defining column names in SQL
• Column names must be lower case in Parquet
• Correct bad data in Amazon Redshift through SQL, then
UNLOAD partitions for encoding to Parquet
• Our tools and automation make it easy to replace
modified partitions of data in Hive tables
Working with data in S3 and Amazon Redshift
Custom tools developed to make life easier:
• Extract CSV data from various DBs, or UNLOAD from
Amazon Redshift in whole or in segments
• Encode CSVs as Parquet files using a Hive schema
• Write data into the correct directory structure in
Amazon S3
• Allows us to move data between Amazon Redshift and
Amazon S3 easily, and in bulk
Custom Parquet Data Migration Tools
• Read records from previous version of a table
• Reads from the old location in Amazon S3
• Write records using the current version of a table
• Writes to the new location in Amazon S3
• Most migrations are trivial:
• Add new column with some default value (or null)
• Rename columns
• More complicated migrations require Java code
• Track original and current version in file metadata
Review & Future Enhancements
Review
• Motivations for extending an Amazon Redshift
warehouse with Amazon EMR
• How our data ingest system operates
• How to query encrypted data in Amazon S3 using Presto
and other Hadoop-ecosystem tools
• How to manage schemas and data migrations
Lessons Learned: TL;DR
• Manage storage and compute separately
• It’s OK to be paranoid about data loss!
• Amazon S3 encryption is easy and seek() works
• Parquet vs. ORC
• Partition and version your tables
• Manage logical and physical table schemas
• Data management tools & automation are important
Future Enhancements
• Archive original source data for SEC 17-a4 compliance
(using Amazon Glacier Vault Lock)
• Decouple data retrieval and processing tasks
• Move ingest processing to Amazon EC2/Amazon ECS
• Move workflow engine DB to Amazon Aurora
• Leveraging other query frameworks: Spark, ML, etc.
• Near real-time streaming ingest
• More data sources
Related Sessions
Remember to complete
your evaluations!
Thank you!

More Related Content

What's hot

Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceReal-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesAmazon Web Services
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftAmazon Web Services
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 

What's hot (20)

Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceReal-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 

Viewers also liked

Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例Amazon Web Services Japan
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Minero Aoki
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAmazon Web Services Japan
 

Viewers also liked (20)

Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
 
20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws
 
20170726 black belt_stepfunctions
20170726 black belt_stepfunctions20170726 black belt_stepfunctions
20170726 black belt_stepfunctions
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS Shield
 
20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWS
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-Ray
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 
AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
 

Similar to Running Big Data Apps on Amazon EMR and Redshift

Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 

Similar to Running Big Data Apps on Amazon EMR and Redshift (20)

Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMR
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Running Big Data Apps on Amazon EMR and Redshift

  • 1. © 2015 Nasdaq, Inc. All rights reserved. “Nasdaq” and the Nasdaq logo are the trademarks of Nasdaq, Inc. and its affiliates in the U.S. and other countries. “Amazon” and the Amazon Web Services logo are the trademarks of Amazon Web Services, Inc. or its affiliates in the U.S. and other countries Nate Sammons, Principal Architect, Nasdaq, Inc. October 2015 BDT314 Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security
  • 2. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S IN MARKET CAP REPRESENTING WORTH $9.6TRILLION DIVERSE INDUSTRIES AND MANY OF THE WORLD’S MOST WELL-KNOWN AND INNOVATIVE BRANDSMORE THAN U.S. 1 TRILLIONNATIONAL VALUE IS TIED TO OUR LIBRARY OF MORE THAN 41,000 GLOBAL INDEXES N A S D A Q T E C H N O L O G Y IS USED TO POWER MORE THAN IN 50 COUNTRIES 100 MARKETPLACES OUR GLOBAL PLATFORM CAN HANDLE MORE THAN 1 MILLION MESSAGES/SECOND AT SUB-40 MICROSECONDS AV E R A G E S P E E D S 1 C L E A R I N G H O U S E WE OWN AND OPERATE 26 MARKETS 5 CENTRAL SECURITIES DEPOSITORIES INCLUDING A C R O S S A S S E T CL A S SE S & GEOGRAPHIES
  • 3. What to Expect from the Session • Motivations for extending an Amazon Redshift warehouse with Amazon EMR • How our data ingest workflow operates • How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools • How to manage schemas and data migrations • Future direction for our data warehouse
  • 5. Amazon Redshift as Nasdaq’s Main Data Warehouse • Transitioned from an on-premises warehouse to Amazon Redshift • Over 1,000 tables migrated • More data sources added as needed • Nearly two years of data • Average daily ingest of over 7B rows
  • 6. Never Throw Anything Away • 23 node ds2.8xlarge Amazon Redshift cluster • 828 vCPU, 5.48 TB of RAM • 368 TB of DB storage capacity, over 1PB of local disk! • 92 GB/sec aggregate disk I/O • Resize once per quarter • 2.7 trillion rows: 1.8T from sources, 900B derived
  • 7. Many Data Sources • Internal DBs, CSV files, stream captures, etc. • Data from all 7 exchanges operated by Nasdaq • Orders, quotes, trade executions • Market “tick” data • Security master • Membership • All highly structured and consistent row-oriented data
  • 8. Data Corollary to the Ideal Gas Law
  • 9. Motivations for Extending to Amazon EMR and Amazon S3 • Resizing a 300+ TB Amazon Redshift cluster isn’t instantaneous • Continuing to grow the cluster is expensive • Paying for CPU and disk to support infrequently accessed data doesn’t make sense • Data will expand to fill any container
  • 11. Goals • Build a secure, cost effective, long-term data store • Provide a SQL interface to all data • Support new MPP analytics workloads (Spark, ML, etc.) • Cap the size of our Amazon Redshift cluster • Manage storage and compute resources separately
  • 13. Amazon Redshift’s Continuing Role • All data lands in Amazon Redshift first • Amazon Redshift clients have strict SLAs on data availability • Must ensure data loads are finished quickly • Aggregations and transformations performed in SQL • SQL is easy and we have a lot of SQL expertise • Transformed data is then unloaded to Amazon S3 for conversion
  • 14. Decouple Storage and Compute Resources Scale each independently as needed, run multiple different apps on top of a common storage system Especially for old, infrequently accessed data, no need to run compute 24/7 to support it; we can keep data “forever” Access needs drop off dramatically over time • Yesterday >> last month >> last quarter >> last year
  • 15. Account Structure and Cost Allocations • Separate AWS accounts for each client / department • Departments can run as much or as little compute as they need; use different query tools, experiments • No competition for compute resources across clients • Amazon S3 costs are shared, compute costs are passed through to each department
  • 18. Nasdaq Workflow Engine • MySQL-backed workflow engine developed in-house • Orchestrates over 40K operations daily • Flexible scheduling and dependency management • Ops GUI for retrying failed steps, root cause analysis • Moving to Amazon Aurora + Amazon EC2 in 2016 • Clustered operation using Amazon S3 as temp storage space
  • 19. Amazon Redshift Data Ingest Workflow • Data is pulled from various sources • Validate data, convert to CSVs + manifest • Store compressed, encrypted data in Amazon S3 temp space • Load into Amazon Redshift using COPY SQL statements • Further transformation performed using SQL • UNLOAD transformed data back to Amazon S3 • Notifications to other systems using Amazon SQS
  • 20. Amazon EMR / Amazon S3 Data Ingest Workflow • Automatically executed after Amazon Redshift loads and transformations complete • Uses Amazon Redshift schema metadata and manifest file to drive conversions to Parquet • Detects schema changes and bumps Hive schema version • Alters schema in Hive Metastore to add new tables, partitions as needed
  • 21. Data Security and Encryption
  • 22. VPC or Nothing • Security is our #1 priority at all times • All instances run in a VPC • Locked down security groups, network ACLs, etc. • Least-privilege IAM roles for each app and human • See SEC302 – IAM Best Practices from Anders • EC2 instance roles in Amazon EMR • VPC endpoint for Amazon S3 • 10 G private AWS Direct Connect circuits into AWS
  • 23. Encryption Key Management • On-premises Safenet LUNA HSM cluster for key storage • Amazon Redshift is directly integrated with our HSMs • Nasdaq KMS: • Internally known as “Vinz Clortho” • Roots encryption keys in the HSM cluster • Allows us full control over where keys are stored, used
  • 24. Transparent Encryption in Amazon S3 and EMRFS Amazon S3 SDK EncryptionMaterialsProvider interface: • Adapter to retrieve keys from our KMS • Used when reading or writing data in Amazon S3 • User metadata to encode encryption key tokens
  • 25. Encryption Performance with Amazon S3 • Roughly 25% slower than unencrypted • Seek within an encrypted object works: • Critical for performance • Handled automatically • Seeks are relative to the unencrypted size • Create a new HTTP request at an offset within the object • Encryption offset work is handled in the AWS SDK itself • Worst case, we must read two extra blocks of AES data
  • 26. Local disk encryption with Amazon EMR • Bootstrap action to encrypt ephemeral disks • Specifically to encrypt Presto’s local temp storage • Standard Linux LUKS configuration • Integrated with the Nasdaq KMS • Retrieves key and mounts disks on startup using init.d
  • 27. SELinux on Amazon EMR • Bootstrap action to install SELinux packages • Adds kernel command line arguments • Rebuilds initrd image • Reboots the node and re-labels the filesystem • Increases cluster boot time • Currently only working on Amazon EMR 3.8 • Working to refine SELinux policy files for Presto
  • 29. What is Presto? • https://prestodb.io • Open Source MPP SQL database from Facebook • Flexible data sources through Connector API • JDBC, ODBC drivers • Nice GUI from Airbnb: http://nerds.airbnb.com/airpal/ • Hive Connector: • Table schemas defined in a Hive Metastore as external tables • Data files stored in Amazon S3
  • 31. Running Presto on Amazon EMR • Bootstrap action to download and install Java 8 & Presto • Based on the Amazon EMR team’s Presto BA • Adds support for custom encryption materials provider jars • Configures Presto to use a remote Hive Metastore • Currently using Amazon EMR 3.8, working towards 4.0
  • 32. Data Encryption in Presto • Presto doesn’t use EMRFS for access to Amazon S3 • We added support for Amazon S3 EncryptionMaterialsProvider to PrestoS3FileSystem.java • Code available at github.com/nasdaq • Working with Facebook to integrate these changes
  • 34. File Formats: Parquet vs. ORC The two most widely used structured data file formats: • Compressed, columnar record storage • Structured, schema-validated data • Supported by a variety of Hadoop-ecosystem apps • Arbitrary user metadata encoded at the file level
  • 35. ORC Pros: • DATE and TIMESTAMP type support in Hive, Presto Cons: • Rigid column ordering requirements • Clunky Java API • Unacceptable performance when encrypted in Amazon S3 • 15-18x slower during our testing (!)
  • 36. The Winner: Parquet • Wide project support: Presto, Spark, Drill, etc. • Actively developed project • Adoption increasing • Column referenced by name instead of position • Set hive.parquet.use-column-names=true in Presto config • Good performance when encrypted (~27% slower) • Clean Java API
  • 37. Parquet Schema Workarounds DATE not supported in Hive or Presto • Instead, convert DATEs to an INTs • 2015-10-08 becomes 20151008 • Timestamps become a BIGINT (64bit integer in Hive) • For nanosecond resolution records, we use a DATE and a separate nanos-since-midnight column
  • 38. Schema and Data Management
  • 39. Hive Metastore • Amazon EMR 4.0 cluster for the Metastore • Easier for remote access from Presto • Reachable through VPC peering with client accounts • The “source of truth” for Hive schemas • Metastore DB on Amazon RDS for MySQL • Easy backups, encrypted storage • Data ingest system creates/alters tables • Alters tables to add new data partitions each day • Detects newly changed schemas
  • 40. Managing Versioned, Partitioned Tables in S3 • Store versions of a table in directories in Amazon S3: s3://schema/table/version/date=YYYYMMDD/*.parquet Works with “msck repair table” commands • When a schema change is detected, increment the version. New data is written to the new location, alerts are generated for humans to determine changes. • Data is migrated in Amazon S3 and old versions are kept for now
  • 41. Logical vs. Physical Schemas • Track a “logical” and “physical” schema for each table • Logical is compared with Amazon Redshift to detect changes • Physical schema used to produce Hive DDL for Presto • Schema definitions stored in MySQL • Version management and change detection • Amazon S3 location for each table • Tools to export these schemas as .sql files • Hive schema and table create statements • “msck repair table” scripts
  • 42. File-level Metadata We encode information in file-level metadata: • Partition column definition • Time zone in which the file was parsed • Current & original schema name and version number • Column data type adjustments (DATE -> INT, etc.) Allows us to always recreate logical schema representations from physical files, re-migrate files if a data migration step had a bug, etc.
  • 43. Table Partitioning and Data Management • Partition hive tables by date • We have mostly timeseries data and are on a daily cadence • Partitioning helps query performance • Use `backticks` when defining column names in SQL • Column names must be lower case in Parquet • Correct bad data in Amazon Redshift through SQL, then UNLOAD partitions for encoding to Parquet • Our tools and automation make it easy to replace modified partitions of data in Hive tables
  • 44. Working with data in S3 and Amazon Redshift Custom tools developed to make life easier: • Extract CSV data from various DBs, or UNLOAD from Amazon Redshift in whole or in segments • Encode CSVs as Parquet files using a Hive schema • Write data into the correct directory structure in Amazon S3 • Allows us to move data between Amazon Redshift and Amazon S3 easily, and in bulk
  • 45. Custom Parquet Data Migration Tools • Read records from previous version of a table • Reads from the old location in Amazon S3 • Write records using the current version of a table • Writes to the new location in Amazon S3 • Most migrations are trivial: • Add new column with some default value (or null) • Rename columns • More complicated migrations require Java code • Track original and current version in file metadata
  • 46. Review & Future Enhancements
  • 47. Review • Motivations for extending an Amazon Redshift warehouse with Amazon EMR • How our data ingest system operates • How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools • How to manage schemas and data migrations
  • 48. Lessons Learned: TL;DR • Manage storage and compute separately • It’s OK to be paranoid about data loss! • Amazon S3 encryption is easy and seek() works • Parquet vs. ORC • Partition and version your tables • Manage logical and physical table schemas • Data management tools & automation are important
  • 49. Future Enhancements • Archive original source data for SEC 17-a4 compliance (using Amazon Glacier Vault Lock) • Decouple data retrieval and processing tasks • Move ingest processing to Amazon EC2/Amazon ECS • Move workflow engine DB to Amazon Aurora • Leveraging other query frameworks: Spark, ML, etc. • Near real-time streaming ingest • More data sources