Running Big Data Apps on Amazon EMR and Redshift

© 2015 Nasdaq, Inc. All rights reserved.
“Nasdaq” and the Nasdaq logo are the trademarks of Nasdaq, Inc. and its affiliates in the U.S. and other countries.
“Amazon” and the Amazon Web Services logo are the trademarks of Amazon Web Services, Inc. or its affiliates in the U.S. and other countries
Nate Sammons, Principal Architect, Nasdaq, Inc.
October 2015
BDT314
Running a Big Data and Analytics
Application on Amazon EMR and Amazon
Redshift with a Focus on Security

NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES

What to Expect from the Session
• Motivations for extending an Amazon Redshift
warehouse with Amazon EMR
• How our data ingest workflow operates
• How to query encrypted data in Amazon S3 using Presto
and other Hadoop-ecosystem tools
• How to manage schemas and data migrations
• Future direction for our data warehouse

Amazon Redshift as Nasdaq’s Main Data Warehouse
• Transitioned from an on-premises warehouse to
Amazon Redshift
• Over 1,000 tables migrated
• More data sources added as needed
• Nearly two years of data
• Average daily ingest of over 7B rows

Never Throw Anything Away
• 23 node ds2.8xlarge Amazon Redshift cluster
• 828 vCPU, 5.48 TB of RAM
• 368 TB of DB storage capacity, over 1PB of local disk!
• 92 GB/sec aggregate disk I/O
• Resize once per quarter
• 2.7 trillion rows: 1.8T from sources, 900B derived

Many Data Sources
• Internal DBs, CSV files, stream captures, etc.
• Data from all 7 exchanges operated by Nasdaq
• Orders, quotes, trade executions
• Market “tick” data
• Security master
• Membership
• All highly structured and consistent row-oriented data

Data Corollary to the Ideal Gas Law

Motivations for Extending to Amazon EMR and
Amazon S3
• Resizing a 300+ TB Amazon Redshift cluster isn’t
instantaneous
• Continuing to grow the cluster is expensive
• Paying for CPU and disk to support infrequently accessed
data doesn’t make sense
• Data will expand to fill any container

Goals
• Build a secure, cost effective, long-term data store
• Provide a SQL interface to all data
• Support new MPP analytics workloads (Spark, ML, etc.)
• Cap the size of our Amazon Redshift cluster
• Manage storage and compute resources separately

Amazon Redshift’s Continuing Role
• All data lands in Amazon Redshift first
• Amazon Redshift clients have strict SLAs on data availability
• Must ensure data loads are finished quickly
• Aggregations and transformations performed in SQL
• SQL is easy and we have a lot of SQL expertise
• Transformed data is then unloaded to Amazon S3 for conversion

Decouple Storage and Compute Resources
Scale each independently as needed, run multiple different
apps on top of a common storage system
Especially for old, infrequently accessed data, no need to
run compute 24/7 to support it; we can keep data “forever”
Access needs drop off dramatically over time
• Yesterday >> last month >> last quarter >> last year

Account Structure and Cost Allocations
• Separate AWS accounts for each client / department
• Departments can run as much or as little compute as
they need; use different query tools, experiments
• No competition for compute resources across clients
• Amazon S3 costs are shared, compute costs are passed
through to each department

Nasdaq Workflow Engine
• MySQL-backed workflow engine developed in-house
• Orchestrates over 40K operations daily
• Flexible scheduling and dependency management
• Ops GUI for retrying failed steps, root cause analysis
• Moving to Amazon Aurora + Amazon EC2 in 2016
• Clustered operation using Amazon S3 as temp storage
space

Amazon Redshift Data Ingest Workflow
• Data is pulled from various sources
• Validate data, convert to CSVs + manifest
• Store compressed, encrypted data in Amazon S3 temp
space
• Load into Amazon Redshift using COPY SQL statements
• Further transformation performed using SQL
• UNLOAD transformed data back to Amazon S3
• Notifications to other systems using Amazon SQS

Amazon EMR / Amazon S3 Data Ingest Workflow
• Automatically executed after Amazon Redshift loads and
transformations complete
• Uses Amazon Redshift schema metadata and manifest file
to drive conversions to Parquet
• Detects schema changes and bumps Hive schema version
• Alters schema in Hive Metastore to add new tables,
partitions as needed

VPC or Nothing
• Security is our #1 priority at all times
• All instances run in a VPC
• Locked down security groups, network ACLs, etc.
• Least-privilege IAM roles for each app and human
• See SEC302 – IAM Best Practices from Anders
• EC2 instance roles in Amazon EMR
• VPC endpoint for Amazon S3
• 10 G private AWS Direct Connect circuits into AWS

Encryption Key Management
• On-premises Safenet LUNA HSM cluster for key storage
• Amazon Redshift is directly integrated with our HSMs
• Nasdaq KMS:
• Internally known as “Vinz Clortho”
• Roots encryption keys in the HSM cluster
• Allows us full control over where keys are stored, used

Transparent Encryption in Amazon S3 and EMRFS
Amazon S3 SDK EncryptionMaterialsProvider
interface:
• Adapter to retrieve keys from our KMS
• Used when reading or writing data in Amazon S3
• User metadata to encode encryption key tokens

Encryption Performance with Amazon S3
• Roughly 25% slower than unencrypted
• Seek within an encrypted object works:
• Critical for performance
• Handled automatically
• Seeks are relative to the unencrypted size
• Create a new HTTP request at an offset within the object
• Encryption offset work is handled in the AWS SDK itself
• Worst case, we must read two extra blocks of AES data

Local disk encryption with Amazon EMR
• Bootstrap action to encrypt ephemeral disks
• Specifically to encrypt Presto’s local temp storage
• Standard Linux LUKS configuration
• Integrated with the Nasdaq KMS
• Retrieves key and mounts disks on startup using init.d

SELinux on Amazon EMR
• Bootstrap action to install SELinux packages
• Adds kernel command line arguments
• Rebuilds initrd image
• Reboots the node and re-labels the filesystem
• Increases cluster boot time
• Currently only working on Amazon EMR 3.8
• Working to refine SELinux policy files for Presto

What is Presto?
• https://prestodb.io
• Open Source MPP SQL database from Facebook
• Flexible data sources through Connector API
• JDBC, ODBC drivers
• Nice GUI from Airbnb: http://nerds.airbnb.com/airpal/
• Hive Connector:
• Table schemas defined in a Hive Metastore as external tables
• Data files stored in Amazon S3

Running Presto on Amazon EMR
• Bootstrap action to download and install Java 8 & Presto
• Based on the Amazon EMR team’s Presto BA
• Adds support for custom encryption materials provider jars
• Configures Presto to use a remote Hive Metastore
• Currently using Amazon EMR 3.8, working towards 4.0

Data Encryption in Presto
• Presto doesn’t use EMRFS for access to Amazon S3
• We added support for Amazon S3
EncryptionMaterialsProvider to PrestoS3FileSystem.java
• Code available at github.com/nasdaq
• Working with Facebook to integrate these changes

File Formats: Parquet vs. ORC
The two most widely used structured data file formats:
• Compressed, columnar record storage
• Structured, schema-validated data
• Supported by a variety of Hadoop-ecosystem apps
• Arbitrary user metadata encoded at the file level

ORC
Pros:
• DATE and TIMESTAMP type support in Hive, Presto
Cons:
• Rigid column ordering requirements
• Clunky Java API
• Unacceptable performance when encrypted in
Amazon S3
• 15-18x slower during our testing (!)

The Winner: Parquet
• Wide project support: Presto, Spark, Drill, etc.
• Actively developed project
• Adoption increasing
• Column referenced by name instead of position
• Set hive.parquet.use-column-names=true in Presto config
• Good performance when encrypted (~27% slower)
• Clean Java API

Parquet Schema Workarounds
DATE not supported in Hive or Presto
• Instead, convert DATEs to an INTs
• 2015-10-08 becomes 20151008
• Timestamps become a BIGINT (64bit integer in Hive)
• For nanosecond resolution records, we use a DATE and
a separate nanos-since-midnight column

Hive Metastore
• Amazon EMR 4.0 cluster for the Metastore
• Easier for remote access from Presto
• Reachable through VPC peering with client accounts
• The “source of truth” for Hive schemas
• Metastore DB on Amazon RDS for MySQL
• Easy backups, encrypted storage
• Data ingest system creates/alters tables
• Alters tables to add new data partitions each day
• Detects newly changed schemas

Managing Versioned, Partitioned Tables in S3
• Store versions of a table in directories in Amazon S3:
s3://schema/table/version/date=YYYYMMDD/*.parquet
Works with “msck repair table” commands
• When a schema change is detected, increment the
version. New data is written to the new location, alerts
are generated for humans to determine changes.
• Data is migrated in Amazon S3 and old versions are
kept for now

Logical vs. Physical Schemas
• Track a “logical” and “physical” schema for each table
• Logical is compared with Amazon Redshift to detect changes
• Physical schema used to produce Hive DDL for Presto
• Schema definitions stored in MySQL
• Version management and change detection
• Amazon S3 location for each table
• Tools to export these schemas as .sql files
• Hive schema and table create statements
• “msck repair table” scripts

File-level Metadata
We encode information in file-level metadata:
• Partition column definition
• Time zone in which the file was parsed
• Current & original schema name and version number
• Column data type adjustments (DATE -> INT, etc.)
Allows us to always recreate logical schema representations
from physical files, re-migrate files if a data migration step
had a bug, etc.

Table Partitioning and Data Management
• Partition hive tables by date
• We have mostly timeseries data and are on a daily cadence
• Partitioning helps query performance
• Use `backticks` when defining column names in SQL
• Column names must be lower case in Parquet
• Correct bad data in Amazon Redshift through SQL, then
UNLOAD partitions for encoding to Parquet
• Our tools and automation make it easy to replace
modified partitions of data in Hive tables

Working with data in S3 and Amazon Redshift
Custom tools developed to make life easier:
• Extract CSV data from various DBs, or UNLOAD from
Amazon Redshift in whole or in segments
• Encode CSVs as Parquet files using a Hive schema
• Write data into the correct directory structure in
Amazon S3
• Allows us to move data between Amazon Redshift and
Amazon S3 easily, and in bulk

Custom Parquet Data Migration Tools
• Read records from previous version of a table
• Reads from the old location in Amazon S3
• Write records using the current version of a table
• Writes to the new location in Amazon S3
• Most migrations are trivial:
• Add new column with some default value (or null)
• Rename columns
• More complicated migrations require Java code
• Track original and current version in file metadata

Review
• Motivations for extending an Amazon Redshift
warehouse with Amazon EMR
• How our data ingest system operates
• How to query encrypted data in Amazon S3 using Presto
and other Hadoop-ecosystem tools
• How to manage schemas and data migrations

Lessons Learned: TL;DR
• Manage storage and compute separately
• It’s OK to be paranoid about data loss!
• Amazon S3 encryption is easy and seek() works
• Parquet vs. ORC
• Partition and version your tables
• Manage logical and physical table schemas
• Data management tools & automation are important

Future Enhancements
• Archive original source data for SEC 17-a4 compliance
(using Amazon Glacier Vault Lock)
• Decouple data retrieval and processing tasks
• Move ingest processing to Amazon EC2/Amazon ECS
• Move workflow engine DB to Amazon Aurora
• Leveraging other query frameworks: Spark, ML, etc.
• Near real-time streaming ingest
• More data sources

Remember to complete
your evaluations!

Running Big Data Apps on Amazon EMR and Redshift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Running Big Data Apps on Amazon EMR and Redshift

Similar to Running Big Data Apps on Amazon EMR and Redshift (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Running Big Data Apps on Amazon EMR and Redshift