Building Data Lakes with AWS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Building Data Lakes with AWS
John Mallory
Storage BD

Rethink how to become a data-driven business
• Business outcomes - start with the insights and actions you want to drive,
then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the good ones and
scale those up, paying only for what you consume
• Agile and timely - deploy data processing infrastructure in minutes, not
months. take advantage of a rich platform of services to respond quickly to
changing business needs

Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Infrastructure

Often Undertaken with Silos of Tools and Data
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL

Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront

This Leads to Friction & Pain
• Challenging to move data across silos
• Forced to keep multiple copies of data
• Complex data transformation & governance
• Users struggle to find data they need
• Slows innovation and evolution
• Expensive

Enter the Data Lake Architecture
Data Lake is a new and increasingly
popular architecture to store and
analyze massive volumes and
heterogeneous types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest & Transformation
• Bring Functionality to the Data
• Schema on Read

Consideration 1 – S3 for the Data Lake

Consolidate Data / Separate Storage & Compute
• Amazon S3 as the data lake storage tier; not a single analytics
tool like Hadoop or a data warehouse
• Decoupled storage and compute is cheaper and more efficient to
operate
• Decoupled storage and compute allow us to evolve to clusterless
architectures (i.e. AWS Lambda, Amazon Athena, Redshift
Spectrum, AWS Glue, Amazon Macie)
• Do not build data silos in Hadoop or an EDW
• Gain the flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture

An AWS Data Lake Architecture
AWS Glue
ETL & Data Catalog
Serverless
Compute
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Athena
Interactive Query
Data
Processing
Amazon EMR
Managed Hadoop Applications
Amazon Redshift
Petabyte-scale Data Warehousing
Storage
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore

• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data
warehouse and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon
Redshift & S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set
(~1100 tables)

Nasdaq uses Presto on Amazon EMR and Amazon Redshift
as a tiered data lake
Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU

Designed for 11 9s
of durability
§ Multiple Encryption Options
§ Robust/Highly Flexible Access Controls
Durable Secure High performance
§ Multiple upload
§ Range GET
§ Scalable Throughput
§ Amazon EMR
§ Amazon Redshift/Spectrum
§ Amazon DynamoDB
§ Amazon Athena
§ Amazon Rekognition
§ Amazon Glue
IntegratedEasy to use
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
§ Simple Management Tools
§ Hadoop compatibility
Scalable
§ Store as much as you need
§ Scale storage and compute
independently
§ Scale without limits
§ Affordable
Why Choose Amazon S3 for Data Lake?

Optimize Costs with Data Tiering
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
• Use Amazon S3 Analytics for storage class
analysis
New

Encryption ComplianceSecurity
§ Identity and Access
Management (IAM) policies
§ Bucket policies
§ Access Control Lists (ACLs)
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server Side Encryption
(SSE-S3)
§ S3 Server Side Encryption
with provided keys (SSE-
C, SSE-KMS)
§ Client-side Encryption
§ Buckets access logs
§ Lifecycle Management
Policies
§ Access Control Lists (ACLs)
§ Versioning & MFA deletes
§ Certifications – HIPAA, PCI,
SOC 1/2/3 etc.
Implement the right security controls in S3

Manage your data
S3 object Tags
Manage storage based on object tags
• Classify your data
• Tag your objects with key-value pairs
• Write policies once based on the type of data
Discoverability Lifecycle PolicyAccess Control

Manage S3 Security
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
Manage permissions with tags

Macie: A New Approach
Amazon Macie
Understand Your Data
Natural Language
Processing (NLP)
Understand Data Access
Machine Learning

Amazon Macie Uses Machine Learning
• Understand behavioral analytics to baseline normal behavior
• Train and develop contextualized alerts by understanding the
value of data being accessed
• Context for content

Business Critical Data in Amazon S3
• Static website content
• Source code
• SSL certificates, private keys
• iOS and Android app signing
keys
• Database backups
• OAuth and Cloud SAAS API
Keys

Consideration 2 – Ingest & Catalog

AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly
into AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer
Acceleration
• Move data up to 300% faster
using AWS’s private network

Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
• Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
• Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform
incoming source data.
Capture and submit streaming
data
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data continuously
into Amazon S3, Redshift and Elasticsearch

Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon Elasticsearch
Service
Extract metadata
with Lambda
Data Sources
Search
capabilities

Catalog with AWS Glue

Glue Data Catalog
Manage table metadata through a Hive metastore API or Hive SQL. Supported by
tools like Hive, Presto, Spark etc.
We added a few extensions:
§ Search over metadata for data discovery
§ Connection info – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.

Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized
list that is searchable

Consideration 3 – Optimizing Performance

Getting high Throughput Performance with S3
• S3 can scale to many thousands of requests per second
• Need a good key naming scheme
• Only at scale do you need to consider your key naming scheme
• What are Partitions?
• Why?
• Spread Keys Lexigraphically
• Goal of Partitioning is too spread the heat
• Prevent HotSpots

Distributing key names
• Add randomness to the beginning of the key name…
<my_bucket>/6213-2013_11_13.jpg
<my_bucket>/4653-2013_11_13.jpg
<my_bucket>/9873-2013_11_13.jpg
<my_bucket>/4657-2013_11_13.jpg
<my_bucket>/1256-2013_11_13.jpg
<my_bucket>/8345-2013_11_13.jpg
<my_bucket>/0321-2013_11_13.jpg
<my_bucket>/5654-2013_11_13.jpg
<my_bucket>/2345-2013_11_13.jpg
<my_bucket>/7567-2013_11_13.jpg
<my_bucket>/3455-2013_11_13.jpg
<my_bucket>/4313-2013_11_13.jpg
Partitions:
<my_bucket>/0
<my_bucket>/1
<my_bucket>/2
<my_bucket>/3
<my_bucket>/4
<my_bucket>/5
<my_bucket>/6
<my_bucket>/7
<my_bucket>/8
<my_bucket>/9

Data Recommendations for EMR and S3
Performance Best Practices:
• Reduce Number of S3 objects by aggregating small files
into larger ones (s3distcp – group-by option)
• Goal: Files >128MB
• Use EMRFS with Consistent View
• Parquet with Snappy compression is emerging as the best
compression algorithm
• Reverse partition scheme to HOUR, DAY, MONTH, YEAR

Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Consideration 4 – Query in Place

S3
Data Catalog
AthenaEMR Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
Amazon Analytics End to End Architecture
IAM
Other
Sources

Explore Your Data Without ETL
• Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades

Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the best performance
and lowest cost
• No ETL required
• Stream data directly from Amazon S3

Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning

What About ETL?
Raw Data Assets Transformed Into Usable Ones

ETL is the most time-consuming part of analytics
ETL Data Warehousing Business Intelligence
80% of time
spent here
Amazon Redshift Amazon QuickSight

AWS Glue
Simple, flexible, cost-effective ETL
§ AWS Glue is a fully managed ETL (extract, transform, and load) service
§ Categorize your data, clean it, enrich it and move it reliably between
various data stores
§ Once catalogued, your data is immediately searchable and queryable across
your data silos
§ Simple and cost-effective
§ Serverless; runs on a fully managed, scale-out Spark environment

Build event-driven ETL pipelines

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
Supports Standard ANSI SQL
High Performance
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Fully Managed Petabyte-scale Data
Warehouse

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL

Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state
Real-time prediction
KPI
process
store
Stream
Amazon Kinesis
Analytics
Amazon
S3
Log
Amazon
KinesisFan out

Case Study: Clickstream Analysis
Hearst Corporation monitors trending content for over 250 digital properties worldwide
and processes more than 30TB of data per day, using an architecture that includes
Amazon Kinesis and Spark running on Amazon EMR.
Store → Process | Analyze → Answers

Amazon Kinesis
Amazon
EMR
Amazon EMR
Amazon Redshift
Elasticsearch
Clickstream
Hearst Corporation monitors
trending content for over 250
digital properties worldwide
and processes more than 30TB
of data per day, using an
architecture that includes
Amazon Kinesis and Spark
running on Amazon EMR.

&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Stream Amazon Kinesis
Firehose
Amazon Athena
Files
Amazon Kinesis
Analytics

Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data lake
Amazon EMR
Amazon
Kinesis
Amazon RedShift
Answers &
Insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting

Choose the Right Tools
Amazon Redshift, Spectrum
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots

Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
You Don’t
Have to
Choose
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis

• Use S3 as the storage repository for your data lake,
instead of a Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more
efficient to operate
• Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
Evolve as Needed

Pop-up Loft

Building Data Lakes with AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Data Lakes with AWS

Similar to Building Data Lakes with AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building Data Lakes with AWS