Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Chicago AWS Summit

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ben Snively
Principal Solutions Architect, Data and Analytics; AI/ML
Amazon Web Services
BDA305-R
Build Data Lakes and Analytics on AWS:
Patterns & Best Practices

VisualizationVariability
Big Data: Different forms of challenges
Volume Velocity Variety Veracity Value

https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-
guru-p-mohapatra-pmp/
Challenges are often driven by:
Data growth
faster than ever
Data variety is
increasing

AWS Data Lake helps address this
Quickly ingest and store any
type of data
Insights and security,
together …
Run the right tool for the right
job without manually copying
data around

Data Lakes from AWS
Analytics
Machine
Learning
Real-time dataTraditional
Data Lake
on AWS
movementdata movement
Ingestion
Intelligence
Storage
Catalog
Variety of
ingestion tools
Decoupled
analytics from
storage/catalog

What data do I have?

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient.”
”Metadata Is the Fish Finder in Data Lake”
What data do I have?
Data Lake
on AWS
Storage | Archival Storage | Data Catalog

Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawl and discover data
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue components

IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-in classifiers
MySQL
MariaDB
PostreSQL
Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPaths
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING…>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection

But I have my own data formats …?
− There is a custom classifier for that …
Row-Based
GROK Classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML Classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON Classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using AWS Glue
supported operators

Other ways of populating the catalog
Call the AWS Glue CreateTable API
Create table manually DDL statement (in Amazon Athena or Amazon EMR)
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

How do I hydrate my Data Lake?

How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
AnalyticsMachine learning
Real-time data movementTraditional data movement

Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Storage Gateway
AWS IoT Core
Data movement from
real-time sources
Data movement from your
datacenters
Amazon S3
Amazon Glacier
AWS Glue

Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library

Amazon S3
Amazon Glacier
AWS Glue
IMPORTANT: Ingest data in its raw form …
Open and comprehensive
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to warm and
cold storage.

Tiered storage to optimize price / performance
Lowest cost
• Tiered storage to optimize price/performance
• Amazon S3 Standard
• Amazon S3 Standard—Infrequent Access
• Amazon S3 One Zone—Infrequent Access
• Amazon Glacier
• Migrate between tiers based on lifecycle policies
• Store data at $0.023*/GB/month with Amazon S3
• Store data at $0.004*/GB/month with Amazon Glacier
* as of July, 2018
Amazon S3
Standard
Amazon S3 Standard
Infrequent Access
Amazon S3 One
Zone-IA
Amazon Glacier

Datasets in the Lake?
Raw datasets – immutable datasets that you can always go back
to.
• Abstract out the complexities of how the data is stored
through the catalog and SerDes
Optimizing Analytics and Machine Learning:
Curated datasets – query-optimized for consumption across wide
number of tools

Raw data stored in Data Lake:
Preparation:
No rmalized
Partitio ned
Co mpressed
S to rage Optimized
Extract – Load – Transform
Preparing raw data for consumption
Data Lake
on AWS
Raw
Ingestion
Curated
DataSets
Data Catalog
ELT

Which tool should I use to analyze my
data?

How do I drive value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS
AnalyticsMachine Learning
Real-time dataTraditional movementdata movement

Different tools for different users … solving different problems
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
SagemakerMachine Learning/Deep Learning

Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight

Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

Hadoop / Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop / Spark
Object storage

Fitting this into the Common Data Catalog
Amazon S3
Interactive Spark cluster
Amazon EMR
Amazon EMR
EMRFS
HDFS
Transient ETL job
Source of Truth
EMRFS
HDFS
Describes the data
MySQL DB
instance
Unifieddataview
AWS Glue
Data Catalog
Stores the data
…

Amazon Redshift – data warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10 the
cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3

Data warehouse …
Amazon Redshift Data Warehouse
Relational data
Gigabytes to petabytes scale
Reporting and analysis
Schema defined prior to data load
AWS
Glue ETL
On Prem
Amazon QuickSight
Existing or new
BI tool
Redshift
COPY

Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
A Data Lake is not an Enterprise Data Warehouse (EDW)
Data Lake EDW

Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n A m a z o n S 3 d a t a l a k e
Amazon S3
Data Lake
Amazon
Redshift data
Amazon Redshift Spectrum
query engine
Exabyte Redshift SQL queries against Amazon S3
Join data across Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned

Amazon Red s h ift
Sp ectru m
Q u er y you r Data L ake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon Redshift
Spectrum
Scale-out serverless compute
AWS Glue Data Catalog
COPY
commands
Hot data
Query directly
on Data Lake

Data Lakes extend the traditional data warehouse
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning

Machine Learning and Big Data

Big Data driving Machine Learning
Better
Decisions
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Better
Products Machine Learning
Deep Learning/ AI
More
Users
More
Data
Click stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data

Agility in Machine Learning
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS
AnalyticsMachine Learning
Real-time dataOn-premises movementdata movement

Agility in Machine Learning – for all users
Application
Services
• Designed for application developers
• Solution-oriented prebuilt models available via apis
• Image analysis, text-to-speech, conversational UX
Platforms
• Designed for data scientists to address common needs
• Fully managed platform for model building
• Reduces the heavy lifting in model building & deployment
Frameworks
• Designed for data scientists to address advanced / emerging needs
• Provides maximum flexibility to develop on the leading AI frameworks
• Enables expert AI systems to be developed & deployed

Digital Globe – using ML to find the right data
Data Lake:
• 100 PB of data in cloud
• Optimize storage tiers
Solution:
• Optimize their Data Lake
storage, cut costs in half

FINRA − Data is central to our mission
Reconstruct the market from trillions of
events
• Data from broker-dealers and exchanges
• Equities, options, fixed income
• Build a graph of market order events
Analyze the data looking for financial fraud
• Insider trading, layering, cross-product
manipulation, front running, & many more
• Looking for a needle in a haystack
4

FINRA − From data puddles to Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3

UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for building packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16

Real-time
App state
or
materialized
view
Interactive
and
batch
Data Lake
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
KCL
Amazon
AI
Amazon
DynamoDB
Amazon
RDS
Change data capture
or export
Transactions
Stream
Files
Amazon Kinesis
Analytics
Amazon Athena
Amazon Kinesis
Firehose
Amazon ES

Core Tenants
• Data lakes and data warehouses complement each other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc..
• Future-proof your analytics
• Choosing the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management

Use the right storage tier and data format
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost

Submit session feedback
1. Tap the Schedule icon.
2. Select the session you
attended.
3. Tap Session Evaluation to
submit your feedback.

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Chicago AWS Summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Chicago AWS Summit

Similar to Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Chicago AWS Summit (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Chicago AWS Summit