Implementing a Data Lake

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
ROHAN RAIZADA
Solutions Architect, Amazon Web Services
Implementing a Data Lake - Securely
store, categorize, and analyze all your
data in one, centralized repository

Data Drives Better Decision Making
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Data lake leaders who were highly efficient
in capturing a diversity of data and making it
accessible to their organization in a timely
fashion outperformed their peers by 9% in
organic revenue growth.*
24%
15%
Organic revenue
growth
Leaders Followers

Data lake is an architecture with a
virtually limitless centralized storage
platform capable of categorization,
processing, analysis, and consumption of
heterogeneous data sets
Key data lake attributes
•
•
•
•
•
Decoupled storage and compute
Rapid ingest and transformation
Secure multi-tenancy
Query in place
Schema on read
Defining the AWS Data Lake

Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence Relational data
TBs-PBs scale
Schema defined before data load
Operational reporting and on demand
Large initial capex + $10K–$50K / TB / Year

Data Lakes Extend the Traditional Approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data warehouse
Business
intelligence
Data lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
learning
DW
queries
Big data
processing
Interactive Real-time

Data Lakes on AWS
Unmatched durability and availability at exabyte scale
Comprehensive security, compliance, and audit capabilities
Object-level controls
Usage and cost analysis insight into your data
Most ways to bring data in
Twice as many partner integrations
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
Machine Learning
Analytics
Internet of Things
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Kinesis
Video Streams

Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS ü Use EMR/Hadoop with local
HDFS for hottest data sets
ü Store cooler data in S3 and
Glacier to reduce costs
ü Use S3 Analytics to optimize
tiering strategy
S3 Analytics

What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing

Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache

Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image / Video analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch
Automatic speech recognition
Amazon Transcribe
Language Translation
Amazon Translate
Natural Language Processing
Amazon Comprehend

Simplified architectural view
Amazon S3
Ingestion
mechanism
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process Consume

There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

Streaming with Amazon Kinesis
Easily collect, process, and analyze data and video streams in real time
Capture, process, and
store video streams
Kinesis Video Streams
Load data streams into
AWS data stores
Analyze data streams
with SQL
Capture, process, and
store data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Amazon Kinesis Data Streams
Build your own data streaming applications
Easy Administration: Create a new stream, set desired capacity and partitioning
to match your data throughput rate and volume.
Build real-time applications: Perform custom record processing on streaming
data using Kinesis Client Library, Apache Spark/ Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Amazon Confidential
Send clickstream
data to Kinesis
Streams
Kinesis Streams
stores and exposes
clickstream data for
processing
Custom application built on
Kinesis Client Library makes
real-time content
recommendations
Readers see
personalized content
suggestions

Amazon Kinesis Data Firehose
Load massive volumes of streaming data into Amazon S3 and Redshift
Zero Admin: Capture and deliver streaming data into S3, Redshift, and other
destinations without writing an application or managing infrastructure
Direct-to-data store integration: Batch, compress, and encrypt streaming data
for delivery into S3, and other destinations in as little as 60 secs using simple
configurations
Seamless elasticity: Seamlessly scales to match data throughput w/o operator
intervention
Capture and submit streaming
data to Firehose
Firehose loads streaming data continuously
into S3 and Redshift
Analyze streaming data using your favorite BI tools
Elasticsearch

Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access

Data preparation accounts for ~80% of the work.
Building training sets
Cleaning and organizing data (60 %)
Collecting data sets (19 %)
Mining data for patterns
Refining algorithms
Other

AWS Glue: ETL Service
Make ETL scripting and deployment easy
Automatically generates ETL code
Code is customizable with Python and Spark
Endpoints provided to edit, debug, & test code
Jobs are scheduled or event-based
Serverless

ETL when you need it
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access

Storing is not enough. Data needs to be discoverable.
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”

AWS Glue: Data Catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
AWS Glue
Data Catalog
Discover data and
extract schema

AWS Glue: Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data Catalog
and keep it in sync
Automatically discover new data, extracts schema
definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers
using Grok expression
Run ad hoc or on a schedule; serverless – only pay
when crawler runs

A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore

Write once, catalog once, read multiple, ETL Anywhere
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore

Amazon EMR: Big Data Processing
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release.
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%.
Use S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector.
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster setup,
node provisioning, &
cluster tuning.
Data Lake
10011000010010101110
01010101110010101000
00111100101100101
010001100001
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security

Amazon Athena: Interactive Analysis
$ SQL
Query Instantly
Zero setup cost;
just point to
Amazon S3 and
start querying.
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression.
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types.
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight.
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift: Modern Data Warehouse
Fast, scalable, fully managed data warehouse at 1/10th the cost
Query data across your Amazon Redshift data warehouse and your Amazon S3
data lake with Redshift Spectrum feature
Massively parallel, scales from gigabytes to exabytes
Fast
Delivers fast results for short
queries, complex queries,
and mixed workloads.
Cost effective
Start at $0.25 per hour; scale
out for as low as $250–$333
per uncompressed terabyte
per year.
Data lake integration Secure
Audit everything; encrypt
data end-to-end; extensive
certification and
compliance.
Query open file formats in
Amazon S3 and optimized
data formats on direct-
attached disks.
$ Data lake
1001100001001010
1110010101011100
1010110101100101
010100001

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake Integration – Amazon Redshift Spectrum
Query across your Amazon Redshift data warehouse and your Amazon S3 data lake
Run Amazon Redshift SQL queries against Amazon
S3
Scale compute and storage separately
Fast query performance
Unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
On demand, pay-per-query based on data scanned
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine

Let’s take an example
1. What is going on with a specific
sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many
sensors are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device

Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena

Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data

Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
“raw-time-series”
Amazon S3
“daily-average”

AWS Glue Job
Serverless, event-driven execution
Data is written out to S3
Output table is automatically
created in Amazon Athena

Kinesis Analytics for in-stream analytics
Kinesis Firehose
Amazon S3
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”

KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS
"devicets", col3 AS "temp", col4 AS "settemp", col5 AS
"pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY
pct_inefficiency DESC limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table

Overall architecture
Kinesis Firehose
Amazon S3
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”

Characteristics
ü Scale to hundreds of thousands of data sources
ü Virtually infinite storage scalability
ü Real-time and batch processing layers
ü Interactive queries
ü Highly available and durable
ü Pay only for what you use
X No servers to manage

Very easy to try – existing template
h t t p s : / / a w s . a m a z o n . c o m / b l o g s / b i g - d a t a / u n i t e - r e a l - t i m e - a n d - b a t c h - a n a l y t i c s - u s i n g - t h e -
b i g - d a t a - l a m b d a - a r c h i t e c t u r e - w i t h o u t - s e r v e r s /

Thank you!

Appendix

Data Lakes on AWS
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
A m a z o n S a g e M a k e r
A W S D e e p L e a r n i n g A M I s
A m a z o n R e k o g n i t i o n
A m a z o n L e x
A W S D e e p L e n s
A m a z o n C o m p r e h e n d
A m a z o n T r a n s l a t e
A m a z o n T r a n s c r i b e
A m a z o n P o l l y
Machine Learning Analytics Internet of Things (IoT)
A W S I o T C o r e
A W S G r e e n g r a s s
A W S I o T A n a l y t i c s
A m a z o n F r e e R T O S
A W S I o T 1 - C l i c k
A W S I o T B u t t o n
A W S I o T D e v i c e M a n a g e m e n t
A W S I o T D e v i c e D e f e n d e r
A m a z o n A t h e n a
A m a z o n E M R
A m a z o n R e d s h i f t
A m a z o n E l a s t i c s e a r c h S e r v i c e
A m a z o n K i n e s i s
A m a z o n Q u i c k S i g h t

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sysco is the leader in selling, marketing, & distributing
food.
Challenge:
Large volumes of data in multiple systems. Also, high
costs from maintaining on-premises EDW deployment.
Solution:
Migrated their on-premises solution to the cloud with
Amazon Redshift, Amazon S3, Amazon EMR, and Athena.

Sysco: Analytics on the Data Lake
ETL
process
Amazon
Redshift
Data
preparation
Ingest raw data
from multiple
sources
Amazon
S3
Marketing
data source
Other
source
systems
Transformed
data
Amazon
S3
Redshift
Spectrum
Athena
Amazon
EMR
Sysco is the leader in selling, marketing, & distributing food.
Challenge: large volumes of data in multiple systems.
Consolidated data into a single Amazon S3 data lake.
Data scientists use Amazon EMR notebooks, Athena, &
Amazon Redshift Spectrum used by business users for
reporting.

Implementing a Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implementing a Data Lake

Similar to Implementing a Data Lake (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Implementing a Data Lake