This document discusses implementing a data lake on AWS to securely store, categorize, and analyze all types of data in a centralized repository. It describes key attributes of a data lake like decoupled storage and compute, rapid ingestion and transformation, and schema on read. It then outlines various AWS services that can be used to build a data lake like S3, Athena, EMR, Redshift, Glue, and Kinesis. It provides examples of streaming IoT data into a data lake and running queries and analytics on the data.
3. Data lake is an architecture with a
virtually limitless centralized storage
platform capable of categorization,
processing, analysis, and consumption of
heterogeneous data sets
Key data lake attributes
•
•
•
•
•
Decoupled storage and compute
Rapid ingest and transformation
Secure multi-tenancy
Query in place
Schema on read
Defining the AWS Data Lake
7. Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS ü Use EMR/Hadoop with local
HDFS for hottest data sets
ü Store cooler data in S3 and
Glacier to reduce costs
ü Use S3 Analytics to optimize
tiering strategy
S3 Analytics
8. What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing
9. What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache
10. What can you do with a data lake?
Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image / Video analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch
Automatic speech recognition
Amazon Transcribe
Language Translation
Amazon Translate
Natural Language Processing
Amazon Comprehend
12. There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
14. Amazon Kinesis Data Streams
Build your own data streaming applications
Easy Administration: Create a new stream, set desired capacity and partitioning
to match your data throughput rate and volume.
Build real-time applications: Perform custom record processing on streaming
data using Kinesis Client Library, Apache Spark/ Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Amazon Confidential
Send clickstream
data to Kinesis
Streams
Kinesis Streams
stores and exposes
clickstream data for
processing
Custom application built on
Kinesis Client Library makes
real-time content
recommendations
Readers see
personalized content
suggestions
15. Amazon Kinesis Data Firehose
Load massive volumes of streaming data into Amazon S3 and Redshift
Zero Admin: Capture and deliver streaming data into S3, Redshift, and other
destinations without writing an application or managing infrastructure
Direct-to-data store integration: Batch, compress, and encrypt streaming data
for delivery into S3, and other destinations in as little as 60 secs using simple
configurations
Seamless elasticity: Seamlessly scales to match data throughput w/o operator
intervention
Capture and submit streaming
data to Firehose
Firehose loads streaming data continuously
into S3 and Redshift
Analyze streaming data using your favorite BI tools
Elasticsearch
16. Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
17. And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access
23. AWS Glue: Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data Catalog
and keep it in sync
Automatically discover new data, extracts schema
definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers
using Grok expression
Run ad hoc or on a schedule; serverless – only pay
when crawler runs
24. A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
25. Write once, catalog once, read multiple, ETL Anywhere
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
30. Let’s take an example
1. What is going on with a specific
sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many
sensors are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device
31. Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
32.
33. Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
38. KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS
"devicets", col3 AS "temp", col4 AS "settemp", col5 AS
"pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY
pct_inefficiency DESC limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table
40. Characteristics
ü Scale to hundreds of thousands of data sources
ü Virtually infinite storage scalability
ü Real-time and batch processing layers
ü Interactive queries
ü Highly available and durable
ü Pay only for what you use
X No servers to manage
41. Very easy to try – existing template
h t t p s : / / a w s . a m a z o n . c o m / b l o g s / b i g - d a t a / u n i t e - r e a l - t i m e - a n d - b a t c h - a n a l y t i c s - u s i n g - t h e -
b i g - d a t a - l a m b d a - a r c h i t e c t u r e - w i t h o u t - s e r v e r s /