SlideShare a Scribd company logo
1 of 47
Download to read offline
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
ROHAN RAIZADA
Solutions Architect, Amazon Web Services
Implementing a Data Lake - Securely
store, categorize, and analyze all your
data in one, centralized repository
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Drives Better Decision Making
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Data lake leaders who were highly efficient
in capturing a diversity of data and making it
accessible to their organization in a timely
fashion outperformed their peers by 9% in
organic revenue growth.*
24%
15%
Organic revenue
growth
Leaders Followers
Data lake is an architecture with a
virtually limitless centralized storage
platform capable of categorization,
processing, analysis, and consumption of
heterogeneous data sets
Key data lake attributes
•
•
•
•
•
Decoupled storage and compute
Rapid ingest and transformation
Secure multi-tenancy
Query in place
Schema on read
Defining the AWS Data Lake
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence Relational data
TBs-PBs scale
Schema defined before data load
Operational reporting and on demand
Large initial capex + $10K–$50K / TB / Year
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data warehouse
Business
intelligence
Data lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
learning
DW
queries
Big data
processing
Interactive Real-time
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Lakes on AWS
Unmatched durability and availability at exabyte scale
Comprehensive security, compliance, and audit capabilities
Object-level controls
Usage and cost analysis insight into your data
Most ways to bring data in
Twice as many partner integrations
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
Machine Learning
Analytics
Internet of Things
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Kinesis
Video Streams
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS ü Use EMR/Hadoop with local
HDFS for hottest data sets
ü Store cooler data in S3 and
Glacier to reduce costs
ü Use S3 Analytics to optimize
tiering strategy
S3 Analytics
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image / Video analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch
Automatic speech recognition
Amazon Transcribe
Language Translation
Amazon Translate
Natural Language Processing
Amazon Comprehend
Simplified architectural view
Amazon S3
Ingestion
mechanism
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process Consume
There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Streaming with Amazon Kinesis
Easily collect, process, and analyze data and video streams in real time
Capture, process, and
store video streams
Kinesis Video Streams
Load data streams into
AWS data stores
Analyze data streams
with SQL
Capture, process, and
store data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
Amazon Kinesis Data Streams
Build your own data streaming applications
Easy Administration: Create a new stream, set desired capacity and partitioning
to match your data throughput rate and volume.
Build real-time applications: Perform custom record processing on streaming
data using Kinesis Client Library, Apache Spark/ Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Amazon Confidential
Send clickstream
data to Kinesis
Streams
Kinesis Streams
stores and exposes
clickstream data for
processing
Custom application built on
Kinesis Client Library makes
real-time content
recommendations
Readers see
personalized content
suggestions
Amazon Kinesis Data Firehose
Load massive volumes of streaming data into Amazon S3 and Redshift
Zero Admin: Capture and deliver streaming data into S3, Redshift, and other
destinations without writing an application or managing infrastructure
Direct-to-data store integration: Batch, compress, and encrypt streaming data
for delivery into S3, and other destinations in as little as 60 secs using simple
configurations
Seamless elasticity: Seamlessly scales to match data throughput w/o operator
intervention
Capture and submit streaming
data to Firehose
Firehose loads streaming data continuously
into S3 and Redshift
Analyze streaming data using your favorite BI tools
Elasticsearch
Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data preparation accounts for ~80% of the work.
Building training sets
Cleaning and organizing data (60 %)
Collecting data sets (19 %)
Mining data for patterns
Refining algorithms
Other
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue: ETL Service
Make ETL scripting and deployment easy
Automatically generates ETL code
Code is customizable with Python and Spark
Endpoints provided to edit, debug, & test code
Jobs are scheduled or event-based
Serverless
ETL when you need it
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Storing is not enough. Data needs to be discoverable.
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue: Data Catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
AWS Glue
Data Catalog
Discover data and
extract schema
AWS Glue: Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data Catalog
and keep it in sync
Automatically discover new data, extracts schema
definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers
using Grok expression
Run ad hoc or on a schedule; serverless – only pay
when crawler runs
A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
Write once, catalog once, read multiple, ETL Anywhere
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Amazon EMR: Big Data Processing
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release.
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%.
Use S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector.
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster setup,
node provisioning, &
cluster tuning.
Data Lake
10011000010010101110
01010101110010101000
00111100101100101
010001100001
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Amazon Athena: Interactive Analysis
$ SQL
Query Instantly
Zero setup cost;
just point to
Amazon S3 and
start querying.
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression.
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types.
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight.
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift: Modern Data Warehouse
Fast, scalable, fully managed data warehouse at 1/10th the cost
Query data across your Amazon Redshift data warehouse and your Amazon S3
data lake with Redshift Spectrum feature
Massively parallel, scales from gigabytes to exabytes
Fast
Delivers fast results for short
queries, complex queries,
and mixed workloads.
Cost effective
Start at $0.25 per hour; scale
out for as low as $250–$333
per uncompressed terabyte
per year.
Data lake integration Secure
Audit everything; encrypt
data end-to-end; extensive
certification and
compliance.
Query open file formats in
Amazon S3 and optimized
data formats on direct-
attached disks.
$ Data lake
1001100001001010
1110010101011100
1010110101100101
010100001
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake Integration – Amazon Redshift Spectrum
Query across your Amazon Redshift data warehouse and your Amazon S3 data lake
Run Amazon Redshift SQL queries against Amazon
S3
Scale compute and storage separately
Fast query performance
Unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
On demand, pay-per-query based on data scanned
Amazon S3
data lake
Amazon
Redshift data
Redshift Spectrum
query engine
Let’s take an example
1. What is going on with a specific
sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many
sensors are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device
Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
AWS Glue Job
Serverless, event-driven execution
Data is written out to S3
Output table is automatically
created in Amazon Athena
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Kinesis Analytics for in-stream analytics
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS
"devicets", col3 AS "temp", col4 AS "settemp", col5 AS
"pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY
pct_inefficiency DESC limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table
Overall architecture
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
Characteristics
ü Scale to hundreds of thousands of data sources
ü Virtually infinite storage scalability
ü Real-time and batch processing layers
ü Interactive queries
ü Highly available and durable
ü Pay only for what you use
X No servers to manage
Very easy to try – existing template
h t t p s : / / a w s . a m a z o n . c o m / b l o g s / b i g - d a t a / u n i t e - r e a l - t i m e - a n d - b a t c h - a n a l y t i c s - u s i n g - t h e -
b i g - d a t a - l a m b d a - a r c h i t e c t u r e - w i t h o u t - s e r v e r s /
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Appendix
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Lakes on AWS
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
A m a z o n S a g e M a k e r
A W S D e e p L e a r n i n g A M I s
A m a z o n R e k o g n i t i o n
A m a z o n L e x
A W S D e e p L e n s
A m a z o n C o m p r e h e n d
A m a z o n T r a n s l a t e
A m a z o n T r a n s c r i b e
A m a z o n P o l l y
Machine Learning Analytics Internet of Things (IoT)
A W S I o T C o r e
A W S G r e e n g r a s s
A W S I o T A n a l y t i c s
A m a z o n F r e e R T O S
A W S I o T 1 - C l i c k
A W S I o T B u t t o n
A W S I o T D e v i c e M a n a g e m e n t
A W S I o T D e v i c e D e f e n d e r
A m a z o n A t h e n a
A m a z o n E M R
A m a z o n R e d s h i f t
A m a z o n E l a s t i c s e a r c h S e r v i c e
A m a z o n K i n e s i s
A m a z o n Q u i c k S i g h t
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sysco is the leader in selling, marketing, & distributing
food.
Challenge:
Large volumes of data in multiple systems. Also, high
costs from maintaining on-premises EDW deployment.
Solution:
Migrated their on-premises solution to the cloud with
Amazon Redshift, Amazon S3, Amazon EMR, and Athena.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Sysco: Analytics on the Data Lake
ETL
process
Amazon
Redshift
Data
preparation
Ingest raw data
from multiple
sources
Amazon
S3
Marketing
data source
Other
source
systems
Transformed
data
Amazon
S3
Redshift
Spectrum
Athena
Amazon
EMR
Sysco is the leader in selling, marketing, & distributing food.
Challenge: large volumes of data in multiple systems.
Consolidated data into a single Amazon S3 data lake.
Data scientists use Amazon EMR notebooks, Athena, &
Amazon Redshift Spectrum used by business users for
reporting.

More Related Content

What's hot

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksAmazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 

What's hot (20)

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 

Similar to Implementing a Data Lake

Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFAmazon Web Services
 

Similar to Implementing a Data Lake (20)

Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SF
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Implementing a Data Lake

  • 1. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. ROHAN RAIZADA Solutions Architect, Amazon Web Services Implementing a Data Lake - Securely store, categorize, and analyze all your data in one, centralized repository
  • 2. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Drives Better Decision Making *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence Data lake leaders who were highly efficient in capturing a diversity of data and making it accessible to their organization in a timely fashion outperformed their peers by 9% in organic revenue growth.* 24% 15% Organic revenue growth Leaders Followers
  • 3. Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets Key data lake attributes • • • • • Decoupled storage and compute Rapid ingest and transformation Secure multi-tenancy Query in place Schema on read Defining the AWS Data Lake
  • 4. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Traditionally, Analytics Looked Like This OLTP ERP CRM LOB Data warehouse Business intelligence Relational data TBs-PBs scale Schema defined before data load Operational reporting and on demand Large initial capex + $10K–$50K / TB / Year
  • 5. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics OLTP ERP CRM LOB Data warehouse Business intelligence Data lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine learning DW queries Big data processing Interactive Real-time
  • 6. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes on AWS Unmatched durability and availability at exabyte scale Comprehensive security, compliance, and audit capabilities Object-level controls Usage and cost analysis insight into your data Most ways to bring data in Twice as many partner integrations Data lake A m a z o n S 3 A m a z o n G l a c i e r A W S G l u e Machine Learning Analytics Internet of Things Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Kinesis Video Streams
  • 7. Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon Glacier HDFS ü Use EMR/Hadoop with local HDFS for hottest data sets ü Store cooler data in S3 and Glacier to reduce costs ü Use S3 Analytics to optimize tiering strategy S3 Analytics
  • 8. What can you do with a data lake? Amazon Glacier Amazon S3 Amazon Redshift Data Warehouse Amazon EMR Clusterless SQL Query Amazon Athena Clusterless ETL Amazon Glue BI & Visualization Hadoop/Hive/Presto Batch processing
  • 9. What can you do with a data lake? Amazon Glacier Amazon S3 Streaming and real-time analytics AWS Lambda Amazon Elasticsearch Service Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Spark Streaming on EMR Amazon ElastiCache
  • 10. What can you do with a data lake? Amazon Glacier Amazon S3 AI and machine learning Life-like speech Amazon Polly Amazon Lex Conversational engine Amazon Rekognition Image / Video analysis Deep learning Frameworks MXNet, TensorFlow, Theano, Caffe, Torch Automatic speech recognition Amazon Transcribe Language Translation Amazon Translate Natural Language Processing Amazon Comprehend
  • 11. Simplified architectural view Amazon S3 Ingestion mechanism Data sources Transactions Web logs / cookies ERP Connected devices Process Consume
  • 12. There are lots of ingestion tools Amazon S3 Process Consume S3 Transfer Acceleration Data sources Transactions Web logs / cookies ERP Connected devices
  • 13. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Streaming with Amazon Kinesis Easily collect, process, and analyze data and video streams in real time Capture, process, and store video streams Kinesis Video Streams Load data streams into AWS data stores Analyze data streams with SQL Capture, process, and store data streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
  • 14. Amazon Kinesis Data Streams Build your own data streaming applications Easy Administration: Create a new stream, set desired capacity and partitioning to match your data throughput rate and volume. Build real-time applications: Perform custom record processing on streaming data using Kinesis Client Library, Apache Spark/ Storm, AWS Lambda, and more. Low cost: Cost-efficient for workloads of any scale. Amazon Confidential Send clickstream data to Kinesis Streams Kinesis Streams stores and exposes clickstream data for processing Custom application built on Kinesis Client Library makes real-time content recommendations Readers see personalized content suggestions
  • 15. Amazon Kinesis Data Firehose Load massive volumes of streaming data into Amazon S3 and Redshift Zero Admin: Capture and deliver streaming data into S3, Redshift, and other destinations without writing an application or managing infrastructure Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into S3, and other destinations in as little as 60 secs using simple configurations Seamless elasticity: Seamlessly scales to match data throughput w/o operator intervention Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Redshift Analyze streaming data using your favorite BI tools Elasticsearch
  • 16. Variety of data processing tools Amazon S3 Consume S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices
  • 17. And multiple ways to consume the data Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Amazon API Gateway Programmatic Access
  • 18. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work. Building training sets Cleaning and organizing data (60 %) Collecting data sets (19 %) Mining data for patterns Refining algorithms Other
  • 19. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue: ETL Service Make ETL scripting and deployment easy Automatically generates ETL code Code is customizable with Python and Spark Endpoints provided to edit, debug, & test code Jobs are scheduled or event-based Serverless
  • 20. ETL when you need it Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access
  • 21. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Storing is not enough. Data needs to be discoverable. Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Gartner CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ”
  • 22. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue: Data Catalog Make data discoverable Automatically discovers data and stores schema Catalog makes data searchable and available for ETL Catalog contains table and job definitions Computes statistics to make queries efficient Compliance AWS Glue Data Catalog Discover data and extract schema
  • 23. AWS Glue: Data Catalog Crawlers AWS Glue Data Catalog - Crawlers Helping Catalog your data Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 24. A central metadata store for your lake Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore
  • 25. Write once, catalog once, read multiple, ETL Anywhere Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore
  • 26. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon EMR: Big Data Processing $ Latest versions Updated with the latest open source frameworks within 30 days of release. Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80%. Use S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector. Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, & cluster tuning. Data Lake 10011000010010101110 01010101110010101000 00111100101100101 010001100001 Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security
  • 27. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon Athena: Interactive Analysis $ SQL Query Instantly Zero setup cost; just point to Amazon S3 and start querying. Pay per query Pay only for queries run; save 30–90% on per- query costs through compression. Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types. Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight. Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift: Modern Data Warehouse Fast, scalable, fully managed data warehouse at 1/10th the cost Query data across your Amazon Redshift data warehouse and your Amazon S3 data lake with Redshift Spectrum feature Massively parallel, scales from gigabytes to exabytes Fast Delivers fast results for short queries, complex queries, and mixed workloads. Cost effective Start at $0.25 per hour; scale out for as low as $250–$333 per uncompressed terabyte per year. Data lake integration Secure Audit everything; encrypt data end-to-end; extensive certification and compliance. Query open file formats in Amazon S3 and optimized data formats on direct- attached disks. $ Data lake 1001100001001010 1110010101011100 1010110101100101 010100001
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake Integration – Amazon Redshift Spectrum Query across your Amazon Redshift data warehouse and your Amazon S3 data lake Run Amazon Redshift SQL queries against Amazon S3 Scale compute and storage separately Fast query performance Unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats On demand, pay-per-query based on data scanned Amazon S3 data lake Amazon Redshift data Redshift Spectrum query engine
  • 30. Let’s take an example 1. What is going on with a specific sensor 2. Daily Aggregations (device, inefficiencies, average temperature) 3. A real-time view of how many sensors are showing inefficiencies 1. Scale 2. Highly availability 3. Less management overhead 4. Pay what I need Business Questions Operations Record-level dataSensor/IOT device
  • 31. Let’s push this data into a Kinesis Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena
  • 32.
  • 33. Querying it in Amazon Athena Either Create a Crawler to auto-generate schema OR Write a DDL on the Athena console/API/ JDBC/ODBC driver Start Querying Data
  • 34. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 35. AWS Glue Job Serverless, event-driven execution Data is written out to S3 Output table is automatically created in Amazon Athena
  • 36. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 37. Kinesis Analytics for in-stream analytics Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 38. KPI - Overall device daily inefficiency" SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency, date FROM awsblogsgluedemo.daily_avg_inefficiency GROUP BY date; Top 10 most inefficient devices - event-level granularity SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets", col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC limit 10; “raw” table with raw data Top 20 most active devices SELECT deviceid, COUNT(*) AS num_events FROM awsblogsgluedemo."raw" GROUP BY deviceid ORDER BY num_events DESC Events by Device ID SELECT uuid, devicets,deviceid, temp FROM awsblogsgluedemo."raw" WHERE deviceid = 1 ORDER BY devicets DESC; “daily-agg” table with daily aggregation “result” table
  • 39. Overall architecture Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 40. Characteristics ü Scale to hundreds of thousands of data sources ü Virtually infinite storage scalability ü Real-time and batch processing layers ü Interactive queries ü Highly available and durable ü Pay only for what you use X No servers to manage
  • 41. Very easy to try – existing template h t t p s : / / a w s . a m a z o n . c o m / b l o g s / b i g - d a t a / u n i t e - r e a l - t i m e - a n d - b a t c h - a n a l y t i c s - u s i n g - t h e - b i g - d a t a - l a m b d a - a r c h i t e c t u r e - w i t h o u t - s e r v e r s /
  • 42. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
  • 43. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Thank you!
  • 44. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Appendix
  • 45. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes on AWS Data lake A m a z o n S 3 A m a z o n G l a c i e r A W S G l u e A m a z o n S a g e M a k e r A W S D e e p L e a r n i n g A M I s A m a z o n R e k o g n i t i o n A m a z o n L e x A W S D e e p L e n s A m a z o n C o m p r e h e n d A m a z o n T r a n s l a t e A m a z o n T r a n s c r i b e A m a z o n P o l l y Machine Learning Analytics Internet of Things (IoT) A W S I o T C o r e A W S G r e e n g r a s s A W S I o T A n a l y t i c s A m a z o n F r e e R T O S A W S I o T 1 - C l i c k A W S I o T B u t t o n A W S I o T D e v i c e M a n a g e m e n t A W S I o T D e v i c e D e f e n d e r A m a z o n A t h e n a A m a z o n E M R A m a z o n R e d s h i f t A m a z o n E l a s t i c s e a r c h S e r v i c e A m a z o n K i n e s i s A m a z o n Q u i c k S i g h t
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sysco is the leader in selling, marketing, & distributing food. Challenge: Large volumes of data in multiple systems. Also, high costs from maintaining on-premises EDW deployment. Solution: Migrated their on-premises solution to the cloud with Amazon Redshift, Amazon S3, Amazon EMR, and Athena.
  • 47. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Sysco: Analytics on the Data Lake ETL process Amazon Redshift Data preparation Ingest raw data from multiple sources Amazon S3 Marketing data source Other source systems Transformed data Amazon S3 Redshift Spectrum Athena Amazon EMR Sysco is the leader in selling, marketing, & distributing food. Challenge: large volumes of data in multiple systems. Consolidated data into a single Amazon S3 data lake. Data scientists use Amazon EMR notebooks, Athena, & Amazon Redshift Spectrum used by business users for reporting.