Implementazione di una soluzione Data Lake.pdf

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Giorgio Nobile
Solutions Architect, Amazon Web Services
Soluzioni Data Lake: salvare, catalogare ed
analizzare tutti i vostri dati

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable of
categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence Relational data
TBs-PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial capex + $10K–$50K / TB / Year

Data Lakes Extend the Traditional Approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time

What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing

Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache

Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch

Data Lakes on AWS
Unmatched durability and availability at Exabyte scale
Comprehensive security, compliance, and audit capabilities
Object-level controls
Usage and cost analysis insight into your data
Most ways to bring data in
Twice as many partner integration
DATA LAKE
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
Machine Learning
Analytics
Internet of Things
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Kinesis
Video Streams

Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS  Use EMR/Hadoop with local
HDFS for hottest data sets
 Store cooler data in S3 and
Glacier to reduce costs
 Use S3 Analytics to optimize
tiering strategy
S3 Analytics

Streaming with Amazon Kinesis
Easily collect, process, and analyze data and video streams in real time
Capture, process, and
store video streams
Kinesis Video Streams
Load data streams into
AWS data stores
Analyze data streams with
SQL
Capture, process, and
store data streams
Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Amazon Kinesis Data Streaming
Collect, process, and analyze data streams in real time
EMR/Spark
Custom code
on EC2
Amazon S3
Amazon
Redshift
Splunk
Ingest,
store data
streams
Kinesis Data
Streams
Kinesis Data
Analytics
Aggregate,
filter, enrich
data
Kinesis Data
Firehose
Egress
data
streams
AWS Lambda
Real time
Fully managed
Scalable
Secure
Cost effective
Amazon
Elasticsearch
Service

Amazon Kinesis Data Firehose

Amazon Kinesis Data Analytics

Amazon Kinesis is a Foundational Service Used
Across Amazon
Amazon
CloudWa
tch
logs
Amazon
S3
events
AWS
metering
Amazon.com
online catalog

Use Case 1: Clickstream Analytics
Example: Website content recommendations
Streams website
clickstreams for
analytics
Aggregates clickstreams
based on user sessions
and calculates site
metrics
Loads aggregated
metrics to Amazon
Redshift

Use Case 2: Real-time Analytics
Example: Analyze streaming social media data

Storing is not enough, data needs to be discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”

AWS Glue—data catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable, and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
Glue
Data Catalog
Discover data and
extract schema

Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

AWS Glue—ETL service
Make ETL scripting and deployment easy
Automatically generates ETL code
Code is customizable with Python and Spark
Endpoints provided to edit, debug, test code
Jobs are scheduled or event-based
Serverless

Amazon EMR—Big Data Processing
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
1001100001001010111001
010101110010101000
00111100101100101
010001100001
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security

Amazon Athena—Interactive Analysis
$
SQL
Query Instantly
Zero setup cost; just
point to S3 and start
querying
Pay per query
Pay only for queries run;
save 30–90% on per-query
costs through compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

Amazon Redshift – Modern Data Warehousing
Fast, scalable, fully managed data warehouse at 1/10th the cost
Massively parallel, scales from gigabytes to exabytes
Queries data across your Redshift data warehouse and Amazon S3 data lake
Fast at scale
Columnar storage technology
to improve I/O efficiency and
scale query performance
Cost-effective
Start at $0.25 per hour; as
low as $250-$333 per
uncompressed terabyte
per year
Open file formats Secure
Audit everything; encrypt
data end-to-end; extensive
certification and compliance
Analyze optimized data
formats on direct-attached
disks, and all open file
formats in S3
$

Redshift Spectrum – Data Lake Analytics
Query across your Amazon Redshift data warehouse and your Amazon S3 data lake
Run Redshift SQL queries against Amazon S3
Scale compute and storage separately
Fast query performance
Unlimited concurrency
CSV, ORC, Grok, Avro & Parquet data formats
On demand, pay per query based on data scanned
S3 data lakeRedshift data
Redshift Spectrum
query engine

Amazon Redshift Cluster Architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
• 2, 16 or 32 slices
Redshift Spectrum
• In-place queries of data on Amazon S3
• Ultra high scale, unlimited concurrency
• CSV, Grok, Avro, Parquet, and more
Redshift Cluster
JDBC/ODBC
...
1 2 3 4 N
Leader Node
Compute Nodes
Spectrum Fleet
Amazon S3

Sysco is the leader in selling, marketing, & distributing food
Challenge:
Large volumes of data in multiple systems. Also, high costs
from maintaining on-premises EDW deployment
Solution:
Migrated their on-premises solution to the cloud with
Redshift, S3, EMR, and Athena

Sysco—Analytics on the Data Lake
ETL
process
Redshift
Data
preparation
Ingest raw data
from multiple
sources
S3Marketing
data source
Other
source
systems
Transformed
data
S3
Redshift
Spectrum
Athena
EMR
Sysco is the leader in selling, marketing, & distributing food
Challenge: large volumes of data in multiple systems
Consolidated data into a single S3 data lake
Data scientists use EMR notebooks, Athena & Amazon Redshift
Spectrum used by business users for reporting

Nasdaq Uses Amazon Redshift for Fast Queries
Migrate legacy on-premises warehouse to Redshift
4.8B rows inserted per trading day (orders, trades,
quotes)
Ingest data from multiple sources, validates, and
stages in Amazon S3
Redshift reads data out of S3 for fast queries
Presto on EMR and S3 used for analysis of massive
historical data set
Redshift
Flat files
Operational
Databases
S3
EMR
Data from all 7 exchanges operated by
Nasdaq (orders, quotes, trade executions)
SQL Clients

Thank you!

Implementazione di una soluzione Data Lake.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implementazione di una soluzione Data Lake.pdf

Similar to Implementazione di una soluzione Data Lake.pdf (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Implementazione di una soluzione Data Lake.pdf