SlideShare a Scribd company logo
1 of 32
S U M M I T
B AH RAI N
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building a Modern Data Platform
on AWS
Asif Abbasi
Specialist Solutions Architect (EMEA)
AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
About me
Muhammad Asif Abbasi
Sr. Specialist Solutions Architect - Analytics
• 20 years of data & analytics background
• 1+ years at AWS
• Data Lakes, Hadoop, Visualization, AI & ML
• Working with AWS service teams
• Mostly talk about Data & Analytics across EMEA
Twitter: @masifabbasi@masifabbasi
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
Data Lakes on AWS
Data Ingestion
Adhoc Analytics
AWS Glue, Lake Formation and Redshift
Q&A
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous datasets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Store exabytes of data
Stage from landing dock to transformed to curated
Make available in each
Load, transform, and catalog once
Make data available to many tools
Open formats and interfaces support innovation
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Amazon
Redshift
Amazon
EMR
Athena
Amazon
Kinesis
Amazon ES
Data lakes help you scale cost effectively
Kinesis
Video Streams
AI services
Amazon
QuickSight
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How it works: Data Lakes and analytics on AWS
S3
IAM AWS KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build data lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI services
Amazon
EMR
Amazon
QuickSight
Data
Catalog
Amazon S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
High performance
Why Amazon S3 for the data lake?
SecureDurable
Available
Easy to use
Scalable & affordable
Integrated
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Kinesis—real time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and querying in place
Fully Managed Process & Query
• Catalog, transform, & query data in Amazon S3
• No physical instances to manage
AWS Lambda
Function
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3 Select and Amazon S3 Glacier Select
Select subset of data from an object based on a SQL expression
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Motivation behind Amazon S3 Select
GET all the data from Amazon S3 objects and my application will filter the data that I need
Amazon Redshift Spectrum Example:
Customer: Run 50,000 queries
Amount of data fetched from S3: 6 PBs
Amount of data used in Amazon Redshift: 650 TB
Data needed from Amazon S3: 10%
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X faster at 1/5 of the cost
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Athena—interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Supports multiple data formats – define schema on demand
$
Query Instantly Pay per query Open Easy
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
Key considerations are cost, performance, & support
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Choosing the right data formats (con’t.)
Pay by the amount of data scanned per query
Use compressed columnar formats
• Parquet
• ORC
Easy to integrate with wide variety of tools
Data set Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data prep is ~80% of data lake work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue—Serverless data catalog & ETL
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Lake Formation
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Traditionally, analytics looked like this…
Expensive: Large initial capex + $10k $50k/TB/year
GBs-TBs scale [not designed for PB/EBs]
Relational data
90% of data was thrown away because of cost
OLTP ERP CRM LOB
Data warehouse
Business intelligence
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lakes evolve the traditional approach
OLTP ERP CRM LOB
Data warehouse
Business
intelligence
Data lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
learning
DW queries Big data
processing
Interactive Real-time
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What does data warehouse modernization mean?
Easy to use
Extends to
your Data Lake
Don’t waste time on
menial administrative
tasks and maintenance
Directly analyze data
stored in your data lake
in open formats
Any scale of data,
workloads, and users
Dynamically scale up to
guarantee performance even
with unpredictable demands
and data volumes
Faster
time-to-insights
Consistently fast
performance, even with
thousands of concurrent
queries and users
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift
Fastest
Get faster time-to-insight
for all types of analytics
workloads; powered by
machine learning, columnar
storage, and MPP
Unlimited
scale
Extends your
data lake
1/10th
the cost
Dynamically scale up to
guarantee performance
even with unpredictable
analytical demands and
data volumes
Analyze data in the Amazon
S3 data lake in-place and in
open formats, together with
data loaded into Amazon
Redshift’s high performance
SSDs
Start at $0.25 per hour,
save costs with automated
administration tasks, and
eliminate business impact
due to downtime; as low as
$1,000 per terabyte per year
Fast, simple, cost-effective data
warehouse that can extend queries to your data lake
Analyze data in open formats
such as Parquet, ORC, and JSON, using SQL tools
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at just $0.25/hour
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Ingestion/Backup/Restore
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Security is built-in
Select compliance certifications*
10 GigE (HPC)
Customer
VPC
Internal
VPC
JDBC/ODBC
Compute
Nodes
Leader
Node
Network Isolation
End-to-end encryption
Integration with AWS Key
Management Service
(AWS KMS)
Amazon S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Caching layer
Concurrency Scaling for bursts of user activity
Automatically
creates more
clusters on-
demand
Consistently
fast
performance
even with
thousands of
concurrent queries
No advance
hydration
required
Quickly scale
to serve changing
query workload
Backup
Redshift Managed Amazon
S3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift elastic resize (GA)
Adds
additional
nodes
to Redshift cluster
Distributes
data
across new
configuration
in minutes
Minimal
transition time
Scale compute
and storage on-
demand
Scale up and down in minutes
Amazon
Redshift
cluster
Redshift Managed
Amazon S3
JDBC/ODBC
Leader Node
CN2CN1 CN3 CN4
Backup
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift intelligent
administration
Automates data
distribution in tables for
improved performance
and disk space
utilization.
Provides intelligent
recommendations for tuning
based on continuous
workload analysis.
ALL
keyA keyB keyC keyD
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
EVEN
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
KEY
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
recommended
distribution key
No more messing
with distkeys.
Coming Soon!
Advise
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift intelligent maintenance
VacuumAnalyze WLM
concurrency
setting
AutoAuto Auto
Maintenance processes like
vacuum and analyze will
automatically run in the
background.
Amazon Redshift will automatically
adjust the WLM concurrency setting to
deliver optimal throughput.
Moving towards
zero-maintenance.
Coming Soon!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Run stored procedures in
Amazon Redshift
Bring your existing Stored
Procedure and run in
Amazon Redshift.
Amazon Redshift will support stored
procedure in PL/pgSQL format,
enabling you to bring your existing
stored procedure to Amazon Redshift.
Migrating to Amazon
Redshift is even
easier.
Coming Soon!
where the data is to
efficiently run ETL,
data validation, and
custom business
logic.
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Asif Abbasi
Specialist Solutions Architect
AWS
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfAmazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...Simplilearn
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceAmazon Web Services
 
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Amazon Web Services
 
AWS Data Analytics on AWS
AWS Data Analytics on AWSAWS Data Analytics on AWS
AWS Data Analytics on AWSsampath439572
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Disaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprintsDisaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprintsHarish Ganesan
 

What's hot (20)

Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
 
AWS Data Analytics on AWS
AWS Data Analytics on AWSAWS Data Analytics on AWS
AWS Data Analytics on AWS
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Disaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprintsDisaster Recovery using AWS -Architecture blueprints
Disaster Recovery using AWS -Architecture blueprints
 

Similar to Building a Modern Data Platform on AWS

Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSSteven Hsieh
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...Provectus
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Amazon Web Services
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfSasikumarPalanivel3
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfsaidbilgen
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS Summit
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS SummitOptimize data lakes with Amazon S3 - STG302 - Santa Clara AWS Summit
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS SummitAmazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 

Similar to Building a Modern Data Platform on AWS (20)

Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data Warehouses
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS Summit
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS SummitOptimize data lakes with Amazon S3 - STG302 - Santa Clara AWS Summit
Optimize data lakes with Amazon S3 - STG302 - Santa Clara AWS Summit
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building a Modern Data Platform on AWS

  • 1. S U M M I T B AH RAI N
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Building a Modern Data Platform on AWS Asif Abbasi Specialist Solutions Architect (EMEA) AWS
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T About me Muhammad Asif Abbasi Sr. Specialist Solutions Architect - Analytics • 20 years of data & analytics background • 1+ years at AWS • Data Lakes, Hadoop, Visualization, AI & ML • Working with AWS service teams • Mostly talk about Data & Analytics across EMEA Twitter: @masifabbasi@masifabbasi
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Agenda Data Lakes on AWS Data Ingestion Adhoc Analytics AWS Glue, Lake Formation and Redshift Q&A
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Defining the AWS data lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous datasets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Store exabytes of data Stage from landing dock to transformed to curated Make available in each Load, transform, and catalog once Make data available to many tools Open formats and interfaces support innovation Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 Amazon Redshift Amazon EMR Athena Amazon Kinesis Amazon ES Data lakes help you scale cost effectively Kinesis Video Streams AI services Amazon QuickSight
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How it works: Data Lakes and analytics on AWS S3 IAM AWS KMS OLTP ERP CRM LOB Devices Web Sensors Social Kinesis Build data lakes quickly • Identify, crawl, and catalog sources • Ingest and clean data • Transform into optimal formats Simplify security management • Enforce encryption • Define access policies • Implement audit login Enable self-service and combined analytics • Analysts discover all data available for analysis from a single data catalog • Use multiple analytics tools over the same data Athena Amazon Redshift AI services Amazon EMR Amazon QuickSight Data Catalog Amazon S3
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T High performance Why Amazon S3 for the data lake? SecureDurable Available Easy to use Scalable & affordable Integrated
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis—real time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T User-Defined Functions • Bring your own functions & code • Execute without provisioning servers Processing and querying in place Fully Managed Process & Query • Catalog, transform, & query data in Amazon S3 • No physical instances to manage AWS Lambda Function
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Select and Amazon S3 Glacier Select Select subset of data from an object based on a SQL expression
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Motivation behind Amazon S3 Select GET all the data from Amazon S3 objects and my application will filter the data that I need Amazon Redshift Spectrum Example: Customer: Run 50,000 queries Amount of data fetched from S3: 6 PBs Amount of data used in Amazon Redshift: 650 TB Data needed from Amazon S3: 10%
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X faster at 1/5 of the cost
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Athena—interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports multiple data formats – define schema on demand $ Query Instantly Pay per query Open Easy
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Choosing the right data formats There is no such thing as the “best” data format • All involve tradeoffs, depending on workload & tools • CSV, TSV, JSON are easy, but not efficient • Compress & store/archive as raw input • Columnar compressed are generally preferred • Parquet or ORC • Smaller storage footprint = lower cost • More efficient scan & query • Row oriented (AVRO) good for full data scans Key considerations are cost, performance, & support
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Choosing the right data formats (con’t.) Pay by the amount of data scanned per query Use compressed columnar formats • Parquet • ORC Easy to integrate with wide variety of tools Data set Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data prep is ~80% of data lake work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue—Serverless data catalog & ETL Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Lake Formation Build, secure, and manage a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Traditionally, analytics looked like this… Expensive: Large initial capex + $10k $50k/TB/year GBs-TBs scale [not designed for PB/EBs] Relational data 90% of data was thrown away because of cost OLTP ERP CRM LOB Data warehouse Business intelligence
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lakes evolve the traditional approach OLTP ERP CRM LOB Data warehouse Business intelligence Data lake 1001100001001010111001 0101011100101010000101 1111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine learning DW queries Big data processing Interactive Real-time Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What does data warehouse modernization mean? Easy to use Extends to your Data Lake Don’t waste time on menial administrative tasks and maintenance Directly analyze data stored in your data lake in open formats Any scale of data, workloads, and users Dynamically scale up to guarantee performance even with unpredictable demands and data volumes Faster time-to-insights Consistently fast performance, even with thousands of concurrent queries and users
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift Fastest Get faster time-to-insight for all types of analytics workloads; powered by machine learning, columnar storage, and MPP Unlimited scale Extends your data lake 1/10th the cost Dynamically scale up to guarantee performance even with unpredictable analytical demands and data volumes Analyze data in the Amazon S3 data lake in-place and in open formats, together with data loaded into Amazon Redshift’s high performance SSDs Start at $0.25 per hour, save costs with automated administration tasks, and eliminate business impact due to downtime; as low as $1,000 per terabyte per year Fast, simple, cost-effective data warehouse that can extend queries to your data lake Analyze data in open formats such as Parquet, ORC, and JSON, using SQL tools
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift architecture Leader Node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute Nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC Ingestion/Backup/Restore
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Security is built-in Select compliance certifications* 10 GigE (HPC) Customer VPC Internal VPC JDBC/ODBC Compute Nodes Leader Node Network Isolation End-to-end encryption Integration with AWS Key Management Service (AWS KMS) Amazon S3
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Caching layer Concurrency Scaling for bursts of user activity Automatically creates more clusters on- demand Consistently fast performance even with thousands of concurrent queries No advance hydration required Quickly scale to serve changing query workload Backup Redshift Managed Amazon S3
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift elastic resize (GA) Adds additional nodes to Redshift cluster Distributes data across new configuration in minutes Minimal transition time Scale compute and storage on- demand Scale up and down in minutes Amazon Redshift cluster Redshift Managed Amazon S3 JDBC/ODBC Leader Node CN2CN1 CN3 CN4 Backup
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift intelligent administration Automates data distribution in tables for improved performance and disk space utilization. Provides intelligent recommendations for tuning based on continuous workload analysis. ALL keyA keyB keyC keyD Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVEN Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 KEY Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 recommended distribution key No more messing with distkeys. Coming Soon! Advise
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift intelligent maintenance VacuumAnalyze WLM concurrency setting AutoAuto Auto Maintenance processes like vacuum and analyze will automatically run in the background. Amazon Redshift will automatically adjust the WLM concurrency setting to deliver optimal throughput. Moving towards zero-maintenance. Coming Soon!
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Run stored procedures in Amazon Redshift Bring your existing Stored Procedure and run in Amazon Redshift. Amazon Redshift will support stored procedure in PL/pgSQL format, enabling you to bring your existing stored procedure to Amazon Redshift. Migrating to Amazon Redshift is even easier. Coming Soon! where the data is to efficiently run ETL, data validation, and custom business logic.
  • 31. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Asif Abbasi Specialist Solutions Architect AWS
  • 32. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.