Building a Modern Data Platform on AWS

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building a Modern Data Platform
on AWS
Asif Abbasi
Specialist Solutions Architect (EMEA)
AWS

About me
Muhammad Asif Abbasi
Sr. Specialist Solutions Architect - Analytics
• 20 years of data & analytics background
• 1+ years at AWS
• Data Lakes, Hadoop, Visualization, AI & ML
• Working with AWS service teams
• Mostly talk about Data & Analytics across EMEA
Twitter: @masifabbasi@masifabbasi

Agenda
Data Lakes on AWS
Data Ingestion
Adhoc Analytics
AWS Glue, Lake Formation and Redshift
Q&A

Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous datasets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read

Store exabytes of data
Stage from landing dock to transformed to curated
Make available in each
Load, transform, and catalog once
Make data available to many tools
Open formats and interfaces support innovation
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Amazon
Redshift
Amazon
EMR
Athena
Amazon
Kinesis
Amazon ES
Data lakes help you scale cost effectively
Kinesis
Video Streams
AI services
Amazon
QuickSight

How it works: Data Lakes and analytics on AWS
S3
IAM AWS KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build data lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI services
Amazon
EMR
Amazon
QuickSight
Data
Catalog
Amazon S3

High performance
Why Amazon S3 for the data lake?
SecureDurable
Available
Easy to use
Scalable & affordable
Integrated

Amazon Kinesis—real time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL

User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and querying in place
Fully Managed Process & Query
• Catalog, transform, & query data in Amazon S3
• No physical instances to manage
AWS Lambda
Function

Amazon S3 Select and Amazon S3 Glacier Select
Select subset of data from an object based on a SQL expression

Motivation behind Amazon S3 Select
GET all the data from Amazon S3 objects and my application will filter the data that I need
Amazon Redshift Spectrum Example:
Customer: Run 50,000 queries
Amount of data fetched from S3: 6 PBs
Amount of data used in Amazon Redshift: 650 TB
Data needed from Amazon S3: 10%

Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X faster at 1/5 of the cost

Amazon Athena—interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Supports multiple data formats – define schema on demand
$
Query Instantly Pay per query Open Easy

Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy, but not efficient
• Compress & store/archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row oriented (AVRO) good for full data scans
Key considerations are cost, performance, & support

Choosing the right data formats (con’t.)
Pay by the amount of data scanned per query
Use compressed columnar formats
• Parquet
• ORC
Easy to integrate with wide variety of tools
Data set Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Data prep is ~80% of data lake work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

AWS Glue—Serverless data catalog & ETL
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless

AWS Lake Formation
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog

Traditionally, analytics looked like this…
Expensive: Large initial capex + $10k $50k/TB/year
GBs-TBs scale [not designed for PB/EBs]
Relational data
90% of data was thrown away because of cost
OLTP ERP CRM LOB
Data warehouse
Business intelligence

Data lakes evolve the traditional approach
OLTP ERP CRM LOB
Data warehouse
Business
intelligence
Data lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
learning
DW queries Big data
processing
Interactive Real-time
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics

What does data warehouse modernization mean?
Easy to use
Extends to
your Data Lake
Don’t waste time on
menial administrative
tasks and maintenance
Directly analyze data
stored in your data lake
in open formats
Any scale of data,
workloads, and users
Dynamically scale up to
guarantee performance even
with unpredictable demands
and data volumes
Faster
time-to-insights
Consistently fast
performance, even with
thousands of concurrent
queries and users

Amazon Redshift
Fastest
Get faster time-to-insight
for all types of analytics
workloads; powered by
machine learning, columnar
storage, and MPP
Unlimited
scale
Extends your
data lake
1/10th
the cost
Dynamically scale up to
guarantee performance
even with unpredictable
analytical demands and
data volumes
Analyze data in the Amazon
S3 data lake in-place and in
open formats, together with
data loaded into Amazon
Redshift’s high performance
SSDs
Start at $0.25 per hour,
save costs with automated
administration tasks, and
eliminate business impact
due to downtime; as low as
$1,000 per terabyte per year
Fast, simple, cost-effective data
warehouse that can extend queries to your data lake
Analyze data in open formats
such as Parquet, ORC, and JSON, using SQL tools

Amazon Redshift architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at just $0.25/hour
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Ingestion/Backup/Restore

Security is built-in
Select compliance certifications*
10 GigE (HPC)
Customer
VPC
Internal
VPC
JDBC/ODBC
Compute
Nodes
Leader
Node
Network Isolation
End-to-end encryption
Integration with AWS Key
Management Service
(AWS KMS)
Amazon S3

Caching layer
Concurrency Scaling for bursts of user activity
Automatically
creates more
clusters on-
demand
Consistently
fast
performance
even with
thousands of
concurrent queries
No advance
hydration
required
Quickly scale
to serve changing
query workload
Backup
Redshift Managed Amazon
S3

Amazon Redshift elastic resize (GA)
Adds
additional
nodes
to Redshift cluster
Distributes
data
across new
configuration
in minutes
Minimal
transition time
Scale compute
and storage on-
demand
Scale up and down in minutes
Amazon
Redshift
cluster
Redshift Managed
Amazon S3
JDBC/ODBC
Leader Node
CN2CN1 CN3 CN4
Backup

Amazon Redshift intelligent
administration
Automates data
distribution in tables for
improved performance
and disk space
utilization.
Provides intelligent
recommendations for tuning
based on continuous
workload analysis.
ALL
keyA keyB keyC keyD
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
EVEN
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
KEY
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
recommended
distribution key
No more messing
with distkeys.
Coming Soon!
Advise

Amazon Redshift intelligent maintenance
VacuumAnalyze WLM
concurrency
setting
AutoAuto Auto
Maintenance processes like
vacuum and analyze will
automatically run in the
background.
Amazon Redshift will automatically
adjust the WLM concurrency setting to
deliver optimal throughput.
Moving towards
zero-maintenance.
Coming Soon!

Run stored procedures in
Amazon Redshift
Bring your existing Stored
Procedure and run in
Amazon Redshift.
Amazon Redshift will support stored
procedure in PL/pgSQL format,
enabling you to bring your existing
stored procedure to Amazon Redshift.
Migrating to Amazon
Redshift is even
easier.
Coming Soon!
where the data is to
efficiently run ETL,
data validation, and
custom business
logic.

Building a Modern Data Platform on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Modern Data Platform on AWS

Similar to Building a Modern Data Platform on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building a Modern Data Platform on AWS