Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Snively, Specialist Solutions Architect – Data and Analytics
October 12, 2017
Serverless Analytics with
Amazon Redshift Spectrum, AWS
Glue, and Amazon QuickSight

Agenda
• What is Serverless?
• Enterprise Data Warehouse on AWS (Amazon Redshift)
• Serverless Queries from your Data Warehouse (Redshift Spectrum)
• Serverless Data Catalog (AWS Glue)
• Serverless ETL (AWS Glue)
• Serverless BI (Amazon QuickSight)
• Demonstration
• Wrap up

What is Serverless
Virtualized Managed Serverless
You can easily
provision servers and
focus on OS and
above.
You focus higher in the
stack but still need to
consider servers, how
much CPU is needed, and
how much RAM.
AWS manages based the
customer configuration.
Build applications and services
without thinking of servers.
Don’t be concerned about
provisioning, scaling, and
maintaining servers for fault
tolerance and availability.
AWS does all of this for you.

No servers to
provision or manage
Scales with usage
Never pay for
idle resources
Availability and fault
tolerance built in
Serverless characteristics

• Managed Massively Parallel Petabyte
Scale Data Warehouse
• Streaming Backup/Restore to S3
• Load data from S3, DynamoDB and EMR
• Extensive Security Features
• Online Scaling from 160 GB -> 2 PB
Amazon
Redshift
Enterprise Data Warehouse a lot faster
a lot simpler
a lot cheaper

Selected Amazon Redshift customers

We innovate quickly
Well over 140 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL

Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Glue Data
Catalog
1

Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Glue Data
Catalog
2

Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Glue Data
Catalog
3

Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Glue Data
Catalog
4

Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
5
Glue Data
Catalog

Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
6
Glue Data
Catalog

7
Amazon Redshift
Spectrum projects,
filters, and aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Glue Data
Catalog

Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
8
Glue Data
Catalog

Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
9
Glue Data
Catalog

AWS Glue
Automatically discovers and categorizes your data to make it
immediately searchable and queryable
Generates code to clean, enrich, and reliably move data between data
stores; you can also use their favorite tools to build ETL jobs
Runs your jobs on a serverless, fully managed, scale-out environment
without needing to provision or manage compute resources
Discover
Develop
Deploy

AWS Glue: Components
Data Catalog
 Apache Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Runs jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job Authoring
 Auto-generates ETL code
 Built on open frameworks – Python and Spark
 Developer-centric – editing, debugging, sharing

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.)
into a single categorized list that is searchable

Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
SQL Server / Oracle
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!

Job authoring in AWS Glue
 Python code generated by AWS Glue
 Connect a notebook or IDE to AWS Glue
 Existing code brought into AWS Glue
You have choices on
how to get started

1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation

Job authoring: Relationalize() transform
Semi-structured schema Relational schema
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B

Serverless ETL to populate your warehouse

Amazon QuickSight is a Business Analytics Service that lets business users
quickly and easily visualize, explore, and share insights from their data.

Basic Concepts
Retail Data
Ops Data
Marketing Data
Relational
Databases
Flat Files
More data sources
coming soon!
Microsoft Active
DirectoryLocal User Definition

QuickSight is deeply integrated
with AWS data sources like
Redshift, RDS, S3, Athena and
others, as well as third-party
sources like Excel, Salesforce, as
well as on-premises databases.
Deep Integration with
AWS Data Sources
Amazon RDS,
Aurora
Amazon
Redshift
Amazon
Athena
Amazon S3
Flat Files

Super-fast Performance with SPICE

What did we cover…
Automatically discovers and
categorizes your data to
make it immediately
searchable and queryable
Business Analytics Service that
lets business users quickly and
easily visualize, explore, and
share insights from their data.
extend the analytic power
beyond data stored your
data warehouse to query
vast amounts of
unstructured data in your
Amazon S3 “data lake”
Runs your jobs on a serverless, fully
managed, scale-out environment
without needing to provision or
manage compute resources

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks

Similar to Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon QuickSight - AWS Online Tech Talks