Learning Objectives:
- Understand how to build a serverless big data solution quickly and easily
- Learn how to discover and prepare all your data for analytics
- Learn how to query and visualize analytics on all your data to create actionable insights
2. Agenda
• What is Serverless?
• Enterprise Data Warehouse on AWS (Amazon Redshift)
• Serverless Queries from your Data Warehouse (Redshift Spectrum)
• Serverless Data Catalog (AWS Glue)
• Serverless ETL (AWS Glue)
• Serverless BI (Amazon QuickSight)
• Demonstration
• Wrap up
3. What is Serverless
Virtualized Managed Serverless
You can easily
provision servers and
focus on OS and
above.
You focus higher in the
stack but still need to
consider servers, how
much CPU is needed, and
how much RAM.
AWS manages based the
customer configuration.
Build applications and services
without thinking of servers.
Don’t be concerned about
provisioning, scaling, and
maintaining servers for fault
tolerance and availability.
AWS does all of this for you.
4. No servers to
provision or manage
Scales with usage
Never pay for
idle resources
Availability and fault
tolerance built in
Serverless characteristics
5. • Managed Massively Parallel Petabyte
Scale Data Warehouse
• Streaming Backup/Restore to S3
• Load data from S3, DynamoDB and EMR
• Extensive Security Features
• Online Scaling from 160 GB -> 2 PB
Amazon
Redshift
Enterprise Data Warehouse a lot faster
a lot simpler
a lot cheaper
7. We innovate quickly
Well over 140 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
8. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
9. Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes
10. Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Glue Data
Catalog
Apache Hive Metastore
1
11. Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Glue Data
Catalog
Apache Hive Metastore
2
12. Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Glue Data
Catalog
Apache Hive Metastore
3
13. Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Glue Data
Catalog
Apache Hive Metastore
4
14. Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
5
Glue Data
Catalog
Apache Hive Metastore
15. Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
6
Glue Data
Catalog
Apache Hive Metastore
16. 7
Amazon Redshift
Spectrum projects,
filters, and aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Glue Data
Catalog
Apache Hive Metastore
17. Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
8
Glue Data
Catalog
Apache Hive Metastore
18. Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
9
Glue Data
Catalog
Apache Hive Metastore
20. AWS Glue
Automatically discovers and categorizes your data to make it
immediately searchable and queryable
Generates code to clean, enrich, and reliably move data between data
stores; you can also use their favorite tools to build ETL jobs
Runs your jobs on a serverless, fully managed, scale-out environment
without needing to provision or manage compute resources
Discover
Develop
Deploy
21. AWS Glue: Components
Data Catalog
Apache Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and create tables
Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
Runs jobs on a serverless Spark platform
Provides flexible scheduling
Handles dependency resolution, monitoring, and alerting
Job Authoring
Auto-generates ETL code
Built on open frameworks – Python and Spark
Developer-centric – editing, debugging, sharing
22. AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.)
into a single categorized list that is searchable
23. Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
SQL Server / Oracle
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
25. Job authoring in AWS Glue
Python code generated by AWS Glue
Connect a notebook or IDE to AWS Glue
Existing code brought into AWS Glue
You have choices on
how to get started
26. 1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
27. Job authoring: Relationalize() transform
Semi-structured schema Relational schema
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
29. Amazon QuickSight is a Business Analytics Service that lets business users
quickly and easily visualize, explore, and share insights from their data.
30. Basic Concepts
Retail Data
Ops Data
Marketing Data
Relational
Databases
Flat Files
More data sources
coming soon!
Microsoft Active
DirectoryLocal User Definition
31. QuickSight is deeply integrated
with AWS data sources like
Redshift, RDS, S3, Athena and
others, as well as third-party
sources like Excel, Salesforce, as
well as on-premises databases.
Deep Integration with
AWS Data Sources
Amazon RDS,
Aurora
Amazon
Redshift
Amazon
Athena
Amazon S3
Flat Files
35. What did we cover…
Automatically discovers and
categorizes your data to
make it immediately
searchable and queryable
Business Analytics Service that
lets business users quickly and
easily visualize, explore, and
share insights from their data.
extend the analytic power
beyond data stored your
data warehouse to query
vast amounts of
unstructured data in your
Amazon S3 “data lake”
Runs your jobs on a serverless, fully
managed, scale-out environment
without needing to provision or
manage compute resources