SlideShare a Scribd company logo
1 of 57
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Serverless Analytics
Pipelines with AWS Glue
Mehul A. Shah
Senior Software Development Manager
AWS Glue
Arup Ray
Vice President, Engineering
Realtor.com
A N T 3 0 8
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Today’s Agenda
AWS Glue overview
What’s new: improvements and features
Realtor.com – customer success
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fully-managed, serverless extract-transform-load (ETL) service
for developers, built by developers
1000s of customers and jobs
A year ago …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A year ago …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Select AWS Glue customers …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You helped us grow …
Sep Nov Jan Mar May Jul Sep
Job and Crawler Runs
0
2
4
6
8
10
12
Sep Nov Jan Mar May Jul Sep
# of Regions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We’ve been busy…
Access policies for AWS
Glue Data Catalog
Amazon SageMaker
notebooks
Encryption
at rest
Canada central
region
Crawler combine
compatible schemas
ETL job
metrics
Amazon DynamoDB
integration
Job delay
notification
London
region
Seoul
region
Crawler merge
new columns
Mumbai
region
Support Apache
Spark 2.2.1
ETL job
timeout
Singapore
region
Sydney
region
Readers support
JSONPath expressions
New Job
Events Types
Support for
Scala scripts
Tokyo
region
Crawler CWE
notifications
XML support AWS CloudTrail
support
Amazon CloudFormation
templates
Crawler exclusion
patterns
Per-second
billing
DynamicFrame
Filter and Map
GDPR, HIPAA, and BAA
compliance
Ireland, Oregon,
Ohio region
Frankfurt
region
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
OrchestrationData Catalog Serverless Jobs
Automatic crawling
Apache Hive Metastore compatible
Integrated with AWS analytic services
Discover
Flexible scheduling
Monitoring and alerting
External integrations
Deploy
Apache Spark core
Python and Scala
Auto-generates ETL code
Develop
AWS Glue components
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Beyond ETL: data science and exploration
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What you’re saying …
data lake
Amazon EMR Hive SparkSQL …”
Ram Kumar Regnaswamy, CTO, Beeswax
data lake Redshift
cost-effective
whole data infrastructure
Umang Rustagi, Co-founder and COO, FinAccel”
200% faster
at a fraction of the cost
Miki Hardisty, CTO, Jack-in-the-box
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Innovation areas
Scalability and performance
Crawlers and ETL jobs
Profiling and debugging
ETL metrics
Interoperability
Orchestration features
Announcing a new job type: Python shell
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
year
month …
day …
2018
11 12
2221
hour …
JSON
year
month …
day …
2018
11 12
2221
hour …
Parquet
transform
Example data organization in:
Amazon Simple Storage Service (Amazon S3)
Apache Hive-
style partitions
source Amazon S3 bucket target Amazon S3 bucket
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crawler results …
Complex, nested fields
Groups files into
Apache Hive-style partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crawler performance
7x improvement YoY
90M+ files per day
Millions of partitions
YMMV with partitions
size and complexity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Read Dynamic Frame
from source
Data transformation +
data cleaning functions
Write Dynamic Frame
to sink
AWS Glue ETL scripts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Review: Apache Spark and AWS Glue ETL
Apache Spark is a distributed data processing engine for complex analytics.
AWS Glue builds on the Apache Spark to offer ETL specific functionality.
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame performance
Raw conversion speed
4x improvement YoY
1TB in under 1.5 hours
with 10 DPU
YMMV with data format
and script complexity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DynamicFrame predicate pushdown
GitHub public timeline
0
20
40
60
80
100
120
0 3 6 9 12
months scanned
Time(mins)
pushDownPredicate =
"(year=='2017' and month=='04’)”)
Example code
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Innovation areas
Scalability and performance
Crawlers and ETL jobs
Profiling and debugging
ETL metrics
Interoperability
Orchestration features
Announcing a new job type: Python shell
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue execution model
Apache Spark and AWS Glue are data
parallel. Data is divided into partitions
(shards) that are processed concurrently.
Jobs are divided into stages
1 stage x 1 partition = 1 task
Driver schedules tasks on executors.
2 executors per DPU
Driver
Executors Overall throughput is limited by
the number of partitions (shards)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue job metrics
• Metrics can be enabled in the AWS Command Line Interface (AWS CLI)
and AWS SDK by passing --enable-metrics as a job parameter key.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Profile jobs using ETL metrics
Derived from the underlying Apache Spark metrics
Driver and per executor
aggregates and instantaneous
reports to Amazon CloudWatch metrics every 30 sec
Metrics: memory usage, bytes read and written,
cpu load, bytes shuffled, needed executors,
and more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: profiling memory usage
overwhelms
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: AWS Glue small-file handling
Driver memory remains below 50%
for the entire duration of execution.
group
tasks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: optimizing parallelism
Processing large, split-able bzip2 files.
With 10 DPU, metric maximum needed executors shows room for scaling.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: optimizing parallelism
With 15 DPU, active executors closely tracks maximum needed executors.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Innovation areas
Scalability and performance
Crawlers and ETL jobs
Profiling and debugging
ETL metrics
Interoperability
Orchestration features
Announcing new job type: Python shell
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goal: compose jobs in DAG through event-based dependencies
Orchestration a year ago …
raw
optimize optimized
SLA
reporting
New
In-practice: time-based workflows
variable mins variable mins
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Orchestration building blocks
Crawlers Jobs TriggersEntities
Schedule ExternalEventsDependencies
Conditions TimeoutRetriesControl
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New features
External
Conditions
Control
Publish crawler and job notifications into Amazon CloudWatch
Amazon CloudWatch events to control downstream workflows
‘ANY’ and ‘ALL’ operators in Trigger conditions
Additional job states ’failed’, ‘stopped’, or ‘timeout’
Configure job timeout
Job delay notifications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example event-driven workflow
raw
optimize optimized
SLA
reporting
New
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Announcing a new job type: Python shell
A new cost-effective ETL primitive for small to
medium tasks
Python
shell 3rd party
service
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue Python shell specs
Python 2.7 environment with
boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, …
cold spin-up: < 20 sec, support for VPCs, no runtime limit
sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB)
pricing: $0.44 per DPU-hour, 1-min minimum, per-second billing
Coming soon (December 2018)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Python shell collaborative filtering example
Amazon customer reviews dataset (s3://amazon-reviews-pds)
Video category
Compute low-rank approx of (Customer x Product) ratings using SVD
uses scipy sparse matrix and SVD library
Step Time (sec)
Redshift COPY 13
Extract ratings 5
Generate matrix 1552
SVD (k=1000) 2575
Total 4145
69 min
$0.60
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
And so much more …
Support for Amazon DynamoDB tables in crawlers and AWS Glue ETL
 Run SQL over DynamoDB tables
Encryption: platform-level, in-transit, and at rest (with AWS KMS keys)
Compliance: GDPR, HIPAA, BAA
connection_type = "dynamodb",
connection_options = {"dynamodb.input.tableName" : ”DDB_TABLE"}
latency > 12
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Tuesday, November 27
ANT326 METRICS-DRIVEN PERFORMANCE TUNING FOR AWS GLUE ETL JOBS
8:30AM – 9:30AM | Venetian, Level 2, Veronese 2406
Tuesday, November 27
ANT309 BUILD AND GOVERN YOUR DATA LAKES WITH AWS GLUE
9:15AM – 10:15AM | Mirage, Mirage Events Center B
Wednesday, November 28
ANT342 WORKING WITH RELATIONAL DATABASES IN AWS GLUE ETL
2:30PM – 3:30PM | Mirage, Grand Ballroom B, Table 3
Tuesday, November 27
ANT313 SERVERLESS DATA PREP WITH AWS GLUE
7:00PM – 9:15PM | Venetian, Level 2, Venetian H
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.
AWS GLUE @
REALTOR.COM®
Arup Ray
Vice President, Engineering
November 26, 2018
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.
To empower
people by making
all things home
simple, efficient
and enjoyable
OUR MISSION
43
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 44
REALTOR.COM®
MAKES MEANINGFUL CONNECTIONS
Engage
• 1.5 times more page views1
• 1.3 times longer visits1
Attract
• 9 of 10 consumer leads are actively
searching or ready to transact2
• #1 for finding an agent3
Connect
• 63 million unique visitors per month4
• 2 billion+ page views per month4
1 comScore Sept. 2018
2 Lead Quality Study, July 2017
3 2017 NAR Profile of Home Buyers and Sellers
4 FY Q418 Internal metrics
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 45
DATA ANALYTICS PLATFORM
Template based data transformation
File Drop
Data Archive
Raw Data
Processed
Data
Transactional
Business Data Access Data
Data Lake Curated Data Sets
PDT Template
BD Template
AD Template
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 46
• Generalize the ETL process
to handle generic data sets
and volumes
• Solve for patterns –
like deeply nested JSON
attributes, array within
JSON attributes
• Make the data easy to query
TEMPLATE DRIVEN TRANSFORMATION
•EMR Hive
requires managing the
cluster
•UDF based functionality
requires a lot of code to
maintain
SOLUTION
CHALLENGESPROBLEM
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 47
Raw Data
Processed
Data
Transactional
PDT Template
PDT TEMPLATE SOLUTION
Between RD and PDT layers in Data Analytics Platform
• Template code implemented in Python.
• User defined configuration as JSON
• Supported Raw Data formats – CSV with header, JSON
• These are the template configuration parameters
– Data location (RD S3 location PDT S3 location)
– Data structure of input – implicit or header location
– Input column names that require data type validation
– Relationalize
• Input multi-valued columns that requires to be relationalized
• Deeply nested columns that require flattening
– Duplicate detection of input rows (optional)
CSV
JSON
Parquet
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.
PDT TEMPLATE
How AWS Glue performs batch data processing
Step 3
Amazon ECS
LGK
Service
Update LGK
Unlock Source & Targets with Lock API
Parse Configuration and fill in template
Lock Source & Targets with Lock API
• Retrieve data from input
partition
• Perform Data type validation
• Perform Flattening
• Relationalize - Explode
• Save in Parquet format
AWS Glue
Amazon S3
Update target metadata AWS Glue
Crawler
Step 1
Amazon ECS
AWS Data
Pipeline
App ID
Step 2
Amazon ECS
Config
Service
LGK
Service
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.
PDT TEMPLATE
How AWS Glue performs batch data processing
AWS Glue
Python shell
LGK
Service
Update LGK
Unlock Source & Targets with Lock API
Parse Configuration and fill in template
Step 3
Lock Source & Targets with Lock API
• Retrieve data from input
partition
• Perform Data type validation
• Perform Flattening
• Relationalize - Explode
• Save in Parquet format
AWS Glue
Amazon S3
Update target metadata AWS Glue
Crawler
AWS Glue
Python shell
Step 1
Airflow
App ID
AWS Glue
Python shell
Step 2
Config
Service
LGK
Service
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 50
DATA PROFILING
Using Data Validation Framework
PDT Data in
Parquet
fashion
Descriptive
statistics of
columns
Data Profiling using
ML package in Glue
Amazon S3 AWS Glue Amazon S3
Amazon
Athena
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 51
AWS GLUE PYTHON SHELL
The way we see it
AWS Glue
Serverless Distributed
Compute Cluster
AWS Glue
Python shell
Serverless Edge Node for
triggering, light
transformations,
uncompress, tar extract,
Parquet conversion
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 52
We analyze consumer behavior
and expectation to deliver a
best-in-class experience for both
REALTORS®
and consumers
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 53
BEYOND PDT
Analytical Profile Services
User
Listing
Clickstream
Consumer
Analytical
Profile
(Batch)
Amazon
Athena
CTAS
Amazon S3
Amazon Athena
Amazon S3
Clickstream
AWS Glue
ETL
Amazon S3
AWS Glue
Consumer
Analytical
Profile
(Near Real Time)
AWS
Glue
ETL
AWS Glue
Amazon Dynamo DB
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 54
BENEFITS OF USING AWS GLUE
Performance for Dynamo DB write using AWS Glue
significantly better - about 10X
Speed of implementation – developer productivity with in-built
transforms
Less code to maintain - <10 lines versus 200+ for explosion
of array attributes
Operations team loves the server less aspect
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mehul A. Shah
glue-pm@amazon.com
Arup Ray
https://www.linkedin.com/in/arupray
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.
THANK YOU

More Related Content

What's hot

What's hot (20)

20190514 AWS Black Belt Online Seminar Amazon API Gateway
20190514 AWS Black Belt Online Seminar Amazon API Gateway 20190514 AWS Black Belt Online Seminar Amazon API Gateway
20190514 AWS Black Belt Online Seminar Amazon API Gateway
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
 
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
 
20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR20191023 AWS Black Belt Online Seminar Amazon EMR
20191023 AWS Black Belt Online Seminar Amazon EMR
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
NET304_Deep Dive into the New Network Load Balancer
NET304_Deep Dive into the New Network Load BalancerNET304_Deep Dive into the New Network Load Balancer
NET304_Deep Dive into the New Network Load Balancer
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue
 
AWS Black Belt Online Seminar 2017 Amazon ElastiCache
AWS Black Belt Online Seminar 2017 Amazon ElastiCacheAWS Black Belt Online Seminar 2017 Amazon ElastiCache
AWS Black Belt Online Seminar 2017 Amazon ElastiCache
 
20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration Service20210216 AWS Black Belt Online Seminar AWS Database Migration Service
20210216 AWS Black Belt Online Seminar AWS Database Migration Service
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
 
20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBS20190320 AWS Black Belt Online Seminar Amazon EBS
20190320 AWS Black Belt Online Seminar Amazon EBS
 
20200422 AWS Black Belt Online Seminar Amazon Elastic Container Service (Amaz...
20200422 AWS Black Belt Online Seminar Amazon Elastic Container Service (Amaz...20200422 AWS Black Belt Online Seminar Amazon Elastic Container Service (Amaz...
20200422 AWS Black Belt Online Seminar Amazon Elastic Container Service (Amaz...
 
20180724 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...
20180724 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...20180724 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...
20180724 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...
 
20190723 AWS Black Belt Online Seminar AWS CloudHSM
20190723 AWS Black Belt Online Seminar AWS CloudHSM 20190723 AWS Black Belt Online Seminar AWS CloudHSM
20190723 AWS Black Belt Online Seminar AWS CloudHSM
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 

Similar to Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Invent 2018

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 

Similar to Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Invent 2018 (20)

Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Track 1_Session 2_SAP on AWS - Running your critical workloads.pdf
Track 1_Session 2_SAP on AWS - Running your critical workloads.pdfTrack 1_Session 2_SAP on AWS - Running your critical workloads.pdf
Track 1_Session 2_SAP on AWS - Running your critical workloads.pdf
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWS
 
AWS Compute Leadership Session: What’s New in Amazon EC2, Containers, and Ser...
AWS Compute Leadership Session: What’s New in Amazon EC2, Containers, and Ser...AWS Compute Leadership Session: What’s New in Amazon EC2, Containers, and Ser...
AWS Compute Leadership Session: What’s New in Amazon EC2, Containers, and Ser...
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%Run Production Workloads on Spot, Save up to 90%
Run Production Workloads on Spot, Save up to 90%
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML Overview
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
Scale Your SAP HANA In-Memory Database on Amazon EC2 High Memory Instances wi...
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
 
Cheat your Way into the Cloud
Cheat your Way into the CloudCheat your Way into the Cloud
Cheat your Way into the Cloud
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Serverless Analytics Pipelines with AWS Glue Mehul A. Shah Senior Software Development Manager AWS Glue Arup Ray Vice President, Engineering Realtor.com A N T 3 0 8
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Today’s Agenda AWS Glue overview What’s new: improvements and features Realtor.com – customer success
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fully-managed, serverless extract-transform-load (ETL) service for developers, built by developers 1000s of customers and jobs A year ago …
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A year ago …
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Select AWS Glue customers …
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You helped us grow … Sep Nov Jan Mar May Jul Sep Job and Crawler Runs 0 2 4 6 8 10 12 Sep Nov Jan Mar May Jul Sep # of Regions
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. We’ve been busy… Access policies for AWS Glue Data Catalog Amazon SageMaker notebooks Encryption at rest Canada central region Crawler combine compatible schemas ETL job metrics Amazon DynamoDB integration Job delay notification London region Seoul region Crawler merge new columns Mumbai region Support Apache Spark 2.2.1 ETL job timeout Singapore region Sydney region Readers support JSONPath expressions New Job Events Types Support for Scala scripts Tokyo region Crawler CWE notifications XML support AWS CloudTrail support Amazon CloudFormation templates Crawler exclusion patterns Per-second billing DynamicFrame Filter and Map GDPR, HIPAA, and BAA compliance Ireland, Oregon, Ohio region Frankfurt region
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. OrchestrationData Catalog Serverless Jobs Automatic crawling Apache Hive Metastore compatible Integrated with AWS analytic services Discover Flexible scheduling Monitoring and alerting External integrations Deploy Apache Spark core Python and Scala Auto-generates ETL code Develop AWS Glue components
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Beyond ETL: data science and exploration
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What you’re saying … data lake Amazon EMR Hive SparkSQL …” Ram Kumar Regnaswamy, CTO, Beeswax data lake Redshift cost-effective whole data infrastructure Umang Rustagi, Co-founder and COO, FinAccel” 200% faster at a fraction of the cost Miki Hardisty, CTO, Jack-in-the-box
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Innovation areas Scalability and performance Crawlers and ETL jobs Profiling and debugging ETL metrics Interoperability Orchestration features Announcing a new job type: Python shell
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. year month … day … 2018 11 12 2221 hour … JSON year month … day … 2018 11 12 2221 hour … Parquet transform Example data organization in: Amazon Simple Storage Service (Amazon S3) Apache Hive- style partitions source Amazon S3 bucket target Amazon S3 bucket
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Crawler results … Complex, nested fields Groups files into Apache Hive-style partitions
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Crawler performance 7x improvement YoY 90M+ files per day Millions of partitions YMMV with partitions size and complexity
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read Dynamic Frame from source Data transformation + data cleaning functions Write Dynamic Frame to sink AWS Glue ETL scripts
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Review: Apache Spark and AWS Glue ETL Apache Spark is a distributed data processing engine for complex analytics. AWS Glue builds on the Apache Spark to offer ETL specific functionality. Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrames SparkSQL AWS Glue ETL
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DataFrames Core data structure for SparkSQL Like structured tables Need schema up-front Each row has same structure Suited for SQL-like analytics DataFrames and DynamicFrames DynamicFrames Like DataFrames for ETL Designed for processing semi-structured data, e.g. JSON, Avro, Apache logs ...
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame performance Raw conversion speed 4x improvement YoY 1TB in under 1.5 hours with 10 DPU YMMV with data format and script complexity
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamicFrame predicate pushdown GitHub public timeline 0 20 40 60 80 100 120 0 3 6 9 12 months scanned Time(mins) pushDownPredicate = "(year=='2017' and month=='04’)”) Example code
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Innovation areas Scalability and performance Crawlers and ETL jobs Profiling and debugging ETL metrics Interoperability Orchestration features Announcing a new job type: Python shell
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue execution model Apache Spark and AWS Glue are data parallel. Data is divided into partitions (shards) that are processed concurrently. Jobs are divided into stages 1 stage x 1 partition = 1 task Driver schedules tasks on executors. 2 executors per DPU Driver Executors Overall throughput is limited by the number of partitions (shards)
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue job metrics • Metrics can be enabled in the AWS Command Line Interface (AWS CLI) and AWS SDK by passing --enable-metrics as a job parameter key.
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Profile jobs using ETL metrics Derived from the underlying Apache Spark metrics Driver and per executor aggregates and instantaneous reports to Amazon CloudWatch metrics every 30 sec Metrics: memory usage, bytes read and written, cpu load, bytes shuffled, needed executors, and more
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: profiling memory usage overwhelms
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: AWS Glue small-file handling Driver memory remains below 50% for the entire duration of execution. group tasks
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: optimizing parallelism Processing large, split-able bzip2 files. With 10 DPU, metric maximum needed executors shows room for scaling.
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: optimizing parallelism With 15 DPU, active executors closely tracks maximum needed executors.
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Innovation areas Scalability and performance Crawlers and ETL jobs Profiling and debugging ETL metrics Interoperability Orchestration features Announcing new job type: Python shell
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goal: compose jobs in DAG through event-based dependencies Orchestration a year ago … raw optimize optimized SLA reporting New In-practice: time-based workflows variable mins variable mins
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Orchestration building blocks Crawlers Jobs TriggersEntities Schedule ExternalEventsDependencies Conditions TimeoutRetriesControl
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New features External Conditions Control Publish crawler and job notifications into Amazon CloudWatch Amazon CloudWatch events to control downstream workflows ‘ANY’ and ‘ALL’ operators in Trigger conditions Additional job states ’failed’, ‘stopped’, or ‘timeout’ Configure job timeout Job delay notifications
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example event-driven workflow raw optimize optimized SLA reporting New
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Announcing a new job type: Python shell A new cost-effective ETL primitive for small to medium tasks Python shell 3rd party service
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Python shell specs Python 2.7 environment with boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, … cold spin-up: < 20 sec, support for VPCs, no runtime limit sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB) pricing: $0.44 per DPU-hour, 1-min minimum, per-second billing Coming soon (December 2018)
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Python shell collaborative filtering example Amazon customer reviews dataset (s3://amazon-reviews-pds) Video category Compute low-rank approx of (Customer x Product) ratings using SVD uses scipy sparse matrix and SVD library Step Time (sec) Redshift COPY 13 Extract ratings 5 Generate matrix 1552 SVD (k=1000) 2575 Total 4145 69 min $0.60
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. And so much more … Support for Amazon DynamoDB tables in crawlers and AWS Glue ETL  Run SQL over DynamoDB tables Encryption: platform-level, in-transit, and at rest (with AWS KMS keys) Compliance: GDPR, HIPAA, BAA connection_type = "dynamodb", connection_options = {"dynamodb.input.tableName" : ”DDB_TABLE"} latency > 12
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Tuesday, November 27 ANT326 METRICS-DRIVEN PERFORMANCE TUNING FOR AWS GLUE ETL JOBS 8:30AM – 9:30AM | Venetian, Level 2, Veronese 2406 Tuesday, November 27 ANT309 BUILD AND GOVERN YOUR DATA LAKES WITH AWS GLUE 9:15AM – 10:15AM | Mirage, Mirage Events Center B Wednesday, November 28 ANT342 WORKING WITH RELATIONAL DATABASES IN AWS GLUE ETL 2:30PM – 3:30PM | Mirage, Grand Ballroom B, Table 3 Tuesday, November 27 ANT313 SERVERLESS DATA PREP WITH AWS GLUE 7:00PM – 9:15PM | Venetian, Level 2, Venetian H
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 42. © 2018 Move, Inc. All rights reserved. Do not copy or distribute. AWS GLUE @ REALTOR.COM® Arup Ray Vice President, Engineering November 26, 2018
  • 43. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. To empower people by making all things home simple, efficient and enjoyable OUR MISSION 43
  • 44. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 44 REALTOR.COM® MAKES MEANINGFUL CONNECTIONS Engage • 1.5 times more page views1 • 1.3 times longer visits1 Attract • 9 of 10 consumer leads are actively searching or ready to transact2 • #1 for finding an agent3 Connect • 63 million unique visitors per month4 • 2 billion+ page views per month4 1 comScore Sept. 2018 2 Lead Quality Study, July 2017 3 2017 NAR Profile of Home Buyers and Sellers 4 FY Q418 Internal metrics
  • 45. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 45 DATA ANALYTICS PLATFORM Template based data transformation File Drop Data Archive Raw Data Processed Data Transactional Business Data Access Data Data Lake Curated Data Sets PDT Template BD Template AD Template
  • 46. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 46 • Generalize the ETL process to handle generic data sets and volumes • Solve for patterns – like deeply nested JSON attributes, array within JSON attributes • Make the data easy to query TEMPLATE DRIVEN TRANSFORMATION •EMR Hive requires managing the cluster •UDF based functionality requires a lot of code to maintain SOLUTION CHALLENGESPROBLEM
  • 47. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 47 Raw Data Processed Data Transactional PDT Template PDT TEMPLATE SOLUTION Between RD and PDT layers in Data Analytics Platform • Template code implemented in Python. • User defined configuration as JSON • Supported Raw Data formats – CSV with header, JSON • These are the template configuration parameters – Data location (RD S3 location PDT S3 location) – Data structure of input – implicit or header location – Input column names that require data type validation – Relationalize • Input multi-valued columns that requires to be relationalized • Deeply nested columns that require flattening – Duplicate detection of input rows (optional) CSV JSON Parquet
  • 48. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. PDT TEMPLATE How AWS Glue performs batch data processing Step 3 Amazon ECS LGK Service Update LGK Unlock Source & Targets with Lock API Parse Configuration and fill in template Lock Source & Targets with Lock API • Retrieve data from input partition • Perform Data type validation • Perform Flattening • Relationalize - Explode • Save in Parquet format AWS Glue Amazon S3 Update target metadata AWS Glue Crawler Step 1 Amazon ECS AWS Data Pipeline App ID Step 2 Amazon ECS Config Service LGK Service
  • 49. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. PDT TEMPLATE How AWS Glue performs batch data processing AWS Glue Python shell LGK Service Update LGK Unlock Source & Targets with Lock API Parse Configuration and fill in template Step 3 Lock Source & Targets with Lock API • Retrieve data from input partition • Perform Data type validation • Perform Flattening • Relationalize - Explode • Save in Parquet format AWS Glue Amazon S3 Update target metadata AWS Glue Crawler AWS Glue Python shell Step 1 Airflow App ID AWS Glue Python shell Step 2 Config Service LGK Service
  • 50. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 50 DATA PROFILING Using Data Validation Framework PDT Data in Parquet fashion Descriptive statistics of columns Data Profiling using ML package in Glue Amazon S3 AWS Glue Amazon S3 Amazon Athena
  • 51. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 51 AWS GLUE PYTHON SHELL The way we see it AWS Glue Serverless Distributed Compute Cluster AWS Glue Python shell Serverless Edge Node for triggering, light transformations, uncompress, tar extract, Parquet conversion
  • 52. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 52 We analyze consumer behavior and expectation to deliver a best-in-class experience for both REALTORS® and consumers
  • 53. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 53 BEYOND PDT Analytical Profile Services User Listing Clickstream Consumer Analytical Profile (Batch) Amazon Athena CTAS Amazon S3 Amazon Athena Amazon S3 Clickstream AWS Glue ETL Amazon S3 AWS Glue Consumer Analytical Profile (Near Real Time) AWS Glue ETL AWS Glue Amazon Dynamo DB
  • 54. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. 54 BENEFITS OF USING AWS GLUE Performance for Dynamo DB write using AWS Glue significantly better - about 10X Speed of implementation – developer productivity with in-built transforms Less code to maintain - <10 lines versus 200+ for explosion of array attributes Operations team loves the server less aspect
  • 55. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mehul A. Shah glue-pm@amazon.com Arup Ray https://www.linkedin.com/in/arupray
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 57. © 2018 Move, Inc. All rights reserved. Do not copy or distribute.© 2018 Move, Inc. All rights reserved. Do not copy or distribute. THANK YOU