Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics Practices (ANT202-R) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modern Cloud Data Warehousing
ft. Equinox Fitness Clubs: Optimize
Your Analytics Practices
A N T 2 0 2 - R
Ryan Kelly
Data Architect
Equinox
Elliott Cordo
VP Data Analytics
Equinox
Lisa Perazzoli
Sr. Product Manager
Amazon Web Services

Raise your hand if you’re using
Amazon Redshift

Your requirements
are evolving
Data variety
and data
volumes
are increasing
rapidly
Integrate
Disparate
data sets
Democratized
access to data
in a governed way
Analytic needs are
evolving beyond
batch reports to
Real-time and
predictive
Incorporation of
Voice, image
recognition, and
IoT use cases
into applications

Traditionally, analytics used to look like this
Data Warehouse
LOBCRMERPOLTP
Business Intelligence
Relational data
GBs-TBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial capex+ $10k–$50k/TB/year

Data
every 5 years
There is more data
than people think.
years
live for
Data platforms need to
scalegrows

There are more
data types than
ever before.

Hadoop Elasticsearch
There are more
ways to analyze data
than ever before
Years ago
11 8 5 4
Presto Spark
Didn’t exist

What does
data warehouse
modernization
mean? Easy to use Extends to
your Data Lake
Don’t waste time on
menial administrative
tasks and maintenance
Directly analyze data
stored in your data lake
in open formats
Any scale of data,
workloads, and users
Dynamically scale up to
guarantee performance even
with unpredictable demands
and data volumes
Faster
time-to-insights
Consistently fast
performance, even with
thousands of concurrent
queries and users

Amazon Redshift
Fastest
Get faster time-to-insight
for all types of analytics
workloads; powered by
machine learning, columnar
storage and MPP
Unlimited
scale
Extends your
Data Lake
1/10th
the cost
Dynamically scale up to
guarantee performance
even with unpredictable
analytical demands and
data volumes
Analyze data in the Amazon
S3 Data Lake in-place and in
open formats, together with
data loaded into Redshift’s
high performance SSDs
Start at $0.25 per hour,
save costs with automated
administration tasks and
eliminate business impact
due to downtime; as low as
$1,000 per terabyte per year
Fast, simple, cost-effective data
warehouse that can extend queries to your Data Lake
Analyze data in open formats
such as Parquet, ORC, and JSON, using SQL tools

for their cloud
data warehouse
workloads than
anyone else
Amazon
Redshift

Selected Amazon Redshift Partners
Data Integration Business Intelligence Systems Integrators

Faster Time to Insights
Normalized Queries Per Hour (QPH)
Assuming Redshift’s QPH 6 months ago=100%
Queriesperhour
Asa%ofredshift6monthsago
Higher is better
100%
181%
237%
284%
350%
>3x faster
Faster performance
New!

Redshift Query Editor
Query data
directly from
the AWS console
Results are instantly
visible within the console
No need to install
and setup an external
JDBC/ODBC client
Launched in October!

Redshift Advisor
>96% of
clusters
have tailored
feedback
Provides
automated
recommendations
to help optimize database
performance and
decrease operating costs
Actionable
WLM
COPY, storage,
and system
maintenance advice
for tuning based
on continuous
workload analysis
Intelligent
recommendations
Launched in July!

Amazon Redshift intelligent maintenance
VacuumAnalyze WLM
Concurrency
Setting
AutoAuto Auto
Maintenance processes like
vacuum and analyze will
automatically run in the
background.
Redshift will automatically adjust the
WLM concurrency setting to deliver
optimal throughput.
Moving towards
zero-maintenance.
Coming Soon!

Amazon Redshift Elastic Resize (GA)
Adds
additional
nodes
to Redshift cluster
Run queries
faster
in busy periods
Minimal
transition time
Scale compute
and storage on-
demand
Scale up and down in minutes
New!
Redshift
Cluster
Computenodes
Redshift Managed S3
JDBC/ODBC
Leader Node
CN2CN1 CN3 CN4
Backup

Caching Layer
Concurrency Scaling for
bursts of user activity (Preview)
Automatically
creates more
clusters on-
demand
Consistently
fast
performance
even with
thousands of
concurrent queries
No advance
hydration
required
Free for >97%
of customers
for every 24 hours
that your main
cluster is in use, you
accrue a one-hour
credit for
Concurrency Scaling
New!
Backup
Amazon Redshift Managed S3

Data warehouse
modernization
is also about the
transition to
data lakes
Data Warehouse Data Lake
OLTP ERP CRM LOB
Business
Intelligence
Devices Web Sensors Social
Machine
Learning
Data Catalog
DW
Queries
Big data
processing
Interactive
analysis
Real-time
insights

The power of data lakes
Most ways to bring data in
Terabyte – Exabyte scale
Security,
compliance, and audit capabilities
Run any analytics
on the same data without movement
Scale
storage and compute independently
Designed for low-cost
storage and analytics
Redshift
EMR Athena
AI Services
ElasticsearchKinesis
Snowball
Kinesis
Video Streams
Kinesis
Data
Streams
Kinesis
Data Firehose
Snowmobile

Layers of a
Data Lake
INGEST
Security
S3
Analyze & infer
Redshift
EMR
Athena
AI Services
Elasticsearch
Service
Kinesis
Discover
AWS Glue
Snowball
Snowmobile
Kinesis Data Firehose
Kinesis Data Streams
Kinesis Data Streams
Database Migration Service
Ingest

Query the
same data
with the best
analytics tool
for the job
Data Lake
on AWS
Redshift EMR Athena Kinesis
Sage
Maker
The importance
of open data
formats and
open APIs
Eliminates data
silos and tool lock-ins
Unified access
and governance
Platform decisions are long-lived.
Innovation in analytics is high.

Modernizing your data warehouse includes
Extending data
warehouse queries
into the data lake
Sizing the data
warehouse
independent of
the data lake
Support for open
data formats
Integration with a
variety of analytical
tools in the data lake
Scalability Unified access and
governance
A solution that will last
for the next 10+ years

Who we are
Is a company with integrated luxury and lifestyle offerings
centered on movement, nutrition, and regeneration
we operate more than 200 locations within every major city
across the country in addition to London and Canada

How hard
could it be?
People check
into the clubs?
Members lift weights and
put them down?
Building neighbors feel
shaking from heavy weights?

More than meets the eye
Many lines of business across
98 clubs & 200+ in total
Plus central supporting
functions
Digital
Products
CRM Marketing Creative
Development
/ Building
Finance Member’s
Services
Maintenance
Personal
training
Pilates Spa Group
Fitness
Membership/
Sales
Retail Food
Services

Digital Products
End user applications
Connections to Apple Health
Connected
Equipment
Pursuit (gamified cycling experience)
Cardio
Digital Assessment
Location Tracking
Connected Tech

The history of data
First there was “LIFE”…
This was Equinox’s
first data warehouse
and was created in 2008

The history of data
Rigorously
Kimball

LIFE was good…
Reporting was reliable
Analytics, sometimes self-serviced!
Customer Profiles
CRM
Email Marketing
Personalization

But sometimes it was bad…
Direct integration with applications, tight coupling
Difficult SDLC, testing cycle, release management
Functional debt
No place to put NEW data
In-flexibility for Data Science
Expensive commercial software
FML

Version 1.X
About 4 years ago
we purchased
Launched
several apps
running in beta
Very
expensive
Limitations with
integrations
Required
platform-specific
knowledge

Re-centering on our goals
Provide
business value
Build technology that
differentiates
Reduce cost and
go all-in on
cloud technology
Adopt modern
engineering principles
Make scalable components
Use ephemeral, stateless
resources
Use distributed databases
Less focus on individual servers

Doesn’t work for everything!
Just put everything in Hadoop or Amazon S3 data lake
You don’t need a data warehouse
Everything can just be late bind
The “new” school

Data warehouse vs. data lake
Data Lakes Data Warehouse
Reliable high SLA reporting
Developer and analyst friendly
Efficient for specific types of pipelines
Large immutable data sets
Semi-structured and
unstructured data sets

Project “Cosmo”
Two week
proof-of-concept
Re-platformed one
Teradata app
It worked!
Amazon
Redshift
Amazon
S3

Bidding farewell
Au revoir
(Not for sale anymore)

“JARVIS” is born
Data Warehouse Data Lakes Data Services
From successful POC to new data platform JARVIS
Amazon
Redshift
Amazon
S3

JARVIS architecture
Data & Analytics Apps
Equinox Apps
Third Party Apps
Informatica
Maximilian
EMR
PT
App
Pursuit
Engage
Exact
Target
Adobe Social
MOSO
Fitness
Agg.
Amazon
Redshift

Amazon Redshift benefits
Cost effective
1/10th the cost of Teradata and SQL Server licensing and maintenance
Low barriers for developers & easy to maintain
Much less platform specific knowledge
Fast and performant
data pipelines reduced from hours to minutes
Devops friendly
API, automation, multi-cluster
Integration with other AWS services
and third party tools
Amazon
Redshift

Sailing on the data lake
High performance,
low cost, blob storage on S3
Functioning
analytic store (not a dumping ground)
Flexible,
late bind strategies where appropriate
Quick setup
for external tables
Easily implement
DR strategies

Data in
the lake
Clickstream data
PURSUIT cycling logs
Club management software logs
Data from software that
enhances our services

Big picture
Query the data
Immutable app log data - Adobe Analytics
Toolkit
Amazon
Redshift
Amazon
EMR
Amazon
Athena
Metadata
AWS
Glue
Storage Amazon
S3

Data ingestion
Adobe Analytics data feeds
Functionality built-in
Choose columns to receive
Specify AWS credentials and Amazon S3
information
Get files daily
Not so fast…
Multiple files are then sent to Amazon S3
including multiple data files, multiple lookup
files, and a manifest file describing
everything sent

Landing in the data lake
Throw all raw
files into an S3
landing bucket
Use Amazon EMR to
aggregate into
single file
2
Save new parquet
file to S3 data
lake bucket
1 3
Save clean data
to a sub-folder named
“dt=YYYY-MM-DD”
Partitioning data
in separate folders allows
for less data to be scanned
Extra
Credit!

Setting AWS Glue up, part 1
Cleaned data is now in Amazon S3
but it can’t be queried yet
Data must be described in AWS Glue
Create a
database in
Glue to label
the data source
Create an
external table
in Glue
interface
1 2

Setting AWS Glue up, part 2
Set up external table in AWS Glue interface:
External tables can also be
created in Athena or Redshift
Run Create External Table
Select Add
Table
manually
Point table
data source
to S3 folder
location
Define
schema
Define “dt”
as an addition
column for
partition
1 2 3 4

Further describing the lake

The assembled pipeline
Adobe
Analytics
EMR AthenaS3
Glue Data
Catalog
Redshift
Spectrum
S3

But wait… why ALTER TABLE
a = Athena()
partitions = [
{‘dt’ : ‘2017-11-20’ , ‘facility-code’ : ‘716’},
{‘dt’ : ‘2017-11-20’ , ‘facility-code’ : ‘715’},
{‘dt’ : ‘2017-11-20’ , ‘facility-code’ : ‘714’}
]
a.repair_table(db_name=‘cyclingops’,
table=‘cycling_logs’, partitions=partitions)
Partitioned tables must
be told about new data
If it is not made aware then it cannot be queried
Alter table easily with
Glue crawler or Athena
We built an Athena interaction
class in Python for flexibility
On successful EMR job we use this class
to repair the table

Processing on Redshift
Light
transformation
via ELT scripts
Happen inside of Redshift
Orchestrated by Maximilian
Big crunches and
semi-structured
data processing
Happen outside of Redshift
Help reserve query capacity

What do our models look like
Flattened
Redshift is columnar so
wide tables are A-OK!
Distributed joins can
be expensive
Rational and
conservative
use of dimensions
especially “Type 2”
Somewhat like
star schemas
Basically, get answer and put in table!

f_checkin
checkin_id
member_id
accounting_cd
terminal_name
contact_key
facility_key
checkin_type_desc
checkin_status
checkin_issue_reasons
checkin_date_key
checkin_time_key
checkin_ts_time
checkin_raw_count
checkin_unique_count
checkin_good_count
checkin_good_daily_counter
trial_checkin_count
etl_source_system_cd
etl_row_create_dts
etl_row_update_dts
etl_run_id
Sample data model
1. Don’t make dimensions
you don’t have to
2. No “junk” dimensions
3. No “mystery” or
flag dimensions
d_contact
contact_key
contact_id
….
d_facility
contact_key
contact_id
….
1
2
3

Fall (DIST)STYLE fashions
DISTSTYLE ALL
Each node receives complete table
Reduces disk usage on small-medium size tables
Preferred for table sizes up to 3M rows with slow changing data
DISTSTYLE KEY
Each node receives portion of data via chosen key
Optimizes JOIN, INSERT INTO, GROUP BY performance
DISTSTYLE EVEN
Each node receives portion of data via round robin
Use if neither option above applies
ALL
keyA keyB keyC keyD
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
EVEN
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
KEY
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4

(DIST)STYLE decision time
Does the table participate in JOINs?
Can you tolerate additional
storage overhead?
Do the query patterns tolerate
reduced parallelism?
Does the table contain at least one
potential DISTKEY column?
Do query patterns utilize potential
DISTKEY columns in JOIN conditions?
Does the table contain at least one
potential DISTKEY column?
Table
DISTSTYLE EVEN DISTSTYLE KEY
DISTSTYLE ALL
Yes
Yes
Yes
Yes
No
No
No
Yes
Yes
The decision is only between
two at a time
If one valid DISTKEY column
exists then KEY or ALL
If no valid DISTKEY column
exists then EVEN or ALL

Optimizations with data lakes
Leverage
self-described
high-compression
Parquet files
Easily
perform
delta queries
and
“what changed
analysis”
from unloaded
snapshots of
Redshift tables
Use
partitions
but do not
over-partition
Lighten
compute
load
on Redshift
by using EMR
or Athena
ELT
from S3 to S3
using Spectrum
and UNLOAD
(make sure
to compress!)
1 2 3 4 5

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Supporting actors
Batchy
Batch & state,
DAG execution
HAMBOT
Data quality &
monitoring
Teletraan1,
Robopager
Ops monitoring
Rundeck
Scheduling
Jenkins
Deployments

How we do deployments
Jenkins
workflows
Spin up
ephemeral EMR clusters and Maximilian assets
Run
major transformations
Run
HAMBOT checks
Merge and deploy

V.I.N.CENT bot
Hero for our engineers
Allows ops interactions via Slack chat interface
Much easier for engineers over the console
Can start cluster in seconds
Reduces need for console access

Maximilian bot
Every hero needs a villain
Further ops interactions via Slack
But this is bot to bot communication
Seeks to destroy clusters twice a day
Humans intervene to fend off cluster destruction
Saves money on unused infrastructure

Things are good!
Re-platformed and productionalized
2 apps in 4 months
Finished re-platform in under a year
Dependability – very few operational issues
Faster time-to-benefit via automated regression
Huge cost savings over Teradata

Spreading the love
The new solution worked so well we built Blink
a new data platform too!
It only took 4 months to do the entire re-platforming

Lessons learned
Take advantage of S3/Redshift integration
Use an S3 first approach whenever possible
Develop an architecture that accommodates
change
One size doesn’t fit all – each tool serves a purpose
E.g. Sometimes it’s Redshift and other times it’s Redshift Spectrum!
Automate everything
Leverage automated tests & deployments to your
analytics environment

Cloud-forward strategy
Micro-service architecture
Gamification & metric driven programming
IoT
Connected cardio, beacons, wearables
Integrated single view of customer & advanced CRM
Machine learning
Recommendations, predictions, NLP, chatbots
Data platforming
Redshift, EMR/Spark, S3/Glue/Spectrum/Athena, Zeppelin Notebooks
We love innovation

Thank you!
Lisa Perazzoli
plisa@amazon.com

Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics Practices (ANT202-R) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics Practices (ANT202-R) - AWS re:Invent 2018

Similar to Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics Practices (ANT202-R) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics Practices (ANT202-R) - AWS re:Invent 2018