SlideShare a Scribd company logo
1 of 52
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data Analytics Architectural
Patterns and Best Practices
A N T 2 0 1
Ben Snively
Solutions Architect
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What to expect from the session
Big data challenges
Architectural principles
How to simplify big data processing
What technologies should you use?
Why?
How?
Reference architecture
Design patterns
Customer examples
Demonstrations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Types of big data analytics
Batch/
interactive
Stream
processing
Machine
learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Delivery of big data services
Virtualized Managed
services
Serverless/
clusterless/
containerized
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Plethora of tools
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big data challenges
Why?
How?
What tools should I use?
Is there a reference architecture?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architectural principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
Use event-journal design patterns
• Immutable datasets (data lake), materialized views
Be cost-conscious
• Big data ≠ big cost
Machine learning (ML) enable your applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect Store Process/
Analyze
Consume
Time to answer (latency)
Throughput
Cost
Simplify big data processing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency µs, ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
What is the temperature of your data/analytic?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
EventsData streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
Datatransport&logging
Import/expo
rt
Files/objects
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Transactions
Data structures
Database records
Type of data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Events
Files
Transactions
Collect
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
Datatransport&logging
Import/expo
rt
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Data structures
Database records
Type of data Store
NoSQL
In-memory
SQL
File/object
store
Stream
storage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
Datatransport&logging
Import/expo
rt
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Store
NoSQL
In-memory
SQL
File/object
store
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Apache Kafka
• High throughput distributed
streaming platform
Amazon Kinesis Data
Streams
• Managed stream storage
Amazon Kinesis Data
Firehose
• Managed data delivery
Stream storage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which stream/message storage should I use?
Amazon
Kinesis Data
Streams
Amazon
Kinesis Data
Firehose
Apache Kafka (on
Amazon Elastic
Compute Cloud
[Amazon EC2])
Amazon Simple
Queue Service
(Amazon SQS)
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes No Yes Yes
Guaranteed ordering Yes No Yes No Yes
Delivery (deduping) At least once At least once At least/At
most/exactly once
At least once Exactly once
Data retention period 7 days N/A Configurable 14 days 14 days
Availability Three AZ Three AZ Configurable Three AZ Three AZ
Scale/throughput No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Multiple consumers Yes No Yes No No
Row/object size 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Low Low Low (+admin) Low-medium Low-medium
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect Store
File
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
Datatransport&loggingIoTApplications
Import/expo
rt
NoSQL
In-memory
SQL
Amazon S3
Amazon Simple Storage Service
(Amazon S3)
• Managed object storage service built
to store and retrieve any amount of
data
File/object storage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use Amazon S3 as the storage for your data lake
• Natively supported by big data frameworks (Spark, Hive, Presto, and others)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot
Instances
• Multiple & heterogeneous analysis clusters and services can use the
same data
• Designed for 99.999999999% durability
• No need to pay for data replication within a region
• Secure – SSL, client/server-side encryption at rest
• Low cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What about HDFS & data tiering?
• Use HDFS/local for working data sets
• (for example, iterative read on the same data sets)
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard–IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold
data
Use S3 analytics to optimize tiering
strategy
S3 analytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect Store
Amazon
DynamoDB
Amazon RDS
Amazon Aurora
File
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
LoggingIoTApplications
Amazon S3
Amazon
Neptune
Amazon ElastiCache
Amazon DAX
Import/expo
rt
SQLNoSQLCache
Amazon ElastiCache
• Managed Memcached or Redis service
Amazon DynamoDB Accelerator (DAX)
• Managed in-memory cache for DynamoDB
Amazon Neptune
• Managed graph DB
Amazon DynamoDB
• Managed key value/document DB
Amazon Relational Database Service
(Amazon RDS)
• Managed relational database service
Cache & database
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Anti-pattern
Single database tier
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best practice: Use the right tool for the job
Data tier
Relational
Referential integrity
with strong
consistency,
transactions, and
hardened scale
Key-value
Low-latency, key-
based queries with
high throughput and
fast data ingestion
Document
Indexing and storing
of documents with
support for query on
any property
In-memory
Microsecond latency,
key-based queries,
specialized data
structures
Graph
Creating and
navigating relations
between data easily
and quickly
Complex query
support via SQL
Simple query
methods with filters
Simple query with
filters, projections
and aggregates
Simple query
methods with filters
Easily express queries
in terms of relations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which data store should I use?
Ask yourself some questions
What is the data structure?
How will the data be accessed?
What is the temperature of the data?
What will the solution cost?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is the data structure?
Access patterns What to use?
Put/get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, search Search
Graph traversal GraphDB
Data structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, search
Key/value In-memory, NoSQL
Graph GraphDB
How will the data be accessed?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
Hot data Warm data Cold data
Low
High
What is the temperature of the data?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon
ElastiCache
Amazon
DynamoDB +
DAX
Amazon
Aurora
Amazon RDS Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon S3 +
Amazon
Glacier
Use cases In memory
caching
K/V lookups,
document
store
OLTP,
transactional
OLTP,
transactional
Log analysis,
reverse
indexing
Graph File store
Performance Ultra high
request rate,
Ultra low
latency
Ultra high
request rate,
ultra low to
low latency
Very high
request rate,
low latency
High request
rate, low
latency
Medium
request rate,
low latency
Medium
request rate,
low latency
High
throughput
Shape K/V K/V and
document
Relational Relational Documents Node/edges Files
Size GB TB, PB (no
limits)
GB, mid TB GB, low TB GB, TB GB, mid TB GB, TB, PB, EB
(no limits)
Cost / GB $$ ¢¢ - $$ ¢¢ ¢¢ ¢¢ ¢¢ ¢- ¢4/10
Availability 2 AZ 3 AZ 3 AZ 2 AZ 1-2 AZ 3 AZ 3 AZ
VPC support Inside VPC VPC endpoint Inside VPC Inside VPC Outside or
inside VPC
Inside VPC VPC endpoint
Database characteristics
Hot data Warm data Cold data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process/
analyze
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process/analyze
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Interactive & batch analytics
Amazon Elasticsearch Service
Managed service for ElasticSearch
Amazon Redshift & Amazon Redshift Spectrum
Managed data warehouse
Spectrum enables querying S3
Amazon Athena
Serverless interactive query service
Amazon EMR
Managed Hadoop framework for running Apache Spark,
Flink, Presto, Tez, Hive, Pig, HBase, and others
Presto
Amazon
EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stream/real-time analytics
Spark Streaming on Amazon EMR
Amazon Kinesis Data Analytics
• Managed service for running SQL on
streaming data
Amazon KCL
• Amazon Kinesis Client Library
AWS Lambda
• Run code serverless (without
provisioning or managing servers)
• Services such as S3 can publish events
to AWS Lambda
• AWS Lambda can pool event from a
Kinesis
Process/analyze
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Predictive analytics
Process/analyze
Developers
Data scientists
Deep learning
experts
Predictive
AmazonML
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which analytics should I use?
Process/analyze
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, one-minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Data Analytics,
Amazon KCL, AWS Lambda, and others
Predictive
Takes milliseconds (real-time) to minutes (batch)
Example: Fraud detection, forecasting demand, speech
recognition
Amazon SageMaker, Amazon Polly, Amazon Rekognition, Amazon
Transcribe, Amazon Translate, Amazon EMR (Spark ML), Amazon
Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which stream processing technology should I use?
Amazon EMR
(Spark
Streaming)
KCL application Amazon Kinesis
analytics
AWS Lambda
Managed service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale / throughput No limits /
~ nodes
No limits /
~ nodes
No Limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python,
Scala
Java, others
through
MultiLangDaemon
ANSI SQL with
extensions
Node.js, Java, Python, .Net Core
Sliding window
functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by
Amazon KCL
Managed by
Amazon Kinesis
Data Analytics
Managed by AWS Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Amazon Redshift
Interactive queries
over S3 data
Interactive
query
General
purpose
Batch
Scale/throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Amazon Redshift
catalog
AWS Glue catalog AWS Glue catalog AWS Glue catalog or
Hive Meta-store
Auth/access controls IAM, users, groups,
and access controls
IAM, users, groups,
and access controls
IAM IAM, LDAP, & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes
Slow
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Raw data stored in data lake
Preparation
Nor malized
Partitioned
Compr essed
Stor age optimized
ELT/ETL
ELT/ETL: Preparing raw data for consumption
Data lake
on AWS
Raw immutable
DataSets
Curated
DataSets
Data catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which ELT/ETL tool should I use?
AWS Glue ETL AWS Data
Pipeline
Amazon Data
Migration Service
(Amazon DMS)
Amazon EMR Apache NiFi Partner
solution
Use case Serverless ETL Data
workflow
service
Migrate databases
(and to/from data
lake)
Customize
developed
Hadoop/Spark
ETL
Automate the
flow of data
between
systems
Rich partner
ecosystem for
ETL
Scale/throughput ~DPUs ~Nodes
through
EMR cluster
EC2 instance type ~Nodes Self managed Self manager or
through partner
Managed service Clusterless Managed Managed EC2 on
your behalf
Managed EC2
on your behalf
Self managed
on Amazon
EMR or
Marketplace
Self manager or
through partner
Data sources S3, RDBMS,
Amazon
RedShift,
DynamoDB
S3, JDBC,
DynamoDB,
custom
RDBMS, data
warehouses,
Amazon S3*
(*limited)
Various
Hadoop/Spark
managed
Various through
rich processor
framework
Various
Skills needed Wizard for
simple
mapping, code
snippets for
advanced ETL
Wizard and
code
snippets
Wizard and
drag/drop
Hadoop/Spark
coding
NiFi processors
and some
coding
Self manager or
through partner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Consume
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect Store ConsumeProcess/analyzeETL
Amazon
QuickSight
Analysis&visualizationDatasceince
AI Apps
Apps
SW engineers/
business
users
Data scientists/
data engineers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Putting it all together
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Collect Store ConsumeProcess/analyze
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
SQLNoSQLCacheFileStream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
Amazon
QuickSight
Analysis&visualizationDatasceince
DataTransport&LoggingIoTApplications
Amazon S3
Import/expo
rt
AI Apps
Apps
Model
Train/
Eval
Models
Deploy
ETL
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast
Amazon
DynamoDB
Amazon RDS
Amazon Aurora
Amazon
Neptune
Amazon ElastiCache
Amazon DAX
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Design patterns
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spark Streaming
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
Processingtechnology
FastSlow
Hive
Native apps
KCL apps
AWS Lambda
Amazon
Athena
Amazon Kinesis Amazon
DynamoDB/RDS
Amazon S3Data
Hot Cold
Data store
Streaming
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Streaming analytics
Amazon EMR
KCL app
AWS Lambda
Spark
Streaming
Amazon
AI
Real-time prediction
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
App state or
Materialized
View
KPI
Process
Store
Amazon
Kinesis
Amazon Kinesis
Data Analytics
Amazon
SNS NotificationsAlerts
Amazon
S3
Log
Amazon
KinesisFan out Downstream
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hearst’s serverless data pipeline
cosmopolitan.com
caranddriver.com
sfchronicle.com
elle.com
Ingestion proxy
(Node.js)
Serverless data
pipeline
Offline
analysis and
archive
Real-time
analysis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mobile devices
(thousands)
Data warehouse
Yieldmo’s data ingestion and real-time analysis architecture
- Page view ID
- Ad ID
- Fully viewed? Y | N
{
ad info,
user info,
phone info
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interactive
&
batch
analytics
Process
Store
Batch
Interactive
Amazon EMR
Hive
Pig
Spark
Amazon
AI
Batch prediction
Real-time prediction
Amazon S3
Files
Amazon
Kinesis Data
Firehose
Amazon Kinesis
Data Analytics
Amazon Redshift
Amazon ES
Consume
Amazon EMR
Presto
Spark
Amazon Athena
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Petabytes of data generated
on-premises, brought to
AWS, and stored in S3
Thousands of analytical
queries performed on
Amazon EMR and Amazon
Redshift
Stringent security
requirements met by
leveraging VPC, VPN,
encryption at-rest and in-
transit, AWS CloudTrail, and
database auditing
Flexible
interactive
queries
Predefined
queries
Surveillance
analytics
Web applications
Analysts; regulators
FINRA: Migrating to AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interactive
&
batch
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
App state
or
materialized
view
KCL
Amazon
AI
Real-time
Amazon
DynamoDB
Amazon
RDS
Change data capture
or export
Transactions
Stream
Files
Data lake
Amazon Kinesis
Data Analytics
Amazon Athena
Amazon Kinesis
Data Firehose
Amazon ES
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What about metadata?
• AWS Glue catalog
• Hive Metastore compliant
• Crawlers - Detect new data, schema, partitions
• Search - Metadata discovery
• Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum
compatible
• Hive Metastore (Presto, Spark, Hive, Pig)
• Can be hosted on Amazon RDS
Data
catalog Metastore Amazon RDS
AWS Glue
catalog AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, data lake, materialized views
Be cost-conscious
• Big data ≠ big cost
AI/ML enable your applications
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ben Snively
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...Amazon Web Services Japan
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksAmazon Web Services
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWSAmazon Web Services Korea
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법Amazon Web Services Korea
 
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Amazon Web Services
 

What's hot (20)

Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless Applications
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...
20191120 AWS Black Belt Online Seminar Amazon Managed Streaming for Apache Ka...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
How to Streamline DataOps on AWS
How to Streamline DataOps on AWSHow to Streamline DataOps on AWS
How to Streamline DataOps on AWS
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
 
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 

Similar to AWS Big Data Analytics Architectural Patterns

Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Data Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFData Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 

Similar to AWS Big Data Analytics Architectural Patterns (20)

Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFData Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SF
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Big Data Analytics Architectural Patterns

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data Analytics Architectural Patterns and Best Practices A N T 2 0 1 Ben Snively Solutions Architect Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to expect from the session Big data challenges Architectural principles How to simplify big data processing What technologies should you use? Why? How? Reference architecture Design patterns Customer examples Demonstrations
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Types of big data analytics Batch/ interactive Stream processing Machine learning
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Delivery of big data services Virtualized Managed services Serverless/ clusterless/ containerized
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Plethora of tools
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big data challenges Why? How? What tools should I use? Is there a reference architecture?
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architectural principles Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use event-journal design patterns • Immutable datasets (data lake), materialized views Be cost-conscious • Big data ≠ big cost Machine learning (ML) enable your applications
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Store Process/ Analyze Consume Time to answer (latency) Throughput Cost Simplify big data processing
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency µs, ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data What is the temperature of your data/analytic?
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Devices Sensors IoT platforms AWS IoT STREAMS IoT EventsData streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES Datatransport&logging Import/expo rt Files/objects Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Transactions Data structures Database records Type of data
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Events Files Transactions Collect Devices Sensors IoT platforms AWS IoT STREAMS IoT Data streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES Datatransport&logging Import/expo rt Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Data structures Database records Type of data Store NoSQL In-memory SQL File/object store Stream storage
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Devices Sensors IoT platforms AWS IoT STREAMS IoT Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES Datatransport&logging Import/expo rt Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Store NoSQL In-memory SQL File/object store Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Apache Kafka • High throughput distributed streaming platform Amazon Kinesis Data Streams • Managed stream storage Amazon Kinesis Data Firehose • Managed data delivery Stream storage
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which stream/message storage should I use? Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Apache Kafka (on Amazon Elastic Compute Cloud [Amazon EC2]) Amazon Simple Queue Service (Amazon SQS) (Standard) Amazon SQS (FIFO) AWS managed Yes Yes No Yes Yes Guaranteed ordering Yes No Yes No Yes Delivery (deduping) At least once At least once At least/At most/exactly once At least once Exactly once Data retention period 7 days N/A Configurable 14 days 14 days Availability Three AZ Three AZ Configurable Three AZ Three AZ Scale/throughput No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic 300 TPS / queue Multiple consumers Yes No Yes No No Row/object size 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Low Low Low (+admin) Low-medium Low-medium
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Store File Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS Datatransport&loggingIoTApplications Import/expo rt NoSQL In-memory SQL Amazon S3 Amazon Simple Storage Service (Amazon S3) • Managed object storage service built to store and retrieve any amount of data File/object storage
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Amazon S3 as the storage for your data lake • Natively supported by big data frameworks (Spark, Hive, Presto, and others) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure – SSL, client/server-side encryption at rest • Low cost
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What about HDFS & data tiering? • Use HDFS/local for working data sets • (for example, iterative read on the same data sets) • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard–IA for less frequently accessed data • Use Amazon Glacier for archiving cold data Use S3 analytics to optimize tiering strategy S3 analytics
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Store Amazon DynamoDB Amazon RDS Amazon Aurora File Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS LoggingIoTApplications Amazon S3 Amazon Neptune Amazon ElastiCache Amazon DAX Import/expo rt SQLNoSQLCache Amazon ElastiCache • Managed Memcached or Redis service Amazon DynamoDB Accelerator (DAX) • Managed in-memory cache for DynamoDB Amazon Neptune • Managed graph DB Amazon DynamoDB • Managed key value/document DB Amazon Relational Database Service (Amazon RDS) • Managed relational database service Cache & database Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anti-pattern Single database tier
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Best practice: Use the right tool for the job Data tier Relational Referential integrity with strong consistency, transactions, and hardened scale Key-value Low-latency, key- based queries with high throughput and fast data ingestion Document Indexing and storing of documents with support for query on any property In-memory Microsecond latency, key-based queries, specialized data structures Graph Creating and navigating relations between data easily and quickly Complex query support via SQL Simple query methods with filters Simple query with filters, projections and aggregates Simple query methods with filters Easily express queries in terms of relations
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which data store should I use? Ask yourself some questions What is the data structure? How will the data be accessed? What is the temperature of the data? What will the solution cost?
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is the data structure? Access patterns What to use? Put/get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, search Search Graph traversal GraphDB Data structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, search Key/value In-memory, NoSQL Graph GraphDB How will the data be accessed?
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SQL Request rate High Low Cost/GB High Low Latency Low High Data volume Low High Amazon Glacier Structure NoSQL Hot data Warm data Cold data Low High What is the temperature of the data?
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon ElastiCache Amazon DynamoDB + DAX Amazon Aurora Amazon RDS Amazon Elasticsearch Service Amazon Neptune Amazon S3 + Amazon Glacier Use cases In memory caching K/V lookups, document store OLTP, transactional OLTP, transactional Log analysis, reverse indexing Graph File store Performance Ultra high request rate, Ultra low latency Ultra high request rate, ultra low to low latency Very high request rate, low latency High request rate, low latency Medium request rate, low latency Medium request rate, low latency High throughput Shape K/V K/V and document Relational Relational Documents Node/edges Files Size GB TB, PB (no limits) GB, mid TB GB, low TB GB, TB GB, mid TB GB, TB, PB, EB (no limits) Cost / GB $$ ¢¢ - $$ ¢¢ ¢¢ ¢¢ ¢¢ ¢- ¢4/10 Availability 2 AZ 3 AZ 3 AZ 2 AZ 1-2 AZ 3 AZ 3 AZ VPC support Inside VPC VPC endpoint Inside VPC Inside VPC Outside or inside VPC Inside VPC VPC endpoint Database characteristics Hot data Warm data Cold data
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process/ analyze
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process/analyze Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Interactive & batch analytics Amazon Elasticsearch Service Managed service for ElasticSearch Amazon Redshift & Amazon Redshift Spectrum Managed data warehouse Spectrum enables querying S3 Amazon Athena Serverless interactive query service Amazon EMR Managed Hadoop framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig, HBase, and others Presto Amazon EMR
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stream/real-time analytics Spark Streaming on Amazon EMR Amazon Kinesis Data Analytics • Managed service for running SQL on streaming data Amazon KCL • Amazon Kinesis Client Library AWS Lambda • Run code serverless (without provisioning or managing servers) • Services such as S3 can publish events to AWS Lambda • AWS Lambda can pool event from a Kinesis Process/analyze KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Predictive analytics Process/analyze Developers Data scientists Deep learning experts Predictive AmazonML
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which analytics should I use? Process/analyze Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, one-minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Data Analytics, Amazon KCL, AWS Lambda, and others Predictive Takes milliseconds (real-time) to minutes (batch) Example: Fraud detection, forecasting demand, speech recognition Amazon SageMaker, Amazon Polly, Amazon Rekognition, Amazon Transcribe, Amazon Translate, Amazon EMR (Spark ML), Amazon Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe) FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which stream processing technology should I use? Amazon EMR (Spark Streaming) KCL application Amazon Kinesis analytics AWS Lambda Managed service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale / throughput No limits / ~ nodes No limits / ~ nodes No Limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Java, others through MultiLangDaemon ANSI SQL with extensions Node.js, Java, Python, .Net Core Sliding window functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by Amazon KCL Managed by Amazon Kinesis Data Analytics Managed by AWS Lambda
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Amazon Redshift Interactive queries over S3 data Interactive query General purpose Batch Scale/throughput ~Nodes ~Nodes Automatic ~ Nodes Managed service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Amazon Redshift catalog AWS Glue catalog AWS Glue catalog AWS Glue catalog or Hive Meta-store Auth/access controls IAM, users, groups, and access controls IAM, users, groups, and access controls IAM IAM, LDAP, & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes Slow
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Raw data stored in data lake Preparation Nor malized Partitioned Compr essed Stor age optimized ELT/ETL ELT/ETL: Preparing raw data for consumption Data lake on AWS Raw immutable DataSets Curated DataSets Data catalog
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which ELT/ETL tool should I use? AWS Glue ETL AWS Data Pipeline Amazon Data Migration Service (Amazon DMS) Amazon EMR Apache NiFi Partner solution Use case Serverless ETL Data workflow service Migrate databases (and to/from data lake) Customize developed Hadoop/Spark ETL Automate the flow of data between systems Rich partner ecosystem for ETL Scale/throughput ~DPUs ~Nodes through EMR cluster EC2 instance type ~Nodes Self managed Self manager or through partner Managed service Clusterless Managed Managed EC2 on your behalf Managed EC2 on your behalf Self managed on Amazon EMR or Marketplace Self manager or through partner Data sources S3, RDBMS, Amazon RedShift, DynamoDB S3, JDBC, DynamoDB, custom RDBMS, data warehouses, Amazon S3* (*limited) Various Hadoop/Spark managed Various through rich processor framework Various Skills needed Wizard for simple mapping, code snippets for advanced ETL Wizard and code snippets Wizard and drag/drop Hadoop/Spark coding NiFi processors and some coding Self manager or through partner
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Consume
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Store ConsumeProcess/analyzeETL Amazon QuickSight Analysis&visualizationDatasceince AI Apps Apps SW engineers/ business users Data scientists/ data engineers
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Putting it all together
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Collect Store ConsumeProcess/analyze Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot SQLNoSQLCacheFileStream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS Amazon QuickSight Analysis&visualizationDatasceince DataTransport&LoggingIoTApplications Amazon S3 Import/expo rt AI Apps Apps Model Train/ Eval Models Deploy ETL FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast Amazon DynamoDB Amazon RDS Amazon Aurora Amazon Neptune Amazon ElastiCache Amazon DAX
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Design patterns
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spark Streaming AWS Lambda KCL apps Amazon Redshift Amazon Redshift Hive Spark Presto Processingtechnology FastSlow Hive Native apps KCL apps AWS Lambda Amazon Athena Amazon Kinesis Amazon DynamoDB/RDS Amazon S3Data Hot Cold Data store Streaming
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Streaming analytics Amazon EMR KCL app AWS Lambda Spark Streaming Amazon AI Real-time prediction Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES App state or Materialized View KPI Process Store Amazon Kinesis Amazon Kinesis Data Analytics Amazon SNS NotificationsAlerts Amazon S3 Log Amazon KinesisFan out Downstream
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hearst’s serverless data pipeline cosmopolitan.com caranddriver.com sfchronicle.com elle.com Ingestion proxy (Node.js) Serverless data pipeline Offline analysis and archive Real-time analysis
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mobile devices (thousands) Data warehouse Yieldmo’s data ingestion and real-time analysis architecture - Page view ID - Ad ID - Fully viewed? Y | N { ad info, user info, phone info }
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interactive & batch analytics Process Store Batch Interactive Amazon EMR Hive Pig Spark Amazon AI Batch prediction Real-time prediction Amazon S3 Files Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Redshift Amazon ES Consume Amazon EMR Presto Spark Amazon Athena
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on Amazon EMR and Amazon Redshift Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, AWS CloudTrail, and database auditing Flexible interactive queries Predefined queries Surveillance analytics Web applications Analysts; regulators FINRA: Migrating to AWS
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interactive & batch Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES AWS Lambda Spark Streaming on Amazon EMR Applications Amazon Kinesis App state or materialized view KCL Amazon AI Real-time Amazon DynamoDB Amazon RDS Change data capture or export Transactions Stream Files Data lake Amazon Kinesis Data Analytics Amazon Athena Amazon Kinesis Data Firehose Amazon ES
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What about metadata? • AWS Glue catalog • Hive Metastore compliant • Crawlers - Detect new data, schema, partitions • Search - Metadata discovery • Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum compatible • Hive Metastore (Presto, Spark, Hive, Pig) • Can be hosted on Amazon RDS Data catalog Metastore Amazon RDS AWS Glue catalog AWS Glue
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, data lake, materialized views Be cost-conscious • Big data ≠ big cost AI/ML enable your applications
  • 51. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ben Snively
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.