SlideShare a Scribd company logo
1 of 21
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
March 16, 2017
Fast Track to Your Data Lake on AWS
John Mallory, Business Development
Data has gravity
…easier to move processing to the data
4k/8k
Genomics
Seismic
Financial
Logs
IoT
Data has Business Value
Challenges with Legacy Data Architectures
• Can’t move data across silos
• Can’t afford to keep all of the data
• Can’t scale with dynamic data and real-time processing
• Can’t scale management of data
• Can’t find the people who know how to configure and
manage complex infrastructure
• Can’t afford the investments to keep refreshing
infrastructure and data centers
Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest
• Storage vs Compute
• Schema on Read
1&2: Consolidate (Data) & Separate (Storage & Compute)
•S3 as the data lake storage tier; not a single analytics
tool like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more
efficient to operate
•Decoupled storage and compute allow us to evolve to
clusterless architectures (i.e. Lambda, Athena & Glue)
•Do not build data silos in Hadoop or the EDW
•Gain flexibility to use all the analytics tools in the
ecosystem around S3 & future proof the architecture
Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
 Multiple upload
 Range GET
 Scalable Throughput
 Store as much as you need
 Scale storage and compute
independently
 Scale without limits
 Affordable
Scalable
 Amazon EMR
 Amazon Redshift
 Amazon DynamoDB
 Amazon Athena
 Amazon Rekognition
 Amazon Glue
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
 Simple Management Tools
 Hadoop compatibility
Easy to use
Why Choose Amazon S3 for data lake?
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
• Store 5PB of historical data for analysis & training
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using AWS
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
3: Implement the Right Security Controls
AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
4: Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly into
AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer Acceleration
• Move data up to 300% faster
using AWS’s private network
5: Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Metadata
What is in the data lake?
Documents the data lake
Summary statistics
Classification
Data
Sources
Search
capabilities
Glue Coming Mid-year
https://aws.amazon.com/answers/big-data/data-lake-solution/
Glue automates the undifferentiated heavy-lifting of ETL
 Cataloging data sources
 Identifying data formats and data types
 Generating Extract, Transform, Load code
 Executing ETL jobs; managing dependencies
 Handling errors
 Managing and scaling resources
Amazon Glue – in Preview
S3 Standard S3 Standard - Infrequent
Access
Amazon Glacier
Active data Archive dataInfrequently accessed data
Milliseconds Minutes to HoursMilliseconds
$0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo
6: Keep More Data
7: Use Athena for Ad Hoc Data Exploration
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3
8: Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
9: Choose the Right Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots
A Sample Data Lake Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
Analytic
Capabilities
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis
Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
10: Evolve as Needed

More Related Content

What's hot

(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...Amazon Web Services
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Amazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWSAmazon Web Services
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightAmazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWSAmazon Web Services
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...Amazon Web Services
 
Optimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsOptimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsAmazon Web Services
 
BDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSBDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 

What's hot (20)

2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS
 
Amazon Kinesis Data Streams
Amazon Kinesis Data StreamsAmazon Kinesis Data Streams
Amazon Kinesis Data Streams
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Optimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsOptimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics Workloads
 
BDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSBDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWS
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 

Similar to Fast Track to Your Data Lake on AWS

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoAmazon Web Services
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 

Similar to Fast Track to Your Data Lake on AWS (20)

Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 

Recently uploaded (15)

lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 

Fast Track to Your Data Lake on AWS

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. March 16, 2017 Fast Track to Your Data Lake on AWS John Mallory, Business Development
  • 2. Data has gravity …easier to move processing to the data 4k/8k Genomics Seismic Financial Logs IoT
  • 4. Challenges with Legacy Data Architectures • Can’t move data across silos • Can’t afford to keep all of the data • Can’t scale with dynamic data and real-time processing • Can’t scale management of data • Can’t find the people who know how to configure and manage complex infrastructure • Can’t afford the investments to keep refreshing infrastructure and data centers
  • 5. Enter Data Lake Architectures Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Benefits of a Data Lake • All Data in One Place • Quick Ingest • Storage vs Compute • Schema on Read
  • 6. 1&2: Consolidate (Data) & Separate (Storage & Compute) •S3 as the data lake storage tier; not a single analytics tool like Hadoop or a data warehouse •Decoupled storage and compute is cheaper and more efficient to operate •Decoupled storage and compute allow us to evolve to clusterless architectures (i.e. Lambda, Athena & Glue) •Do not build data silos in Hadoop or the EDW •Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture
  • 7. Designed for 11 9s of durability • Multiple Encryption Options • Robust/Highly Flexible Access Controls Durable Secure High performance  Multiple upload  Range GET  Scalable Throughput  Store as much as you need  Scale storage and compute independently  Scale without limits  Affordable Scalable  Amazon EMR  Amazon Redshift  Amazon DynamoDB  Amazon Athena  Amazon Rekognition  Amazon Glue Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies  Simple Management Tools  Hadoop compatibility Easy to use Why Choose Amazon S3 for data lake?
  • 8. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case Study: Re-architecting Compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day • Store 5PB of historical data for analysis & training Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20m annually by using AWS
  • 9. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. 3: Implement the Right Security Controls
  • 10. AWS Snowball & Snowmobile • Accelerate PBs with AWS-provided appliances • 50, 80, 100 TB models • 100PB Snowmobile AWS Storage Gateway • Instant hybrid cloud • Up to 120 MB/s cloud upload rate (4x improvement), and 4: Choose the Right Ingestion Methods Amazon Kinesis Firehose • Ingest device streams directly into AWS data stores AWS Direct Connect • COLO to AWS • Use native copy tools Native/ISV Connectors • Sqoop, Flume, DistCp • Commvault, Veritas, etc Amazon S3 Transfer Acceleration • Move data up to 300% faster using AWS’s private network
  • 11. 5: Catalog Your Data S3 Put data in S3 Amazon DynamoDB Amazon Elasticsearch Service Metadata What is in the data lake? Documents the data lake Summary statistics Classification Data Sources Search capabilities Glue Coming Mid-year https://aws.amazon.com/answers/big-data/data-lake-solution/
  • 12. Glue automates the undifferentiated heavy-lifting of ETL  Cataloging data sources  Identifying data formats and data types  Generating Extract, Transform, Load code  Executing ETL jobs; managing dependencies  Handling errors  Managing and scaling resources Amazon Glue – in Preview
  • 13. S3 Standard S3 Standard - Infrequent Access Amazon Glacier Active data Archive dataInfrequently accessed data Milliseconds Minutes to HoursMilliseconds $0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo 6: Keep More Data
  • 14. 7: Use Athena for Ad Hoc Data Exploration Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 15. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  • 16. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3
  • 17. 8: Use the Right Data Formats • Pay by the amount of data scanned per query • Use Compressed Columnar Formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 18. 9: Choose the Right Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition & Amazon Polly Image Recognition & Text-to-Speech AI APIs Amazon Lex Voice or Text Chatbots
  • 19. A Sample Data Lake Pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  • 20. Amazon S3 Data Lake Amazon Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 AWS Data Lake Analytic Capabilities Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis
  • 21. Use S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse Decoupled storage and compute is cheaper and more efficient to operate Decoupled storage and compute allow us to evolve to clusterless architectures like Athena Do not build data silos in Hadoop or the Enterprise DW Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture 10: Evolve as Needed

Editor's Notes

  1. As content quality improves and the need to suppprt multiple ways of viewing it prolifirate, we are facing the challenge of content gravity. While it’s relatively easy to process the media, it’s becoming exceedingly difficult to move it around and store it. For example moving from HD to 4K and eventually 8K content may result in an increase of storage footprint on the order of 10x or more. Storage is not the only challenge here, as the contnent weighs more it’s more difficult to quickly and cost effectively transfer it to affiliates and partners in the supply chain. The conclusion is that you should strive to keep the data as close as possible to sufficient processing resources.
  2. The native features of S3 are exactly what you want from a Data Lake Replication across AZ’s for high availability and durability Massively parallel and scalable Storage scales independent of compute Low storage cost at < $0.025/GB This is nearly impossible to achieve with a fixed database cluster
  3. SUGGESTED TALKING POINTS: The Financial Industry Regulatory Authority (FINRA), one of the largest independent securities regulators in the U.S., was established to help watch and regulate financial trading practices. To respond to rapidly changing market dynamics, FINRA moved its market surveillance platform to AWS to analyze and store approximately 75 billion market events every day. FINRA selected AWS because it offered the right services while fulfilling the company’s security requirements. By using dynamic clusters (Hadoop, Hive, and HBase), and services such as Amazon EMR and Amazon S3, FINRA was able to create a flexible platform that could adapt to changing market dynamics. By using AWS, FINRA has been able to increase agility, speed and cost savings while allowing them to operate at scale. The company estimates it will save $10 to $20 million annually by using AWS.
  4. AWS has a broad set of capabilities that make security easy With all your data in S3 you have a variety of encryption options Client Side Server Side Encryption with KMS Keys You can extend encryption to a 3rd party provider We integrate with HSM as well IAM offers the ability to create users and roles for those users which can restrict access to only those capabilities you allow You can set S3 bucket policies for IAM users S3 has a private VPC endpoint so you don’t need to exit your VPC via a NAT gateway And you have native features such as setting Lifecycle policies for your S3 data as well as bucket access logs.  
  5. EBS: Raised max throughput to 320 MB/sec (PIOPS) and 160 MB/sec (GP2), plus larger & faster ssd volumes (raised max vol size from 1 TB to 16 TB) Snowball: Physical storage device by AWS to accelerate PB-scale data transfer with AWS-provided appliances Kinesis Firehose: Ingest data streams directly into AWS data stores (S3 and Redshift). You can use Amazon Kinesis to ingest data from hundreds of thousands of sensors processing hundreds of terabytes of data per hour. Zero administration: Capture and deliver streaming data into S3, Redshift, and other destinations without writing any applications or managing infrastructure. Direct-to-Data Store Integration Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. Seamless Elasticity: Seamlessly scales to match data throughput without intervention.
  6. Show me all my customer data Search important – how to discover what is there, where it is,etc (Glue will replace later) Is this one step too far? (benefits of an AWS data lake slide, data governance what it is interms of index,catalog, and manage your data rather than nuts and bolts of data catalog. Use topic of data governance itself ElasticSearch is also used for querying the data lake itself - load processed data into Elasticsearch (integrated with Hadoop workflow in a data lake?) ask Bob Taylor about integrating index search element into Hadoop -
  7. Across the board, we provide 3 storage options with 3 different performance characteristics and price points. On the left, we have S3 Standard which is our high performance object storage for the internet, designed for very active, hot workloads. Data in S3 Standard is available in milliseconds and costs $0.03/GB/month (starting at). On the right hand side, we have Glacier, our cold storage service designed for long term archival and infrequently accessed data. Data in Glacier has a 3-5 hour access latency and Glacier costs $0.007/GB/month (starting at). Between the hot and cold options, we have a “warm” option – S3 infrequent access designed for data you plan to access maybe a few times a year or what we think of as “active archive”. S3-IA costs $0.0125/GB/mo (starting at). From an archiving perspective, customers typically use S3IA and Glacier together. Just a quick note terminology – S3 stores data in buckets and each piece of data is an object; Glacier stores data in vaults (equivalent of S3 buckets) and each piece of data is called an archive (similar to object). You will hear me use bucket/vault/object/archive later on.
  8. You simply put your Data in S3 and submit SQL against it
  9. For a datalake, Athena won’t be the only application reading the data. ORC and Parquet were chosen because they are open source and are available for use with other analytics tools. You can use a few lines of Pyspark code, running on Amazon EMR, to convert your files to Parquet for the best performance and cost When you create a table for Athena, you are essentially just creating metadata and, as you run queries, the schema is applied to the data. Data is streamed to Athena from S3, it is not copies and there is no ETL. This makes Athena ideal for customers using S3 as Data Lake extraction, transformation, and load No loading of data required. Query data where it lives.
  10. Quicksight can talk to Athena using a JDBC driver