Submit Search
Upload
Building Data Lakes with AWS
•
17 likes
•
3,125 views
Amazon Web Services
Follow
by John Mallory, Storage Business Development, AWS
Read less
Read more
Report
Share
Report
Share
1 of 55
Recommended
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Amazon Web Services
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Amazon Web Services
Introduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
Amazon Web Services
AWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
The delta architecture
The delta architecture
Prakash Chockalingam
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
Amazon Web Services
Recommended
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Amazon Web Services
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Amazon Web Services
Introduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
Amazon Web Services
AWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
The delta architecture
The delta architecture
Prakash Chockalingam
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
Amazon Web Services
Introduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Amazon Web Services
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
Amazon ElastiCache and Redis
Amazon ElastiCache and Redis
Amazon Web Services
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Amazon Web Services
Building a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
Introduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
Amazon QuickSight
Amazon QuickSight
Amazon Web Services
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Amazon Web Services
Building a Data Lake on S3 for IoT Workloads
Building a Data Lake on S3 for IoT Workloads
Amazon Web Services
AWS RDS
AWS RDS
Mahesh Raj
Snowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
Amazon Web Services
Introduction to Amazon Aurora
Introduction to Amazon Aurora
Amazon Web Services
Amazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
Amazon Web Services
A tour of Amazon Redshift
A tour of Amazon Redshift
Kel Graham
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
Amazon Web Services
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Amazon Web Services
Amazon Redshift
Amazon Redshift
Amazon Web Services
Securing Your Big Data on AWS
Securing Your Big Data on AWS
Amazon Web Services
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
Amazon Web Services
More Related Content
What's hot
Introduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Amazon Web Services
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
Amazon ElastiCache and Redis
Amazon ElastiCache and Redis
Amazon Web Services
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Amazon Web Services
Building a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
Introduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
Amazon QuickSight
Amazon QuickSight
Amazon Web Services
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Amazon Web Services
Building a Data Lake on S3 for IoT Workloads
Building a Data Lake on S3 for IoT Workloads
Amazon Web Services
AWS RDS
AWS RDS
Mahesh Raj
Snowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
Amazon Web Services
Introduction to Amazon Aurora
Introduction to Amazon Aurora
Amazon Web Services
Amazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
Amazon Web Services
A tour of Amazon Redshift
A tour of Amazon Redshift
Kel Graham
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
Amazon Web Services
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Amazon Web Services
Amazon Redshift
Amazon Redshift
Amazon Web Services
What's hot
(20)
Introduction to Amazon Redshift
Introduction to Amazon Redshift
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon ElastiCache and Redis
Amazon ElastiCache and Redis
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Building a Data Lake on AWS
Building a Data Lake on AWS
Introduction to Amazon Athena
Introduction to Amazon Athena
Amazon QuickSight
Amazon QuickSight
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Building a Data Lake on S3 for IoT Workloads
Building a Data Lake on S3 for IoT Workloads
AWS RDS
AWS RDS
Snowflake Architecture.pptx
Snowflake Architecture.pptx
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
Introduction to Amazon Aurora
Introduction to Amazon Aurora
Amazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
A tour of Amazon Redshift
A tour of Amazon Redshift
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Amazon Redshift
Amazon Redshift
Similar to Building Data Lakes with AWS
Securing Your Big Data on AWS
Securing Your Big Data on AWS
Amazon Web Services
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
Amazon Web Services
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Amazon Web Services
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
Amazon Web Services
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
AWS Riyadh User Group
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon Web Services
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Amazon Web Services
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Amazon Web Services
21st Century Analytics with Zopa
21st Century Analytics with Zopa
Amazon Web Services
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
Amazon Web Services
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Amazon Web Services
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
Similar to Building Data Lakes with AWS
(20)
Securing Your Big Data on AWS
Securing Your Big Data on AWS
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
21st Century Analytics with Zopa
21st Century Analytics with Zopa
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
More from Amazon Web Services
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
Open banking as a service
Open banking as a service
Amazon Web Services
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
Computer Vision con AWS
Computer Vision con AWS
Amazon Web Services
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
Tools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
Building a web application without servers
Building a web application without servers
Amazon Web Services
Fundraising Essentials
Fundraising Essentials
Amazon Web Services
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
More from Amazon Web Services
(20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Open banking as a service
Open banking as a service
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Computer Vision con AWS
Computer Vision con AWS
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Tools for building your MVP on AWS
Tools for building your MVP on AWS
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Building a web application without servers
Building a web application without servers
Fundraising Essentials
Fundraising Essentials
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Building Data Lakes with AWS
1.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Building Data
Lakes with AWS John Mallory Storage BD
2.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Rethink how to
become a data-driven business • Business outcomes - start with the insights and actions you want to drive, then work backwards to a streamlined design • Experimentation - start small, test many ideas, keep the good ones and scale those up, paying only for what you consume • Agile and timely - deploy data processing infrastructure in minutes, not months. take advantage of a rich platform of services to respond quickly to changing business needs
3.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Finding Value in
Data is a Journey Business Monitoring Business Insights New Business Opportunity Business Optimization Business Transformation Evolving Tools and Infrastructure
4.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Often Undertaken with
Silos of Tools and Data Hadoop Spark NoSQL Storage Arrays Databases Data Warehouse Structured Data SQL Raw Data ETL Advanced Analytics ETL
5.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Legacy Data Warehouses
& RDBMS • Complex to setup and manage • Do not scale • Takes months to add new data sources • Queries take too long • Cost $MM upfront
6.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved This Leads to
Friction & Pain • Challenging to move data across silos • Forced to keep multiple copies of data • Complex data transformation & governance • Users struggle to find data they need • Slows innovation and evolution • Expensive
7.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Enter the Data
Lake Architecture Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Benefits of a Data Lake • All Data in One Place • Quick Ingest & Transformation • Bring Functionality to the Data • Schema on Read
8.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Consideration 1 –
S3 for the Data Lake
9.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Consolidate Data /
Separate Storage & Compute • Amazon S3 as the data lake storage tier; not a single analytics tool like Hadoop or a data warehouse • Decoupled storage and compute is cheaper and more efficient to operate • Decoupled storage and compute allow us to evolve to clusterless architectures (i.e. AWS Lambda, Amazon Athena, Redshift Spectrum, AWS Glue, Amazon Macie) • Do not build data silos in Hadoop or an EDW • Gain the flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture
10.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved An AWS Data
Lake Architecture AWS Glue ETL & Data Catalog Serverless Compute AWS Lambda Trigger-based Code Execution Amazon Redshift Spectrum Fast @ Exabyte scale Amazon Athena Interactive Query Data Processing Amazon EMR Managed Hadoop Applications Amazon Redshift Petabyte-scale Data Warehousing Storage Amazon S3 Exabyte-scale Object Storage AWS Glue Data Catalog Hive-compatible Metastore
11.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Nasdaq implements
an S3 data lake + Redshift data warehouse architecture • Most recent two years of data is kept in the Redshift data warehouse and snapshotted into S3 for disaster recovery • Data between two and five years old is kept in S3 • Presto on EMR is used to ad-hoc query data in S3 • Transitioned from an on-premises data warehouse to Amazon Redshift & S3 data lake architecture • Over 1,000 tables migrated • Average daily ingest of over 7B rows • Migrated off legacy DW to AWS (start to finish) in 7 man-months • AWS costs were 43% of legacy budget for the same data set (~1100 tables)
12.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Nasdaq uses Presto
on Amazon EMR and Amazon Redshift as a tiered data lake Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
13.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Designed for 11 9s of durability § Multiple Encryption Options § Robust/Highly Flexible Access Controls Durable
Secure High performance § Multiple upload § Range GET § Scalable Throughput § Amazon EMR § Amazon Redshift/Spectrum § Amazon DynamoDB § Amazon Athena § Amazon Rekognition § Amazon Glue IntegratedEasy to use § Simple REST API § AWS SDKs § Read-after-create consistency § Event notification § Lifecycle policies § Simple Management Tools § Hadoop compatibility Scalable § Store as much as you need § Scale storage and compute independently § Scale without limits § Affordable Why Choose Amazon S3 for Data Lake?
14.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Optimize Costs with
Data Tiering • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use Amazon S3 Analytics for storage class analysis New
15.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Encryption ComplianceSecurity § Identity and Access Management (IAM) policies §
Bucket policies § Access Control Lists (ACLs) § Private VPC endpoints to Amazon S3 § SSL endpoints § Server Side Encryption (SSE-S3) § S3 Server Side Encryption with provided keys (SSE- C, SSE-KMS) § Client-side Encryption § Buckets access logs § Lifecycle Management Policies § Access Control Lists (ACLs) § Versioning & MFA deletes § Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right security controls in S3
16.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Manage your data S3
object Tags Manage storage based on object tags • Classify your data • Tag your objects with key-value pairs • Write policies once based on the type of data Discoverability Lifecycle PolicyAccess Control
17.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Manage S3 Security { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*" "Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}} } ] } Manage permissions with tags
18.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Macie: A New Approach Amazon Macie Understand Your Data Natural Language Processing (NLP) Understand Data Access Machine Learning
19.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Macie Uses Machine Learning • Understand behavioral analytics to baseline normal behavior • Train and develop contextualized alerts by understanding the value of data being accessed •
Context for content
20.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Business Critical Data in Amazon S3 • Static website content • Source code •
SSL certificates, private keys • iOS and Android app signing keys • Database backups • OAuth and Cloud SAAS API Keys
21.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
22.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Consideration 2 –
Ingest & Catalog
23.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved AWS Snowball &
Snowmobile • Accelerate PBs with AWS-provided appliances • 50, 80, 100 TB models • 100PB Snowmobile AWS Storage Gateway • Instant hybrid cloud • Up to 120 MB/s cloud upload rate (4x improvement), and Choose the Right Ingestion Methods Amazon Kinesis Firehose • Ingest device streams directly into AWS data stores AWS Direct Connect • COLO to AWS • Use native copy tools Native/ISV Connectors • Sqoop, Flume, DistCp • Commvault, Veritas, etc Amazon S3 Transfer Acceleration • Move data up to 300% faster using AWS’s private network
24.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Kinesis Firehose Load
massive volumes of streaming data into Amazon S3, Redshift and Elasticsearch • Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure. • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention • Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data. Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch
25.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Catalog Your Data S3 Put data in S3 Amazon DynamoDB Amazon Elasticsearch Service Extract metadata with Lambda Data Sources Search capabilities
26.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Catalog with AWS
Glue
27.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Glue Data Catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark etc. We added a few extensions: § Search
over metadata for data discovery § Connection info – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through Crawlers.
28.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Catalog: Crawlers Automatically
discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
29.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved AWS Glue Data
Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
30.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Consideration 3 –
Optimizing Performance
31.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Getting high Throughput
Performance with S3 • S3 can scale to many thousands of requests per second • Need a good key naming scheme • Only at scale do you need to consider your key naming scheme • What are Partitions? • Why? • Spread Keys Lexigraphically • Goal of Partitioning is too spread the heat • Prevent HotSpots
32.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Distributing key names •
Add randomness to the beginning of the key name… <my_bucket>/6213-2013_11_13.jpg <my_bucket>/4653-2013_11_13.jpg <my_bucket>/9873-2013_11_13.jpg <my_bucket>/4657-2013_11_13.jpg <my_bucket>/1256-2013_11_13.jpg <my_bucket>/8345-2013_11_13.jpg <my_bucket>/0321-2013_11_13.jpg <my_bucket>/5654-2013_11_13.jpg <my_bucket>/2345-2013_11_13.jpg <my_bucket>/7567-2013_11_13.jpg <my_bucket>/3455-2013_11_13.jpg <my_bucket>/4313-2013_11_13.jpg Partitions: <my_bucket>/0 <my_bucket>/1 <my_bucket>/2 <my_bucket>/3 <my_bucket>/4 <my_bucket>/5 <my_bucket>/6 <my_bucket>/7 <my_bucket>/8 <my_bucket>/9
33.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Recommendations for
EMR and S3 Performance Best Practices: • Reduce Number of S3 objects by aggregating small files into larger ones (s3distcp – group-by option) • Goal: Files >128MB • Use EMRFS with Consistent View • Parquet with Snappy compression is emerging as the best compression algorithm • Reverse partition scheme to HOUR, DAY, MONTH, YEAR
34.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Use the Right
Data Formats • Pay by the amount of data scanned per query • Use Compressed Columnar Formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
35.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Consideration 4 –
Query in Place
36.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved S3 Data Catalog AthenaEMR Redshift Spectrum Amazon ML / MXNet RDS QuickSight Kinesis Database Migration Service Glue Amazon Analytics
End to End Architecture IAM Other Sources
37.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Explore Your Data
Without ETL • Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
38.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Athena is Serverless •
No Infrastructure or administration • Zero Spin up time • Transparent upgrades
39.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Query Data Directly
from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3
40.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Familiar Technologies Under
the Covers • Used for SQL Queries • In-memory distributed query engine • ANSI-SQL compatible with extensions • Used for DDL functionality • Complex data types • Multitude of formats • Supports data partitioning
41.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved What About ETL? Raw
Data Assets Transformed Into Usable Ones
42.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved ETL is the
most time-consuming part of analytics ETL Data Warehousing Business Intelligence 80% of time spent here Amazon Redshift Amazon QuickSight
43.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved AWS Glue Simple, flexible,
cost-effective ETL § AWS Glue is a fully managed ETL (extract, transform, and load) service § Categorize your data, clean it, enrich it and move it reliably between various data stores § Once catalogued, your data is immediately searchable and queryable across your data silos § Simple and cost-effective § Serverless; runs on a fully managed, scale-out Spark environment
44.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Build event-driven ETL
pipelines
45.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Relational data warehouse Massively
parallel; Petabyte scale Fully managed Supports Standard ANSI SQL High Performance Amazon Redshift a lot faster a lot simpler a lot cheaper Fully Managed Petabyte-scale Data Warehouse
46.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Redshift Spectrum Run
SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
47.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon EMR Real-time Analytics Amazon Kinesis KCL app AWS Lambda Spark Streaming Amazon SNS Amazon ML Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alerts App state Real-time prediction KPI process store Stream Amazon Kinesis Analytics Amazon S3 Log Amazon KinesisFan out
48.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Case Study: Clickstream
Analysis Hearst Corporation monitors trending content for over 250 digital properties worldwide and processes more than 30TB of data per day, using an architecture that includes Amazon Kinesis and Spark running on Amazon EMR. Store → Process | Analyze → Answers
49.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Kinesis Amazon EMR Amazon EMR Amazon Redshift Elasticsearch Clickstream Hearst Corporation monitors trending content for over 250 digital properties worldwide and processes more than 30TB of data per day, using an architecture that includes Amazon Kinesis and Spark running on Amazon EMR.
50.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Stream Amazon Kinesis Firehose Amazon Athena Files Amazon Kinesis Analytics
51.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Amazon S3 Data lake Amazon EMR Amazon Kinesis Amazon RedShift Answers & Insights Hot HomesUsers Properties Agents User
Profile Recommendation Hot Homes Similar Homes Agent Follow-up Agent Scorecard Marketing A/B Testing Real Time Data … Amazon DynamoDB BI / Reporting
52.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Choose the Right
Tools Amazon Redshift, Spectrum Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition & Amazon Polly Image Recognition & Text-to-Speech AI APIs Amazon Lex Voice or Text Chatbots
53.
Amazon S3 Data Lake Amazon
Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 AWS Data Lake You Don’t Have to Choose Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis
54.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Use S3
as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse • Decoupled storage and compute is cheaper and more efficient to operate • Decoupled storage and compute allow us to evolve to clusterless architectures like Athena • Do not build data silos in Hadoop or the Enterprise DW • Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture Evolve as Needed
55.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft