SlideShare a Scribd company logo
1 of 38
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Solutions Architect
Susan Chan, Senior Product Manager - Amazon S3
August 2016
Building a Data Lake with
Amazon S3
Evolution of “Data Lakes”
Databases
Transactions
Data
warehouse
Evolution of big data architecture
Extract, transform and load (ETL)
Databases
Files
Transactions
Logs
Data
warehouse
Evolution of big data architecture
ETL
ETL
Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
Evolution of big data architecture
? Hadoop
?
ETL
ETL
Amazon
Glacier
Amazon S3 Amazon
DynamoDB
Amazon
RDS
Amazon EMR
Amazon
Redshift
AWS Data
Pipeline
Amazon Kinesis Amazon
CloudSearch
Amazon Kinesis-
enabled app
AWS Lambda Amazon
Machine
Learning
Amazon
SQS
Amazon
ElastiCache
Amazon
DynamoDB
Streams
A growing ecosystem…
Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
Data
Lake
The Genesis of “Data Lakes”
What really is a “Data Lake”
Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI  An API and user interface that expose these
features to internal and external users
 A robust set of security controls –
governance through technology, not policy
 A search index and workflow which enables
data discovery
 A foundation of highly durable data storage
and streaming of any type of data
Storage
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Data Lake – Hadoop (HDFS) as the Storage
Search
Access
QueryProcess
Archive
Transaction
s
Data Lake – Amazon S3 as the storage
Search
Access
QueryProcess
Archive
Amazon
RDS
Amazon
DynamoDB
Amazon
Elasticsearch
Service
Amazon
Glacier
Amazon S3
Amazon
Redshift
Amazon Elastic
MapReduce
Amazon
Machine Learning
Amazon
ElastiCache
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governance
Catalogue & search
Catalogue & Search Architecture
Encryption for Data protection
Authentication & Authorisation
Access Control & restrictions
Entitlements
Data Protection via Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure
Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & UI
API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda Metadata IndexUsers
IAM
TVM - Elastic
Beanstalk
Putting It All Together
Amazon
Kinesis
Amazon S3 Amazon Glacier
IAM
Encrypted
Data
Security Token
Service
AWS Lambda
Search
Index
Metadata
Index
API GatewayUsers UI - Elastic
Beanstalk
KMS
Collect
& Store
Catalogue &
Search
Entitlements &
Access Controls
APIs & UI
Amazon S3 - Foundation for
your Data Lake
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 AWS Elastic MapReduce
 Amazon Redshift
 Amazon DynamoDB
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event Notification
 Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
Why Amazon S3 for Data Lake?
 Natively supported by frameworks like — Spark, Hive, Presto, etc.
 Can run transient Hadoop clusters
 Multiple clusters can use the same data
 Highly durable, available, and scalable
 Low Cost: S3 Standard starts at $0.0275 per GB per month
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Query string authentication
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 Server Side Encryption
with provided keys
(SSE-C, SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls
Use Case
We use S3 as the “source of truth” for our cloud-based data
warehouse. Any dataset that is worth retaining is stored on
S3. This includes data from billions of streaming events
from (Netflix-enabled) televisions, laptops, and mobile
devices every hour captured by our log data pipeline
(called Ursula), plus dimension data from Cassandra
supplied by our Aegisthus pipeline.
“
”
Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Eva Tse
Director, Big Data Platform
Tip #1: Use versioning
 Protects from accidental overwrites and
deletes
 New version with every upload
 Easy retrieval of deleted objects and roll
back to previous versions
Versioning
Tip #2: Use lifecycle policies
 Automatic tiering and cost controls
 Includes two possible actions:
 Transition: archives to Standard - IA or
Amazon Glacier based on object age you
specified
 Expiration: deletes objects after specified time
 Actions can be combined
 Set policies at the bucket or prefix level
 Set policies for current version or non-
current versions
Lifecycle policies
Versioning + lifecycle policies
Expired object delete marker policy
 Deleting a versioned object makes a
delete marker the current version of the
object
 Removing expired object delete marker
can improve list performance
 Lifecycle policy automatically removes
the current version delete marker when
previous versions of the object no
longer exist
Expired object delete
marker
Insert console screen shot
Enable policy with the console
Incomplete multipart upload expiration policy
 Partial upload does incur storage charges
 Set a lifecycle policy to automatically make
incomplete multipart uploads expire after a
predefined number of days
Incomplete multipart
upload expiration
Best Practice
Enable policy with the Management Console
Considerations for organizing your Data Lake
 Amazon S3 storage uses a flat keyspace
 Separate data by business unit, application, type, and time
 Natural data partitioning is very useful
 Paths should be self documenting and intuitive
 Changing prefix structure in future is hard/costly
Best Practices for your Data Lake
 Always store a copy of raw input as the first rule of thumb
 Use automation with S3 Events to enable trigger based
workflows
 Use a format that supports your data, rather than force your
data into a format
 Apply compression everywhere to reduce the network load
Thank you!

More Related Content

What's hot

What's hot (20)

Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
Adf presentation
Adf presentationAdf presentation
Adf presentation
 

Similar to Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 

Similar to Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series (20)

Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
AWS 資料湖服務
AWS 資料湖服務AWS 資料湖服務
AWS 資料湖服務
 
AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...
AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...
AWS re:Invent 2016: Strategic Planning for Long-Term Data Archiving with Amaz...
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Deep Dive on Amazon S3
Deep Dive on Amazon S3Deep Dive on Amazon S3
Deep Dive on Amazon S3
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Migrating Large Scale Datasets
Migrating Large Scale DatasetsMigrating Large Scale Datasets
Migrating Large Scale Datasets
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Storage with Amazon S3 and Amazon Glacier
Storage with Amazon S3 and Amazon GlacierStorage with Amazon S3 and Amazon Glacier
Storage with Amazon S3 and Amazon Glacier
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Solutions Architect Susan Chan, Senior Product Manager - Amazon S3 August 2016 Building a Data Lake with Amazon S3
  • 3. Databases Transactions Data warehouse Evolution of big data architecture Extract, transform and load (ETL)
  • 6. Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift AWS Data Pipeline Amazon Kinesis Amazon CloudSearch Amazon Kinesis- enabled app AWS Lambda Amazon Machine Learning Amazon SQS Amazon ElastiCache Amazon DynamoDB Streams A growing ecosystem…
  • 8. What really is a “Data Lake”
  • 9. Components of a Data Lake Collect & Store Catalogue & Search Entitlements API & UI  An API and user interface that expose these features to internal and external users  A robust set of security controls – governance through technology, not policy  A search index and workflow which enables data discovery  A foundation of highly durable data storage and streaming of any type of data
  • 10. Storage High durability Stores raw data from input sources Support for any type of data Low cost
  • 11. Data Lake – Hadoop (HDFS) as the Storage Search Access QueryProcess Archive
  • 12. Transaction s Data Lake – Amazon S3 as the storage Search Access QueryProcess Archive Amazon RDS Amazon DynamoDB Amazon Elasticsearch Service Amazon Glacier Amazon S3 Amazon Redshift Amazon Elastic MapReduce Amazon Machine Learning Amazon ElastiCache
  • 13. Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Catalogue & search
  • 14. Catalogue & Search Architecture
  • 15. Encryption for Data protection Authentication & Authorisation Access Control & restrictions Entitlements
  • 16. Data Protection via Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  • 17. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  • 18. Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected API & UI
  • 19. API & UI Architecture API Gateway UI - Elastic Beanstalk AWS Lambda Metadata IndexUsers IAM TVM - Elastic Beanstalk
  • 20. Putting It All Together
  • 21. Amazon Kinesis Amazon S3 Amazon Glacier IAM Encrypted Data Security Token Service AWS Lambda Search Index Metadata Index API GatewayUsers UI - Elastic Beanstalk KMS Collect & Store Catalogue & Search Entitlements & Access Controls APIs & UI
  • 22. Amazon S3 - Foundation for your Data Lake
  • 23. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  AWS Elastic MapReduce  Amazon Redshift  Amazon DynamoDB Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event Notification  Lifecycle policies Easy to use Why Amazon S3 for Data Lake?
  • 24. Why Amazon S3 for Data Lake?  Natively supported by frameworks like — Spark, Hive, Presto, etc.  Can run transient Hadoop clusters  Multiple clusters can use the same data  Highly durable, available, and scalable  Low Cost: S3 Standard starts at $0.0275 per GB per month
  • 25. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  • 26. Choice of storage classes on S3 Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier
  • 27. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Query string authentication  SSL endpoints  Server Side Encryption (SSE-S3)  Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right controls
  • 28. Use Case We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline. “ ” Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html Eva Tse Director, Big Data Platform
  • 29. Tip #1: Use versioning  Protects from accidental overwrites and deletes  New version with every upload  Easy retrieval of deleted objects and roll back to previous versions Versioning
  • 30. Tip #2: Use lifecycle policies  Automatic tiering and cost controls  Includes two possible actions:  Transition: archives to Standard - IA or Amazon Glacier based on object age you specified  Expiration: deletes objects after specified time  Actions can be combined  Set policies at the bucket or prefix level  Set policies for current version or non- current versions Lifecycle policies
  • 32. Expired object delete marker policy  Deleting a versioned object makes a delete marker the current version of the object  Removing expired object delete marker can improve list performance  Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist Expired object delete marker
  • 33. Insert console screen shot Enable policy with the console
  • 34. Incomplete multipart upload expiration policy  Partial upload does incur storage charges  Set a lifecycle policy to automatically make incomplete multipart uploads expire after a predefined number of days Incomplete multipart upload expiration Best Practice
  • 35. Enable policy with the Management Console
  • 36. Considerations for organizing your Data Lake  Amazon S3 storage uses a flat keyspace  Separate data by business unit, application, type, and time  Natural data partitioning is very useful  Paths should be self documenting and intuitive  Changing prefix structure in future is hard/costly
  • 37. Best Practices for your Data Lake  Always store a copy of raw input as the first rule of thumb  Use automation with S3 Events to enable trigger based workflows  Use a format that supports your data, rather than force your data into a format  Apply compression everywhere to reduce the network load