Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Solutions Architect
Susan Chan, Senior Product Manager - Amazon S3
August 2016
Building a Data Lake with
Amazon S3

Databases
Transactions
Data
warehouse
Evolution of big data architecture
Extract, transform and load (ETL)

Databases
Files
Transactions
Logs
Data
warehouse
ETL
ETL

Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
? Hadoop
?
ETL
ETL

Amazon
Glacier
Amazon S3 Amazon
DynamoDB
Amazon
RDS
Amazon EMR
Amazon
Redshift
AWS Data
Pipeline
Amazon Kinesis Amazon
CloudSearch
Amazon Kinesis-
enabled app
AWS Lambda Amazon
Machine
Learning
Amazon
SQS
Amazon
ElastiCache
Amazon
DynamoDB
Streams
A growing ecosystem…

Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
Data
Lake
The Genesis of “Data Lakes”

What really is a “Data Lake”

Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI  An API and user interface that expose these
features to internal and external users
 A robust set of security controls –
governance through technology, not policy
 A search index and workflow which enables
data discovery
 A foundation of highly durable data storage
and streaming of any type of data

Storage
High durability
Stores raw data from input sources
Support for any type of data
Low cost

Data Lake – Hadoop (HDFS) as the Storage
Search
Access
QueryProcess
Archive

Transaction
s
Data Lake – Amazon S3 as the storage
Search
Access
QueryProcess
Archive
Amazon
RDS
Amazon
DynamoDB
Amazon
Elasticsearch
Service
Amazon
Glacier
Amazon S3
Amazon
Redshift
Amazon Elastic
MapReduce
Amazon
Machine Learning
Amazon
ElastiCache

Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governance
Catalogue & search

Catalogue & Search Architecture

Encryption for Data protection
Authentication & Authorisation
Access Control & restrictions
Entitlements

Data Protection via Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure

Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…

Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & UI

API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda Metadata IndexUsers
IAM
TVM - Elastic
Beanstalk

Amazon
Kinesis
Amazon S3 Amazon Glacier
IAM
Encrypted
Data
Security Token
Service
AWS Lambda
Search
Index
Metadata
Index
API GatewayUsers UI - Elastic
Beanstalk
KMS
Collect
& Store
Catalogue &
Search
Entitlements &
Access Controls
APIs & UI

Amazon S3 - Foundation for
your Data Lake

Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 AWS Elastic MapReduce
 Amazon Redshift
 Amazon DynamoDB
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event Notification
 Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?

Why Amazon S3 for Data Lake?
 Natively supported by frameworks like — Spark, Hive, Presto, etc.
 Can run transient Hadoop clusters
 Multiple clusters can use the same data
 Highly durable, available, and scalable
 Low Cost: S3 Standard starts at $0.0275 per GB per month

AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3

Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier

Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Query string authentication
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 Server Side Encryption
with provided keys
(SSE-C, SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls

Use Case
We use S3 as the “source of truth” for our cloud-based data
warehouse. Any dataset that is worth retaining is stored on
S3. This includes data from billions of streaming events
from (Netflix-enabled) televisions, laptops, and mobile
devices every hour captured by our log data pipeline
(called Ursula), plus dimension data from Cassandra
supplied by our Aegisthus pipeline.
“
”
Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Eva Tse
Director, Big Data Platform

Tip #1: Use versioning
 Protects from accidental overwrites and
deletes
 New version with every upload
 Easy retrieval of deleted objects and roll
back to previous versions
Versioning

Tip #2: Use lifecycle policies
 Automatic tiering and cost controls
 Includes two possible actions:
 Transition: archives to Standard - IA or
Amazon Glacier based on object age you
specified
 Expiration: deletes objects after specified time
 Actions can be combined
 Set policies at the bucket or prefix level
 Set policies for current version or non-
current versions
Lifecycle policies

Versioning + lifecycle policies

Expired object delete marker policy
 Deleting a versioned object makes a
delete marker the current version of the
object
 Removing expired object delete marker
can improve list performance
 Lifecycle policy automatically removes
the current version delete marker when
previous versions of the object no
longer exist
Expired object delete
marker

Insert console screen shot
Enable policy with the console

Incomplete multipart upload expiration policy
 Partial upload does incur storage charges
 Set a lifecycle policy to automatically make
incomplete multipart uploads expire after a
predefined number of days
Incomplete multipart
upload expiration
Best Practice

Enable policy with the Management Console

Considerations for organizing your Data Lake
 Amazon S3 storage uses a flat keyspace
 Separate data by business unit, application, type, and time
 Natural data partitioning is very useful
 Paths should be self documenting and intuitive
 Changing prefix structure in future is hard/costly

Best Practices for your Data Lake
 Always store a copy of raw input as the first rule of thumb
 Use automation with S3 Events to enable trigger based
workflows
 Use a format that supports your data, rather than force your
data into a format
 Apply compression everywhere to reduce the network load

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

Similar to Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series