AWS March 2016 Webinar Series Building Your Data Lake on AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solution Architect, AWS
March 2016
Replay of BDT317
Building Your Data Lake on AWS

Benefits of the Enterprise Data Warehouse
Self documenting schema
Enforced data types
Ubiquitous and common security model
Simple tools to access, robust ecosystem
Transactionality

But customers have additional requirements…

The Rise of “Big Data”
Enterprise
data warehouse
Amazon
EMR
Amazon
S3

STORAGE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE

Benefits of Separation of Compute & Storage
All your data, without paying for unused cores
Independent cost attribution per dataset
Use the right tool for a job, at the right time
Increased durability without operations
Common model for data, without enforcing access method

Comparison of a Data Lake to an Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI
use cases
BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Enterprise DWEMR S3

EMR S3
The New Problem
Enterprise
data warehouse
≠
Which system has my data?
How can I do machine
learning against the DW?
I built this in Hive, can we get
it into the Finance reports?
These sources are giving
different results…
But I implemented the
algorithm in Anaconda…

Dive Into The Data Lake
≠
Enterprise
data warehouseEMR S3

Dive Into The Data Lake
Enterprise
data warehouse
Load Cleansed Data
Export Computed Aggregates
Ingest any data
Data cleansing
Data catalogue
Trend analysis
Machine learning
Structured analysis
Common access tools
Efficient aggregation
Structured business rules
EMR S3

Components of a Data Lake
Data Storage
• High durability
• Stores raw data from input sources
• Support for any type of data
• Low cost
Streaming
• Streaming ingest of feed data
• Provides the ability to consume any
dataset as a stream
• Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Catalogue
• Metadata lake
• Used for summary statistics and data
Classification management
Search
• Simplified access model for data
discovery
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Entitlements system
• Encryption
• Authentication
• Authorisation
• Chargeback
• Quotas
• Data masking
• Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Storage & Streams
Catalogue & Search
Entitlements
API & UI
API & User Interface
• Exposes the data lake to customers
• Programmatically query catalogue
• Expose search API
• Ensures that entitlements are respected

Storage
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Simple Storage Service
Highly scalable object storage for the Internet
1 byte to 5 TB in size
Designed for 99.999999999% durability, 99.99%
availability
Regional service, no single points of failure
Server side encryption
Storage
AWS Global Infrastructure
App Services
Deployment & Administration
Networking
Compute Database
Analytics

Storage Lifecycle Integration
S3 – Standard S3 – Infrequent Access Amazon Glacier

Data Storage Format
• Not all data formats are created equally
• Unstructured vs. semi-structured vs. structured
• Store a copy of raw input
• Data standardisation as a workflow following
ingest
• Use a format that supports your data, rather
than force your data into a format
• Consider how data will change over time
• Apply common compression

Consider Different Types of Data
Unstructured
• Store native file format (logs, dump files, whatever)
• Compress with a streaming codec (LZO, Snappy)
Semi-structured - JSON, XML files, etc.
• Consider evolution ability of the data schema (Avro)
• Store the schema for the data as a file attribute (metadata/tag)
Structured
• Lots of data is CSV!
• Columnar storage (Orc, Parquet)

Where to Store Data
Amazon S3 storage uses a flat keyspace
Separate data storage by business unit, application, type, and
time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly

Metadata
Services
CRUD API
Query API
Analytics API
Systems of
Reference
Return
URLs
URLs as deeplinks to
applications, file
exchanges via S3
(RESTful file services)
or manifests for Big
Data Analytics / HPC.
Integration Layer
System to system via Amazon SNS/Amazon SQS
System to user via mobile push
Amazon Simple Workflow for high level system integration / orchestration
http://en.wikipedia.org/wiki/Resource-oriented_architecture
s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}
Resource Oriented Architecture

Streaming
Streaming ingest of feed data
Provides the ability to consume any
dataset as a stream
Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Why Do Streams Matter?
• Latency between event & action
• Most BI systems target event to action latency of
1 hour
• Streaming analytics would expect event to
action latency < 2 seconds
• Stream orientation simplifies architecture, but
can increase operational complexity
• Increase in complexity needs to be justified by
business value of reduced latency

Storage
App Services
Networking
Analytics
Compute Database
Amazon Kinesis
Managed service for real time big data processing
Create streams to produce & consume data
Elastically add and remove shards for performance
Use Amazon Kinesis Worker Library to process data
Integration with S3, Amazon Redshift, and
DynamoDB

Data
Sources
AWSEndpointData
Sources
Data
Sources
Data
Sources
S3
App.1
[Archive/
Ingestion]
App.2
[Sliding
Window
Analysis]
App.3
[Data
Loading]
App.4
[Event
Processing
Systems]
DynamoDB
Amazon Redshift
Data
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Amazon Kinesis Architecture

Streaming Storage Integration
Object store
Amazon S3Amazon Kinesis
Analytics applications
Read & write file dataRead & write to streams
Archive
stream
Replay
history
Object store

Catalogue & search
Metadata lake
Used for summary statistics and data Classification management
Simplified model for data discovery & governance
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Building a Data Catalogue
• Aggregated information about your storage &
streaming layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service

Data Catalogue – Metadata Index
• Stores data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes

Amazon
RDS
Amazon
DynamoDB
Amazon
Redshift
Amazon
ElastiCache
Managed NoSQL
Amazon DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with Amazon EMR & HiveStorage
Database
App Services
Networking
Analytics
Compute

AWS Lambda
Fully-managed event processor
Node.js or Java, integrated AWS SDK
Natively compile & install any Node.js modules
Specify runtime RAM & timeout
Automatically scaled to support event volume
Events from Amazon S3, Amazon SNS, Amazon
DynamoDB, Amazon Kinesis, & AWS Lambda
Integrated CloudWatch logging
Compute Storage
Database
App Services
Networking
Analytics
Serverless Compute

Data Catalogue – Search
Ingestion and pre-processing
Text processing (normalization)
• Tokenization
• Downcasing
• Stemming
• Stopword removal
• Synonym addition
Indexing
Matching
Ranking and relevance
• TF-IDF
• Additional criteria (rating, user behavior,
freshness, etc.)
NoSQLRDBMS Files Any Source
Search Index
Processor

Features and Benefits
Easy to set up and operate
• AWS Management Console, SDK, CLI
Scalable
• Automatic scaling on data size and traffic
Reliable
• Automatic recovery of instances, multi-AZ, etc.
High performance
• Low latency and high throughput performance through in-memory caching
Fully managed
• No capacity guessing
Rich features
• Faceted search, suggestions, relevance ranking, geospatial search, multi-
language support, etc.
Cost effective
• Pay as you go
Amazon
CloudSearch &
ElasticSearch

Data Catalogue – Building Search Index
Enable DynamoDB Update
Stream for metadata index
table
Additional AWS Lambda
function reads Update Stream
and extracts index fields from
S3 object
Update to Amazon
CloudSearch domain

Catalogue & Search Architecture

Entitlements
Encryption
Authentication
Authorisation
Chargeback
Quotas
Data masking
Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Identity & Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon
Security Token Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies

IAM Policy Language
JSON documents
Can include variables which extract
information from the request context
aws:CurrentTime For date/time conditions
aws:EpochTime The date in epoch or UNIX time, for
use with date/time conditions
aws:TokenIssueTime The date/time that temporary security
credentials were issued, for use with
date/time conditions.
aws:principaltype A value that indicates whether the
principal is an account, user,
federated, or assumed role—see the
explanation that follows
aws:SecureTransport Boolean representing whether the
request was sent using SSL
aws:SourceIp The requester's IP address, for use
with IP address conditions
aws:UserAgent Information about the requester's client
application, for use with string
conditions
aws:userid The unique ID for the current user
aws:username The friendly name of the current user

IAM Policy Language
Example: Allow a user to access a private part of the data lake
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake"],
"Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}}
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"]
}
]
}

IAM Federation
IAM allows federation to Active Directory and
other OpenID providers (Amazon, Facebook,
Google)
AWS Directory Service provides an AD
Connector which can automate federated
connectivity to ADFS
IAM
Users
AWS
Directory
Service
AD Connector
Direct
Connect
Hardware
VPN

Extended user defined security

Entitlements Engine: Amazon STS Token Vending Machine
http://amzn.to/1FMPrTF

Data Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure

Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…

Secure Data Flow
IAM
Amazon S3
API Gateway
Users
Temporary
Credential
Temporary
Credential
s3://mydatalake/${YYY-MM-DD}/
${resource}/${resourceID}
Encrypted
Data
Metadata
Index -
DynamoDB
TVM - Elastic
Beanstalk
Security Token
Service

API & UI
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
Storage & Streams
Catalogue & Search
Entitlements
API & UI

Data Lake API & UI
Exposes the Metadata API, search, and Amazon
S3 storage services to customers
Can be based on TVM/STS Temporary Access for
many services, and a bespoke API for Metadata
Drive all UI operations from API?

Introducing Amazon API Gateway
Host multiple versions and stages of APIs
Create and distribute API keys to developers
Leverage AWS Sigv4 to authorize access to APIs
Throttle and monitor requests to protect the backend
Leverages AWS Lambda

Additional Features
Managed cache to store API responses
Reduced latency and DDoS protection through AWS CloudFront
SDK generation for iOS, Android, and JavaScript
Swagger support
Request / response data transformation and API mocking

An API Call Flow
Internet
Mobile Apps
Websites
Services
API
Gateway
AWS Lambda
functions
AWS
API Gateway
cache
Endpoints on
Amazon EC2
Any other publicly
accessible endpoint
Amazon CloudWatch
monitoring
Amazon
CloudFront

API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda
Metadata IndexUsers IAM
TVM - Elastic
Beanstalk

A Data Lake Is…
• A foundation of highly durable data storage and streaming of any
type of data
• A metadata index and workflow which helps us categorise and
govern data stored in the data lake
• A search index and workflow which enables data discovery
• A robust set of security controls – governance through technology,
not policy
• An API and user interface that expose these features to internal and
external users

Storage & Streams
Amazon Kinesis
Amazon S3 Amazon Glacier
Data Catalogue & Search
AWS Lambda
Search Index Metadata Index
API GatewayUsers UI - Elastic
Beanstalk
Entitlements
IAM
Encrypted Data
Security Token
Service
TVM - Elastic Beanstalk
KMS
API & UI

Thank you!
Ian Meyers, Principal Solution Architect

AWS March 2016 Webinar Series Building Your Data Lake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS March 2016 Webinar Series Building Your Data Lake on AWS

Similar to AWS March 2016 Webinar Series Building Your Data Lake on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS March 2016 Webinar Series Building Your Data Lake on AWS