SlideShare a Scribd company logo
1 of 60
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solution Architect, AWS
March 2016
Replay of BDT317
Building Your Data Lake on AWS
Benefits of the Enterprise Data Warehouse
Self documenting schema
Enforced data types
Ubiquitous and common security model
Simple tools to access, robust ecosystem
Transactionality
STORAGE
COMPUTE
But customers have additional requirements…
The Rise of “Big Data”
Enterprise
data warehouse
Amazon
EMR
Amazon
S3
STORAGE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
Benefits of Separation of Compute & Storage
All your data, without paying for unused cores
Independent cost attribution per dataset
Use the right tool for a job, at the right time
Increased durability without operations
Common model for data, without enforcing access method
Comparison of a Data Lake to an Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI
use cases
BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Enterprise DWEMR S3
EMR S3
The New Problem
Enterprise
data warehouse
≠
Which system has my data?
How can I do machine
learning against the DW?
I built this in Hive, can we get
it into the Finance reports?
These sources are giving
different results…
But I implemented the
algorithm in Anaconda…
Dive Into The Data Lake
≠
Enterprise
data warehouseEMR S3
Dive Into The Data Lake
Enterprise
data warehouse
Load Cleansed Data
Export Computed Aggregates
Ingest any data
Data cleansing
Data catalogue
Trend analysis
Machine learning
Structured analysis
Common access tools
Efficient aggregation
Structured business rules
EMR S3
Components of a Data Lake
Data Storage
• High durability
• Stores raw data from input sources
• Support for any type of data
• Low cost
Streaming
• Streaming ingest of feed data
• Provides the ability to consume any
dataset as a stream
• Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Catalogue
• Metadata lake
• Used for summary statistics and data
Classification management
Search
• Simplified access model for data
discovery
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Entitlements system
• Encryption
• Authentication
• Authorisation
• Chargeback
• Quotas
• Data masking
• Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
API & User Interface
• Exposes the data lake to customers
• Programmatically query catalogue
• Expose search API
• Ensures that entitlements are respected
Storage
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Simple Storage Service
Highly scalable object storage for the Internet
1 byte to 5 TB in size
Designed for 99.999999999% durability, 99.99%
availability
Regional service, no single points of failure
Server side encryption
Storage
AWS Global Infrastructure
App Services
Deployment & Administration
Networking
Compute Database
Analytics
Storage Lifecycle Integration
S3 – Standard S3 – Infrequent Access Amazon Glacier
Data Storage Format
• Not all data formats are created equally
• Unstructured vs. semi-structured vs. structured
• Store a copy of raw input
• Data standardisation as a workflow following
ingest
• Use a format that supports your data, rather
than force your data into a format
• Consider how data will change over time
• Apply common compression
Consider Different Types of Data
Unstructured
• Store native file format (logs, dump files, whatever)
• Compress with a streaming codec (LZO, Snappy)
Semi-structured - JSON, XML files, etc.
• Consider evolution ability of the data schema (Avro)
• Store the schema for the data as a file attribute (metadata/tag)
Structured
• Lots of data is CSV!
• Columnar storage (Orc, Parquet)
Where to Store Data
Amazon S3 storage uses a flat keyspace
Separate data storage by business unit, application, type, and
time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly
Metadata
Services
CRUD API
Query API
Analytics API
Systems of
Reference
Return
URLs
URLs as deeplinks to
applications, file
exchanges via S3
(RESTful file services)
or manifests for Big
Data Analytics / HPC.
Integration Layer
System to system via Amazon SNS/Amazon SQS
System to user via mobile push
Amazon Simple Workflow for high level system integration / orchestration
http://en.wikipedia.org/wiki/Resource-oriented_architecture
s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}
Resource Oriented Architecture
Streaming
Streaming ingest of feed data
Provides the ability to consume any
dataset as a stream
Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Why Do Streams Matter?
• Latency between event & action
• Most BI systems target event to action latency of
1 hour
• Streaming analytics would expect event to
action latency < 2 seconds
• Stream orientation simplifies architecture, but
can increase operational complexity
• Increase in complexity needs to be justified by
business value of reduced latency
Storage
AWS Global Infrastructure
App Services
Deployment & Administration
Networking
Analytics
Compute Database
Amazon Kinesis
Managed service for real time big data processing
Create streams to produce & consume data
Elastically add and remove shards for performance
Use Amazon Kinesis Worker Library to process data
Integration with S3, Amazon Redshift, and
DynamoDB
Data
Sources
AWSEndpointData
Sources
Data
Sources
Data
Sources
S3
App.1
[Archive/
Ingestion]
App.2
[Sliding
Window
Analysis]
App.3
[Data
Loading]
App.4
[Event
Processing
Systems]
DynamoDB
Amazon Redshift
Data
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Amazon Kinesis Architecture
Streaming Storage Integration
Object store
Amazon S3Amazon Kinesis
Analytics applications
Read & write file dataRead & write to streams
Archive
stream
Replay
history
Object store
Catalogue & search
Metadata lake
Used for summary statistics and data Classification management
Simplified model for data discovery & governance
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Building a Data Catalogue
• Aggregated information about your storage &
streaming layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
Data Catalogue – Metadata Index
• Stores data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes
http://amzn.to/1LSSbFp
Amazon
RDS
Amazon
DynamoDB
Amazon
Redshift
Amazon
ElastiCache
Managed NoSQL
Amazon DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with Amazon EMR & HiveStorage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Compute
AWS Lambda
Fully-managed event processor
Node.js or Java, integrated AWS SDK
Natively compile & install any Node.js modules
Specify runtime RAM & timeout
Automatically scaled to support event volume
Events from Amazon S3, Amazon SNS, Amazon
DynamoDB, Amazon Kinesis, & AWS Lambda
Integrated CloudWatch logging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Serverless Compute
Data Catalogue – Search
Ingestion and pre-processing
Text processing (normalization)
• Tokenization
• Downcasing
• Stemming
• Stopword removal
• Synonym addition
Indexing
Matching
Ranking and relevance
• TF-IDF
• Additional criteria (rating, user behavior,
freshness, etc.)
NoSQLRDBMS Files Any Source
Search Index
Processor
Features and Benefits
Easy to set up and operate
• AWS Management Console, SDK, CLI
Scalable
• Automatic scaling on data size and traffic
Reliable
• Automatic recovery of instances, multi-AZ, etc.
High performance
• Low latency and high throughput performance through in-memory caching
Fully managed
• No capacity guessing
Rich features
• Faceted search, suggestions, relevance ranking, geospatial search, multi-
language support, etc.
Cost effective
• Pay as you go
Amazon
CloudSearch &
ElasticSearch
Data Catalogue – Building Search Index
Enable DynamoDB Update
Stream for metadata index
table
Additional AWS Lambda
function reads Update Stream
and extracts index fields from
S3 object
Update to Amazon
CloudSearch domain
Catalogue & Search Architecture
Entitlements
Encryption
Authentication
Authorisation
Chargeback
Quotas
Data masking
Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Data Lake ≢ Open Access
Identity & Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon
Security Token Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies
IAM Policy Language
JSON documents
Can include variables which extract
information from the request context
aws:CurrentTime For date/time conditions
aws:EpochTime The date in epoch or UNIX time, for
use with date/time conditions
aws:TokenIssueTime The date/time that temporary security
credentials were issued, for use with
date/time conditions.
aws:principaltype A value that indicates whether the
principal is an account, user,
federated, or assumed role—see the
explanation that follows
aws:SecureTransport Boolean representing whether the
request was sent using SSL
aws:SourceIp The requester's IP address, for use
with IP address conditions
aws:UserAgent Information about the requester's client
application, for use with string
conditions
aws:userid The unique ID for the current user
aws:username The friendly name of the current user
IAM Policy Language
Example: Allow a user to access a private part of the data lake
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake"],
"Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}}
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"]
}
]
}
IAM Federation
IAM allows federation to Active Directory and
other OpenID providers (Amazon, Facebook,
Google)
AWS Directory Service provides an AD
Connector which can automate federated
connectivity to ADFS
IAM
Users
AWS
Directory
Service
AD Connector
Direct
Connect
Hardware
VPN
Extended user defined security
Entitlements Engine: Amazon STS Token Vending Machine
http://amzn.to/1FMPrTF
Data Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure
Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…
Secure Data Flow
IAM
Amazon S3
API Gateway
Users
Temporary
Credential
Temporary
Credential
s3://mydatalake/${YYY-MM-DD}/
${resource}/${resourceID}
Encrypted
Data
Metadata
Index -
DynamoDB
TVM - Elastic
Beanstalk
Security Token
Service
API & UI
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Data Lake API & UI
Exposes the Metadata API, search, and Amazon
S3 storage services to customers
Can be based on TVM/STS Temporary Access for
many services, and a bespoke API for Metadata
Drive all UI operations from API?
Amazon API Gateway
Introducing Amazon API Gateway
Host multiple versions and stages of APIs
Create and distribute API keys to developers
Leverage AWS Sigv4 to authorize access to APIs
Throttle and monitor requests to protect the backend
Leverages AWS Lambda
Additional Features
Managed cache to store API responses
Reduced latency and DDoS protection through AWS CloudFront
SDK generation for iOS, Android, and JavaScript
Swagger support
Request / response data transformation and API mocking
An API Call Flow
Internet
Mobile Apps
Websites
Services
API
Gateway
AWS Lambda
functions
AWS
API Gateway
cache
Endpoints on
Amazon EC2
Any other publicly
accessible endpoint
Amazon CloudWatch
monitoring
Amazon
CloudFront
API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda
Metadata IndexUsers IAM
TVM - Elastic
Beanstalk
Putting It All Together
A Data Lake Is…
• A foundation of highly durable data storage and streaming of any
type of data
• A metadata index and workflow which helps us categorise and
govern data stored in the data lake
• A search index and workflow which enables data discovery
• A robust set of security controls – governance through technology,
not policy
• An API and user interface that expose these features to internal and
external users
Storage & Streams
Amazon Kinesis
Amazon S3 Amazon Glacier
Data Catalogue & Search
AWS Lambda
Search Index Metadata Index
API GatewayUsers UI - Elastic
Beanstalk
Entitlements
IAM
Encrypted Data
Security Token
Service
TVM - Elastic Beanstalk
KMS
API & UI
Thank you!
Ian Meyers, Principal Solution Architect

More Related Content

What's hot

Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSAmazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Amazon Web Services
 
Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Amazon Web Services
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 

What's hot (20)

Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
 
Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300Analysing All Your Streaming Data - Level 300
Analysing All Your Streaming Data - Level 300
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Securing Your Big Data on AWS
Securing Your Big Data on AWSSecuring Your Big Data on AWS
Securing Your Big Data on AWS
 

Viewers also liked

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Data Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieData Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieJens Albrecht
 
Validation of services, data and metadata
Validation of services, data and metadataValidation of services, data and metadata
Validation of services, data and metadataLuis Bermudez
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDMrnaramore
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Martin Traverso
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetupstevemcpherson
 
WITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark StreamingWITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark StreamingDmitry Kniazev
 
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation Forum
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation ForumChallenges in Global Standardisation | EnergySys Hydrocarbon Allocation Forum
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation ForumEnergySys Limited
 
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...Carlos Gabriel Asato
 
Data Modelling is NOT just for RDBMS's
Data Modelling is NOT just for RDBMS'sData Modelling is NOT just for RDBMS's
Data Modelling is NOT just for RDBMS'sChristopher Bradley
 
Data modelling where did it all go wrong?
Data modelling where did it all go wrong?Data modelling where did it all go wrong?
Data modelling where did it all go wrong?Christopher Bradley
 
WITSML to PPDM mapping project
WITSML to PPDM mapping projectWITSML to PPDM mapping project
WITSML to PPDM mapping projectETLSolutions
 
Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...Christopher Bradley
 
Incorporating ERP metadata in your data models
Incorporating ERP metadata in your data modelsIncorporating ERP metadata in your data models
Incorporating ERP metadata in your data modelsChristopher Bradley
 
Simple workflow to populate PPDM tables from well files
Simple workflow to populate PPDM tables from well filesSimple workflow to populate PPDM tables from well files
Simple workflow to populate PPDM tables from well filesAndrew Zolnai
 
The role of Data Virtualisation in your EIM strategy
The role of Data Virtualisation in your EIM strategyThe role of Data Virtualisation in your EIM strategy
The role of Data Virtualisation in your EIM strategyChristopher Bradley
 
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30EnergySys Limited
 

Viewers also liked (20)

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Data Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur TechnologieData Lake Architektur: Von den Anforderungen zur Technologie
Data Lake Architektur: Von den Anforderungen zur Technologie
 
Validation of services, data and metadata
Validation of services, data and metadataValidation of services, data and metadata
Validation of services, data and metadata
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDM
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
WITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark StreamingWITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark Streaming
 
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation Forum
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation ForumChallenges in Global Standardisation | EnergySys Hydrocarbon Allocation Forum
Challenges in Global Standardisation | EnergySys Hydrocarbon Allocation Forum
 
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...
GIS Technology and E&P in Petroleum Industry Context, Applications and Impact...
 
Data Modelling is NOT just for RDBMS's
Data Modelling is NOT just for RDBMS'sData Modelling is NOT just for RDBMS's
Data Modelling is NOT just for RDBMS's
 
Data modelling where did it all go wrong?
Data modelling where did it all go wrong?Data modelling where did it all go wrong?
Data modelling where did it all go wrong?
 
WITSML to PPDM mapping project
WITSML to PPDM mapping projectWITSML to PPDM mapping project
WITSML to PPDM mapping project
 
Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...Information is at the heart of all architecture disciplines & why Conceptual ...
Information is at the heart of all architecture disciplines & why Conceptual ...
 
Incorporating ERP metadata in your data models
Incorporating ERP metadata in your data modelsIncorporating ERP metadata in your data models
Incorporating ERP metadata in your data models
 
Simple workflow to populate PPDM tables from well files
Simple workflow to populate PPDM tables from well filesSimple workflow to populate PPDM tables from well files
Simple workflow to populate PPDM tables from well files
 
The role of Data Virtualisation in your EIM strategy
The role of Data Virtualisation in your EIM strategyThe role of Data Virtualisation in your EIM strategy
The role of Data Virtualisation in your EIM strategy
 
WITSML
WITSMLWITSML
WITSML
 
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30
Prodml Production Reporting | Hydrocarbon Allocation Forum | 2014 09-30
 

Similar to AWS March 2016 Webinar Series Building Your Data Lake on AWS

Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLAmazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Amazon Web Services
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 

Similar to AWS March 2016 Webinar Series Building Your Data Lake on AWS (20)

Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

AWS March 2016 Webinar Series Building Your Data Lake on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solution Architect, AWS March 2016 Replay of BDT317 Building Your Data Lake on AWS
  • 2.
  • 3. Benefits of the Enterprise Data Warehouse Self documenting schema Enforced data types Ubiquitous and common security model Simple tools to access, robust ecosystem Transactionality
  • 5. But customers have additional requirements…
  • 6. The Rise of “Big Data” Enterprise data warehouse Amazon EMR Amazon S3
  • 8. Benefits of Separation of Compute & Storage All your data, without paying for unused cores Independent cost attribution per dataset Use the right tool for a job, at the right time Increased durability without operations Common model for data, without enforcing access method
  • 9. Comparison of a Data Lake to an Enterprise Data Warehouse Complementary to EDW (not replacement) Data lake can be source for EDW Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics) Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Enterprise DWEMR S3
  • 10. EMR S3 The New Problem Enterprise data warehouse ≠ Which system has my data? How can I do machine learning against the DW? I built this in Hive, can we get it into the Finance reports? These sources are giving different results… But I implemented the algorithm in Anaconda…
  • 11. Dive Into The Data Lake ≠ Enterprise data warehouseEMR S3
  • 12. Dive Into The Data Lake Enterprise data warehouse Load Cleansed Data Export Computed Aggregates Ingest any data Data cleansing Data catalogue Trend analysis Machine learning Structured analysis Common access tools Efficient aggregation Structured business rules EMR S3
  • 13. Components of a Data Lake Data Storage • High durability • Stores raw data from input sources • Support for any type of data • Low cost Streaming • Streaming ingest of feed data • Provides the ability to consume any dataset as a stream • Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  • 14. Components of a Data Lake Catalogue • Metadata lake • Used for summary statistics and data Classification management Search • Simplified access model for data discovery Storage & Streams Catalogue & Search Entitlements API & UI
  • 15. Components of a Data Lake Entitlements system • Encryption • Authentication • Authorisation • Chargeback • Quotas • Data masking • Regional restrictions Storage & Streams Catalogue & Search Entitlements API & UI
  • 16. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI API & User Interface • Exposes the data lake to customers • Programmatically query catalogue • Expose search API • Ensures that entitlements are respected
  • 17. Storage High durability Stores raw data from input sources Support for any type of data Low cost Storage & Streams Catalogue & Search Entitlements API & UI
  • 18. Simple Storage Service Highly scalable object storage for the Internet 1 byte to 5 TB in size Designed for 99.999999999% durability, 99.99% availability Regional service, no single points of failure Server side encryption Storage AWS Global Infrastructure App Services Deployment & Administration Networking Compute Database Analytics
  • 19. Storage Lifecycle Integration S3 – Standard S3 – Infrequent Access Amazon Glacier
  • 20. Data Storage Format • Not all data formats are created equally • Unstructured vs. semi-structured vs. structured • Store a copy of raw input • Data standardisation as a workflow following ingest • Use a format that supports your data, rather than force your data into a format • Consider how data will change over time • Apply common compression
  • 21. Consider Different Types of Data Unstructured • Store native file format (logs, dump files, whatever) • Compress with a streaming codec (LZO, Snappy) Semi-structured - JSON, XML files, etc. • Consider evolution ability of the data schema (Avro) • Store the schema for the data as a file attribute (metadata/tag) Structured • Lots of data is CSV! • Columnar storage (Orc, Parquet)
  • 22. Where to Store Data Amazon S3 storage uses a flat keyspace Separate data storage by business unit, application, type, and time Natural data partitioning is very useful Paths should be self documenting and intuitive Changing prefix structure in future is hard/costly
  • 23. Metadata Services CRUD API Query API Analytics API Systems of Reference Return URLs URLs as deeplinks to applications, file exchanges via S3 (RESTful file services) or manifests for Big Data Analytics / HPC. Integration Layer System to system via Amazon SNS/Amazon SQS System to user via mobile push Amazon Simple Workflow for high level system integration / orchestration http://en.wikipedia.org/wiki/Resource-oriented_architecture s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied} Resource Oriented Architecture
  • 24. Streaming Streaming ingest of feed data Provides the ability to consume any dataset as a stream Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  • 25. Why Do Streams Matter? • Latency between event & action • Most BI systems target event to action latency of 1 hour • Streaming analytics would expect event to action latency < 2 seconds • Stream orientation simplifies architecture, but can increase operational complexity • Increase in complexity needs to be justified by business value of reduced latency
  • 26. Storage AWS Global Infrastructure App Services Deployment & Administration Networking Analytics Compute Database Amazon Kinesis Managed service for real time big data processing Create streams to produce & consume data Elastically add and remove shards for performance Use Amazon Kinesis Worker Library to process data Integration with S3, Amazon Redshift, and DynamoDB
  • 28. Streaming Storage Integration Object store Amazon S3Amazon Kinesis Analytics applications Read & write file dataRead & write to streams Archive stream Replay history Object store
  • 29. Catalogue & search Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Storage & Streams Catalogue & Search Entitlements API & UI
  • 30. Building a Data Catalogue • Aggregated information about your storage & streaming layer • Storage service for metadata Ownership, data lineage • Data abstraction layer Customer data = collection of prefixes • Enabling data discovery • API for use by entitlements service
  • 31. Data Catalogue – Metadata Index • Stores data about your Amazon S3 storage environment • Total size & count of objects by prefix, data classification, refresh schedule, object version information • Amazon S3 events processed by Lambda function • DynamoDB metadata tables store required attributes
  • 33. Amazon RDS Amazon DynamoDB Amazon Redshift Amazon ElastiCache Managed NoSQL Amazon DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with Amazon EMR & HiveStorage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics Compute
  • 34. AWS Lambda Fully-managed event processor Node.js or Java, integrated AWS SDK Natively compile & install any Node.js modules Specify runtime RAM & timeout Automatically scaled to support event volume Events from Amazon S3, Amazon SNS, Amazon DynamoDB, Amazon Kinesis, & AWS Lambda Integrated CloudWatch logging Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics Serverless Compute
  • 35. Data Catalogue – Search Ingestion and pre-processing Text processing (normalization) • Tokenization • Downcasing • Stemming • Stopword removal • Synonym addition Indexing Matching Ranking and relevance • TF-IDF • Additional criteria (rating, user behavior, freshness, etc.) NoSQLRDBMS Files Any Source Search Index Processor
  • 36. Features and Benefits Easy to set up and operate • AWS Management Console, SDK, CLI Scalable • Automatic scaling on data size and traffic Reliable • Automatic recovery of instances, multi-AZ, etc. High performance • Low latency and high throughput performance through in-memory caching Fully managed • No capacity guessing Rich features • Faceted search, suggestions, relevance ranking, geospatial search, multi- language support, etc. Cost effective • Pay as you go Amazon CloudSearch & ElasticSearch
  • 37. Data Catalogue – Building Search Index Enable DynamoDB Update Stream for metadata index table Additional AWS Lambda function reads Update Stream and extracts index fields from S3 object Update to Amazon CloudSearch domain
  • 38. Catalogue & Search Architecture
  • 40. Data Lake ≢ Open Access
  • 41. Identity & Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  • 42. IAM Policy Language JSON documents Can include variables which extract information from the request context aws:CurrentTime For date/time conditions aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions. aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows aws:SecureTransport Boolean representing whether the request was sent using SSL aws:SourceIp The requester's IP address, for use with IP address conditions aws:UserAgent Information about the requester's client application, for use with string conditions aws:userid The unique ID for the current user aws:username The friendly name of the current user
  • 43. IAM Policy Language Example: Allow a user to access a private part of the data lake { "Version": "2012-10-17", "Statement": [ { "Action": ["s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake"], "Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}} }, { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"] } ] }
  • 44. IAM Federation IAM allows federation to Active Directory and other OpenID providers (Amazon, Facebook, Google) AWS Directory Service provides an AD Connector which can automate federated connectivity to ADFS IAM Users AWS Directory Service AD Connector Direct Connect Hardware VPN
  • 46. Entitlements Engine: Amazon STS Token Vending Machine http://amzn.to/1FMPrTF
  • 47. Data Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  • 48. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  • 49. Secure Data Flow IAM Amazon S3 API Gateway Users Temporary Credential Temporary Credential s3://mydatalake/${YYY-MM-DD}/ ${resource}/${resourceID} Encrypted Data Metadata Index - DynamoDB TVM - Elastic Beanstalk Security Token Service
  • 50. API & UI Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected Storage & Streams Catalogue & Search Entitlements API & UI
  • 51. Data Lake API & UI Exposes the Metadata API, search, and Amazon S3 storage services to customers Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata Drive all UI operations from API?
  • 53. Introducing Amazon API Gateway Host multiple versions and stages of APIs Create and distribute API keys to developers Leverage AWS Sigv4 to authorize access to APIs Throttle and monitor requests to protect the backend Leverages AWS Lambda
  • 54. Additional Features Managed cache to store API responses Reduced latency and DDoS protection through AWS CloudFront SDK generation for iOS, Android, and JavaScript Swagger support Request / response data transformation and API mocking
  • 55. An API Call Flow Internet Mobile Apps Websites Services API Gateway AWS Lambda functions AWS API Gateway cache Endpoints on Amazon EC2 Any other publicly accessible endpoint Amazon CloudWatch monitoring Amazon CloudFront
  • 56. API & UI Architecture API Gateway UI - Elastic Beanstalk AWS Lambda Metadata IndexUsers IAM TVM - Elastic Beanstalk
  • 57. Putting It All Together
  • 58. A Data Lake Is… • A foundation of highly durable data storage and streaming of any type of data • A metadata index and workflow which helps us categorise and govern data stored in the data lake • A search index and workflow which enables data discovery • A robust set of security controls – governance through technology, not policy • An API and user interface that expose these features to internal and external users
  • 59. Storage & Streams Amazon Kinesis Amazon S3 Amazon Glacier Data Catalogue & Search AWS Lambda Search Index Metadata Index API GatewayUsers UI - Elastic Beanstalk Entitlements IAM Encrypted Data Security Token Service TVM - Elastic Beanstalk KMS API & UI
  • 60. Thank you! Ian Meyers, Principal Solution Architect