Using Data Lakes: Data Analytics Week SF

Pop-up Loft
Using Data Lakes
John Mallory
johmallo@amazon.com

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks

Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything

Challenges with legacy data architectures
• Can’t move data across silos
• Can’t handle semi-structured or unstructured data
• Can’t deal with dynamic data and real-time processing
• Complex ETL processes
• Difficulty applying consistent data governance across all
data sets

Legacy data architectures exist as isolated data silos
Hadoop
cluster
SQL
database
Data
warehouse
appliance

Navigating the Data Lake…
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository

Data Lake – Centralized data, single source of truth
Store and analyze all of your data,
from all of your sources, in one
centralized location
“Why is the data distributed in
many locations? Where is the
single source of truth?”

Data Lake – Quick ingest
Quickly ingest data
without needing to force it into a
predefined schema
“How can I collect data quickly
from various sources and store
it efficiently?”

Why AWS?
Implementing a Data Lake architecture requires a broad
set of tools and technologies to serve an increasingly
diverse set of applications and use cases.
Boils down to:
Picking the right tool for the right job…on a consumption-
based model…

Central Storage
AWS Database
Migration Service
AWS Snowball*
AWS Direct
Connect
Amazon Kinesis Firehose
Data
Ingestion Catalog
Amazon
DynamoDB
Amazon
Elasticsearch Service
Amazon
Kinesis
Amazon
EMR
Amazon
RDS
Amazon
Athena
Amazon
QuickSight
Amazon
Redshift*
Processing
and Analytics
S3
Access and Secure
Example of AWS Services for Data Lake
Amazon
AWS Glue

Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service
Data Lake reference architecture
Catalog' &'Search
Management
Cognito
QuickSight
Central'Storage

S3 – Center of the Data Lake

Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple Upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage commitments
Scalable
§ Amazon EMR
§ Amazon Redshift
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?

Encryption ComplianceSecurity
§ AWS Identity and Access
Management (IAM) policies
§ Bucket policies
§ Access control lists (ACLs)
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server-Side Encryption
(SSE-S3)
§ Amazon S3 Server-
Side Encryption with
provided keys (SSE-C,
SSE-KMS)
§ Client-Side Encryption
§ Buckets access logs
§ Lifecycle management
policies
§ ACLs
§ Versioning & MFA
deletes
§ Certifications – HIPAA,
PCI, SOC 1/2/3, etc.
Implement the right cloud security controls

• Decouple storage and compute by making Amazon S3
object-based storage.
• Consolidate and centrally manage and govern your data.
• Flexibility to use any and all tools in the ecosystem. The right
tool for the job.
• Future proof your architecture. As new use cases and new
tools emerge, you can plug and play current best of breed.
Benefits of an AWS Data Lake

AWS Glue – Data Catalog
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum,
Amazon EMR and API

AWS Glue – ETL
Automatically generated ETL code running on serverless Apache Spark with the power
and flexibility to bring data together.

Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase, Spark, Presto
Easy to use, fully managed
On-demand, reserved instance, and
Spot pricing
Tight integration with Amazon S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce

Server%rack%1
(20%nodes)
Server%rack%2
(20%nodes)
Server%rack%N%
(20%nodes)
Core
On-premises Hadoop clusters
• A cluster of 1U machines
• Typically 12 Cores, 32/64 GB
RAM, and 6 - 8 TB of HDD ($3-4K)
• Networking switches and racks
• Open-source distribution of
Hadoop or a fixed licensing term
by commercial distributions
• Different node roles
• HDFS uses local disk and is sized
for 3x data replication

Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or
Apache Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache
Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink,
Apache NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata

Why Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy-to-manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Elasticsearch
Service

Decouple compute and storage by using
Amazon S3 as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local

HBase on Amazon S3 for scalable NoSQL

Amazon
Athena
Interactive query service
• Query directly from Amazon S3
• Use ANSI SQL
• Serverless
• Multiple data formats
• Pay per query

Why use Athena?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions

Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Hive Metadata Definition
• Hive Data Definition Language
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail

Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions

Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift works with third-party analysis tools
JDBC/ODBC
Amazon Redshift
Amazon
QuickSight
New!

Amazon Redshift has security built in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3/Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node

Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against Amazon S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes

Translate use cases to the right tools
- Low-latency SQL -> Athena or Presto or Amazon Redshift
- Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift
- Management and monitoring -> EMR console or Ganglia metrics
- HDFS -> Amazon S3
- Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action)
- Query console -> Athena or Hue
- Security -> Ranger (CF template) or HiveServer2 or IAM roles
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Athena
Streaming
Flink
Glue
Amazon Redshift

Real-time distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed - zero admin
Highly available and reliable
Tightly integrated with other AWS services
Amazon
Elasticsearch
Service

Amazon Elasticsearch Service leading use cases
Log Analytics &
Operational Monitoring
• Monitor the performance of
applications, web servers, and
hardware
• Easy to use, yet powerful data
visualization tools to detect
issues in near real-time
• Dig into logs in an intuitive,
fine-grained way
• Kibana provides fast, easy
visualization
Traditional Search
• Application or website provides
search capabilities over diverse
documents
• Tasked with making this knowledge
base searchable and accessible
• Key search features including text
matching, faceting, filtering, fuzzy
search, autocomplete, and
highlighting
• Query API to support application
search

Real-time processing
High throughput; elastic
Easy to use
Amazon EMR, Amazon S3,
Amazon Redshift, DynamoDB
Integrations
Amazon
Kinesis

Amazon Kinesis
Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS

Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorchCognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
AWS ML Stack
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens

Proven customer success
The vast majority of big data use cases deployed in the cloud
today run on AWS.

“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case study: Re-architecting compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20M annually by using AWS

Fraud detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20M per year.

Building a Data Lake
An architectural
approach that allows you
to store massive amounts
of “raw” data into a
central location
It’s readily available to be
categorized,
processed,
analyzed,
and consumed
by diverse groups
Catalog' &'Search
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service

Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

Using Data Lakes: Data Analytics Week SF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Data Lakes: Data Analytics Week SF

Similar to Using Data Lakes: Data Analytics Week SF (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Using Data Lakes: Data Analytics Week SF