Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
5. Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
6. Challenges with legacy data architectures
• Can’t move data across silos
• Can’t handle semi-structured or unstructured data
• Can’t deal with dynamic data and real-time processing
• Complex ETL processes
• Difficulty applying consistent data governance across all
data sets
7. Legacy data architectures exist as isolated data silos
Hadoop
cluster
SQL
database
Data
warehouse
appliance
8. Navigating the Data Lake…
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository
9. Data Lake – Centralized data, single source of truth
Store and analyze all of your data,
from all of your sources, in one
centralized location
“Why is the data distributed in
many locations? Where is the
single source of truth?”
10. Data Lake – Quick ingest
Quickly ingest data
without needing to force it into a
predefined schema
“How can I collect data quickly
from various sources and store
it efficiently?”
12. Why AWS?
Implementing a Data Lake architecture requires a broad
set of tools and technologies to serve an increasingly
diverse set of applications and use cases.
Boils down to:
Picking the right tool for the right job…on a consumption-
based model…
13. Central Storage
AWS Database
Migration Service
AWS Snowball*
AWS Direct
Connect
Amazon Kinesis Firehose
Data
Ingestion Catalog
Amazon
DynamoDB
Amazon
Elasticsearch Service
Amazon
Kinesis
Amazon
EMR
Amazon
RDS
Amazon
Athena
Amazon
QuickSight
Amazon
Redshift*
Processing
and Analytics
S3
Access and Secure
Example of AWS Services for Data Lake
Amazon
Elasticsearch Service
AWS Glue
16. Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple Upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage commitments
Scalable
§ Amazon EMR
§ Amazon Redshift
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
17. Encryption ComplianceSecurity
§ AWS Identity and Access
Management (IAM) policies
§ Bucket policies
§ Access control lists (ACLs)
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server-Side Encryption
(SSE-S3)
§ Amazon S3 Server-
Side Encryption with
provided keys (SSE-C,
SSE-KMS)
§ Client-Side Encryption
§ Buckets access logs
§ Lifecycle management
policies
§ ACLs
§ Versioning & MFA
deletes
§ Certifications – HIPAA,
PCI, SOC 1/2/3, etc.
Implement the right cloud security controls
18. • Decouple storage and compute by making Amazon S3
object-based storage.
• Consolidate and centrally manage and govern your data.
• Flexibility to use any and all tools in the ecosystem. The right
tool for the job.
• Future proof your architecture. As new use cases and new
tools emerge, you can plug and play current best of breed.
Benefits of an AWS Data Lake
24. Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or
Apache Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache
Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink,
Apache NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata
25. Why Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy-to-manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
26. Many storage layers to choose from
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Elasticsearch
Service
27. Decouple compute and storage by using
Amazon S3 as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
30. Why use Athena?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
31. Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply
32. Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
33. Hive Metadata Definition
• Hive Data Definition Language
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
34. Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions
35. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
36. Amazon Redshift works with third-party analysis tools
JDBC/ODBC
Amazon Redshift
Amazon
QuickSight
New!
37. Amazon Redshift has security built in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3/Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
38. Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against Amazon S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes
39. Translate use cases to the right tools
- Low-latency SQL -> Athena or Presto or Amazon Redshift
- Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift
- Management and monitoring -> EMR console or Ganglia metrics
- HDFS -> Amazon S3
- Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action)
- Query console -> Athena or Hue
- Security -> Ranger (CF template) or HiveServer2 or IAM roles
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Athena
Streaming
Flink
Glue
Amazon Redshift
40. Real-time distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed - zero admin
Highly available and reliable
Tightly integrated with other AWS services
Amazon
Elasticsearch
Service
41. Amazon Elasticsearch Service leading use cases
Log Analytics &
Operational Monitoring
• Monitor the performance of
applications, web servers, and
hardware
• Easy to use, yet powerful data
visualization tools to detect
issues in near real-time
• Dig into logs in an intuitive,
fine-grained way
• Kibana provides fast, easy
visualization
Traditional Search
• Application or website provides
search capabilities over diverse
documents
• Tasked with making this knowledge
base searchable and accessible
• Key search features including text
matching, faceting, filtering, fuzzy
search, autocomplete, and
highlighting
• Query API to support application
search
43. Amazon Kinesis
Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon
Elasticsearch Service
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
46. “For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case study: Re-architecting compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20M annually by using AWS
47. Fraud detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20M per year.
48. Building a Data Lake
An architectural
approach that allows you
to store massive amounts
of “raw” data into a
central location
It’s readily available to be
categorized,
processed,
analyzed,
and consumed
by diverse groups
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service