4. Challenges with Legacy Data Architectures
• Can’t move data across silos
• Can’t afford to keep all of the data
• Can’t scale with dynamic data and real-time processing
• Can’t scale management of data
• Can’t find the people who know how to configure and
manage complex infrastructure
• Can’t afford the investments to keep refreshing
infrastructure and data centers
5. Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest
• Storage vs Compute
• Schema on Read
6. 1&2: Consolidate (Data) & Separate (Storage & Compute)
•S3 as the data lake storage tier; not a single analytics
tool like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more
efficient to operate
•Decoupled storage and compute allow us to evolve to
clusterless architectures (i.e. Lambda, Athena & Glue)
•Do not build data silos in Hadoop or the EDW
•Gain flexibility to use all the analytics tools in the
ecosystem around S3 & future proof the architecture
7. Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
Multiple upload
Range GET
Scalable Throughput
Store as much as you need
Scale storage and compute
independently
Scale without limits
Affordable
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Amazon Athena
Amazon Rekognition
Amazon Glue
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Simple Management Tools
Hadoop compatibility
Easy to use
Why Choose Amazon S3 for data lake?
8. “For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
• Store 5PB of historical data for analysis & training
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using AWS
9. Encryption ComplianceSecurity
Identity and Access
Management (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Private VPC endpoints to
Amazon S3
SSL endpoints
Server Side Encryption
(SSE-S3)
S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
Client-side Encryption
Buckets access logs
Lifecycle Management
Policies
Access Control Lists
(ACLs)
Versioning & MFA
deletes
Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
3: Implement the Right Security Controls
10. AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
4: Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly into
AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer Acceleration
• Move data up to 300% faster
using AWS’s private network
11. 5: Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Metadata
What is in the data lake?
Documents the data lake
Summary statistics
Classification
Data
Sources
Search
capabilities
Glue Coming Mid-year
https://aws.amazon.com/answers/big-data/data-lake-solution/
12. Glue automates the undifferentiated heavy-lifting of ETL
Cataloging data sources
Identifying data formats and data types
Generating Extract, Transform, Load code
Executing ETL jobs; managing dependencies
Handling errors
Managing and scaling resources
Amazon Glue – in Preview
13. S3 Standard S3 Standard - Infrequent
Access
Amazon Glacier
Active data Archive dataInfrequently accessed data
Milliseconds Minutes to HoursMilliseconds
$0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo
6: Keep More Data
14. 7: Use Athena for Ad Hoc Data Exploration
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
15. Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
16. Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3
17. 8: Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
18. 9: Choose the Right Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots
19. A Sample Data Lake Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
20. Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
Analytic
Capabilities
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis
21. Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
10: Evolve as Needed
Editor's Notes
As content quality improves and the need to suppprt multiple ways of viewing it prolifirate, we are facing the challenge of content gravity.
While it’s relatively easy to process the media, it’s becoming exceedingly difficult to move it around and store it. For example moving from HD to 4K and eventually 8K content may result in an increase of storage footprint on the order of 10x or more.
Storage is not the only challenge here, as the contnent weighs more it’s more difficult to quickly and cost effectively transfer it to affiliates and partners in the supply chain. The conclusion is that you should strive to keep the data as close as possible to sufficient processing resources.
The native features of S3 are exactly what you want from a Data Lake
Replication across AZ’s for high availability and durability
Massively parallel and scalable
Storage scales independent of compute
Low storage cost at < $0.025/GB
This is nearly impossible to achieve with a fixed database cluster
SUGGESTED TALKING POINTS:
The Financial Industry Regulatory Authority (FINRA), one of the largest independent securities regulators in the U.S., was established to help watch and regulate financial trading practices.
To respond to rapidly changing market dynamics, FINRA moved its market surveillance platform to AWS to analyze and store approximately 75 billion market events every day. FINRA selected AWS because it offered the right services while fulfilling the company’s security requirements. By using dynamic clusters (Hadoop, Hive, and HBase), and services such as Amazon EMR and Amazon S3, FINRA was able to create a flexible platform that could adapt to changing market dynamics.
By using AWS, FINRA has been able to increase agility, speed and cost savings while allowing them to operate at scale. The company estimates it will save $10 to $20 million annually by using AWS.
AWS has a broad set of capabilities that make security easy
With all your data in S3 you have a variety of encryption options
Client Side
Server Side
Encryption with KMS Keys
You can extend encryption to a 3rd party provider
We integrate with HSM as well
IAM offers the ability to create users and roles for those users which can restrict access to only those capabilities you allow
You can set S3 bucket policies for IAM users
S3 has a private VPC endpoint so you don’t need to exit your VPC via a NAT gateway
And you have native features such as setting Lifecycle policies for your S3 data as well as bucket access logs.
EBS: Raised max throughput to 320 MB/sec (PIOPS) and 160 MB/sec (GP2), plus larger & faster ssd volumes (raised max vol size from 1 TB to 16 TB)
Snowball: Physical storage device by AWS to accelerate PB-scale data transfer with AWS-provided appliances
Kinesis Firehose: Ingest data streams directly into AWS data stores (S3 and Redshift). You can use Amazon Kinesis to ingest data from hundreds of thousands of sensors processing hundreds of terabytes of data per hour.
Zero administration: Capture and deliver streaming data into S3, Redshift, and other destinations without writing any applications or managing infrastructure.
Direct-to-Data Store Integration Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations.
Seamless Elasticity: Seamlessly scales to match data throughput without intervention.
Show me all my customer data
Search important – how to discover what is there, where it is,etc
(Glue will replace later)
Is this one step too far?
(benefits of an AWS data lake slide, data governance what it is interms of index,catalog, and manage your data rather than nuts and bolts of data catalog.
Use topic of data governance itself
ElasticSearch is also used for querying the data lake itself - load processed data into Elasticsearch (integrated with Hadoop workflow in a data lake?) ask Bob Taylor about integrating index search element into Hadoop -
Across the board, we provide 3 storage options with 3 different performance characteristics and price points. On the left, we have S3 Standard which is our high performance object storage for the internet, designed for very active, hot workloads. Data in S3 Standard is available in milliseconds and costs $0.03/GB/month (starting at). On the right hand side, we have Glacier, our cold storage service designed for long term archival and infrequently accessed data. Data in Glacier has a 3-5 hour access latency and Glacier costs $0.007/GB/month (starting at). Between the hot and cold options, we have a “warm” option – S3 infrequent access designed for data you plan to access maybe a few times a year or what we think of as “active archive”. S3-IA costs $0.0125/GB/mo (starting at). From an archiving perspective, customers typically use S3IA and Glacier together.
Just a quick note terminology – S3 stores data in buckets and each piece of data is an object; Glacier stores data in vaults (equivalent of S3 buckets) and each piece of data is called an archive (similar to object). You will hear me use bucket/vault/object/archive later on.
You simply put your Data in S3 and submit SQL against it
For a datalake, Athena won’t be the only application reading the data. ORC and Parquet were chosen because they are open source and are available for use with other analytics tools.
You can use a few lines of Pyspark code, running on Amazon EMR, to convert your files to Parquet for the best performance and cost
When you create a table for Athena, you are essentially just creating metadata and, as you run queries, the schema is applied to the data.
Data is streamed to Athena from S3, it is not copies and there is no ETL. This makes Athena ideal for customers using S3 as Data Lake
extraction, transformation, and load
No loading of data required. Query data where it lives.