SlideShare a Scribd company logo
1 of 49
Pop-up Loft
Using Data Lakes
John Mallory
johmallo@amazon.com
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks
Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
Challenges with legacy data architectures
• Can’t move data across silos
• Can’t handle semi-structured or unstructured data
• Can’t deal with dynamic data and real-time processing
• Complex ETL processes
• Difficulty applying consistent data governance across all
data sets
Legacy data architectures exist as isolated data silos
Hadoop
cluster
SQL
database
Data
warehouse
appliance
Navigating the Data Lake…
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository
Data Lake – Centralized data, single source of truth
Store and analyze all of your data,
from all of your sources, in one
centralized location
“Why is the data distributed in
many locations? Where is the
single source of truth?”
Data Lake – Quick ingest
Quickly ingest data
without needing to force it into a
predefined schema
“How can I collect data quickly
from various sources and store
it efficiently?”
Building a Data Lake on AWS
Why AWS?
Implementing a Data Lake architecture requires a broad
set of tools and technologies to serve an increasingly
diverse set of applications and use cases.
Boils down to:
Picking the right tool for the right job…on a consumption-
based model…
Central Storage
AWS Database
Migration Service
AWS Snowball*
AWS Direct
Connect
Amazon Kinesis Firehose
Data
Ingestion Catalog
Amazon
DynamoDB
Amazon
Elasticsearch Service
Amazon
Kinesis
Amazon
EMR
Amazon
RDS
Amazon
Athena
Amazon
QuickSight
Amazon
Redshift*
Processing
and Analytics
S3
Access and Secure
Example of AWS Services for Data Lake
Amazon
Elasticsearch Service
AWS Glue
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service
Data Lake reference architecture
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3 – Center of the Data Lake
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple Upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage commitments
Scalable
§ Amazon EMR
§ Amazon Redshift
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
Encryption ComplianceSecurity
§ AWS Identity and Access
Management (IAM) policies
§ Bucket policies
§ Access control lists (ACLs)
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server-Side Encryption
(SSE-S3)
§ Amazon S3 Server-
Side Encryption with
provided keys (SSE-C,
SSE-KMS)
§ Client-Side Encryption
§ Buckets access logs
§ Lifecycle management
policies
§ ACLs
§ Versioning & MFA
deletes
§ Certifications – HIPAA,
PCI, SOC 1/2/3, etc.
Implement the right cloud security controls
• Decouple storage and compute by making Amazon S3
object-based storage.
• Consolidate and centrally manage and govern your data.
• Flexibility to use any and all tools in the ecosystem. The right
tool for the job.
• Future proof your architecture. As new use cases and new
tools emerge, you can plug and play current best of breed.
Benefits of an AWS Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Data Catalog
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum,
Amazon EMR and API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – ETL
Automatically generated ETL code running on serverless Apache Spark with the power
and flexibility to bring data together.
Process and analyze…
Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase, Spark, Presto
Easy to use, fully managed
On-demand, reserved instance, and
Spot pricing
Tight integration with Amazon S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
Server%rack%1
(20%nodes)
Server%rack%2
(20%nodes)
Server%rack%N%
(20%nodes)
Core
On-premises Hadoop clusters
• A cluster of 1U machines
• Typically 12 Cores, 32/64 GB
RAM, and 6 - 8 TB of HDD ($3-4K)
• Networking switches and racks
• Open-source distribution of
Hadoop or a fixed licensing term
by commercial distributions
• Different node roles
• HDFS uses local disk and is sized
for 3x data replication
Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or
Apache Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache
Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink,
Apache NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata
Why Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy-to-manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Elasticsearch
Service
Decouple compute and storage by using
Amazon S3 as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
HBase on Amazon S3 for scalable NoSQL
Amazon
Athena
Interactive query service
• Query directly from Amazon S3
• Use ANSI SQL
• Serverless
• Multiple data formats
• Pay per query
Why use Athena?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Hive Metadata Definition
• Hive Data Definition Language
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Amazon Redshift works with third-party analysis tools
JDBC/ODBC
Amazon Redshift
Amazon
QuickSight
New!
Amazon Redshift has security built in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3/Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against Amazon S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes
Translate use cases to the right tools
- Low-latency SQL -> Athena or Presto or Amazon Redshift
- Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift
- Management and monitoring -> EMR console or Ganglia metrics
- HDFS -> Amazon S3
- Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action)
- Query console -> Athena or Hue
- Security -> Ranger (CF template) or HiveServer2 or IAM roles
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Athena
Streaming
Flink
Glue
Amazon Redshift
Real-time distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed - zero admin
Highly available and reliable
Tightly integrated with other AWS services
Amazon
Elasticsearch
Service
Amazon Elasticsearch Service leading use cases
Log Analytics &
Operational Monitoring
• Monitor the performance of
applications, web servers, and
hardware
• Easy to use, yet powerful data
visualization tools to detect
issues in near real-time
• Dig into logs in an intuitive,
fine-grained way
• Kibana provides fast, easy
visualization
Traditional Search
• Application or website provides
search capabilities over diverse
documents
• Tasked with making this knowledge
base searchable and accessible
• Key search features including text
matching, faceting, filtering, fuzzy
search, autocomplete, and
highlighting
• Query API to support application
search
Real-time processing
High throughput; elastic
Easy to use
Amazon EMR, Amazon S3,
Amazon Redshift, DynamoDB
Integrations
Amazon
Kinesis
Amazon Kinesis
Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
and Amazon
Elasticsearch Service
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorchCognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
AWS ML Stack
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
Proven customer success
The vast majority of big data use cases deployed in the cloud
today run on AWS.
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case study: Re-architecting compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20M annually by using AWS
Fraud detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20M per year.
Building a Data Lake
An architectural
approach that allows you
to store massive amounts
of “raw” data into a
central location
It’s readily available to be
categorized,
processed,
analyzed,
and consumed
by diverse groups
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

More Related Content

What's hot

Data Design for Microservices - DevDay Austin 2017 Day 2
Data Design for Microservices - DevDay Austin 2017 Day 2Data Design for Microservices - DevDay Austin 2017 Day 2
Data Design for Microservices - DevDay Austin 2017 Day 2Amazon Web Services
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Amazon Web Services
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2Amazon Web Services
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsAmazon Web Services
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaDatabricks
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudAmazon Web Services
 

What's hot (20)

Data Design for Microservices - DevDay Austin 2017 Day 2
Data Design for Microservices - DevDay Austin 2017 Day 2Data Design for Microservices - DevDay Austin 2017 Day 2
Data Design for Microservices - DevDay Austin 2017 Day 2
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
 
Lecture1
Lecture1Lecture1
Lecture1
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2
Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS Cloud
 

Similar to Using Data Lakes: Data Analytics Week SF

Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 

Similar to Using Data Lakes: Data Analytics Week SF (20)

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Using Data Lakes: Data Analytics Week SF

  • 1. Pop-up Loft Using Data Lakes John Mallory johmallo@amazon.com
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  • 5. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 6. Challenges with legacy data architectures • Can’t move data across silos • Can’t handle semi-structured or unstructured data • Can’t deal with dynamic data and real-time processing • Complex ETL processes • Difficulty applying consistent data governance across all data sets
  • 7. Legacy data architectures exist as isolated data silos Hadoop cluster SQL database Data warehouse appliance
  • 8. Navigating the Data Lake… Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogenous types of data in a centralized repository
  • 9. Data Lake – Centralized data, single source of truth Store and analyze all of your data, from all of your sources, in one centralized location “Why is the data distributed in many locations? Where is the single source of truth?”
  • 10. Data Lake – Quick ingest Quickly ingest data without needing to force it into a predefined schema “How can I collect data quickly from various sources and store it efficiently?”
  • 11. Building a Data Lake on AWS
  • 12. Why AWS? Implementing a Data Lake architecture requires a broad set of tools and technologies to serve an increasingly diverse set of applications and use cases. Boils down to: Picking the right tool for the right job…on a consumption- based model…
  • 13. Central Storage AWS Database Migration Service AWS Snowball* AWS Direct Connect Amazon Kinesis Firehose Data Ingestion Catalog Amazon DynamoDB Amazon Elasticsearch Service Amazon Kinesis Amazon EMR Amazon RDS Amazon Athena Amazon QuickSight Amazon Redshift* Processing and Analytics S3 Access and Secure Example of AWS Services for Data Lake Amazon Elasticsearch Service AWS Glue
  • 14. Catalog' &'Search Access%and%search%metadata Access'&'User'Interface Give%your%users%easy%and%secure%access DynamoDB Elasticsearch API'Gateway Identity'&'Access' Management Cognito QuickSight Amazon' AI EMR Redshift Athena Kinesis RDS Central'Storage Secure,%cost5effective Storage%in%Amazon%S3 S3 Snowball Database' Migration' Service Kinesis' Firehose Direct'Connect Data'Ingestion Get%your%data%into%S3 Quickly%and%securely Protect'and'Secure Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified Processing' &'Analytics Use%of%predictive%and%prescriptive% analytics%to%gain%better%understanding Security'Token' Service CloudWatch CloudTrail Key'Management' Service Data Lake reference architecture Catalog' &'Search Access%and%search%metadata Access'&'User'Interface Give%your%users%easy%and%secure%access DynamoDB Elasticsearch API'Gateway Identity'&'Access' Management Cognito QuickSight Central'Storage Secure,%cost5effective Storage%in%Amazon%S3
  • 15. S3 – Center of the Data Lake
  • 16. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance § Multiple Upload § Range GET § Store as much as you need § Scale storage and compute independently § No minimum usage commitments Scalable § Amazon EMR § Amazon Redshift § Amazon DynamoDB Integrated § Simple REST API § AWS SDKs § Read-after-create consistency § Event notification § Lifecycle policies Easy to use Why Amazon S3 for Data Lake?
  • 17. Encryption ComplianceSecurity § AWS Identity and Access Management (IAM) policies § Bucket policies § Access control lists (ACLs) § Private VPC endpoints to Amazon S3 § SSL endpoints § Server-Side Encryption (SSE-S3) § Amazon S3 Server- Side Encryption with provided keys (SSE-C, SSE-KMS) § Client-Side Encryption § Buckets access logs § Lifecycle management policies § ACLs § Versioning & MFA deletes § Certifications – HIPAA, PCI, SOC 1/2/3, etc. Implement the right cloud security controls
  • 18. • Decouple storage and compute by making Amazon S3 object-based storage. • Consolidate and centrally manage and govern your data. • Flexibility to use any and all tools in the ecosystem. The right tool for the job. • Future proof your architecture. As new use cases and new tools emerge, you can plug and play current best of breed. Benefits of an AWS Data Lake
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.
  • 22. Hadoop/HDFS clusters Hive, Pig, Impala, Hbase, Spark, Presto Easy to use, fully managed On-demand, reserved instance, and Spot pricing Tight integration with Amazon S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  • 23. Server%rack%1 (20%nodes) Server%rack%2 (20%nodes) Server%rack%N% (20%nodes) Core On-premises Hadoop clusters • A cluster of 1U machines • Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication
  • 24. Workload types running on the same cluster • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  • 25. Why Amazon EMR? Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy-to-manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes
  • 26. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR Amazon Elasticsearch Service
  • 27. Decouple compute and storage by using Amazon S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 28. HBase on Amazon S3 for scalable NoSQL
  • 29. Amazon Athena Interactive query service • Query directly from Amazon S3 • Use ANSI SQL • Serverless • Multiple data formats • Pay per query
  • 30. Why use Athena? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Encrypted • Standard compliant and open storage formats • Built on powerful community supported OSS solutions
  • 31. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfer apply
  • 32. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 33. Hive Metadata Definition • Hive Data Definition Language • Data Manipulation Language (INSERT, UPDATE) • Create Table As • User Defined Functions • Hive compatible SerDe (serializer/deserializer) • CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
  • 34. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  • 35. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  • 36. Amazon Redshift works with third-party analysis tools JDBC/ODBC Amazon Redshift Amazon QuickSight New!
  • 37. Amazon Redshift has security built in SSL to secure data in transit Encryption to secure data at rest • AES-256; hardware accelerated • All blocks on disks and in Amazon S3 encrypted • HSM support No direct access to compute nodes Audit logging, AWS CloudTrail, AWS KMS integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3/Amazon DynamoDB Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node
  • 38. Redshift Spectrum ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC Leverages Amazon Redshift’s advanced cost- based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against Amazon S3 data Efficient join processing within the Amazon Redshift cluster Spectrum Nodes Redshift Nodes
  • 39. Translate use cases to the right tools - Low-latency SQL -> Athena or Presto or Amazon Redshift - Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift - Management and monitoring -> EMR console or Ganglia metrics - HDFS -> Amazon S3 - Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action) - Query console -> Athena or Hue - Security -> Ranger (CF template) or HiveServer2 or IAM roles Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink Glue Amazon Redshift
  • 40. Real-time distributed search and analytics engine Managed service using Elasticsearch and Kibana Fully managed - zero admin Highly available and reliable Tightly integrated with other AWS services Amazon Elasticsearch Service
  • 41. Amazon Elasticsearch Service leading use cases Log Analytics & Operational Monitoring • Monitor the performance of applications, web servers, and hardware • Easy to use, yet powerful data visualization tools to detect issues in near real-time • Dig into logs in an intuitive, fine-grained way • Kibana provides fast, easy visualization Traditional Search • Application or website provides search capabilities over diverse documents • Tasked with making this knowledge base searchable and accessible • Key search features including text matching, faceting, filtering, fuzzy search, autocomplete, and highlighting • Query API to support application search
  • 42. Real-time processing High throughput; elastic Easy to use Amazon EMR, Amazon S3, Amazon Redshift, DynamoDB Integrations Amazon Kinesis
  • 43. Amazon Kinesis Streams • For technical developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift, and Amazon Elasticsearch Service Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy Services make it easy to capture, deliver, and process streams on AWS
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Frameworks & Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Greengrass) Vision: Rekognition Image Rekognition Video Speech: Polly Transcribe Language: Lex Translate Comprehend Apache MXNet PyTorchCognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon AWS ML Stack Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens
  • 45. Proven customer success The vast majority of big data use cases deployed in the cloud today run on AWS.
  • 46. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case study: Re-architecting compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20M annually by using AWS
  • 47. Fraud detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20M per year.
  • 48. Building a Data Lake An architectural approach that allows you to store massive amounts of “raw” data into a central location It’s readily available to be categorized, processed, analyzed, and consumed by diverse groups Catalog' &'Search Access%and%search%metadata Access'&'User'Interface Give%your%users%easy%and%secure%access DynamoDB Elasticsearch API'Gateway Identity'&'Access' Management Cognito QuickSight Amazon' AI EMR Redshift Athena Kinesis RDS Central'Storage Secure,%cost5effective Storage%in%Amazon%S3 S3 Snowball Database' Migration' Service Kinesis' Firehose Direct'Connect Data'Ingestion Get%your%data%into%S3 Quickly%and%securely Protect'and'Secure Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified Processing' &'Analytics Use%of%predictive%and%prescriptive% analytics%to%gain%better%understanding Security'Token' Service CloudWatch CloudTrail Key'Management' Service
  • 49. Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS