SlideShare a Scribd company logo
1 of 35
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Aditi Gupta, Solutions Architect, Amazon Web Services
Big Data@Scale
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
1. Why big data?
2. How to do big data processing on AWS?
3. Architectural patterns
4. Customer data lake success stories
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ever Increasing data
International Data Corporation(IDC) -Digital universe
2016 – 16.1 Zettabyte(ZB)
2025 – 163 Zettabyte(ZB)
Volume
Velocity
Variety
1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data Processing @ Scale
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT
Logging
Logging
Amazon
CloudWatch
AWS
CloudTrail
Devices
Sensors &
IoT solutions AWS IoT
Analytics
IoT
Mobile apps
Web apps
Enterprise apps
Applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Getting Data into AWS
AWS Direct Connect
AWS Snowball
Amazon Kinesis
Firehose
AWS Storage
Gateway
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STORE
Amazon
Elasticsearch Service
Amazon DynamoDB
Amazon Redshift
Amazon RDS
Search SQL NoSQL
Database
Amazon S3
Storage
File/Object
Storage
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
IOT / Applications/Devices streams
Streaming
data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
PROCESS/
ANALYZE
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Data Enrichment
Analyze- Batch, Interactive,
Streaming
Extract Transform Load
(ETL)
Data Lake
Amazon EMR Amazon Kinesis AWS Glue
Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift*
Amazon ES Amazon EMR Amazon S3Amazon Athena
Amazon EMR AWS GlueAmazon Redshift*
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS Elastic
Map Reduce
(EMR)
Fully Managed Hadoop Cluster Framework
Supports Big data frameworks such as Hive, Impala, Presto, Spark and
more...
EMR File System(EMRFS) allows EMR clusters to efficiently and
securely use Amazon S3 for storage of any scale.
Integrated with Amazon S3, RDS, Redshift & any JDBC-compliant data
store
On-demand and spot pricing; pay as you go
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Redshift
Fully managed relational data warehouse
Massively parallel; petabyte scale
Data Compression reduces I/O massively
Columnar data storage designed for scale
$1,000/TB/Year; starts at $0.25/hour
a lot faster
a lot simpler
a lot cheaper
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Kinesis
Managed Service for Real Time Big Data Processing
Kinesis Streams
Create streams to produce & consume data
Elastically add and remove Shards for performance and scale
Kinesis Firehose
Easily load massive amount of streaming data into S3,Redshift
Kinesis Analytics
Easily analyze data streams using standard SQL queries
Elastically scales to match data throughput
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Athena
An interactive query service that makes it easy to analyze data
directly from Amazon S3 using Standard SQL.
Serverless – No infrastructure or resources to manage at any
scale
Schema on read – Same data, many views
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS
Glue
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and creates tables
Managed Transform Engine
Auto-generates ETL code
Build on open frameworks – Python and Spark
Job Scheduler
Runs jobs on a serverless Spark Platform; Massively scalable
Integrated with S3, RDS, EMR, Redshift, Athena & any JDBC-
compliant data store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Apps & Services
API
Amazon QuickSight
Analysis and Visualization Notebooks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Putting It All Together
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksAPI
ETL
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS/ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
BatchInteractiveStreamML
Amazon EC2
COLLECT
Mobile apps
Web apps
Devices
Sensors &
IoT solutions AWS IoT
Analytics
Enterprise
apps
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTApplications
STORE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon RDS
Amazon DynamoDB
Streams
SearchSQLNoSQLFileStream
Amazon Redshift
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architectural Patterns
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Event-Driven Batch Analytics on AWS
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Staging Data
Input
validation
/conversion
layer
Pre-processed dataAWS Lambda
Input Tracking
layer
AggrJob
Submission
and Monitoring
Layer
AWS Lambda
AWS Lambda
State
Management
Store
Identity and Access Management (IAM)
Monitoring and logging (CloudWatch)
Aggregation
and load layer
Amazon
Redshift
Amazon EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-Time and Batch Analytics
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Athena
Amazon QuickSight
Raw data in
Kinesis Firehose
Serving Layer
Pre-processed Views
Filtered data
S3 Bucket
S3 Bucket
Speed Layer
Kinesis Firehose
Kinesis Analytics
User device settings
Raw Data
Batch Layer
S3 Bucket
S3 Bucket
AWS EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift
spectrum extends
data warehousing
out to exabytes -
no loading required
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Extending Data Warehousing out to Exabytes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
Data lake on Amazon S3 with AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer Data Lake Success Stories
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ETL, SLA, Production Ad hoc, Exploratory, Test
2200+ m1.xlarge 2000+ m1.xlarge
bonus clusters
3 x 150 m2.4xlarge
S3
12 am – 10 am
250 m2.4xlarge
Netflix Uses S3 to Back its Various Clusters
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We can
do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using
AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75
billion trading events per day and securely store over 5
petabytes of data, attaining savings of $10-20mm per year.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T C LA S S E S
& GEOGRAPHIES
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data warehouse
and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift
& S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100
tables)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Began implementing an S3 data lake on AWS in 2014
• Has been running in production since early 2015
• Now able to integrate all data sets together in one analytics platform, i.e. sales data,
marketing data, manufacturing line data, patient population data, FDA public datasets,
etc.
• Rapid data experimentation
• Enables new use cases & data innovations not previously possible
• Leverages Amazon EMR and Amazon Redshift for their analytics layer around the lake
• Leverages R-Studio and SAS for data science layer on top of EMR and Redshift
• They use EMR for their ETL layer
• EMR is 50% faster & 30% cheaper than their legacy ETL solution
• Amgen’s AWS S3 data lake won Best Practice Award at Bio-IT World 2016 for ‘Real
World Data Platform & Analytics’
Best Practices
Awards
Bio IT World
2016
Winner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A CONCEPTUAL VIEW OF AMGEN’S DATA LAKE
DATA LAKE PLATFORM
Common Tools and Capabilities
Innovative new tools and
capabilities are reused
across all functions (e.g.
search, data processing &
storage, visualization tools)Functions can manage
their own data, while
contributing to the
common data layer
Business applications are
built to meet specific
information needs, from
simple data access, data
visualization, to complex
statistical/predictive
models
Manufacturing Data
BatchGenealogyVisualization
PDSelf-ServiceAnalytics
InstrumentBusDataSearch
Real World Data
FDASentinelAnalytics
EpiProgrammer’sWorkbench
PatientPopulationAnalytics
Commercial Data
USFieldSalesReporting
GlobalForecasting
USMarketingAnalytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Use Amazon S3 as the storage repository for your data
lake, instead of a Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more
efficient to operate
• Decoupled storage and compute allow us to evolve to
clusterless architectures like AWS Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the
ecosystem around Amazon S3 & future proof the
architecture
Key Learnings
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session survey
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

More Related Content

What's hot

Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Amazon Web Services
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsAmazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Amazon Web Services
 
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...Amazon Web Services
 
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 

What's hot (20)

Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...
How to Build HR Lakes on AWS to Unlock New Business Insights (DAT367) - AWS r...
 
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 

Similar to Big Data@Scale_AWSPSSummit_Singapore

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Amazon Web Services
 
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right JobAmazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Amazon Web Services
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsAmazon Web Services
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Aws Tools for Alexa Skills
Aws Tools for Alexa SkillsAws Tools for Alexa Skills
Aws Tools for Alexa SkillsBoaz Ziniman
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWSAWS Germany
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018Amazon Web Services
 

Similar to Big Data@Scale_AWSPSSummit_Singapore (20)

Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your Applications
 
AWS Database Services @ Scale
AWS Database Services @ ScaleAWS Database Services @ Scale
AWS Database Services @ Scale
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Aws Tools for Alexa Skills
Aws Tools for Alexa SkillsAws Tools for Alexa Skills
Aws Tools for Alexa Skills
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Big Data@Scale_AWSPSSummit_Singapore

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Aditi Gupta, Solutions Architect, Amazon Web Services Big Data@Scale
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key Takeaways 1. Why big data? 2. How to do big data processing on AWS? 3. Architectural patterns 4. Customer data lake success stories
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ever Increasing data International Data Corporation(IDC) -Digital universe 2016 – 16.1 Zettabyte(ZB) 2025 – 163 Zettabyte(ZB) Volume Velocity Variety 1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data Processing @ Scale COLLECT STORE PROCESS/ ANALYZE CONSUME data answers
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT Logging Logging Amazon CloudWatch AWS CloudTrail Devices Sensors & IoT solutions AWS IoT Analytics IoT Mobile apps Web apps Enterprise apps Applications
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Getting Data into AWS AWS Direct Connect AWS Snowball Amazon Kinesis Firehose AWS Storage Gateway
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE data answers
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STORE Amazon Elasticsearch Service Amazon DynamoDB Amazon Redshift Amazon RDS Search SQL NoSQL Database Amazon S3 Storage File/Object Storage Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Amazon DynamoDB Streams IOT / Applications/Devices streams Streaming data
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE data answers PROCESS/ ANALYZE
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Data Enrichment Analyze- Batch, Interactive, Streaming Extract Transform Load (ETL) Data Lake Amazon EMR Amazon Kinesis AWS Glue Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift* Amazon ES Amazon EMR Amazon S3Amazon Athena Amazon EMR AWS GlueAmazon Redshift*
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE AWS Elastic Map Reduce (EMR) Fully Managed Hadoop Cluster Framework Supports Big data frameworks such as Hive, Impala, Presto, Spark and more... EMR File System(EMRFS) allows EMR clusters to efficiently and securely use Amazon S3 for storage of any scale. Integrated with Amazon S3, RDS, Redshift & any JDBC-compliant data store On-demand and spot pricing; pay as you go
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Redshift Fully managed relational data warehouse Massively parallel; petabyte scale Data Compression reduces I/O massively Columnar data storage designed for scale $1,000/TB/Year; starts at $0.25/hour a lot faster a lot simpler a lot cheaper
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Kinesis Managed Service for Real Time Big Data Processing Kinesis Streams Create streams to produce & consume data Elastically add and remove Shards for performance and scale Kinesis Firehose Easily load massive amount of streaming data into S3,Redshift Kinesis Analytics Easily analyze data streams using standard SQL queries Elastically scales to match data throughput
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Athena An interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL. Serverless – No infrastructure or resources to manage at any scale Schema on read – Same data, many views
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE AWS Glue Data Catalog Hive Metastore compatible with enhanced functionality Crawlers automatically extract metadata and creates tables Managed Transform Engine Auto-generates ETL code Build on open frameworks – Python and Spark Job Scheduler Runs jobs on a serverless Spark Platform; Massively scalable Integrated with S3, RDS, EMR, Redshift, Athena & any JDBC- compliant data store
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE PROCESS/ ANALYZE CONSUME data answers
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CONSUME Apps & Services API Amazon QuickSight Analysis and Visualization Notebooks
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Putting It All Together
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CONSUME Amazon QuickSight Apps & Services Analysis&visualizationNotebooksAPI ETL Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS/ANALYZE Amazon Machine Learning Presto Amazon EMR BatchInteractiveStreamML Amazon EC2 COLLECT Mobile apps Web apps Devices Sensors & IoT solutions AWS IoT Analytics Enterprise apps Logging Amazon CloudWatch AWS CloudTrail LoggingIoTApplications STORE Amazon Elasticsearch Service Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon RDS Amazon DynamoDB Streams SearchSQLNoSQLFileStream Amazon Redshift
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architectural Patterns
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Event-Driven Batch Analytics on AWS On premises data Web appdata Amazon RDS Other databases Streaming data Your data Staging Data Input validation /conversion layer Pre-processed dataAWS Lambda Input Tracking layer AggrJob Submission and Monitoring Layer AWS Lambda AWS Lambda State Management Store Identity and Access Management (IAM) Monitoring and logging (CloudWatch) Aggregation and load layer Amazon Redshift Amazon EMR
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-Time and Batch Analytics On premises data Web appdata Amazon RDS Other databases Streaming data Your data Athena Amazon QuickSight Raw data in Kinesis Firehose Serving Layer Pre-processed Views Filtered data S3 Bucket S3 Bucket Speed Layer Kinesis Firehose Kinesis Analytics User device settings Raw Data Batch Layer S3 Bucket S3 Bucket AWS EMR
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift spectrum extends data warehousing out to exabytes - no loading required Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore Extending Data Warehousing out to Exabytes
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. On premises data Web appdata Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AWS GLUE ETL Data lake on Amazon S3 with AWS Glue
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Customer Data Lake Success Stories
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ETL, SLA, Production Ad hoc, Exploratory, Test 2200+ m1.xlarge 2000+ m1.xlarge bonus clusters 3 x 150 m2.4xlarge S3 12 am – 10 am 250 m2.4xlarge Netflix Uses S3 to Back its Various Clusters
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case Study: Re-architecting Compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20m annually by using AWS
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fraud Detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20mm per year.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S IN MARKET CAP REPRESENTING WORTH $9.6TRILLION DIVERSE INDUSTRIES AND MANY OF THE WORLD’S MOST WELL-KNOWN AND INNOVATIVE BRANDSMORE THAN U.S. 1 TRILLIONNATIONAL VALUE IS TIED TO OUR LIBRARY OF MORE THAN 41,000 GLOBAL INDEXES N A S D A Q T E C H N O L O G Y IS USED TO POWER MORE THAN IN 50 COUNTRIES 100 MARKETPLACES OUR GLOBAL PLATFORM CAN HANDLE MORE THAN 1 MILLION MESSAGES/SECOND AT SUB-40 MICROSECONDS AV E R A G E S P E E D S 1 C L E A R I N G H O U S E WE OWN AND OPERATE 26 MARKETS 5 CENTRAL SECURITIES DEPOSITORIES INCLUDING A C R O S S A S S E T C LA S S E S & GEOGRAPHIES
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Nasdaq implements an S3 data lake + Redshift data warehouse architecture • Most recent two years of data is kept in the Redshift data warehouse and snapshotted into S3 for disaster recovery • Data between two and five years old is kept in S3 • Presto on EMR is used to ad-hoc query data in S3 • Transitioned from an on-premises data warehouse to Amazon Redshift & S3 data lake architecture • Over 1,000 tables migrated • Average daily ingest of over 7B rows • Migrated off legacy DW to AWS (start to finish) in 7 man-months • AWS costs were 43% of legacy budget for the same data set (~1100 tables)
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Began implementing an S3 data lake on AWS in 2014 • Has been running in production since early 2015 • Now able to integrate all data sets together in one analytics platform, i.e. sales data, marketing data, manufacturing line data, patient population data, FDA public datasets, etc. • Rapid data experimentation • Enables new use cases & data innovations not previously possible • Leverages Amazon EMR and Amazon Redshift for their analytics layer around the lake • Leverages R-Studio and SAS for data science layer on top of EMR and Redshift • They use EMR for their ETL layer • EMR is 50% faster & 30% cheaper than their legacy ETL solution • Amgen’s AWS S3 data lake won Best Practice Award at Bio-IT World 2016 for ‘Real World Data Platform & Analytics’ Best Practices Awards Bio IT World 2016 Winner
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A CONCEPTUAL VIEW OF AMGEN’S DATA LAKE DATA LAKE PLATFORM Common Tools and Capabilities Innovative new tools and capabilities are reused across all functions (e.g. search, data processing & storage, visualization tools)Functions can manage their own data, while contributing to the common data layer Business applications are built to meet specific information needs, from simple data access, data visualization, to complex statistical/predictive models Manufacturing Data BatchGenealogyVisualization PDSelf-ServiceAnalytics InstrumentBusDataSearch Real World Data FDASentinelAnalytics EpiProgrammer’sWorkbench PatientPopulationAnalytics Commercial Data USFieldSalesReporting GlobalForecasting USMarketingAnalytics
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Use Amazon S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse • Decoupled storage and compute is cheaper and more efficient to operate • Decoupled storage and compute allow us to evolve to clusterless architectures like AWS Athena • Do not build data silos in Hadoop or the Enterprise DW • Gain flexibility to use all the analytics tools in the ecosystem around Amazon S3 & future proof the architecture Key Learnings
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!