More Related Content
Similar to Big Data@Scale_AWSPSSummit_Singapore
Similar to Big Data@Scale_AWSPSSummit_Singapore (20)
More from Amazon Web Services
More from Amazon Web Services (20)
Big Data@Scale_AWSPSSummit_Singapore
- 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Aditi Gupta, Solutions Architect, Amazon Web Services
Big Data@Scale
- 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
1. Why big data?
2. How to do big data processing on AWS?
3. Architectural patterns
4. Customer data lake success stories
- 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ever Increasing data
International Data Corporation(IDC) -Digital universe
2016 – 16.1 Zettabyte(ZB)
2025 – 163 Zettabyte(ZB)
Volume
Velocity
Variety
1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB
- 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data Processing @ Scale
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
- 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT
Logging
Logging
Amazon
CloudWatch
AWS
CloudTrail
Devices
Sensors &
IoT solutions AWS IoT
Analytics
IoT
Mobile apps
Web apps
Enterprise apps
Applications
- 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Getting Data into AWS
AWS Direct Connect
AWS Snowball
Amazon Kinesis
Firehose
AWS Storage
Gateway
- 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
- 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STORE
Amazon
Elasticsearch Service
Amazon DynamoDB
Amazon Redshift
Amazon RDS
Search SQL NoSQL
Database
Amazon S3
Storage
File/Object
Storage
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
IOT / Applications/Devices streams
Streaming
data
- 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
PROCESS/
ANALYZE
- 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Data Enrichment
Analyze- Batch, Interactive,
Streaming
Extract Transform Load
(ETL)
Data Lake
Amazon EMR Amazon Kinesis AWS Glue
Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift*
Amazon ES Amazon EMR Amazon S3Amazon Athena
Amazon EMR AWS GlueAmazon Redshift*
- 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS Elastic
Map Reduce
(EMR)
Fully Managed Hadoop Cluster Framework
Supports Big data frameworks such as Hive, Impala, Presto, Spark and
more...
EMR File System(EMRFS) allows EMR clusters to efficiently and
securely use Amazon S3 for storage of any scale.
Integrated with Amazon S3, RDS, Redshift & any JDBC-compliant data
store
On-demand and spot pricing; pay as you go
- 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Redshift
Fully managed relational data warehouse
Massively parallel; petabyte scale
Data Compression reduces I/O massively
Columnar data storage designed for scale
$1,000/TB/Year; starts at $0.25/hour
a lot faster
a lot simpler
a lot cheaper
- 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Kinesis
Managed Service for Real Time Big Data Processing
Kinesis Streams
Create streams to produce & consume data
Elastically add and remove Shards for performance and scale
Kinesis Firehose
Easily load massive amount of streaming data into S3,Redshift
Kinesis Analytics
Easily analyze data streams using standard SQL queries
Elastically scales to match data throughput
- 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Athena
An interactive query service that makes it easy to analyze data
directly from Amazon S3 using Standard SQL.
Serverless – No infrastructure or resources to manage at any
scale
Schema on read – Same data, many views
- 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS
Glue
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and creates tables
Managed Transform Engine
Auto-generates ETL code
Build on open frameworks – Python and Spark
Job Scheduler
Runs jobs on a serverless Spark Platform; Massively scalable
Integrated with S3, RDS, EMR, Redshift, Athena & any JDBC-
compliant data store
- 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
- 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Apps & Services
API
Amazon QuickSight
Analysis and Visualization Notebooks
- 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Putting It All Together
- 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksAPI
ETL
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS/ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
BatchInteractiveStreamML
Amazon EC2
COLLECT
Mobile apps
Web apps
Devices
Sensors &
IoT solutions AWS IoT
Analytics
Enterprise
apps
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTApplications
STORE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon RDS
Amazon DynamoDB
Streams
SearchSQLNoSQLFileStream
Amazon Redshift
- 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architectural Patterns
- 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Event-Driven Batch Analytics on AWS
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Staging Data
Input
validation
/conversion
layer
Pre-processed dataAWS Lambda
Input Tracking
layer
AggrJob
Submission
and Monitoring
Layer
AWS Lambda
AWS Lambda
State
Management
Store
Identity and Access Management (IAM)
Monitoring and logging (CloudWatch)
Aggregation
and load layer
Amazon
Redshift
Amazon EMR
- 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-Time and Batch Analytics
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Athena
Amazon QuickSight
Raw data in
Kinesis Firehose
Serving Layer
Pre-processed Views
Filtered data
S3 Bucket
S3 Bucket
Speed Layer
Kinesis Firehose
Kinesis Analytics
User device settings
Raw Data
Batch Layer
S3 Bucket
S3 Bucket
AWS EMR
- 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift
spectrum extends
data warehousing
out to exabytes -
no loading required
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Extending Data Warehousing out to Exabytes
- 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
Data lake on Amazon S3 with AWS Glue
- 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer Data Lake Success Stories
- 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ETL, SLA, Production Ad hoc, Exploratory, Test
2200+ m1.xlarge 2000+ m1.xlarge
bonus clusters
3 x 150 m2.4xlarge
S3
12 am – 10 am
250 m2.4xlarge
Netflix Uses S3 to Back its Various Clusters
- 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We can
do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using
AWS
- 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75
billion trading events per day and securely store over 5
petabytes of data, attaining savings of $10-20mm per year.
- 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T C LA S S E S
& GEOGRAPHIES
- 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data warehouse
and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift
& S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100
tables)
- 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Began implementing an S3 data lake on AWS in 2014
• Has been running in production since early 2015
• Now able to integrate all data sets together in one analytics platform, i.e. sales data,
marketing data, manufacturing line data, patient population data, FDA public datasets,
etc.
• Rapid data experimentation
• Enables new use cases & data innovations not previously possible
• Leverages Amazon EMR and Amazon Redshift for their analytics layer around the lake
• Leverages R-Studio and SAS for data science layer on top of EMR and Redshift
• They use EMR for their ETL layer
• EMR is 50% faster & 30% cheaper than their legacy ETL solution
• Amgen’s AWS S3 data lake won Best Practice Award at Bio-IT World 2016 for ‘Real
World Data Platform & Analytics’
Best Practices
Awards
Bio IT World
2016
Winner
- 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A CONCEPTUAL VIEW OF AMGEN’S DATA LAKE
DATA LAKE PLATFORM
Common Tools and Capabilities
Innovative new tools and
capabilities are reused
across all functions (e.g.
search, data processing &
storage, visualization tools)Functions can manage
their own data, while
contributing to the
common data layer
Business applications are
built to meet specific
information needs, from
simple data access, data
visualization, to complex
statistical/predictive
models
Manufacturing Data
BatchGenealogyVisualization
PDSelf-ServiceAnalytics
InstrumentBusDataSearch
Real World Data
FDASentinelAnalytics
EpiProgrammer’sWorkbench
PatientPopulationAnalytics
Commercial Data
USFieldSalesReporting
GlobalForecasting
USMarketingAnalytics
- 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Use Amazon S3 as the storage repository for your data
lake, instead of a Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more
efficient to operate
• Decoupled storage and compute allow us to evolve to
clusterless architectures like AWS Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the
ecosystem around Amazon S3 & future proof the
architecture
Key Learnings
- 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session survey
- 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!