SlideShare a Scribd company logo
1 of 36
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bob Griffiths, AWS Solutions Architect Manager
John Hitchingham, FINRA Engineering
August 14, 2017
FINRA’s Managed Data Lake – Next
Gen Analytics in the Cloud
Overview of Big Data Services
What is big data?
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
Collect
AWS
Import/Export
AWS Direct
Connect
Amazon
Kinesis
Amazon
EMR
Amazon
EC2
Process & Analyze
Amazon
Glacier
Amazon
S3
Store
Amazon
Machine
Learning
Amazon
Redshift
Amazon
DynamoDB
Amazon
Kinesis
Analytics
Amazon
QuickSight
AWS Database
Migration
Service
AWS Data
Pipeline
Amazon RDS,
Amazon Aurora
Big Data services on AWS
Amazon
Elasticsearch
Service
Amazon
Athena
AWS
Glue
AWS
Snowball
Scale as your data and business grows
The volume, variety, and velocity at which data is being generated are
leaving organizations with new questions to answer, such as:
Data Lake
Central Storage
Secure, cost-effective
storage in
Amazon S3
Data Ingestion
Get your data into S3 quickly and securely
Kinesis Firehose, Direct Connect, Snowball,
Database Migration Service
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
Processing & Analytics
Use of predictive and prescriptive analytics
to gain better understanding
DynamoDB
Elasticsearch Service
API Gateway
Directory Service
Cognito
Athena, QuickSight, EMR, Amazon Redshift
IAM, CloudWatch, CloudTrail, KMS
Protect & Secure
Use entitlements to ensure data
is secure and users’ identities are
verified
Store and analyze all your data—structured and
unstructured—from all of your sources, in one centralized
location at low cost.
Quickly ingest data without needing to force it into a
predefined schema, enabling ad-hoc analysis by applying
schemas on read, not write.
Separating your storage and compute allows you to scale
each component as required and attach multiple data
processing and analytics services to the same data set.
Scale
Use only the services you need
Scale only the services you need
Pay only for what you use
Discounts through Reserved Instances
Types including Spot, and upfront commitments
Cost
Visibility/control of all APIs and retrievals
Encryption of all data at each step
Store an exabyte of data or more in Amazon S3
Analyze GB to PB using standard tools
Control egress and ingress points using VPCs
Security
and scale
Big data does not mean just batch
• Can be streamed in
• Processed in real time
• Can be used to respond quickly to requests
and actionable events, generate business
value
You can mix and match
• On-premises and cloud
• Custom development and managed services
Agility
&
actionable
insights
FINRA’s Managed Data Lake
In order to solve its market regulation challenges, over the past three years,
FINRA’s Technology team has pioneered a managed cloud service to
operate big data workloads and perform analytics at large scale.
The results of FINRA’s innovations have been significant.
To achieve these gains and operate its big data ecosystem, FINRA
Technology has built a set of cutting edge tools, processes, and know-how.
FINRA’s experience
A 30% operating cost reduction, in both labor and infrastructure
A 5x increase in operational resiliency
The business is able to perform analytics at an unprecedented scale and depth
Legacy pain points – infrastructure and ops
Did not scale well as volumes and
workloads increased
Duplication of effort in data management
(data lifecycle, retention, versioning, etc.)
Data sync issues – manual effort to keep
data in sync
Costly system maintenance and
upgrades
Legacy pain points – analytics and data science
Business
Analysts
Data
Scientists Data
Analysts
Data
Engineers
Ops
What data do we have?
What format is it in?
Where to I get it?
Get this data for them…
Not on disk – pull from tape
Wait for tapes
from offsite
Prepare & Format
Oops, I need more data … Repeat!
I need data in different format …
Repeat!
etc…, etc…
Summary of cloud drivers
• Fast-growing data volumes YoY
• High cost of pre-building for peak
• Escalating costs of in-house technology infrastructure
• Long time-to-market for finding insights in data
• Appliance platforms were facing obsolescence and end-of life as
a result of new big data technologies
Keep spending more on legacy infrastructure or
redirect dollars to core business of regulation?
FINRA cloud program business objectives
• Discover data easily
• Access (all the) data easily
• Increase the power of analytic tools
• Make data processing resilient
• Make data processing cost effective
Could this be achieved in the cloud?
Cloud architectural principles
Manage Data
Consistently
• Define, store and share our
data as an enterprise asset
• All data should be enabled
for analytics
• Protect data in a holistic
manner (data at rest and
data in transit)
Integrate our
Portfolio
• Shared solutions for
common business
processes across the
organization
• All "business" data in cloud
will be tracked by a
centralized Data
Management System so that
FINRA can manage the data
lifecycle in a productive and
cost effective manner
• All FINRA-developed
applications will have
service interfaces
Operational
Resiliency
• Multi-AZ components and
fail-over
• Auto-scaling and load
balancing to achieve high
availability
• No logon to servers or
services for routine
operations
• Applications should include
automated operations jobs
to handle known failure
scenarios, recovery, data
issues, and notifications.
From data puddles to Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3
Data processing stream on Data Lake
Catalog &
Storage
ETL
Normalize, Enrich, Reformat
Human
Analytics
Validation
Ingest
Broker Dealers
Exchanges
3rd Party Providers
Data
Files
Analyst
Data Scientist
Regulatory User
• Centralized Catalog
• 100s of EMR clusters
• As many Lambda
functions as needed
Patterns
Automated Surveillance
Power of parallelization
ETL Job1
Input Result
ETL Job2
Input Result
ETL Jobn
Input Result
Workloads run in parallel for workload isolation to meet SLAs
Processing scales to meet demand
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
11/1
11/3
11/5
11/7
11/9
11/11
11/13
11/15
11/17
11/19
11/21
11/23
11/25
11/27
11/29
Daily Order Volume (Billions)
0
2000
4000
6000
8000
10000
12000
2016-10-17T02
2016-10-17T08
2016-10-17T14
2016-10-17T20
2016-10-18T02
2016-10-18T08
2016-10-18T14
2016-10-18T20
2016-10-19T02
2016-10-19T08
2016-10-19T14
2016-10-19T20
2016-10-20T02
2016-10-20T08
2016-10-20T14
2016-10-20T20
2016-10-21T02
2016-10-21T08
2016-10-21T14
2016-10-21T20
2016-10-22T02
2016-10-24T03
2016-10-24T20
ComputeNodes
Hour of Day
AWS EMR compute on EC2
EMR
Catalog for centralized data management
http://finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and
partitions
• Use with Spark, Presto, Hive, etc.
• Faster instantiation of clusters
Catalog and the Data Lake ecosystem
Hive
Metastore
Data Catalog
Data Catalog UI
Analyst
Data Scientist
Explore Use
Object Storage
(S3)
Custom
Handler
Request object Info
Processing
Get object info
(optl. DDL)
Knows
Object/File
Object/File
Object/File
Query (w/ DDL)
Store Results
Custom
Handler
Register Output
Validation
ETL
Machine SurveillanceLambda EMR
Interactive Analytics
EMR Redshift
(Spectrum)
Get DDL
Analytics – one-stop shop for data
Data
Analyst
Data
Scientist
JDBC
Client
JDBC
Client
Table 1
Table 2
AuthN
AuthZ
Metastore
Table N
Logical “Database”
Achieve interactive query speed with Data Lake
Query Table size
(rows)
Output
size (rows)
ORC TXT/BZ2
select count(*) from TABLE_1
where trade_date = cast(‘2016-08-09’ as date)
2469171608 1 4s 1m56s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 12 3s 1m51s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 8364 5s 2m5s
select * from TABLE_1 where col2 = cast('2016-08-10' as
date) and col3='I' and col4='CR' and col5 between 100000.0
and 103000.0
2469171608 760 10s 2m3s
Test Config:
Presto 0.167.0.6t (Teradata) On EMR
Data on S3 (external tables)
Cluster size: 60 worker node x r4.4xlarge
Key points:
Use ORC (Or Parquet) for performant query
Grow the data store with no work
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Main Production Data Store (Bucket on S3)
Size
(PB)
• Data footprint grows
seamlessly
• All data accessible for
interactive (or batch query)
from moment it is stored
Or scale out with multiple clusters…
User A JDBC Client
Table 1
Table 2
AuthN
Metastore
Table N
Logical “Database”
JDBC ClientUser B
JDBC App
Cluster A
Cluster B
Cluster C
Still One Copy
Of Data!
Data needs for data science and ML
• Allow discovery & exploration
• Bring disparate sources of data together
• Allows users to focus on problem not the infrastructure
• Safeguard information with high degree of security and
least privileges access
A single way to access all of the data
Logical Data
Repository
1
Data
Scientist
Logical Data
Repository
Accelerate discovery through
self-service
Logical Data
Repository
Logical Data
Repository
Data
Scientist
Data
Engineer1
2
N Data
Engineer
Data
Engineer
Before Data Lake Data Lake
Data science on the Data Lake
Data
Scientist
JDBC
Client
Logical ‘Database’
EMR Cluster
Still one copy
of data!
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Notebook
Interface
Data
Scientist
Catalog
Notebook or
Shell
Universal Data Science Platform (UDSP)
• Environment (EC2) for each
Data Scientist
• Simple provisioning interface
• Right instance (memory or
GPU) for job
• Access to all the data in
Data Lake
• Shut off when not using for
savings
• Secure (LDAP AuthN/Z +
Encryption)
Data
Scientist
UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
Some business benefits with Data Lake
 Market volume changes no longer disruptive technology events

Regulatory analysts can now interactively analyze 1000x more market events
(billons of rows vs millions before)

Easily reprocess data if there are upstream data errors – used to take weeks to
find capacity now can be done in day/days.
 Querying order route detail went from 10s of minutes to seconds
 Quicker turnaround to provide data for oversight
 Machine Learning model development is easier
Want to hear more?
Feel free to contact me:
john.hitchingham@finra.org
Thank you!

More Related Content

What's hot

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 

What's hot (20)

Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 

Similar to FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
Amazon Web Services Korea
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics
Amazon Web Services
 
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptxTrack 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Amazon Web Services
 

Similar to FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud (20)

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Cloud Data Integration Best Practices
Cloud Data Integration Best PracticesCloud Data Integration Best Practices
Cloud Data Integration Best Practices
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif AbbasiAWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptxTrack 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bob Griffiths, AWS Solutions Architect Manager John Hitchingham, FINRA Engineering August 14, 2017 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
  • 2. Overview of Big Data Services
  • 3. What is big data? When your data sets become so large and complex you have to start innovating around how to collect, store, process, analyze, and share them.
  • 4. Collect AWS Import/Export AWS Direct Connect Amazon Kinesis Amazon EMR Amazon EC2 Process & Analyze Amazon Glacier Amazon S3 Store Amazon Machine Learning Amazon Redshift Amazon DynamoDB Amazon Kinesis Analytics Amazon QuickSight AWS Database Migration Service AWS Data Pipeline Amazon RDS, Amazon Aurora Big Data services on AWS Amazon Elasticsearch Service Amazon Athena AWS Glue AWS Snowball
  • 5. Scale as your data and business grows The volume, variety, and velocity at which data is being generated are leaving organizations with new questions to answer, such as:
  • 6. Data Lake Central Storage Secure, cost-effective storage in Amazon S3 Data Ingestion Get your data into S3 quickly and securely Kinesis Firehose, Direct Connect, Snowball, Database Migration Service Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding DynamoDB Elasticsearch Service API Gateway Directory Service Cognito Athena, QuickSight, EMR, Amazon Redshift IAM, CloudWatch, CloudTrail, KMS Protect & Secure Use entitlements to ensure data is secure and users’ identities are verified
  • 7. Store and analyze all your data—structured and unstructured—from all of your sources, in one centralized location at low cost. Quickly ingest data without needing to force it into a predefined schema, enabling ad-hoc analysis by applying schemas on read, not write. Separating your storage and compute allows you to scale each component as required and attach multiple data processing and analytics services to the same data set. Scale
  • 8. Use only the services you need Scale only the services you need Pay only for what you use Discounts through Reserved Instances Types including Spot, and upfront commitments Cost
  • 9. Visibility/control of all APIs and retrievals Encryption of all data at each step Store an exabyte of data or more in Amazon S3 Analyze GB to PB using standard tools Control egress and ingress points using VPCs Security and scale
  • 10. Big data does not mean just batch • Can be streamed in • Processed in real time • Can be used to respond quickly to requests and actionable events, generate business value You can mix and match • On-premises and cloud • Custom development and managed services Agility & actionable insights
  • 12. In order to solve its market regulation challenges, over the past three years, FINRA’s Technology team has pioneered a managed cloud service to operate big data workloads and perform analytics at large scale. The results of FINRA’s innovations have been significant. To achieve these gains and operate its big data ecosystem, FINRA Technology has built a set of cutting edge tools, processes, and know-how. FINRA’s experience A 30% operating cost reduction, in both labor and infrastructure A 5x increase in operational resiliency The business is able to perform analytics at an unprecedented scale and depth
  • 13.
  • 14. Legacy pain points – infrastructure and ops Did not scale well as volumes and workloads increased Duplication of effort in data management (data lifecycle, retention, versioning, etc.) Data sync issues – manual effort to keep data in sync Costly system maintenance and upgrades
  • 15. Legacy pain points – analytics and data science Business Analysts Data Scientists Data Analysts Data Engineers Ops What data do we have? What format is it in? Where to I get it? Get this data for them… Not on disk – pull from tape Wait for tapes from offsite Prepare & Format Oops, I need more data … Repeat! I need data in different format … Repeat! etc…, etc…
  • 16. Summary of cloud drivers • Fast-growing data volumes YoY • High cost of pre-building for peak • Escalating costs of in-house technology infrastructure • Long time-to-market for finding insights in data • Appliance platforms were facing obsolescence and end-of life as a result of new big data technologies Keep spending more on legacy infrastructure or redirect dollars to core business of regulation?
  • 17. FINRA cloud program business objectives • Discover data easily • Access (all the) data easily • Increase the power of analytic tools • Make data processing resilient • Make data processing cost effective Could this be achieved in the cloud?
  • 18. Cloud architectural principles Manage Data Consistently • Define, store and share our data as an enterprise asset • All data should be enabled for analytics • Protect data in a holistic manner (data at rest and data in transit) Integrate our Portfolio • Shared solutions for common business processes across the organization • All "business" data in cloud will be tracked by a centralized Data Management System so that FINRA can manage the data lifecycle in a productive and cost effective manner • All FINRA-developed applications will have service interfaces Operational Resiliency • Multi-AZ components and fail-over • Auto-scaling and load balancing to achieve high availability • No logon to servers or services for routine operations • Applications should include automated operations jobs to handle known failure scenarios, recovery, data issues, and notifications.
  • 19. From data puddles to Data Lake Database1 Storage Query/Compute Catalog Database2 Storage Query/Compute Catalog Databasen Storage Query/Compute Catalog Storage Query/ Compute Catalog EMR Spark LambdaEMR Presto EMR HBase herd Hive metastore FINRA in Data Center FINRA in AWS Scales Silo Amazon S3
  • 20. Data processing stream on Data Lake Catalog & Storage ETL Normalize, Enrich, Reformat Human Analytics Validation Ingest Broker Dealers Exchanges 3rd Party Providers Data Files Analyst Data Scientist Regulatory User • Centralized Catalog • 100s of EMR clusters • As many Lambda functions as needed Patterns Automated Surveillance
  • 21. Power of parallelization ETL Job1 Input Result ETL Job2 Input Result ETL Jobn Input Result Workloads run in parallel for workload isolation to meet SLAs
  • 22. Processing scales to meet demand 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 11/1 11/3 11/5 11/7 11/9 11/11 11/13 11/15 11/17 11/19 11/21 11/23 11/25 11/27 11/29 Daily Order Volume (Billions) 0 2000 4000 6000 8000 10000 12000 2016-10-17T02 2016-10-17T08 2016-10-17T14 2016-10-17T20 2016-10-18T02 2016-10-18T08 2016-10-18T14 2016-10-18T20 2016-10-19T02 2016-10-19T08 2016-10-19T14 2016-10-19T20 2016-10-20T02 2016-10-20T08 2016-10-20T14 2016-10-20T20 2016-10-21T02 2016-10-21T08 2016-10-21T14 2016-10-21T20 2016-10-22T02 2016-10-24T03 2016-10-24T20 ComputeNodes Hour of Day AWS EMR compute on EC2 EMR
  • 23. Catalog for centralized data management http://finraos.github.io/herd Unified catalog • Schemas • Versions • Encryption type • Storage policies Lineage and Usage • Track publishers and consumers • Easily identify jobs and derived data sets Shared Metastore • Common definition of tables and partitions • Use with Spark, Presto, Hive, etc. • Faster instantiation of clusters
  • 24. Catalog and the Data Lake ecosystem Hive Metastore Data Catalog Data Catalog UI Analyst Data Scientist Explore Use Object Storage (S3) Custom Handler Request object Info Processing Get object info (optl. DDL) Knows Object/File Object/File Object/File Query (w/ DDL) Store Results Custom Handler Register Output Validation ETL Machine SurveillanceLambda EMR Interactive Analytics EMR Redshift (Spectrum) Get DDL
  • 25. Analytics – one-stop shop for data Data Analyst Data Scientist JDBC Client JDBC Client Table 1 Table 2 AuthN AuthZ Metastore Table N Logical “Database”
  • 26. Achieve interactive query speed with Data Lake Query Table size (rows) Output size (rows) ORC TXT/BZ2 select count(*) from TABLE_1 where trade_date = cast(‘2016-08-09’ as date) 2469171608 1 4s 1m56s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 12 3s 1m51s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 8364 5s 2m5s select * from TABLE_1 where col2 = cast('2016-08-10' as date) and col3='I' and col4='CR' and col5 between 100000.0 and 103000.0 2469171608 760 10s 2m3s Test Config: Presto 0.167.0.6t (Teradata) On EMR Data on S3 (external tables) Cluster size: 60 worker node x r4.4xlarge Key points: Use ORC (Or Parquet) for performant query
  • 27. Grow the data store with no work 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Main Production Data Store (Bucket on S3) Size (PB) • Data footprint grows seamlessly • All data accessible for interactive (or batch query) from moment it is stored
  • 28. Or scale out with multiple clusters… User A JDBC Client Table 1 Table 2 AuthN Metastore Table N Logical “Database” JDBC ClientUser B JDBC App Cluster A Cluster B Cluster C Still One Copy Of Data!
  • 29. Data needs for data science and ML • Allow discovery & exploration • Bring disparate sources of data together • Allows users to focus on problem not the infrastructure • Safeguard information with high degree of security and least privileges access
  • 30. A single way to access all of the data Logical Data Repository 1 Data Scientist Logical Data Repository Accelerate discovery through self-service Logical Data Repository Logical Data Repository Data Scientist Data Engineer1 2 N Data Engineer Data Engineer Before Data Lake Data Lake
  • 31. Data science on the Data Lake Data Scientist JDBC Client Logical ‘Database’ EMR Cluster Still one copy of data! Spark Cluster DS-in-a-box AuthN Data Scientist Notebook Interface Data Scientist Catalog Notebook or Shell
  • 32. Universal Data Science Platform (UDSP) • Environment (EC2) for each Data Scientist • Simple provisioning interface • Right instance (memory or GPU) for job • Access to all the data in Data Lake • Shut off when not using for savings • Secure (LDAP AuthN/Z + Encryption) Data Scientist
  • 33. UDSP – Inventory – not just R • R 3.2.5, Python (2.7.12 and 3.4.3) • Packages • R: 300+ Python: 100+ • Tools for Building Packages • gcc, gfortran, make, java, maven, ant… • IDEs • Jupyter, RStudio Server • Deep Learning • CUDA, CuDNN (if GPU present) • Theano, Caffe, Torch • TensorFlow 16
  • 34. Some business benefits with Data Lake  Market volume changes no longer disruptive technology events  Regulatory analysts can now interactively analyze 1000x more market events (billons of rows vs millions before)  Easily reprocess data if there are upstream data errors – used to take weeks to find capacity now can be done in day/days.  Querying order route detail went from 10s of minutes to seconds  Quicker turnaround to provide data for oversight  Machine Learning model development is easier
  • 35. Want to hear more? Feel free to contact me: john.hitchingham@finra.org