FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.
3. What is big data?
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
4. Collect
AWS
Import/Export
AWS Direct
Connect
Amazon
Kinesis
Amazon
EMR
Amazon
EC2
Process & Analyze
Amazon
Glacier
Amazon
S3
Store
Amazon
Machine
Learning
Amazon
Redshift
Amazon
DynamoDB
Amazon
Kinesis
Analytics
Amazon
QuickSight
AWS Database
Migration
Service
AWS Data
Pipeline
Amazon RDS,
Amazon Aurora
Big Data services on AWS
Amazon
Elasticsearch
Service
Amazon
Athena
AWS
Glue
AWS
Snowball
5. Scale as your data and business grows
The volume, variety, and velocity at which data is being generated are
leaving organizations with new questions to answer, such as:
6. Data Lake
Central Storage
Secure, cost-effective
storage in
Amazon S3
Data Ingestion
Get your data into S3 quickly and securely
Kinesis Firehose, Direct Connect, Snowball,
Database Migration Service
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
Processing & Analytics
Use of predictive and prescriptive analytics
to gain better understanding
DynamoDB
Elasticsearch Service
API Gateway
Directory Service
Cognito
Athena, QuickSight, EMR, Amazon Redshift
IAM, CloudWatch, CloudTrail, KMS
Protect & Secure
Use entitlements to ensure data
is secure and users’ identities are
verified
7. Store and analyze all your data—structured and
unstructured—from all of your sources, in one centralized
location at low cost.
Quickly ingest data without needing to force it into a
predefined schema, enabling ad-hoc analysis by applying
schemas on read, not write.
Separating your storage and compute allows you to scale
each component as required and attach multiple data
processing and analytics services to the same data set.
Scale
8. Use only the services you need
Scale only the services you need
Pay only for what you use
Discounts through Reserved Instances
Types including Spot, and upfront commitments
Cost
9. Visibility/control of all APIs and retrievals
Encryption of all data at each step
Store an exabyte of data or more in Amazon S3
Analyze GB to PB using standard tools
Control egress and ingress points using VPCs
Security
and scale
10. Big data does not mean just batch
• Can be streamed in
• Processed in real time
• Can be used to respond quickly to requests
and actionable events, generate business
value
You can mix and match
• On-premises and cloud
• Custom development and managed services
Agility
&
actionable
insights
12. In order to solve its market regulation challenges, over the past three years,
FINRA’s Technology team has pioneered a managed cloud service to
operate big data workloads and perform analytics at large scale.
The results of FINRA’s innovations have been significant.
To achieve these gains and operate its big data ecosystem, FINRA
Technology has built a set of cutting edge tools, processes, and know-how.
FINRA’s experience
A 30% operating cost reduction, in both labor and infrastructure
A 5x increase in operational resiliency
The business is able to perform analytics at an unprecedented scale and depth
13.
14. Legacy pain points – infrastructure and ops
Did not scale well as volumes and
workloads increased
Duplication of effort in data management
(data lifecycle, retention, versioning, etc.)
Data sync issues – manual effort to keep
data in sync
Costly system maintenance and
upgrades
15. Legacy pain points – analytics and data science
Business
Analysts
Data
Scientists Data
Analysts
Data
Engineers
Ops
What data do we have?
What format is it in?
Where to I get it?
Get this data for them…
Not on disk – pull from tape
Wait for tapes
from offsite
Prepare & Format
Oops, I need more data … Repeat!
I need data in different format …
Repeat!
etc…, etc…
16. Summary of cloud drivers
• Fast-growing data volumes YoY
• High cost of pre-building for peak
• Escalating costs of in-house technology infrastructure
• Long time-to-market for finding insights in data
• Appliance platforms were facing obsolescence and end-of life as
a result of new big data technologies
Keep spending more on legacy infrastructure or
redirect dollars to core business of regulation?
17. FINRA cloud program business objectives
• Discover data easily
• Access (all the) data easily
• Increase the power of analytic tools
• Make data processing resilient
• Make data processing cost effective
Could this be achieved in the cloud?
18. Cloud architectural principles
Manage Data
Consistently
• Define, store and share our
data as an enterprise asset
• All data should be enabled
for analytics
• Protect data in a holistic
manner (data at rest and
data in transit)
Integrate our
Portfolio
• Shared solutions for
common business
processes across the
organization
• All "business" data in cloud
will be tracked by a
centralized Data
Management System so that
FINRA can manage the data
lifecycle in a productive and
cost effective manner
• All FINRA-developed
applications will have
service interfaces
Operational
Resiliency
• Multi-AZ components and
fail-over
• Auto-scaling and load
balancing to achieve high
availability
• No logon to servers or
services for routine
operations
• Applications should include
automated operations jobs
to handle known failure
scenarios, recovery, data
issues, and notifications.
19. From data puddles to Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3
20. Data processing stream on Data Lake
Catalog &
Storage
ETL
Normalize, Enrich, Reformat
Human
Analytics
Validation
Ingest
Broker Dealers
Exchanges
3rd Party Providers
Data
Files
Analyst
Data Scientist
Regulatory User
• Centralized Catalog
• 100s of EMR clusters
• As many Lambda
functions as needed
Patterns
Automated Surveillance
21. Power of parallelization
ETL Job1
Input Result
ETL Job2
Input Result
ETL Jobn
Input Result
Workloads run in parallel for workload isolation to meet SLAs
23. Catalog for centralized data management
http://finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and
partitions
• Use with Spark, Presto, Hive, etc.
• Faster instantiation of clusters
24. Catalog and the Data Lake ecosystem
Hive
Metastore
Data Catalog
Data Catalog UI
Analyst
Data Scientist
Explore Use
Object Storage
(S3)
Custom
Handler
Request object Info
Processing
Get object info
(optl. DDL)
Knows
Object/File
Object/File
Object/File
Query (w/ DDL)
Store Results
Custom
Handler
Register Output
Validation
ETL
Machine SurveillanceLambda EMR
Interactive Analytics
EMR Redshift
(Spectrum)
Get DDL
25. Analytics – one-stop shop for data
Data
Analyst
Data
Scientist
JDBC
Client
JDBC
Client
Table 1
Table 2
AuthN
AuthZ
Metastore
Table N
Logical “Database”
26. Achieve interactive query speed with Data Lake
Query Table size
(rows)
Output
size (rows)
ORC TXT/BZ2
select count(*) from TABLE_1
where trade_date = cast(‘2016-08-09’ as date)
2469171608 1 4s 1m56s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 12 3s 1m51s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 8364 5s 2m5s
select * from TABLE_1 where col2 = cast('2016-08-10' as
date) and col3='I' and col4='CR' and col5 between 100000.0
and 103000.0
2469171608 760 10s 2m3s
Test Config:
Presto 0.167.0.6t (Teradata) On EMR
Data on S3 (external tables)
Cluster size: 60 worker node x r4.4xlarge
Key points:
Use ORC (Or Parquet) for performant query
27. Grow the data store with no work
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Main Production Data Store (Bucket on S3)
Size
(PB)
• Data footprint grows
seamlessly
• All data accessible for
interactive (or batch query)
from moment it is stored
28. Or scale out with multiple clusters…
User A JDBC Client
Table 1
Table 2
AuthN
Metastore
Table N
Logical “Database”
JDBC ClientUser B
JDBC App
Cluster A
Cluster B
Cluster C
Still One Copy
Of Data!
29. Data needs for data science and ML
• Allow discovery & exploration
• Bring disparate sources of data together
• Allows users to focus on problem not the infrastructure
• Safeguard information with high degree of security and
least privileges access
30. A single way to access all of the data
Logical Data
Repository
1
Data
Scientist
Logical Data
Repository
Accelerate discovery through
self-service
Logical Data
Repository
Logical Data
Repository
Data
Scientist
Data
Engineer1
2
N Data
Engineer
Data
Engineer
Before Data Lake Data Lake
31. Data science on the Data Lake
Data
Scientist
JDBC
Client
Logical ‘Database’
EMR Cluster
Still one copy
of data!
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Notebook
Interface
Data
Scientist
Catalog
Notebook or
Shell
32. Universal Data Science Platform (UDSP)
• Environment (EC2) for each
Data Scientist
• Simple provisioning interface
• Right instance (memory or
GPU) for job
• Access to all the data in
Data Lake
• Shut off when not using for
savings
• Secure (LDAP AuthN/Z +
Encryption)
Data
Scientist
33. UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
34. Some business benefits with Data Lake
Market volume changes no longer disruptive technology events
Regulatory analysts can now interactively analyze 1000x more market events
(billons of rows vs millions before)
Easily reprocess data if there are upstream data errors – used to take weeks to
find capacity now can be done in day/days.
Querying order route detail went from 10s of minutes to seconds
Quicker turnaround to provide data for oversight
Machine Learning model development is easier
35. Want to hear more?
Feel free to contact me:
john.hitchingham@finra.org