SlideShare a Scribd company logo
1 of 15
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Next-gen Data Flow Platform for the Enterprise
Santosh Bardwaj
Vice President, Advanced Analytics
The opinions expressed in this presentation are those of the presenters,
in their individual capacities, and not necessarily those of Discover.
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Agenda
2
What it
takes to
build an
enterprise-
ready
platform
Discover’s
next-gen data
ingestion
platform
built on NiFi
Challenges
and how we
overcame
them
1 32
Next steps
with the
platform
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
$37Bn Consumer Deposits
$9Bn Private Student Loans
$7Bn Personal Loans
1 in 4 Households1
$60Bn in Credit Card Receivables
Leading Cash
Rewards
 $183Bn Payment Services Volume
 185+ Countries/Territories
Discover is a leading U.S. direct bank & payments partner
Note(s)
Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits
1. TNS’ Consumer Payment Strategies Study
3
Deposits & Lending
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Advancing our
data-analytic
capabilities
Ingest, classify
and transform data
from “source to
insight” in minutes
Centralize data,
next-generation
analytic tools and
reporting on the
Hadoop Data Lake
Extend the
Data Lake and
advanced
analytic stack
on the Cloud to
enable speed
to market
Operationalize
business use
cases leveraging
advanced
analytic
capabilities
Provide real-time
customer insight and
rapid deployment of
new strategies into
the decision engines
Advanced
Analytics
Capabilities
1
5
4
3
2
From hours
to minutes Built around a
foundation of a
continuous data
pipeline and hybrid
data-analytic lake
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Unified data ingestion platform built on NiFi
5
Unified data ingestion platform
 Ingest data from source systems
 Push to the Enterprise Data Lake
 Governed process leveraging
common-reusable templates
What is NiFi?
 Enables automated data flow
management
 Acquires data from producers
 Delivers to consumers while
orchestrating the flow
Scalable and Customizable
Provenance
Promotes reuse
Secure
User Interface (drag & drop)
Why we chose NiFi to build our
data ingestion platform
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
The next-gen platform built on NiFi and Spark is designed to
streamline our data pipeline into a near real-time paradigm
6
Operational
Database
Raw Data Lake
(flat file)
Limited user
access and tools
Source
of Truth
Enterprise
DW
Database file
extracts
SFTP
ETL Grid ETL Grid
~24 hours
Raw
data
Source
of truth
Source of truth
- Enriched
Enterprise Data Lake
Phase 1
“True Sourcing”
Phase 2
“Enriched Sourcing”
Minutes
Nightly batch to near real-time
NiFi
Spark
NiFi
Hortonworks
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
We are also extending the capability of into the cloud
7
Batch
sources
Event
Bus
Mini-batch
Real-time
On-premise Data Lake
Model scoring/
decisioning
Real-time
analytics
History
Operational Data
Store
Real-time
AWS Data Lake
Kafka
Hortonworks
Amazon S3
Hortonworks
Spark
7
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Data Flow Categorization within the Hadoop Data Lake
8
System of
Record
(SOR)
Source of Truth
(SOT)
Source of Truth
– Enriched
(SOT-E)
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Detail flow and foundational components
9
SOURCES RAW SOR SOT SOT-E
Source files
Landing
area
File
Catalog
Convert to
standard
format
Schema
evolution
Apply
schema
changes
Raw data
consumable
Technical
metadata
Business
metadata
DQ checks
Data enrichment
(Business
transformation)
Ability to
export data
out of Lake
Continuous
integration
Monitoring Data lineage
Data
governance
Exception
handling
Security
Data
reconciliation
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Ingesting complex data - How complex?
Format of files will vary, some are easy to consume, others hard
Example: Records with Dynamic arrays/vectors of primitives or strings
Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City
Data:
John, Doe, 2, Susie, Chris, Chicago
Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta
Frank, Smith, 1, Ralph, Toronto
Example: Records with an array of Struct data types
Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City,
CompanyStruct.YearsWorked, Age
Data:
John, 1, Discover, Chicago, 3 , 44
Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35
10
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Our solution – A custom NiFi processor to handle complex data types
11
Spark
Converter
Discover schema.json
Data File.001
Data File.avsc
Data File 001.avro
Ingestion Pipeline
Source of
Truth - Source
NiFi Process
Group
System of Record
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Continuous improvement of real-time data ingestion using NiFi
NiFi Ingestion Flow Version I
Source : Flat File Destination: Hadoop
24 hours
NiFi Ingestion Flow Version II
Source : Event Bus Destination: Hadoop
Complex logic, limited scale
NiFi Ingestion Flow Version III
Source : Event Bus Destination: Hadoop
Custom NiFi processor developed in-house, reusable and scalable
Seconds
112
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
ETL on Hadoop progression
Version I
Traditional
ETL tool
Version II
ETL on
HiveQL
Version III
ETL on Spark
(hand-coded)
Coming soon
Automated
(flow-based)
ETL on Spark
13
~18 hours ~8 hours
Data enrichment from SOR to SOT (~600 jobs)
~1 hourRun time:
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Upcoming enhancements to our data pipeline
Integrating
data
quality,
catalog into
NiFi flow
Custom
processors
to parse
complex
data
structures
Enterprise
scale ETL
on Hadoop
using
Spark
Self-
service
data
pipelines
Integrating
batch and
real-time
data
pipelines
14
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Hiring Data Engineers
Q & A

More Related Content

What's hot

Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationDenodo
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech Talks
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech TalksCloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech Talks
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech TalksAmazon Web Services
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesLars E Martinsson
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data GovernanceDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 

What's hot (20)

Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data Virtualization
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech Talks
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech TalksCloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech Talks
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech Talks
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Data Architecture for Data Governance
Data Architecture for Data GovernanceData Architecture for Data Governance
Data Architecture for Data Governance
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 

Similar to Continuous Data Ingestion pipeline for the Enterprise

GraphTalk Helsinki - Introduction to Graphs and Neo4j
GraphTalk Helsinki - Introduction to Graphs and Neo4jGraphTalk Helsinki - Introduction to Graphs and Neo4j
GraphTalk Helsinki - Introduction to Graphs and Neo4jNeo4j
 
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017Splunk
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...DevOps.com
 
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...Capgemini
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondSingleStore
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data SnapLogic
 
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking GroupHybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking GroupHostedbyConfluent
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...Grid Dynamics
 
Keynote: GraphTour Toronto
Keynote: GraphTour TorontoKeynote: GraphTour Toronto
Keynote: GraphTour TorontoNeo4j
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Building Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks DeltaBuilding Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks DeltaDatabricks
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationDenodo
 

Similar to Continuous Data Ingestion pipeline for the Enterprise (20)

GraphTalk Helsinki - Introduction to Graphs and Neo4j
GraphTalk Helsinki - Introduction to Graphs and Neo4jGraphTalk Helsinki - Introduction to Graphs and Neo4j
GraphTalk Helsinki - Introduction to Graphs and Neo4j
 
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
Using Splunk to Defend Against Advanced Threats - Webinar Slides: November 2017
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
5 Steps to Achieving the Single Pane of Glass Across DevOps -- APM, NPM, Metr...
 
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
CWIN17 san francisco-thomas dornis-2017 - Data concierge-The Foundation of a ...
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data
 
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking GroupHybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
Hybrid Cloud Streaming and Modernising Payments at Lloyds Banking Group
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
Open Blueprint for Real-Time Analytics with In-Stream Processing (ISP); 2017 ...
 
Keynote: GraphTour Toronto
Keynote: GraphTour TorontoKeynote: GraphTour Toronto
Keynote: GraphTour Toronto
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Building Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks DeltaBuilding Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks Delta
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Continuous Data Ingestion pipeline for the Enterprise

  • 1. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Next-gen Data Flow Platform for the Enterprise Santosh Bardwaj Vice President, Advanced Analytics The opinions expressed in this presentation are those of the presenters, in their individual capacities, and not necessarily those of Discover.
  • 2. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Agenda 2 What it takes to build an enterprise- ready platform Discover’s next-gen data ingestion platform built on NiFi Challenges and how we overcame them 1 32 Next steps with the platform 4
  • 3. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute $37Bn Consumer Deposits $9Bn Private Student Loans $7Bn Personal Loans 1 in 4 Households1 $60Bn in Credit Card Receivables Leading Cash Rewards  $183Bn Payment Services Volume  185+ Countries/Territories Discover is a leading U.S. direct bank & payments partner Note(s) Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits 1. TNS’ Consumer Payment Strategies Study 3 Deposits & Lending
  • 4. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Advancing our data-analytic capabilities Ingest, classify and transform data from “source to insight” in minutes Centralize data, next-generation analytic tools and reporting on the Hadoop Data Lake Extend the Data Lake and advanced analytic stack on the Cloud to enable speed to market Operationalize business use cases leveraging advanced analytic capabilities Provide real-time customer insight and rapid deployment of new strategies into the decision engines Advanced Analytics Capabilities 1 5 4 3 2 From hours to minutes Built around a foundation of a continuous data pipeline and hybrid data-analytic lake 4
  • 5. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Unified data ingestion platform built on NiFi 5 Unified data ingestion platform  Ingest data from source systems  Push to the Enterprise Data Lake  Governed process leveraging common-reusable templates What is NiFi?  Enables automated data flow management  Acquires data from producers  Delivers to consumers while orchestrating the flow Scalable and Customizable Provenance Promotes reuse Secure User Interface (drag & drop) Why we chose NiFi to build our data ingestion platform
  • 6. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute The next-gen platform built on NiFi and Spark is designed to streamline our data pipeline into a near real-time paradigm 6 Operational Database Raw Data Lake (flat file) Limited user access and tools Source of Truth Enterprise DW Database file extracts SFTP ETL Grid ETL Grid ~24 hours Raw data Source of truth Source of truth - Enriched Enterprise Data Lake Phase 1 “True Sourcing” Phase 2 “Enriched Sourcing” Minutes Nightly batch to near real-time NiFi Spark NiFi Hortonworks
  • 7. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute We are also extending the capability of into the cloud 7 Batch sources Event Bus Mini-batch Real-time On-premise Data Lake Model scoring/ decisioning Real-time analytics History Operational Data Store Real-time AWS Data Lake Kafka Hortonworks Amazon S3 Hortonworks Spark 7
  • 8. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Data Flow Categorization within the Hadoop Data Lake 8 System of Record (SOR) Source of Truth (SOT) Source of Truth – Enriched (SOT-E)
  • 9. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Detail flow and foundational components 9 SOURCES RAW SOR SOT SOT-E Source files Landing area File Catalog Convert to standard format Schema evolution Apply schema changes Raw data consumable Technical metadata Business metadata DQ checks Data enrichment (Business transformation) Ability to export data out of Lake Continuous integration Monitoring Data lineage Data governance Exception handling Security Data reconciliation
  • 10. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Ingesting complex data - How complex? Format of files will vary, some are easy to consume, others hard Example: Records with Dynamic arrays/vectors of primitives or strings Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City Data: John, Doe, 2, Susie, Chris, Chicago Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta Frank, Smith, 1, Ralph, Toronto Example: Records with an array of Struct data types Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City, CompanyStruct.YearsWorked, Age Data: John, 1, Discover, Chicago, 3 , 44 Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35 10
  • 11. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Our solution – A custom NiFi processor to handle complex data types 11 Spark Converter Discover schema.json Data File.001 Data File.avsc Data File 001.avro Ingestion Pipeline Source of Truth - Source NiFi Process Group System of Record
  • 12. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Continuous improvement of real-time data ingestion using NiFi NiFi Ingestion Flow Version I Source : Flat File Destination: Hadoop 24 hours NiFi Ingestion Flow Version II Source : Event Bus Destination: Hadoop Complex logic, limited scale NiFi Ingestion Flow Version III Source : Event Bus Destination: Hadoop Custom NiFi processor developed in-house, reusable and scalable Seconds 112
  • 13. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute ETL on Hadoop progression Version I Traditional ETL tool Version II ETL on HiveQL Version III ETL on Spark (hand-coded) Coming soon Automated (flow-based) ETL on Spark 13 ~18 hours ~8 hours Data enrichment from SOR to SOT (~600 jobs) ~1 hourRun time:
  • 14. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Upcoming enhancements to our data pipeline Integrating data quality, catalog into NiFi flow Custom processors to parse complex data structures Enterprise scale ETL on Hadoop using Spark Self- service data pipelines Integrating batch and real-time data pipelines 14
  • 15. ©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute Hiring Data Engineers Q & A

Editor's Notes

  1. Discover has a tradition of operating on mature data – analytic platforms such as TD, SAS – Platforms are proprietary, expensive Since the beginning of this decade , there are 3 key trends that have influenced the future of the industry: Big Data, Open source tools, Real-Time analytics and  cloud Business – Reinvent our key decisioning platforms such as Fraud, Credit decisioning, Collections – Faster, Richer data, better quality insights , Faster development & deployment  Technology foundation consists of – Hadoop, a new Data pipeline Collectively should help improved our speed to market from days/ hours to minutes
  2. Multiple record formats within a single file Records will contain complex data structures (sub-records, dynamic arrays/vectors) Fixed width, single and multiple delimited, Mainframe
  3. Systematically convert source files to a standard format with schema information attached Apply our own “Discover Schema” (stored in json) to the raw source file (or use CopyBook for mainframe files) Feed the source data and our “Discover Schema” into a Spark application “Discover Schema” is needed so our convertor knows how to parse the incoming data file Output is an AVRO data file along with corresponding .avsc schema Avro data and schema is then passed on to the ingestion pipeline for further Hive Loading and processing