Continuous Data Ingestion pipeline for the Enterprise

©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Next-gen Data Flow Platform for the Enterprise
Santosh Bardwaj
Vice President, Advanced Analytics
The opinions expressed in this presentation are those of the presenters,
in their individual capacities, and not necessarily those of Discover.

Agenda
2
What it
takes to
build an
enterprise-
ready
platform
Discover’s
next-gen data
ingestion
platform
built on NiFi
Challenges
and how we
overcame
them
1 32
Next steps
with the
platform
4

$37Bn Consumer Deposits
$9Bn Private Student Loans
$7Bn Personal Loans
1 in 4 Households1
$60Bn in Credit Card Receivables
Leading Cash
Rewards
 $183Bn Payment Services Volume
 185+ Countries/Territories
Discover is a leading U.S. direct bank & payments partner
Note(s)
Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits
1. TNS’ Consumer Payment Strategies Study
3
Deposits & Lending

Advancing our
data-analytic
capabilities
Ingest, classify
and transform data
from “source to
insight” in minutes
Centralize data,
next-generation
analytic tools and
reporting on the
Hadoop Data Lake
Extend the
Data Lake and
advanced
analytic stack
on the Cloud to
enable speed
to market
Operationalize
business use
cases leveraging
advanced
analytic
capabilities
Provide real-time
customer insight and
rapid deployment of
new strategies into
the decision engines
Advanced
Analytics
Capabilities
1
5
4
3
2
From hours
to minutes Built around a
foundation of a
continuous data
pipeline and hybrid
data-analytic lake
4

Unified data ingestion platform built on NiFi
5
Unified data ingestion platform
 Ingest data from source systems
 Push to the Enterprise Data Lake
 Governed process leveraging
common-reusable templates
What is NiFi?
 Enables automated data flow
management
 Acquires data from producers
 Delivers to consumers while
orchestrating the flow
Scalable and Customizable
Provenance
Promotes reuse
Secure
User Interface (drag & drop)
Why we chose NiFi to build our
data ingestion platform

The next-gen platform built on NiFi and Spark is designed to
streamline our data pipeline into a near real-time paradigm
6
Operational
Database
Raw Data Lake
(flat file)
Limited user
access and tools
Source
of Truth
Enterprise
DW
Database file
extracts
SFTP
ETL Grid ETL Grid
~24 hours
Raw
data
Source
of truth
Source of truth
- Enriched
Enterprise Data Lake
Phase 1
“True Sourcing”
Phase 2
“Enriched Sourcing”
Minutes
Nightly batch to near real-time
NiFi
Spark
NiFi
Hortonworks

We are also extending the capability of into the cloud
7
Batch
sources
Event
Bus
Mini-batch
Real-time
On-premise Data Lake
Model scoring/
decisioning
Real-time
analytics
History
Operational Data
Store
Real-time
AWS Data Lake
Kafka
Hortonworks
Amazon S3
Hortonworks
Spark
7

Data Flow Categorization within the Hadoop Data Lake
8
System of
Record
(SOR)
Source of Truth
(SOT)
Source of Truth
– Enriched
(SOT-E)

Detail flow and foundational components
9
SOURCES RAW SOR SOT SOT-E
Source files
Landing
area
File
Catalog
Convert to
standard
format
Schema
evolution
Apply
schema
changes
Raw data
consumable
Technical
metadata
Business
metadata
DQ checks
Data enrichment
(Business
transformation)
Ability to
export data
out of Lake
Continuous
integration
Monitoring Data lineage
Data
governance
Exception
handling
Security
Data
reconciliation

Ingesting complex data - How complex?
Format of files will vary, some are easy to consume, others hard
Example: Records with Dynamic arrays/vectors of primitives or strings
Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City
Data:
John, Doe, 2, Susie, Chris, Chicago
Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta
Frank, Smith, 1, Ralph, Toronto
Example: Records with an array of Struct data types
Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City,
CompanyStruct.YearsWorked, Age
Data:
John, 1, Discover, Chicago, 3 , 44
Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35
10

Our solution – A custom NiFi processor to handle complex data types
11
Spark
Converter
Discover schema.json
Data File.001
Data File.avsc
Data File 001.avro
Ingestion Pipeline
Source of
Truth - Source
NiFi Process
Group
System of Record

Continuous improvement of real-time data ingestion using NiFi
NiFi Ingestion Flow Version I
Source : Flat File Destination: Hadoop
24 hours
NiFi Ingestion Flow Version II
Source : Event Bus Destination: Hadoop
Complex logic, limited scale
NiFi Ingestion Flow Version III
Source : Event Bus Destination: Hadoop
Custom NiFi processor developed in-house, reusable and scalable
Seconds
112

ETL on Hadoop progression
Version I
Traditional
ETL tool
Version II
ETL on
HiveQL
Version III
ETL on Spark
(hand-coded)
Coming soon
Automated
(flow-based)
ETL on Spark
13
~18 hours ~8 hours
Data enrichment from SOR to SOT (~600 jobs)
~1 hourRun time:

Upcoming enhancements to our data pipeline
Integrating
data
quality,
catalog into
NiFi flow
Custom
processors
to parse
complex
data
structures
Enterprise
scale ETL
on Hadoop
using
Spark
Self-
service
data
pipelines
Integrating
batch and
real-time
data
pipelines
14

Hiring Data Engineers
Q & A

Continuous Data Ingestion pipeline for the Enterprise

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Continuous Data Ingestion pipeline for the Enterprise

Similar to Continuous Data Ingestion pipeline for the Enterprise (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Continuous Data Ingestion pipeline for the Enterprise

Editor's Notes