Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Discover has a tradition of operating on mature data â analytic platforms such as TD, SAS â Platforms are proprietary, expensive
Since the beginning of this decade , there are 3 key trends that have influenced the future of the industry: Big Data, Open source tools, Real-Time analytics and cloud
Business â Reinvent our key decisioning platforms such as Fraud, Credit decisioning,
Collections â Faster, Richer data, better quality insights , Faster development & deploymentÂ
Technology foundation consists of â Hadoop, a new Data pipeline
Collectively should help improved our speed to market from days/ hours to minutes
Multiple record formats within a single file
Records will contain complex data structures (sub-records, dynamic arrays/vectors)
Fixed width, single and multiple delimited, Mainframe
Systematically convert source files to a standard format with schema information attached
Apply our own âDiscover Schemaâ (stored in json) to the raw source file (or use CopyBook for mainframe files)
Feed the source data and our âDiscover Schemaâ into a Spark application
âDiscover Schemaâ is needed so our convertor knows how to parse the incoming data file
Output is an AVRO data file along with corresponding .avsc schema
Avro data and schema is then passed on to the ingestion pipeline for further Hive Loading and processing