2. Design Principles
• Future-proof, scalable and auto recoverable,
compatible with existing technologies, loose
coupled and layered architecture
3. Centralized Data
Governance service
• Build Schema catalog service to track all data
entities and attributes for both structured and
unstructured data sets
• Establish and enforce proper practices including
solution patterns/design, coding, testing
automation and release procedues
4. Logical Architecture
Data Transformation and
storage
Data
Acquisition
Text files
Image files
XML files
EDI files
Event
…
Data
Distribution
BI
Reports
Text files
Image files
XML files
EDI files
Event
…
Data Processing Pipeline
Hadoop HDFS
MapReduce
Hive
Pig
Flume
Spark
Java/Scala
NoSql
MongoDB
Cassandra
Relational
Database
MS Sql
Oracle
MySql
5. Logical Architecture
• Data lifecycle control, access audit, replication
and DR
• On-desk and in-memory data processing
technology stack - sql or nosql database,
hadoop map reduce, Spark or ETL tool etc
• Central data inventory services for discovery,
tracking and optimization
6. Technology Stack
• HDFS, MapReduce, Yarn
• Oozie, Hive, Spark, Kafka, Cassandra,
MongoDB
• BI & Reporting, Data acquisition and
distribution, Data inventory and data model
7. Schema Catalog
• MongoDB schema store
• Schemas, Entities, attributes defined using Arvo
format
• Define all Data Sources, destinations including
format, transfer protocol, file system, schedule
etc
8. Data Ledger
• Ledger inventory of all business data set across
enterprise
• data set producer and consumer registration
• Data set are tagged and can be queried for
traceability and usages
9. Data Process and Persistent
• Relational database for OLTP, data warehouse
and BI which need to access SQL database and
existing systems
• HDFS for source, destination, staging, no
structured document, large to huge data
processing, data saved in either Arvo or Parquet
format for better exchange and performance
• Cassanadra for high frequency, high write
transaction systems and MongoDB for document