Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Architecture For enterprise


Published on

  • Login to see the comments

  • Be the first to like this

Big Data Architecture For enterprise

  1. 1. Big Data Architecture for Enterprise Wei Zhang Big Data Architect Up up consultant, LLC
  2. 2. Design Principles • Future-proof, scalable and auto recoverable, compatible with existing technologies, loose coupled and layered architecture
  3. 3. Centralized Data Governance service • Build Schema catalog service to track all data entities and attributes for both structured and unstructured data sets • Establish and enforce proper practices including solution patterns/design, coding, testing automation and release procedues
  4. 4. Logical Architecture Data Transformation and storage Data Acquisition Text files Image files XML files EDI files Event … Data Distribution BI Reports Text files Image files XML files EDI files Event … Data Processing Pipeline Hadoop HDFS MapReduce Hive Pig Flume Spark Java/Scala NoSql MongoDB Cassandra Relational Database MS Sql Oracle MySql
  5. 5. Logical Architecture • Data lifecycle control, access audit, replication and DR • On-desk and in-memory data processing technology stack - sql or nosql database, hadoop map reduce, Spark or ETL tool etc • Central data inventory services for discovery, tracking and optimization
  6. 6. Technology Stack • HDFS, MapReduce, Yarn • Oozie, Hive, Spark, Kafka, Cassandra, MongoDB • BI & Reporting, Data acquisition and distribution, Data inventory and data model
  7. 7. Schema Catalog • MongoDB schema store • Schemas, Entities, attributes defined using Arvo format • Define all Data Sources, destinations including format, transfer protocol, file system, schedule etc
  8. 8. Data Ledger • Ledger inventory of all business data set across enterprise • data set producer and consumer registration • Data set are tagged and can be queried for traceability and usages
  9. 9. Data Process and Persistent • Relational database for OLTP, data warehouse and BI which need to access SQL database and existing systems • HDFS for source, destination, staging, no structured document, large to huge data processing, data saved in either Arvo or Parquet format for better exchange and performance • Cassanadra for high frequency, high write transaction systems and MongoDB for document
  10. 10. Automated and Regression Testing • Maven, SBT, Junit, Scalatest
  11. 11. Physical Deployment • Low End: 7.2 RPM / 75 IOPS, 16 core, 128G (data acquisition and distribution) • Medium: 15k RPM / 175 IOPS, 24 core, 512G (batch processing) • High End: 6K - 500K IOPS, 80 core, 1.5T (realtime processing/analytics)