Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stopping the Lake from becoming a Swamp

Garbage in, Garbage out. This truism is not new only for Big Data but it has become significantly more impactful as the move has begun away from traditional schema based approaches to more flexible and dynamic file system approaches. Many companies are finding that early stage successes rapidly become mid-term failures as data poured into Hadoop can no longer be managed and marshaled effectively, creating a cycle of reduced effectiveness and increased cost. This session covers how Big Data changes the way that information needs to be managed, how traditional approaches like MDM need to change and how new technologies and methods such as machine learning and business meta-data become crucial to the long term viability of a heterogeneous Big Data infrastructure.

Presented at Informatica World 2016 by Steve Jones, Global VP Big Data, Capgemini Insights & Data

  • Login to see the comments

Stopping the Lake from becoming a Swamp

  1. 1. Stopping your Lake becoming a Swamp Steve Jones – Global VP Big Data – Capgemini 4 May 2016
  2. 2. #INFA16 The biggest complaint in Data Governance… The business won’t engage
  3. 3. #INFA16 The biggest complaint in Data Governance… Want to know why?
  4. 4. #INFA16 The biggest complaint in Data Governance… Its because they never cared about Data Governance
  5. 5. #INFA16 The biggest complaint in Data Governance… Because it didn’t actually govern the data in the way they wanted
  6. 6. #INFA16 Corporate Ad-hoc LOB management Operations Market Multiple internal views – consistently compromised Operations LOB mart Spreadsheets Line of business Transactional systems CRMERP PLM EDW Corporate ODS Web
  7. 7. #INFA16 Corporate Ad-hoc LOB management Operations Market Multiple internal views - you already have a swamp Operations LOB mart Spreadsheets Line of business Transactional systems CRMERP PLM EDW Fit Detail Freshness Fidelity Corporate ODS Web
  8. 8. #INFA16 IT’s answer – just make the EDW ‘bigger’- one view for all
  9. 9. #INFA16 Why the single view & the golden record fails Division1 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate Now agree on everythingDivision2 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division3 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division4 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate KPIs
  10. 10. #INFA16 The biggest complaint in Data Governance… But good news, Big Data is changing the game
  11. 11. #INFA16 Marketing Business Meta-Data – recognizing how people work Data Lake Debt RecoveryHigh Value Customers Explore the Lake Prioritize
  12. 12. #INFA16 Marketing Business Meta-Data – recognizing how people work Data Lake Debt RecoveryHigh Value Customers Explore the Lake Business Meta- DataPurpose: Debt Recovery Verify Purpose Exclude Purpose: Mktg Exclude
  13. 13. #INFA16 So what do we need? Govern where it matters !  Focus on Meta-data, MDM and RDM !  Enforce only when sharing !  Treat Corporate as aggregation of Local. Encourage local requirements !  Let the business decide what they need !  Build from the bottom !  Enable traceability to source !  Disposable data views. Distill on demand !  Select only what you want !  Business friendly tooling !  Re-usable information maps !  Rapid change cycle. Store everything !  Store everything ‘as is’ !  Include structured and unstructured data !  Store it cheaply.
  14. 14. #INFA16 Why Data Lakes work Archives Operational systems Case Mgmt Operation s Schedulin g Data Explosion Geographic Data One Place for Everything Supply Chain Optimization Fraud Detection Next Best Action Marketing RMADLibSAS
  15. 15. #INFA16 Keep governance small and focused – the MDM RADAR Core Sync Share Federated
  16. 16. #INFA16 Collaboration over quality The cross-reference is the most important thing an MDM project can deliver
  17. 17. #INFA16 Collaboration over quality Quality is a side-effect – not a goal of the first phase of your MDM project
  18. 18. #INFA16 Why the X-Ref matters Local view Corporate standards Master data and reference data Corporate view Customer x-ref Customer MDM Invoices Orders Customer Invoices Orders BU1 Information governance BU2 BU3 Customer Invoices Orders Customer Invoices Orders Customer Invoices Orders
  19. 19. #INFA16 The Quality mess Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL
  20. 20. #INFA16 The current situation – add a new System Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL Reports ETL
  21. 21. #INFA16 The current situation – Where does Data Quality go? Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL Reports ETL Fix at Source Patch Patch Patch Patch Patch DQ DQ DQ DQ
  22. 22. #INFA16 Lets try another way – Fix at Source is still the ideal but Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading Data Lake Data Lake + Quality Data Quality & MDM
  23. 23. #INFA16 Then lets go further Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading Data Lake Data Lake + Quality Data Quality & MDM Data Science
  24. 24. #INFA16 Why the Business Data Lake succeeds Division1 Sales Finance Supply chain Marketing R&D PersonalKPIs Now agree where it counts Division2 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division3 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division4 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate Corporate KPIs DivsionalKPIs
  25. 25. #INFA16 Ingestion tier Insights tier Unified operations tier System monitoring System management Unified data management tier Data mgmt. services MDM RDM Audit and policy mgmt. Processing tier Workflow management Distillation tier HDFS storage Unstructured and structured data In-memory MPP database Real time Micro batch Mega batch SQL NoSQL SparkSQL Query interface s SQL Help me ingestion – you’re my only hope Sources Action tier Real-time ingestion Micro batch ingestion Batch ingestion Real-time insights Interactive insights Batch insights Ingestion The first and LAST place you can have control Distillation Think ‘Excel’ not EDW
  26. 26. #INFA16 If your business isn’t trivial – you’ll end up with multiple lakes Client Data Lake Partner Data Lake Geo specific Geo specificGeo specific Corporate Data Lake Analytics Results Distillatio n Summary AnalyticsAnalytics Results Analytics Results
  27. 27. #INFA16 Federation requires THREE things The cross-reference •  If you can’t link data sets its impossible Rationalization •  The lesson of Google & Amazon is less choice, not more Business Meta-Data •  Technical Meta-Data is no-longer enough
  28. 28. #INFA16 Building a Lake and the challenge of federation Client specific Client specific Geo specific Geo specificGeo specific Corporate Data LakeTo do this you need to standardize technology
  29. 29. #INFA16 Business Meta-Data and the x-ref – the two things that must be shared Client specific Client specific Geo specific Geo specificGeo specific Corporate Data Lake Business Meta- Data & x-ref
  30. 30. #INFA16 The new philosophy Business Data Lake Store everything Encourage local Govern only the common Assume Federation It’s all about insight at the point of action
  31. 31. #INFA16 But remember Your next challenge will be analytical collaboration beyond your company boundaries

×