Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enterprise Metadata Integration, Cloudera

GraphConnect Europe 2017
Mirko Kämpf, Cloudera Inc

  • Login to see the comments

  • Be the first to like this

Enterprise Metadata Integration, Cloudera

  1. 1. 1© Cloudera, Inc. All rights reserved. Enterprise Metadata Integration Mirko Kämpf | Cloudera GraphConnect 2017 – London
  2. 2. 2© Cloudera, Inc. All rights reserved. Who is speaking? Solutions Architect @ Cloudera -time series analysis, network analysis, data enrichment pipelines -personal interest: QA-Systems and semantic search Data Science Activities The Detection of Emerging Trends Using Wikipedia Traffic Data and Context Networks (PLOS ONE, 2015) Hadoop.TS (IJCA, 2013) Fluctuations in Wikipedia Access-Rate and Edit-Event Data. (Physica A, 2012).
  3. 3. 3© Cloudera, Inc. All rights reserved. Our Approach: Multilayer Metadata Integration … • Status dashboards are provided per Use-Case. • Each dashboard offers facts from multiple layers: - (L1) technical layer - (L2) operational metadata (Hadoop specific only) - (L3) application specific operational metadata - (L4) quality metrics (second order metadata) • Our Achievements: • Graph database (Neo4J) allows context exploration. • Cluster spanning metadata exploration is possible now. • Exposure of inherent but sometimes hidden facts becomes as easy as writing an email. Integration of facts to gain business knowledge
  4. 4. 4© Cloudera, Inc. All rights reserved. Intro
  5. 5. 5© Cloudera, Inc. All rights reserved. People do mining … for centuries! http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html gold & diamonds, ore & coal, minerals, oil … Outcome drives whole economy
  6. 6. 6© Cloudera, Inc. All rights reserved. People use computers … for decades! 1938 Z1: World’s first free programmable device, created by Conrad Zuse. U.S. Department of Energy uses Intel Supercomputer at Argonne National Laboratory. 2015 http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png http://www.horst-zuse.homepage.t-online.de/z1.html
  7. 7. 7© Cloudera, Inc. All rights reserved. DATA MINING http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/ Blog: About Learning Data Mining & Data Analysis
  8. 8. 8© Cloudera, Inc. All rights reserved. If data is the new oil … … metadata are nuggets and brilliants of our age. Screenshot taken from: https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil
  9. 9. 9© Cloudera, Inc. All rights reserved. Diamonds: beautiful even as raw material Brilliant: result of expert’s work Even more exciting in combination with other material and skills …
  10. 10. 10© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Success Factors: http://www.burkhard-beyer.net/Reportage_Goldschmied.html
  11. 11. 11© Cloudera, Inc. All rights reserved. Be very careful with initial success … … work towards a professional level! High quality and reproducibility are results of a Professional Management It is hard to believe what you can get and which options arise … Manage overwhelming excitement! Start new activities not randomly …
  12. 12. 12© Cloudera, Inc. All rights reserved. Let’s Think Data Driven! • Build a mid-term or better a long-term strategy. • Try to stay independent of a particular technology or tool. Not the fancy toolset but rather data is what matters most. • After initial success you should slow down and control speed of expansion. • Focus on: maximized accessibility of data. Google’s goal was to make the data of the internet accessible. You should become your own Google! • Idea & Vision • Material • Skills / Methods • Tools
  13. 13. 13© Cloudera, Inc. All rights reserved. Dataset Profiles / Flow Descriptors •Our material is data & metadata: - Data about data : descriptive data, Dublin core metadata model, … - Derived data : statistics extracted from processes, documents, … - Results of ML/AI procedures : extracted structure and learned models - Outcome of crowd based operations : Wikipedia with its inherent structure, communication logs, access and edit history. • Idea & Vision • Material • Skills / Methods • Tools
  14. 14. 14© Cloudera, Inc. All rights reserved. Knowledge Extraction for Better Data Science
  15. 15. 15© Cloudera, Inc. All rights reserved. Science: According to Wikipedia: Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. https://en.wikipedia.org/wiki/Science
  16. 16. 16© Cloudera, Inc. All rights reserved. Data Science: My observation: Commercial Data Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the market / business context. https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif
  17. 17. 17© Cloudera, Inc. All rights reserved. Details Look into nature ….
  18. 18. 18© Cloudera, Inc. All rights reserved. Context Look into nature ….
  19. 19. 19© Cloudera, Inc. All rights reserved. Result: Visualization of Facts • An image shows what the text says. > Multi-channel communication • Data Science benefits from such an approach. > Today we still use infographics Difference: Biologist who created this one on the left observed by eye. Today, we use more and more data analysis methods.
  20. 20. 20© Cloudera, Inc. All rights reserved. Process: Knowledge Extraction is a Natural Process • Combine multiple sources • Repeat observation • Incorporate context to explain differences/variation • Cross-checks to identify anomalies
  21. 21. 21© Cloudera, Inc. All rights reserved. Process: Knowledge Extraction is a Natural Process Knowledge Facts Data
  22. 22. 22© Cloudera, Inc. All rights reserved. How did we implement EMDM? - Hadoop Based: for scalability. - Open Graph Data Model: for flexibility and connectivity - Data Centric: following the Big Data paradigm
  23. 23. 23© Cloudera, Inc. All rights reserved. Big Data Processing: e.g., with Hadoop
  24. 24. 24© Cloudera, Inc. All rights reserved. Big Graph Processing on Hadoop: e.g., with Giraph
  25. 25. 25© Cloudera, Inc. All rights reserved. Project Name should stand for: Graphs, Hadoop, and the ecosystem …
  26. 26. 26© Cloudera, Inc. All rights reserved. Project Name should stand for: Graphs, Hadoop, and the ecosystem …
  27. 27. 27© Cloudera, Inc. All rights reserved. Data Science Process Model (DSPM) • DSPM defines core artifacts for knowledge management • Describes analysis / transformation context • Allows repeatable execution • Process properties become measurable • Supports comparison of results from multiple procedures • All those fatcs are essential ingredients to business optimization. • But: Logging & tracking should never block creativity! • Remember: Scientists often act like artists. • Idea & Vision • Material • Skills / Methods • Tools Toolbox and Management Methods
  28. 28. 28© Cloudera, Inc. All rights reserved. Data Science Process Model (DSPM) • Idea & Vision • Material • Skills / Methods • Tools Representation of domain knowledge (in our case it is data science in general) Human Interaction Ontology Toolbox and Management Methods Ability to solve a problem using IT and data Technology Aspects - represent and inter- act with facts & data Data Governance Certified QM
  29. 29. 29© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Semantic Logging • Property with name: (K,V) : key-value-pair • Property of a thing: S => (K,V) : (S,P,O) is a triple K becomes P; V becomes O • Many of those triples in one common context with name G: G => (S,P,O) is called quad or named graph • Log4J is the logging standard we build on. • Using structured data instead of plain strings allows easy parsing (e.g., apache log format). • Triple representation avoids specific parsing and makes log data part of the linked data graph.
  30. 30. 30© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Etosha Toolbox Data extractors, Data transformers, Ontology based orchestration, People and machines, contribute facts, Iterative approach with closed feedback-loops, Scalable environment … C O N C E P T
  31. 31. 31© Cloudera, Inc. All rights reserved. • Idea & Vision • Material • Skills / Methods • Tools Multi-layer metadata capturing Operational metrics Metrics about fast & static data Business metrics Contextualized presentation Ad-hoc queries for exploration Graph-analytics > Knowledge exposure > Self-Service DS and BI can speak the same language. I N I T I A L I M P L E M E N T A T I O N
  32. 32. 32© Cloudera, Inc. All rights reserved. Results: Access Facts & Context of Critical Processes DEMO of context exploration: https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be
  33. 33. 33© Cloudera, Inc. All rights reserved. Results: Better Collaboration for (Hadoop) Knowledge Workers • Our Achievements: • The open graph model is language-, OS-, and hardware-independent. • Merging of knowledge partitions enables cluster spanning metadata exploration. • Query beans expose facts from multiple stores to a web-based interfaces. • Next Steps: • Improve implicit triplification (Query Solr-index and get RDF data) • Standardize the process and integrate with existing ontologies. • Grow a community … and enter the Apache Incubator.
  34. 34. 34© Cloudera, Inc. All rights reserved. Thank you! mirko@cloudera.com @semanpix

×