Enterprise Metadata Integration, Cloudera

1© Cloudera, Inc. All rights reserved.
Enterprise Metadata Integration
Mirko Kämpf | Cloudera
GraphConnect 2017 – London

Who is speaking?
Solutions Architect @ Cloudera
-time series analysis, network analysis, data enrichment pipelines
-personal interest: QA-Systems and semantic search
Data Science Activities
The Detection of Emerging Trends Using Wikipedia Traffic Data
and Context Networks (PLOS ONE, 2015)
Hadoop.TS (IJCA, 2013)
Fluctuations in Wikipedia Access-Rate and Edit-Event Data.
(Physica A, 2012).

Our Approach: Multilayer Metadata Integration …
• Status dashboards are provided per Use-Case.
• Each dashboard offers facts from multiple layers:
- (L1) technical layer
- (L2) operational metadata (Hadoop specific only)
- (L3) application specific operational metadata
- (L4) quality metrics (second order metadata)
• Our Achievements:
• Graph database (Neo4J) allows context exploration.
• Cluster spanning metadata exploration is possible now.
• Exposure of inherent but sometimes hidden facts becomes as easy as writing an email.
Integration of facts
to gain business
knowledge

Intro

People do mining … for centuries!
http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html
gold & diamonds,
ore & coal,
minerals,
oil …
Outcome drives whole economy

People use computers … for decades!
1938
Z1: World’s first free programmable
device, created by Conrad Zuse.
U.S. Department of Energy uses Intel
Supercomputer at Argonne National Laboratory.
2015
http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png
http://www.horst-zuse.homepage.t-online.de/z1.html

DATA
MINING
http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/
Blog: About Learning Data Mining & Data Analysis

If data is the new oil …
… metadata are nuggets
and brilliants of our age.
Screenshot taken from:
https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil

Diamonds: beautiful even as raw material Brilliant: result of expert’s work
Even more exciting in combination
with other material and skills …

• Idea & Vision
• Material
• Skills / Methods
• Tools
Success Factors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html

Be very careful with initial success …
… work towards a professional level!
High quality and reproducibility
are results of a
Professional Management
It is hard to believe what
you can get and which
options arise …
Manage overwhelming
excitement!
Start new activities
not randomly …

Let’s Think Data Driven!
• Build a mid-term or better a long-term strategy.
• Try to stay independent of a particular technology or tool.
Not the fancy toolset but rather data is what matters most.
• After initial success you should slow down and control speed of expansion.
• Focus on: maximized accessibility of data.
Google’s goal was to make the data of the internet accessible.
You should become your own Google!
• Idea & Vision
• Material
• Tools

Dataset Profiles / Flow Descriptors
•Our material is data & metadata:
- Data about data : descriptive data, Dublin core metadata model, …
- Derived data : statistics extracted from processes, documents, …
- Results of ML/AI procedures : extracted structure and learned models
- Outcome of crowd based operations : Wikipedia with its inherent
structure, communication logs, access and edit history.
• Idea & Vision
• Material
• Tools

Knowledge Extraction for
Better Data Science

Science:
According to Wikipedia:
Science is a systematic
enterprise that builds and
organizes knowledge in the
form of testable explanations
and predictions about
the universe.
https://en.wikipedia.org/wiki/Science

Data Science:
My observation:
Commercial Data Science
is a systematic enterprise
that builds and organizes
knowledge in the form of
testable explanations and
predictions about the
market / business context.
https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif

Details
Look into nature ….

Context
Look into nature ….

Result: Visualization of Facts
• An image shows what the text says.
> Multi-channel communication
• Data Science benefits from such an approach.
> Today we still use infographics
Difference:
Biologist who created this one on the left observed by
eye. Today, we use more and
more data analysis methods.

Process: Knowledge Extraction is a Natural Process
• Combine multiple sources
• Repeat observation
• Incorporate context to explain
differences/variation
• Cross-checks to identify
anomalies

Process: Knowledge Extraction is a Natural Process
Knowledge
Facts
Data

How did we implement EMDM?
- Hadoop Based: for scalability.
- Open Graph Data Model: for flexibility and connectivity
- Data Centric: following the Big Data paradigm

Big Data Processing:
e.g., with Hadoop

Big Graph Processing on Hadoop:
e.g., with Giraph

Project Name should stand for:
Graphs, Hadoop, and the ecosystem …

Data Science Process Model (DSPM)
• DSPM defines core artifacts for knowledge management
• Describes analysis / transformation context
• Allows repeatable execution
• Process properties become measurable
• Supports comparison of results from multiple procedures
• All those fatcs are essential ingredients to business optimization.
• But: Logging & tracking should never block creativity!
• Remember: Scientists often act like artists.
• Idea & Vision
• Material
• Tools
Toolbox and
Management Methods

Data Science Process Model (DSPM)
• Idea & Vision
• Material
• Tools
Representation of domain knowledge
(in our case it is data science in general)
Human
Interaction
Ontology Toolbox and
Management Methods
Ability to solve
a problem using
IT and data
Technology Aspects
- represent and inter-
act with facts & data
Data Governance
Certified QM

• Idea & Vision
• Material
• Tools
Semantic Logging
• Property with name: (K,V) : key-value-pair
• Property of a thing: S => (K,V) : (S,P,O) is a triple
K becomes P; V becomes O
• Many of those triples in one common context with name G:
G => (S,P,O) is called quad or named graph
• Log4J is the logging standard we build on.
• Using structured data instead of plain strings allows easy parsing
(e.g., apache log format).
• Triple representation avoids specific parsing and makes log data
part of the linked data graph.

• Idea & Vision
• Material
• Tools
Etosha Toolbox
Data extractors,
Data transformers,
Ontology based orchestration,
People and machines,
contribute facts,
Iterative approach with
closed feedback-loops,
Scalable environment …
C
O
N
C
E
P
T

• Idea & Vision
• Material
• Tools
Multi-layer metadata capturing
Operational metrics
Metrics about fast & static data
Business metrics
Contextualized presentation
Ad-hoc queries for exploration
Graph-analytics
> Knowledge exposure
> Self-Service DS and BI can
speak the same language.
I
N
I
T
I
A
L
I
M
P
L
E
M
E
N
T
A
T
I
O
N

Results: Access Facts & Context of Critical Processes
DEMO of context exploration:
https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be

Results: Better Collaboration for
(Hadoop) Knowledge Workers
• Our Achievements:
• The open graph model is language-, OS-, and hardware-independent.
• Merging of knowledge partitions enables cluster spanning metadata exploration.
• Query beans expose facts from multiple stores to a web-based interfaces.
• Next Steps:
• Improve implicit triplification (Query Solr-index and get RDF data)
• Standardize the process and integrate with existing ontologies.
• Grow a community … and enter the Apache Incubator.

Thank you!
mirko@cloudera.com
@semanpix

Enterprise Metadata Integration, Cloudera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enterprise Metadata Integration, Cloudera

Similar to Enterprise Metadata Integration, Cloudera (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Enterprise Metadata Integration, Cloudera

Editor's Notes