This talk addresses two questions: “How can the quality of taxonomies be defined?” and “How can it be measured?” See how quality criteria vary depending on how a taxonomy is applied, such as automatic content classification in ecommerce or a knowledge graph for data integration in enterprises. Distinguish between formal quality, structural properties, content coverage, and network topology. Investigate the advantages of standards-based and machine-processable SKOS taxonomies to be able to measure the quality of taxonomies automatically, as well as several tools and techniques for quality assessment.
How to submit a standout Adobe Champion Application
Taxonomy Quality Assessment
1. Andreas Blumauer
CEO & Managing Partner
Semantic Web Company &
PoolParty Semantic Suite
TAXONOMY QUALITY
ASSESSMENT:
TOOLS & TECHNIQUES
Taxonomy
Boot Camp 2016
Washington, DC
1
2. INTRODUCTION
2
Semantic Web
Company
founder &
CEO of
Andreas
Blumauer
developer and
vendor of
2004
founded
5.5
current
Version
active at
based on
Vienna
located
part of Taxonomy Knowledge Graph
standard for
part of is a
>200serves customers
Ontology
manages
part ofis a
4. Why is taxonomy
quality important?
Some examples for
quality issues and
their possible
consequences
4 ▸ Missing labels
▹ AGROVOC (FAO) defines concepts in 25 different languages. While most concepts have
English labels attached, only 38% have German labels.
▹ This can be a problem for multilingual applications that rely on label translations.
▸ Orphan concepts
▹ An orphan concept is a concept that has no semantic relation with any other concept.
Although it might have attached lexical labels, it lacks valuable context information.
▹ This can be crucial for retrieval tasks such as search query expansion.
▸ Mismatch between content and taxonomy
▹ There are only minor overlaps between the scope of the documents (or data) to be
indexed and the scope of the controlled vocabulary in use.
▹ This leads to a sparse enrichment of the document index by semantic information.
See also: Finding quality issues in SKOS vocabularies
(Christian Mader, Bernhard Haslhofer, Antoine Isaac)
5. Taxonomy quality
issues are more
frequently
observed than
some might expect
5
See also: Finding quality issues in SKOS vocabularies
6. Taxonomy quality
criteria and issues
at different levels
6
1. Formal integrity conditions based on SKOS
▹ Construction of well-formed and consistent data to promote interoperability
▹ Example: No two concepts may be connected by both related and broader transitive
▹ Read more: SKOS: A Guide for Information Professionals (Jane Frazier)
2. Labeling and documentation issues
▹ Construction of taxonomies that allow support for complex retrieval tasks
▹ Example: No two concepts of a concept scheme may have the same preferred label
▹ Read more: SKOS Primer (Antoine Isaac / Ed Summers)
3. Structural issues
▹ Logic-based based processing of taxonomies
▹ Example: Avoidance of hierarchical cycles
▹ Read more: Key choices in the design of SKOS (Thomas Baker et al)
4. Content coverage
▹ Development of taxonomies that reflect well the scope of represented content
▹ Example: Avoid maintaining subtrees that only have limited occurrences in a representative
document corpus
▹ Read more: Corpus management with PoolParty
5. Network topological issues (experimental)
▹ (Co-)occurrences of concepts in a corpus should be reflected in the network topology of a
knowledge graph
▹ Example: Nodes/concepts with high betweenness centrality should occur correspondingly
in a reference document corpus
7. Why are
standards-based
technologies and
tools so important
when it comes to
taxonomy quality
management?
7
Spreadsheet editors are still the most common type of software application
being used for taxonomy management. They cannot measure quality automatically.
8. ‘Good’ quality
depends on the
usage scenario
8
Example: Google Product Taxonomy has no synonyms at all, only hierarchical relations
9. How to pick the
most relevant
quality criteria for a
taxonomy project
9
PoolParty supports various application scenarios. Quality checks can be enforced,
reported, or ignored.
10. How to pick the
most relevant
quality criteria for a
taxonomy project
10 ▸ General purpose thesaurus vs.
Custom enterprise taxonomy
▹ Custom enterprise taxonomies can be developed specifically on top of reference corpora
▹ General purpose thesauri are frequently used in the context of linked data environments
→ Linked data specific issues become more important
■ Missing In-Links
■ Missing Out-Links
■ Broken Links
■ Undefined SKOS Resources
■ HTTP URI Scheme Violation
See also: PoolParty SKOS Quality Checker based on qSKOS
14. Unveil mismatch
between taxonomy
and document
corpus
14 Content Manager
Integrator
Taxonomist/
Ontologist
Thesaurus
Server
Extractor
PowerTagging
uses API
is user of
is user of
is basis of
is basis of
Index
annotates
enriches
Corpus Learning/
Semantic Analysis
CMS
extends
is basis of
analyzes
uses API
15. Unveil mismatch
between taxonomy
and document
corpus
15
PoolParty extracts concepts not being used in a reference corpus at all and provides
suggestions how those concepts could be reworked or extended to become relevant.
17. Unveil mismatch
between taxonomy
and document
corpus
17
PoolParty suggest possible ‘right places’ for the candidate concepts within the approved
taxonomy.
22. Combined analysis
over network
topology and
reference corpus:
Correlation
Betweenness &
Document
Frequency
22
Example: STW Thesaurus for Economics and reference corpus about ‘Crude Oil Market’
25. qSKOS
▸ qSKOS is a tool for finding quality issues in SKOS vocabularies
▸ Available as free online service at http://qskos.poolparty.biz/
▸ SKOS taxonomy being analyzed with regards to 24 issues
25
26. PoolParty Import
Validator
26
▸ RDF Validation to go beyond SKOS
▸ Checks are defined in RDF, repair strategies also defined as RDF
▸ 15 checks have been integrated
27. Shapes Constraint
Language (SHACL)
▸ “Do for RDF what XML Schema does for XML”
▸ Language for validating RDF graphs against a set of conditions
▸ SHACL shape graphs are used to validate that data graphs satisfy a set of
conditions
▸ Current status: W3C Working Draft (14 August 2016)
See also: Towards maintainable constraint validation and repair for taxonomies:
The PoolParty approach (Christian Mader and Monika Solanki)
27
28. GET YOUR
TEST ACCOUNT
GET CERTIFIED
28
Get your test account at
www.poolparty.biz/demo
Get certified at
www.poolparty.biz/academy/