Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata

Exploring Comparative Evaluation of Semantic
Enrichment Tools for Cultural Heritage Metadata
Hugo Manguinhas, Nuno Freire, Antoine Isaac, Valentine Charles| Europeana Foundation
Juliane Stiller| Humboldt-Universität zu Berlin
Aitor Soroa| University of the Basque Country
Rainer Simon| Austrian Institute of Technology
Vladimir Alexiev| Ontotext Corp

Agenda
TPDL 2016 - Exploring Comparative Evaluation of Semantic
CC BY-SA
• Europeana and Semantic Enrichments
• Goal and Steps of the Evaluation
• Results and Evaluation
• Conclusion

What is Europeana?
CC BY-SA
We aggregate metadata:
• From all EU countries
• ~3,500 galleries, libraries,
archives and museums
• More than 52M objects
• In about 50 languages
• Huge amount of references to
places, agents, concepts, time
Europeana aggregation infrastructure
Europeana| CC BY-SA
The Platform for Europe’s Digital Cultural Heritage

What are Semantic Enrichments?
CC BY-SA
● Key technique to contextualize objects
● Aims at adding new information at the semantic level to the
metadata about certain resources
● Linked Data context -> creation of new links between the
enriched resources and others
● Information Retrieval -> adding new terms to a query or
document and therefore reaching a higher visibility of
documents within the document space

Semantic Enrichments in Europeana
CC BY-SA
Offers richer context to items by linking them to
relevant entities (people, places, objects, types)

Automatic Enrichments are Challenging as
CC BY-SA
• Bewildering variety of objects;
• Differing provider practices for providing extra information;
• Data normalization issues;
• Information in a variety of languages, without the
appropriate language tags to determine the language.

Established a Task Force Gathering Domain Experts
CC BY-SA

Goals of the Task Force
CC BY-SA
• Perform concrete evaluations and comparisons of enrichment
tools in the CH domain
• Identify methodologies for evaluating enrichment in CH, which:
• are realistic wrt. the amount of resources to be employed;
• facilitate the measure of enrichment progress over time;
• build a reference set of metadata and enrichments that can be used
for future comparative evaluations.

Enrichment Tools Evaluated
CC BY-SA
• Semantic Enrichment Framework used at Europeana
• Dictionary based against a curated subset of DBpedia, Geonames and
SemiumTime
• The European Library named entity recognition process
• NERD and heuristic based againts GeoNames, GemeinsamenNormdatei
• The Background Link service developed within the LoCloud project
• NERD applying supervised statistical methods against DBpedia
• The Vocabulary Matching service (also from LoCloud project)
• Dictionary based against several thesauri in SKOS
• The Recogito open source tool used in Pelagios 3 project
• NERD and heuristic based against Wikidata
• The Ontotext Semantic Platform enrichment tool
• NERD with NLP against MediaGraph

Evaluation of Enrichments Tools
CC BY-SA
Steps performed:
1. Select a sample dataset for enrichment,
2. Apply different tools to enrich the sample dataset,
3. Manually create reference annotated corpus,
4. Compare automatic annotations with the manual ones to
determine performance of enrichment tools.

The Sample Dataset
CC BY-SA
• A subset of The European Library (TEL) submitted to
Europeana
• Biggest data aggregator for Europeana -> widest coverage of
countries and languages.
• For each of the 19 countries, we randomly selected a
maximum of 1000 records.
Resulting in a dataset of 17,300 metadata
records in 10 languages

Building the corpus
CC BY-SA
● Each tool was applied to the dataset resulting in a total of 360k
enrichments
● Each enrichment set for each tool was:
○ normalized to a reference target vocabulary using coreferencing
information (~62.16% of success)
■ Geonames (places), DBpedia (for other resource types)
○ compared to identified overlap between tools resulting in 26
agreement combinations (plus non-agreements)
● At most 1000 enrichments were selected per set
○ resulting in a main corpus of 1757 enrichments
○ a secondary corpus composed of a subset of 46 enrichments for
measuring the inter-rater agreement

Building the corpus (overview)
CC BY-SA
Subset of TEL
Dataset
(17.3K)
Raw
Enrichments
(~360K)
Evaluation Corpus
(1757 distinct
enrichments)
Inter-Annotator
Corpus
(46 enrichments)
Annotated Corpus
(1757 enrichments)
Agreement
Combinations
(26 sets)
Normalized
Enrichments
(~360K)
TEL Dataset
(~175K)
Selected at most 1000
distinct enrichments per
distinct set
Normalized using a
reference target dataset
Compared
amongst tools
Randomly selected ~ 1000
per country (19)
Applied 7 enrichment tools
/ tools settings
Annotated according to the
Annotation Guidelines

Annotation criteria and guidelines
CC BY-SA
Category Annotation Description
Semantic
correctness
C = Correct
I = Incorrect
U = Unsure
Is the enrichment semantically correct or not?
Completeness of name
match
F = Full match
P = Partial match
U = Unsure
Was the whole phrase/named entity enriched or only parts of it?
Completeness of
concept match
F = Full match
B = Broader than
N = Narrower than
U = Unsure
Whether the matched concept is at the same level of conceptual abstraction
as the named entity/phrase. Since sometimes the exact concept is not
available in the target vocabulary, a narrower or broader concept may be
used in the enrichment.
Use B when the concept identified (i.e. target) is broader than the intended
concept. Use N, when it is narrower.
This is also true for a geographical region where the concept identified
describes a smaller entity (narrower code), whereas the “broader”-code
refers to a bigger geographical region.
In addition to the criteria, the guidelines also included:
● explanation of how to apply the criteria
● concrete examples matching all possible combinations of the criteria

Example of the annotated corpus
CC BY-SA
16 members of our Task Force
manually assessed the correctness of
both corpora applying the criteria

Measuring performance of each tool
CC BY-SA
● We adapted the notions of Information Retrieval for precision and
recall to measure performance of enrichment tools
● Two measures were defined considering the criteria:
○ relaxed: all enrichments annotated as semantically correct are
considered “true” regardless of their completeness
○ strict: considers as “true” the semantically correct enrichments with a
full name and concept completeness
● As it was not possible within the TF to identify all possible true
enrichments, we chose for a pooled recall using the total number
of enrichments

Measuring performance of each tool
CC BY-SA
Tools
Annotated
Enrichments
(% of full
corpus)
Precision Max
Pooled
Recall
Estimated
Recall
Estimated
F-measure
Relax. Strict Relax. Strict Relax. Strict
Group
A
EF 550 (31.3%) 0.985 0.965 0.458 0.355 0.432 0.522 0.597
TEL 391 (22.3%) 0.982 0.982 0.325 0.254 0.315 0.404 0.477
Group
B
BgLinks 427 (24.3%) 0.888 0.574 0.355 0.249 0.200 0.389 0.296
Pelagios 502 (28.6%) 0.854 0.820 0.418 0.286 0.340 0.428 0.481
VocMatch 100 (05.7%) 0.774 0.312 0.083 0.048 0.024 0.091 0.045
Ontotext v1 489 (27.8%) 0.842 0.505 0.407 0.272 0.202 0.411 0.289
Ontotext v2 682 (38.8%) 0.924 0.632 0.567 0.418 0.354 0.576 0.454
The observer percentage agreement for Fleiss Kappa was 79.9%

Conclusions
CC BY-SA
1. Select a dataset for your evaluation that represents the diversity of
your (meta)data
2. Building a gold standard is ideal but not always possible
3. Consider using the semantics of target datasets
4. Try to keep balance between evaluated tools
5. Give clear guidelines on how to annotate the corpus
6. Use the right tool for annotating the corpus

Recommendation for Enrichment Tools
CC BY-SA
● Consider applying different techniques depending on the kind of
values used within the enriched property
● Matches on parts of a field's textual content may result in too
general or even meaningless enrichments if they fail to recognize
compound expressions
○ In the Europeana context, concepts so broad as "general period" do
not bring any value as enrichments
● Apply strong disambiguation mechanism that considers the
accuracy of a name reference together with the relevance of the
entity
● Quality issues originating in the metadata mapping process are a
great obstacle to get good enrichments

Next steps...
CC BY-SA
● Build a gold standard with
○ significant language and spatial coverage and
○ diversity of subjects and domains
● Develop a tool for helping annotators in creating and managing
the gold standard
○ tuned for the annotation task, collaborative and easily accessible
● Develop (or extend) a solution for running evaluations
○ Possibly Inspired by GERBIL
● Re-evaluate the tools
● Integrate enrichment evaluation into an holistic evaluation
framework for search

References
CC BY-SA
● Monti, Johanna, Mario Monteleone, Maria Pia di Buono, and Federica Marano. 2013. “Cross-
Lingual Information Retrieval and Semantic Interoperability for Cultural Heritage Repositories.”
In Recent Advances in Natural Language Processing, RANLP 2013, 9-11 September, 2013, Hissar,
Bulgaria, 483–90. http://aclweb.org/anthology/R/R13/R13-1063.pdf.
● Petras, Vivien, Toine Bogers, Elaine G. Toms, Mark M. Hall, Jacques Savoy, Piotr Malak, Adam
Pawlowski, Nicola Ferro, and Ivano Masiero. 2013. “Cultural Heritage in CLEF (CHiC) 2013.” In
Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International
Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings,
192–211. doi:10.1007/978-3-642-40802-1_23.
● Stiller, Juliane, Vivien Petras, Maria Gäde, and Antoine Isaac. 2014. “Automatic Enrichments with
Controlled Vocabularies in Europeana: Challenges and Consequences.” In Digital Heritage.
Progress in Cultural Heritage: Documentation, Preservation, and Protection, 238–47.
● Usbeck, Ricardo, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin
Brümmer, Diego Ceccarelli, et al. 2015. “GERBIL: General Entity Annotator Benchmarking
Framework.” In Proceedings of the 24th International Conference on World Wide Web, 1133–43.
WWW ’15. New York, NY, USA: ACM. doi:10.1145/2736277.2741626.

Previous Studies
CC BY-SA
• Evaluations of enrichments were performed sporadically:
• Sample evaluation of alignments of places and agents to DBpedia
• Evaluation of the relationship between enrichments, objects and
queries which retrieve enriched objects (Stiller et al. 2014)
• Evaluation of retrieval performances in the CH domain indirectly also
target enrichments (e.g. Monti et al. 2013, Petras et al. 2013)
• Tools and frameworks for enrichment:
• GERBIL (Usbeck et al. 2015)

Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata

Similar to Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata (20)

Recently uploaded

Recently uploaded (20)

Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata