Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

Ontologies and their use in
Information Retrieval
Mauro Dragoni
Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)
https://shell.fbk.eu/index.php/Mauro_Dragoni - dragoni@fbk.eu
KEYSTONE Training School, Malta
July, 20th 2015

Outline
1. On your marks and get set…
2. A general approach: pros and cons of concept-based structured
representations
3. Ontology-based IR platforms
4. Behind the lines
a) Cross-language Information Retrieval
b) Ontology Matching

Before to start…
 What is an ontology?
 What is a machine-readable dictionary?
 What about ambiguity?
 Terms vs. concepts, is everything clear?

What is an ontology?
 “the branch of philosophy which deals with the nature and the organization
of reality”
 “an ontology is an explicit specification of a conceptualization”
[Gruber1993]
 conceptualization: abstract model of the world
 explicit specification: model described by using unambiguous language
 domain ontology
 upper ontology
 example: DOLCE [Guarino2002]

Ontology Components
 Classes: entities describing objects common characteristics (for example:
“Agricultural Method”).
 Individuals: entities that are instances of classes (for example “Multi Crops
Farming” is an instance of “Agricultural Method”).
 Properties: binary relations between entities (for example “IsAffectedBy”).
 Attributes (or DataType Properties): characteristics that qualify individuals
(for example “Has Name”).

Hierarchies
 Concepts can be organized in subsumptions hierarchies
 Meaning: every sub-concepts is also a super-concept
 Examples:
 “Intensive Farming” is-a “Agricultural Method”
 “Agricultural Method” is-a “Method”
 Concept hierarchies are generally represented by using tree structures

Attributes and Properties
 Properties: binary relations between classes
 Domain and co-domain: classes to which individuals need to belong to be in
relation
 Example: “Agriculture” <isAffectedBy> “Agriculture Pollution”
 Attributes: binary relations between an individual and values (not other
entities)
 Domain: class to which the attribute is applied
 Co-domain: the type of the value (for example “String”)
 Properties and Attributes can be organized in hierarchies.

Steps for building an ontology
 To identify the classes of the domain.
 To organize them in a hierarchy.
 To define properties and attributes.
 To define individuals, if there are.

Why ontologies are useful?
 Ontologies provide:
 common dictionary of terms;
 a shared and formal interpretation of the domain.
 Ontologies permit to:
 solve ambiguities;
 share knowledge (not only between humans, but also between machines);
 use automatic reasoning techniques.

Use of ontologies in IR
 Exploit metadata
 Entity linking
 “which president …”  “Barack Obama is-a President”
 Extraction of triples from text
 applying NLP parsers for extracting dependencies

What is an thesaurus?
 A “coarse” version of ontologies
 Generally, 3 kinds of relations are represented:
 hierarchical (generalization/specialization)
 equivalence (synonymity)
 associative (other kind of relationships)
 Extensive tool used for query expansion approaches [Bhogal2007,
Grootjen2006,Qiu1993,Mandala2000]

Machine-readable dictionaries
 A dictionary in an electronic form.
 The power of MRD is characterized by word senses. [Kilgariff1997,
Lakoff1987, Ruhl1989]
 Identity of meaning: synonyms [Gove1973]
 Inclusion of meaning: hyponymy or hyperonymy; troponymy [Cruse1986,
Green2002, Fellbaum1998]
 transitive relationship
 Part-whole meaning: meronymy (has part), holonymy (part of)
[Green2002, Cruse1986, Evens1986]
 Opposite meaning: antonymy

and now…
… let’s see how we can exploit this within
an information retrieval system…

Motivations and Challenges
 Considering how information is usually represented and classified.
 Documents and Queries are represented using terms.
 Indexing:
 terms are extracted from each document;
 terms frequency of each document is computed (TF);
 terms frequency over the entire index is computed (IDF).
 Searching:
 the vector space model is used to computed the similarity between documents and
queries;
 queries are generally expanded to increase the recall of the system.

Drawbacks of the
Term-Based representation – 1/2
 The “semantic connections” between terms in documents and queries are
not considered.
 Different vector positions may be allocated to the synonyms of the same
term:
 the importance of a determinate concept is distributed among different vector
components;
 information loss.

Drawbacks of the
Term-Based representation – 2/2
 The query expansion has to be used carefully.
 It is more easy to increase the recall of a system with respect to its precision.
Which is better? [Abdelali2007]
 In the worst case, the size of a document vector could be close to the
number of terms used in the repository:
 in general, the number of concepts is less than the number of words;
 the time needed to compare documents is higher;

Intuition Behind
 Using concepts to represent the terms contained in documents and
queries. [Dragoni2012b]
1. Documents and Queries may be represented in the same way.
2. The issue related to how many and which terms have to be used for query
expansion is not considered.
3. The size of a concept vector is generally smaller than the size of a term vector.
 IMPORTANT: This is not a query expansion technique !!!

a first simple example …
 a close vocabulary:

 how to compute concept weights?

 how is weighted each concept of the vocabulary?
 suppose to have the document “xxyyyz”

… that we evaluated
 Experiments on the MuchMore Collection (http://muchmore.dfki.de)
 The collection contains numerous medical terms.
 The term-based representations is advantaged over the semantic
representation.
 Experiments on the TREC Ad-Hoc Collection:
 Results have been compared with the IRS presented at TREC-7 and TREC-8
conference
 Only the systems that implements a semantic representation of queries have
been considered.
 Over dozens of runs, the three systems that performs better at recall 0.0 have
been chosen. [Spink2006]

MuchMore Collection
System P@5 P@10 P@15 P@30 MAP
Term-Based 0.544 0.480 0.405 0.273 0.449
Synset-Based 0.648 0.484 0.403 0.309 0.459
Conceptual Indexing 0.770 0.735 0.690 0.523 0.449
Ontology Indexing 0.784 0.765 0.728 0.594 0.477

TREC-7
Term-Based 0.444 0.414 0.375 0.348 0.199
AT&T Labs 1 0.644 0.558 0.499 0.419 0.296
AT&T Labs 2 0.644 0.558 0.497 0.413 0.294
City University, Sheffield, Microsoft 0.572 0.542 0.507 0.412 0.288

TREC-8
Term-Based 0.476 0.436 0.389 0.362 0.243
IBM Watson 0.588 0.504 0.472 0.410 0.301
Microsoft Research 0.580 0.550 0.499 0.425 0.317
TwentyOne 0.500 0.454 0.433 0.368 0.292

Some considerations
 Two drawbacks have been identified:
 The absence of some terms in the ontology, (in particular terms related to
specific domains like biomedical, mechanical, business, etc.), may affects the
final retrieval result.
 a more complete knowledge base is needed.
 Term ambiguity. By using a Word Sense Disambiguation approach, concepts
associated with incorrect senses would be discarded or weighted less.
 a Word Sense Disambiguation algorithm is required: but it has to be used carefully.

Checkpoint 1
 the use of machine-readable dictionaries is suitable for implementing a
first semantic engine
 but if we use ontologies we have more and more information
 properties
 attributes
 the problem is: how can we exploit all these information?

Ontology enhanced IR
 Enrichment of documents (and queries) with information coming from
semantic resources
 information expansion: adding synonyms, antonyms, … not new but still helpful
 annotations: relation or association between a semantic entity and a document
 Most of the information expansion systems are based on WordNet and the
Roget’s Thesaurus
 Systems using annotations are interfaced with the Linked Open Data
cloud, and mainly with Freebase and Wikipedia

Classification of
Semantic IR approaches
Criterion Approaches
Semantic
knowledge
representation
• Statistical [Deerwester1990]
• Linguistic conceptualization [Gonzalo1998,
Mandala1998,Giunchiglia2009]
• Ontology-based [Guha2003,Popov2004]
Scope • Web search [Finin2005,Fernandez2008]
• Limited domain repositories [Popov2004]
• Desktop search [Chirita2005]
Query • Keyword query [Guha2003]
• Natural language query [Lopez2009]
• Controlled natural language query [Bernstein2006, Cohen2003]
• Structured query based on ontology query language [notes]
Content
retrieved
• Data retrieval
• Information retrieval
Content
ranking
• No ranking
• Keyword-based ranking [Guha2003]
• Semantic-based ranking [Stojanovic2003]

Limitation of Semantic
IR approaches – 1/2
Criterion Limitation IR Semantic
Semantic knowledge
representation
• No exploitation of the full
potential of an ontological
language, beyond those that
could be reduced to
conventional classification
schemes.
x (Partially)
Scope • No scalability to large and
heterogeneous repositories of
documents.
x
Goal • Boolean retrieval models where
the Information Retrieval
problem is reduced to a data
retrieval task.
x
Query • Limited usability x

Limitation of Semantic
IR approaches – 2/2
Criterion Limitation IR Semantic
Content retrieved • Focus on textual content: no
management of different formats
(multimedia)
(Partially) (Partially)
Content ranking • Lack of semantic ranking
criterion. The ranking (if provided
relies on keyword-based
approaches.
x x
Coverage • Knowledge incompleteness.
[Croft1986]
(Partially) x
Evaluation • Lack of standard evaluation
frameworks. [Giunchiglia2009]
x

A basic ontology-based IR model
SPARQL
Editor
SPARQL
Query
Query
Processing
Searching
Indexing
Ranking
Semantic
Entities
Semantic Knowledge
(ontology + KB)
Document Corpus
Ranked
Documents
Semantic Index
(weighted annotations)
User
Unsorted
Documents

Basic ontology-based IR model - Limits
 Heterogeneity
 a single ontologies (but also a set of them) cannot covers all possible domains
 Scalability
 imagine to annotate the Web by using all knowledge bases currently available
 a final solution does not exist… but nice and practical approaches can be used
 Usability
 try to think… are all the people you know able to write queries in SPARQL?

Extended ontology-based IR model
Nat. Lang.
Interface
Natural Lang.
Query
Query
Processing
Searching
Indexing
Ranking
Semantic
Entities
Preprocessed
Semantic Knowledge
Unstructured Web
contents
Ranked
Documents
Semantic Index
(weighted annotations)
User
Unsorted
Documents
Semantic
Web

Evaluation Results
 Mean Average Precision
 Prec@10
Semantic System Lucene TREC Automatic
0.16 0.1 0.2
Semantic System Lucene TREC Automatic
0.37 0.25 0.30

A focus on the indexing procedure
 Challenge: to link semantic knowledge with documents and query in an
efficient and effective way:
 document corpus and semantic knowledge should remain decoupled;
 annotations have to be provided in a flexible and scalable way.
 Annotations can be provided in two ways:
 by applying an information extraction technique based on pure NLP
approaches;
 by applying a contextual semantic information approach.

Annotator Requirements
 Identification of the entities within the documents
 conceptually, it is not so much different w.r.t. a traditional IR indexing process
 Ontologies must not be touched (decoupling)
 Should be open-domain
 Scalable-friendly:
 indexing of ontologies;
 indexing of documents;
 an interesting alternative: usage of non-embedded annotations

Natural Language Processing Annotation
<html>
<body>
<p>Schizophrenia patients whose medication couldn’t stop
the imaginary voices in their heads</p>
</body>
</html>
HTML
Parser
Schizophrenia patients whose medication
couldn’t stop the imaginary voices in their
heads
NLP
Tools
<document><p><s>
<w c=“w” pos=“NNP” stem=“Schizophrenia”>Schizophrenia</w>
<w c=“w” pos=“NN$” stem=“patient”>patients</w>
<w c=“w” pos=“WP$”>whose</w>
<w c=“w” pos=“NN” stem=“medication”>medications</w>
<w c=“w” pos=“MD”>could</w><w c=“w” pos=“RB”>not</w>
<w c=“w” pos=“VB” stem=“stop”>stop</w>
<w c=“w” pos=“DT”>the</w>
<w c=“w” pos=“JJ”>imaginary</w>
<w c=“w” pos=“NN$” stem=“voice”>voices</w>
<w c=“w” pos=“IN”>in</w>
<w c=“w” pos=“PRP$”>their</w>
<w c=“w” pos=“NN$” stem=“head”>heads</w>
</s></p></document>
Token
Filter
schizophrenia
patient
medication
stop
voice
head

Natural Language Processing Annotation
schizophrenia
patient
medication
stop
voice
head
Index
Searcher
Frequency
Counter
Annotation
Creator
Keyword Ontology Entities
schizophrenia E1
patient E4, E5
… …
head E2, E8
Ontology Entity Document Frequencies
E1 D1(1), D4(2)
… …
E8 D1(1), D5(4), D6(3)
Ontology Entity Document Weight
E1 D1 0.9
… …
E8 D1 0.3

Contextual Semantic Annotation
Ontology
Selection of the
semantic context
Selection of
contextualized terms in
the document index
Search of terms in
the document index
Selection of a
semantic entity
E1: Individual Maradona;
Labels: {“Maradona”,
“Diego Maradona”,
“Pelusa”}
Keyword Documents
Maradona D1, D2, D87
Pelusa D95, D140
football_player D87, D61, D44, D1
Argentina D43, D32, D2
E34 = Class: football_player
Labels:{“football player”}
E22 = Individual: Argentina
Labels:{“Argentina”}
Potential documents to annotate
{D1, D2, D87, D95, D140}
Contextualized documents
{D1, D2, D32, D43, D44, D61, D87}

Contextual Semantic Annotation
Potential documents to annotate
{D1, D2, D87, D95, D140}
Contextualized documents
{D1, D2, D32, D43, D44, D61, D87}
Selection of semantic
contextualized documents
Documents to annotate
{D1, D2, D87, D95, D140}
Creation of annotations
Ontology Entity Document Weight
E1 D1 0.5
E1 D2 0.2
E1 D87 0.67

An idea for aggregating rankings
 Multi-dimensional aggregation criteria
 Document score is computed from different perspectives (criteria)
 Assignment of priorities to criteria
 Compute criteria weights
 Weight of criteria with low priority depends on the score of criteria with high
priority
 Aggregate criteria scores [Dragoni2012]

Querying and Ranking
 Queries transformed by mapping terms with ontology entities
 Contextual disambiguation is very important
 simple example: “Rock musicians Britain”
 Ranking: two options
 to evaluate only the “matches” between detected entities
 to aggregate (on your way) rank produced by using only the entities, only the
query terms, and/or both of them

Use of multiple ontologies
 What we need: an Ontology Gateway
 Tasks of an ontology gateway:
 collect available semantic content;
 store the semantic content efficiently in order to ease its access;
 implement and approach for the “selection” of the content
 Most important ontology gateways online:
 Swoogle [Ding2004,Brin1998]
 Watson [Aquin2007,Aquin2007b]
 WebCORE [Fernandez2006,Fernandez2007]

Use of multiple ontologies - opportunities
 Recall improvement:
 Ontology 1 focused on entities  stress on the identification of semantic
entities within the document
 Ontology 2 focused on properties  stress on the identification of relationships
between entities in the document
 precision should also increase, but some drops are possible.
 Supporting multiple perspectives:
 analysis of each entities from different point of views

Use of multiple ontologies - challenges
 To figure out how to use them:
 it is necessary to formally represent the relationships between the ontologies
and the techniques used for extracting information from them;
 example: you may have ontologies describing the same domain by using
different structures!!!
 To find suitable ontologies and mappings:
 again: more than one ontologies describing the same domain;
 not a good practice to select only one  build mappings!!!

A use case
 Information system containing products technical data
 users look for something that satisfies their needs
 engineers want to exploit information for creating new product variants
 Ontologies focused on particular aspects of products
 product conceptualizations are separated

Checkpoint 2
 Annotation of documents is more important than the querying of the
repositories… why?
 differences in the amount of content
 once we have decided how to annotate documents, queries should be
annotated by using the same procedure in order to homogenize the process
 Challenges in built knowledge bases
 Ranking… play with them and “stress your creativity”

Ontologies and IR – 2 use cases
 Demonstrate the usefulness of semantic approaches used in combination
with traditional IR techniques.
 Show how IR and Semantics may help each other
 Two scenarios:
 Cross-language information retrieval [Dragoni2014]
 Ontology matching [Dragoni2015]
 Sentiment analysis

Cross-Language Information Retrieval
Background - Challenges
 Out-of-Vocabulary issue
 improve the corpora used for training the machine translation model.
 usage of domain information for increasing the coverage of the
dictionaries.
 Usage of semantic artifacts for structuring the representation of
(multilingual) documents.
 GOAL: to integrate domain-specific semantic knowledge
within a CLIR system and evaluate their effectiveness

Our Scenario
 Use case: the agricultural domain
 Knowledge resources: Agrovoc and Organic.Lingua ontologies
 3 components used in the proposed approach:
 Annotator
 Indexer
 Retriever

Annotation Process – Step 1
en
es
it
de
fr
….

en
es
it
de
fr
….
 Document content is used as query.
 Between the candidate results, only “exact matches” are
considered.
Annotation Process – Step 2

Approach – Annotation Stats
Domain
Ontology
Number of
Concepts
Manual
Annotations
Automatic
Annotations
Agrovoc (AV) 32061 0 133596
(5834 distinct
concepts used)
Organic.Lingua (OL) 291 27871
(264 distinct
concepts used)
16434
(208 distinct
concepts used)

Approach - Index
 Given a document:
 Text and annotations are extracted.
 The context of each concept is retrieved from the ontologies.
 Each contextual concepts are indexed with a weight proportional
w.r.t. their semantic distance from the semantic annotation.
 Structure of each index record:

Approach - Retriever
 Three retrieval configurations available:
 Only translations: query terms are translated by using machine
translation services.
 Semantic expansion by exploiting the domain ontology: query terms
are matched with ontology concepts; if an exact match exists, query
is expanded by using the URI of the concept and the URIs of the
contextual ones.
 Ontology matching only: terms not having an exact match with
ontology concepts are discarded.

Evaluation - Setup
 Collection of 13,000 multilingual documents.
 48 queries originally provided in English and manually translated
in 12 languages under the supervision of both domain and
language experts.
 Gold standard manually built by the domain experts.
 MAP, Prec@5, Prec@10, Prec@20, Recall have been used.

Results - 1
Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec.
BASELINE 0.554 0.617 0.545 0.465 0.920
Auto: AV 3.24% 3.11% 5.04% 3.81% 2.52%
Auto: OL 2.31% 1.91% 2.88% 2.98% 0.77%
Auto: AV+OL 3.13% 2.95% 4.63% 3.86% 2.53%
Auto+Man: OL 1.65% 3.40% 3.95% 4.48% 1.37%
Auto+Man: AV+OL 4.38% 5.96% 7.18% 6.07% 2.97%
Auto+Man*2: OL 1.00% 3.30% 4.02% 3.27% 1.36%
Auto+Man*2: AV+OL 3.29% 4.86% 6.73% 6.03% 2.97%

Results - 2
Query
Cov.
Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec.
AV 39.3
(9 langs)
0.137 0.189 0.191 0.179 0.552
OL 15.7
(10 langs)
0.260 0.359 0.319 0.322 0.635
AV + OL 33.3
(12 langs)
0.173 0.247 0.226 0.221 0.586

Ontology Matching
 Given two thesauri/ontologies/vocabularies find alignments between
entities
 Formally a “match” may be represented with the following 5-tuple:
‹ id, e1, e2, R, c ›
 Extensive literature about matching approaches (early ‘80s)

Motivations
 Need: a system, for experts, able to suggest possible matches between
concepts
 Exploit multilinguality… why?
 allows to reduce ambiguity: the probability, for two different concepts, of having
the same label across several languages is very low.
 term translations have been adapted to the domain: experts in charge of
translations put a lot of their cultural heritage in choosing the right terms for
each concept.

The Proposed Approach - 1
 Inspired by information retrieval techniques
 Built on top of the Lucene search engine
 For each element of the thesaurus a structured multilingual representation
is built:
 An index for each thesaurus is built
[prefLabel] "Food chains"@en
[prefLabel] "Catene alimentari"@it
[altLabel] "Food distributions"@en
[altLabel] "Reti alimentari"@it
label-en: “food chain”
label-en: “food distribution”
label-it: “catena alimentare”
label-it: “rete alimentare”

The Proposed Approach - 2
 How matches are suggested?
 source and target thesauri are chosen
 for each concept, a query is performed from the source to the target thesaurus
 the standard Lucene scoring formula is used for computing the ranking
 for each query, a ranking of 5 suggestions is provided to the user

Evaluation Set-Up
 2 contexts:
 six multilingual thesauri (3 medical domain, 3 agricultural domain)
 adapted Multifarm benchmark
 2 tasks:
 matching system (only the first suggestion is considered)
 suggestion system

Results - 1
Mapping Set # of Mappings Prec@1 Prec@3 Prec@5 Recall
Eurovoc  Agrovoc 1297 0.816 0.931 0.967 0.874
Agrovoc  Eurovoc 1297 0.906 0.969 0.988 0.695
Avg. 0.861 0.950 0.978 0.785
Gemet  Agrovoc 1181 0.909 0.964 0.983 0.546
Agrovoc  Gemet 1181 0.943 0.981 0.994 0.740
Avg. 0.926 0.973 0.989 0.643
MDR  MeSH 6061 0.776 0.914 0.956 0.807
MeSH  MDR 6061 0.716 0.888 0.939 0.789
Avg. 0.746 0.901 0.948 0.798
MDR  SNOMED 19971 0.621 0.826 0.908 0.559
SNOMED  MDR 19971 0.556 0.760 0.855 0.519
Avg. 0.589 0.793 0.882 0.539
MeSH  SNOMED 26634 0.690 0.871 0.931 0.660
SNOMED  MeSH 26634 0.657 0.835 0.908 0.564
Avg. 0.674 0.853 0.920 0.612
Results obtained by the proposed system on the domain-specific thesauri

Results - 2
Mapping Set IRBOM WeSeE
(2012)
RiMOM
(2013)
YAM++
(2013)
YAM++
(2012)
AUTOM
Sv2
(2012)
Agrovoc  Eurovoc 0.821 0.785 0.628 0.615 0.615 0.599
Gemet  Agrovoc 0.759 0.726 0.548 0.579 0.579 0.485
MDR  MeSH 0.771 0.749 0.611 0.613 0.613 0.536
MDR  SNOMED 0.563 0.624 0.495 0.473 0.473 0.405
MeSH  SNOMED 0.642 0.631 0.457 0.458 0.458 0.497
Results obtained by the all systems on the domain-specific thesauri

Results - 3
System Name Precision Recall F-Measure
IRBOM 0.68 0.43 0.53
WeSeE (2012) 0.61 0.32 0.41
RiMOM (2013) 0.52 0.13 0.21
YAM++ (2013) 0.51 0.36 0.40
YAM++ (2012) 0.50 0.36 0.40
AUTOMSv2 (2012) 0.49 0.10 0.36
Results obtained by all systems on the adapted Multifarm Benchmark

So… at the end…
 Ontologies in IR is still a controversial topic
 Personal Opinion: to combine structured and unstructured representation
seems to be the most suitable solution
 Pay attention to the kind of queries performed by users
 Aggregation of results
 Be brave… try to work with triples!!!!

Mauro Dragoni
https://shell.fbk.eu/index.php/Mauro_Dragoni
dragoni@fbk.eu

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

Similar to Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval (20)

More from Mauro Dragoni

More from Mauro Dragoni (8)

Recently uploaded

Recently uploaded (20)

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

Editor's Notes