The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Exploring the Future Potential of AI-Enabled Smartphone Processors
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
1. Ontologies and their use in
Information Retrieval
Mauro Dragoni
Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)
https://shell.fbk.eu/index.php/Mauro_Dragoni - dragoni@fbk.eu
KEYSTONE Training School, Malta
July, 20th 2015
2. Outline
1. On your marks and get set…
2. A general approach: pros and cons of concept-based structured
representations
3. Ontology-based IR platforms
4. Behind the lines
a) Cross-language Information Retrieval
b) Ontology Matching
3. Before to start…
What is an ontology?
What is a machine-readable dictionary?
What about ambiguity?
Terms vs. concepts, is everything clear?
4. What is an ontology?
“the branch of philosophy which deals with the nature and the organization
of reality”
“an ontology is an explicit specification of a conceptualization”
[Gruber1993]
conceptualization: abstract model of the world
explicit specification: model described by using unambiguous language
domain ontology
upper ontology
example: DOLCE [Guarino2002]
5. Ontology Components
Classes: entities describing objects common characteristics (for example:
“Agricultural Method”).
Individuals: entities that are instances of classes (for example “Multi Crops
Farming” is an instance of “Agricultural Method”).
Properties: binary relations between entities (for example “IsAffectedBy”).
Attributes (or DataType Properties): characteristics that qualify individuals
(for example “Has Name”).
6. Hierarchies
Concepts can be organized in subsumptions hierarchies
Meaning: every sub-concepts is also a super-concept
Examples:
“Intensive Farming” is-a “Agricultural Method”
“Agricultural Method” is-a “Method”
Concept hierarchies are generally represented by using tree structures
7. Attributes and Properties
Properties: binary relations between classes
Domain and co-domain: classes to which individuals need to belong to be in
relation
Example: “Agriculture” <isAffectedBy> “Agriculture Pollution”
Attributes: binary relations between an individual and values (not other
entities)
Domain: class to which the attribute is applied
Co-domain: the type of the value (for example “String”)
Properties and Attributes can be organized in hierarchies.
8. Steps for building an ontology
To identify the classes of the domain.
To organize them in a hierarchy.
To define properties and attributes.
To define individuals, if there are.
9. Why ontologies are useful?
Ontologies provide:
common dictionary of terms;
a shared and formal interpretation of the domain.
Ontologies permit to:
solve ambiguities;
share knowledge (not only between humans, but also between machines);
use automatic reasoning techniques.
10. Use of ontologies in IR
Exploit metadata
Entity linking
“which president …” “Barack Obama is-a President”
Extraction of triples from text
applying NLP parsers for extracting dependencies
11. What is an thesaurus?
A “coarse” version of ontologies
Generally, 3 kinds of relations are represented:
hierarchical (generalization/specialization)
equivalence (synonymity)
associative (other kind of relationships)
Extensive tool used for query expansion approaches [Bhogal2007,
Grootjen2006,Qiu1993,Mandala2000]
12. Machine-readable dictionaries
A dictionary in an electronic form.
The power of MRD is characterized by word senses. [Kilgariff1997,
Lakoff1987, Ruhl1989]
Identity of meaning: synonyms [Gove1973]
Inclusion of meaning: hyponymy or hyperonymy; troponymy [Cruse1986,
Green2002, Fellbaum1998]
transitive relationship
Part-whole meaning: meronymy (has part), holonymy (part of)
[Green2002, Cruse1986, Evens1986]
Opposite meaning: antonymy
13. and now…
… let’s see how we can exploit this within
an information retrieval system…
14. Motivations and Challenges
Considering how information is usually represented and classified.
Documents and Queries are represented using terms.
Indexing:
terms are extracted from each document;
terms frequency of each document is computed (TF);
terms frequency over the entire index is computed (IDF).
Searching:
the vector space model is used to computed the similarity between documents and
queries;
queries are generally expanded to increase the recall of the system.
15. Drawbacks of the
Term-Based representation – 1/2
The “semantic connections” between terms in documents and queries are
not considered.
Different vector positions may be allocated to the synonyms of the same
term:
the importance of a determinate concept is distributed among different vector
components;
information loss.
16. Drawbacks of the
Term-Based representation – 2/2
The query expansion has to be used carefully.
It is more easy to increase the recall of a system with respect to its precision.
Which is better? [Abdelali2007]
In the worst case, the size of a document vector could be close to the
number of terms used in the repository:
in general, the number of concepts is less than the number of words;
the time needed to compare documents is higher;
17. Intuition Behind
Using concepts to represent the terms contained in documents and
queries. [Dragoni2012b]
1. Documents and Queries may be represented in the same way.
2. The issue related to how many and which terms have to be used for query
expansion is not considered.
3. The size of a concept vector is generally smaller than the size of a term vector.
IMPORTANT: This is not a query expansion technique !!!
20. how to compute concept weights?
a first simple example …
21. how is weighted each concept of the vocabulary?
suppose to have the document “xxyyyz”
a first simple example …
22. … that we evaluated
Experiments on the MuchMore Collection (http://muchmore.dfki.de)
The collection contains numerous medical terms.
The term-based representations is advantaged over the semantic
representation.
Experiments on the TREC Ad-Hoc Collection:
Results have been compared with the IRS presented at TREC-7 and TREC-8
conference
Only the systems that implements a semantic representation of queries have
been considered.
Over dozens of runs, the three systems that performs better at recall 0.0 have
been chosen. [Spink2006]
28. Some considerations
Two drawbacks have been identified:
The absence of some terms in the ontology, (in particular terms related to
specific domains like biomedical, mechanical, business, etc.), may affects the
final retrieval result.
a more complete knowledge base is needed.
Term ambiguity. By using a Word Sense Disambiguation approach, concepts
associated with incorrect senses would be discarded or weighted less.
a Word Sense Disambiguation algorithm is required: but it has to be used carefully.
31. Checkpoint 1
the use of machine-readable dictionaries is suitable for implementing a
first semantic engine
but if we use ontologies we have more and more information
properties
attributes
the problem is: how can we exploit all these information?
32. Ontology enhanced IR
Enrichment of documents (and queries) with information coming from
semantic resources
information expansion: adding synonyms, antonyms, … not new but still helpful
annotations: relation or association between a semantic entity and a document
Most of the information expansion systems are based on WordNet and the
Roget’s Thesaurus
Systems using annotations are interfaced with the Linked Open Data
cloud, and mainly with Freebase and Wikipedia
33. Classification of
Semantic IR approaches
Criterion Approaches
Semantic
knowledge
representation
• Statistical [Deerwester1990]
• Linguistic conceptualization [Gonzalo1998,
Mandala1998,Giunchiglia2009]
• Ontology-based [Guha2003,Popov2004]
Scope • Web search [Finin2005,Fernandez2008]
• Limited domain repositories [Popov2004]
• Desktop search [Chirita2005]
Query • Keyword query [Guha2003]
• Natural language query [Lopez2009]
• Controlled natural language query [Bernstein2006, Cohen2003]
• Structured query based on ontology query language [notes]
Content
retrieved
• Data retrieval
• Information retrieval
Content
ranking
• No ranking
• Keyword-based ranking [Guha2003]
• Semantic-based ranking [Stojanovic2003]
34. Limitation of Semantic
IR approaches – 1/2
Criterion Limitation IR Semantic
Semantic knowledge
representation
• No exploitation of the full
potential of an ontological
language, beyond those that
could be reduced to
conventional classification
schemes.
x (Partially)
Scope • No scalability to large and
heterogeneous repositories of
documents.
x
Goal • Boolean retrieval models where
the Information Retrieval
problem is reduced to a data
retrieval task.
x
Query • Limited usability x
35. Limitation of Semantic
IR approaches – 2/2
Criterion Limitation IR Semantic
Content retrieved • Focus on textual content: no
management of different formats
(multimedia)
(Partially) (Partially)
Content ranking • Lack of semantic ranking
criterion. The ranking (if provided
relies on keyword-based
approaches.
x x
Coverage • Knowledge incompleteness.
[Croft1986]
(Partially) x
Evaluation • Lack of standard evaluation
frameworks. [Giunchiglia2009]
x
36. A basic ontology-based IR model
SPARQL
Editor
SPARQL
Query
Query
Processing
Searching
Indexing
Ranking
Semantic
Entities
Semantic Knowledge
(ontology + KB)
Document Corpus
Ranked
Documents
Semantic Index
(weighted annotations)
User
Unsorted
Documents
37. Basic ontology-based IR model - Limits
Heterogeneity
a single ontologies (but also a set of them) cannot covers all possible domains
Scalability
imagine to annotate the Web by using all knowledge bases currently available
a final solution does not exist… but nice and practical approaches can be used
Usability
try to think… are all the people you know able to write queries in SPARQL?
38. Extended ontology-based IR model
Nat. Lang.
Interface
Natural Lang.
Query
Query
Processing
Searching
Indexing
Ranking
Semantic
Entities
Preprocessed
Semantic Knowledge
Unstructured Web
contents
Ranked
Documents
Semantic Index
(weighted annotations)
User
Unsorted
Documents
Semantic
Web
39. Evaluation Results
Mean Average Precision
Prec@10
Semantic System Lucene TREC Automatic
0.16 0.1 0.2
Semantic System Lucene TREC Automatic
0.37 0.25 0.30
40. A focus on the indexing procedure
Challenge: to link semantic knowledge with documents and query in an
efficient and effective way:
document corpus and semantic knowledge should remain decoupled;
annotations have to be provided in a flexible and scalable way.
Annotations can be provided in two ways:
by applying an information extraction technique based on pure NLP
approaches;
by applying a contextual semantic information approach.
41. Annotator Requirements
Identification of the entities within the documents
conceptually, it is not so much different w.r.t. a traditional IR indexing process
Ontologies must not be touched (decoupling)
Should be open-domain
Scalable-friendly:
indexing of ontologies;
indexing of documents;
an interesting alternative: usage of non-embedded annotations
46. An idea for aggregating rankings
Multi-dimensional aggregation criteria
Document score is computed from different perspectives (criteria)
Assignment of priorities to criteria
Compute criteria weights
Weight of criteria with low priority depends on the score of criteria with high
priority
Aggregate criteria scores [Dragoni2012]
47. Querying and Ranking
Queries transformed by mapping terms with ontology entities
Contextual disambiguation is very important
simple example: “Rock musicians Britain”
Ranking: two options
to evaluate only the “matches” between detected entities
to aggregate (on your way) rank produced by using only the entities, only the
query terms, and/or both of them
48. Use of multiple ontologies
What we need: an Ontology Gateway
Tasks of an ontology gateway:
collect available semantic content;
store the semantic content efficiently in order to ease its access;
implement and approach for the “selection” of the content
Most important ontology gateways online:
Swoogle [Ding2004,Brin1998]
Watson [Aquin2007,Aquin2007b]
WebCORE [Fernandez2006,Fernandez2007]
49. Use of multiple ontologies - opportunities
Recall improvement:
Ontology 1 focused on entities stress on the identification of semantic
entities within the document
Ontology 2 focused on properties stress on the identification of relationships
between entities in the document
precision should also increase, but some drops are possible.
Supporting multiple perspectives:
analysis of each entities from different point of views
50. Use of multiple ontologies - challenges
To figure out how to use them:
it is necessary to formally represent the relationships between the ontologies
and the techniques used for extracting information from them;
example: you may have ontologies describing the same domain by using
different structures!!!
To find suitable ontologies and mappings:
again: more than one ontologies describing the same domain;
not a good practice to select only one build mappings!!!
51. A use case
Information system containing products technical data
users look for something that satisfies their needs
engineers want to exploit information for creating new product variants
Ontologies focused on particular aspects of products
product conceptualizations are separated
57. Checkpoint 2
Annotation of documents is more important than the querying of the
repositories… why?
differences in the amount of content
once we have decided how to annotate documents, queries should be
annotated by using the same procedure in order to homogenize the process
Challenges in built knowledge bases
Ranking… play with them and “stress your creativity”
58. Ontologies and IR – 2 use cases
Demonstrate the usefulness of semantic approaches used in combination
with traditional IR techniques.
Show how IR and Semantics may help each other
Two scenarios:
Cross-language information retrieval [Dragoni2014]
Ontology matching [Dragoni2015]
Sentiment analysis
59. Cross-Language Information Retrieval
Background - Challenges
Out-of-Vocabulary issue
improve the corpora used for training the machine translation model.
usage of domain information for increasing the coverage of the
dictionaries.
Usage of semantic artifacts for structuring the representation of
(multilingual) documents.
GOAL: to integrate domain-specific semantic knowledge
within a CLIR system and evaluate their effectiveness
60. Our Scenario
Use case: the agricultural domain
Knowledge resources: Agrovoc and Organic.Lingua ontologies
3 components used in the proposed approach:
Annotator
Indexer
Retriever
62. en
es
it
de
fr
….
Document content is used as query.
Between the candidate results, only “exact matches” are
considered.
Annotation Process – Step 2
64. Approach - Index
Given a document:
Text and annotations are extracted.
The context of each concept is retrieved from the ontologies.
Each contextual concepts are indexed with a weight proportional
w.r.t. their semantic distance from the semantic annotation.
Structure of each index record:
65. Approach - Retriever
Three retrieval configurations available:
Only translations: query terms are translated by using machine
translation services.
Semantic expansion by exploiting the domain ontology: query terms
are matched with ontology concepts; if an exact match exists, query
is expanded by using the URI of the concept and the URIs of the
contextual ones.
Ontology matching only: terms not having an exact match with
ontology concepts are discarded.
66. Evaluation - Setup
Collection of 13,000 multilingual documents.
48 queries originally provided in English and manually translated
in 12 languages under the supervision of both domain and
language experts.
Gold standard manually built by the domain experts.
MAP, Prec@5, Prec@10, Prec@20, Recall have been used.
69. Ontology Matching
Given two thesauri/ontologies/vocabularies find alignments between
entities
Formally a “match” may be represented with the following 5-tuple:
‹ id, e1, e2, R, c ›
Extensive literature about matching approaches (early ‘80s)
70. Motivations
Need: a system, for experts, able to suggest possible matches between
concepts
Exploit multilinguality… why?
allows to reduce ambiguity: the probability, for two different concepts, of having
the same label across several languages is very low.
term translations have been adapted to the domain: experts in charge of
translations put a lot of their cultural heritage in choosing the right terms for
each concept.
71. The Proposed Approach - 1
Inspired by information retrieval techniques
Built on top of the Lucene search engine
For each element of the thesaurus a structured multilingual representation
is built:
An index for each thesaurus is built
[prefLabel] "Food chains"@en
[prefLabel] "Catene alimentari"@it
[altLabel] "Food distributions"@en
[altLabel] "Reti alimentari"@it
label-en: “food chain”
label-en: “food distribution”
label-it: “catena alimentare”
label-it: “rete alimentare”
72. The Proposed Approach - 2
How matches are suggested?
source and target thesauri are chosen
for each concept, a query is performed from the source to the target thesaurus
the standard Lucene scoring formula is used for computing the ranking
for each query, a ranking of 5 suggestions is provided to the user
73. Evaluation Set-Up
2 contexts:
six multilingual thesauri (3 medical domain, 3 agricultural domain)
adapted Multifarm benchmark
2 tasks:
matching system (only the first suggestion is considered)
suggestion system
76. Results - 3
System Name Precision Recall F-Measure
IRBOM 0.68 0.43 0.53
WeSeE (2012) 0.61 0.32 0.41
RiMOM (2013) 0.52 0.13 0.21
YAM++ (2013) 0.51 0.36 0.40
YAM++ (2012) 0.50 0.36 0.40
AUTOMSv2 (2012) 0.49 0.10 0.36
Results obtained by all systems on the adapted Multifarm Benchmark
77. So… at the end…
Ontologies in IR is still a controversial topic
Personal Opinion: to combine structured and unstructured representation
seems to be the most suitable solution
Pay attention to the kind of queries performed by users
Aggregation of results
Be brave… try to work with triples!!!!
[Ruhl1989] C. Ruhl. On Monosemy: A study in linguistic semantics. State University of New York Press, Albany, NY, 1989.
[Gove1973] P.B. Gove. Webster’s New Dictionary of Synonyms. G. & C. Merriam Company, Springfield, MA, 1973.
[Cruse1986] A.D. Cruse. Lexical Semantics. Cambridge University Press, 1986.
[Green2002] R. Green, C.A. Bean, and S.H. Myaeng. The Semantics of Relationships: An Interdisciplinary Perspective. Cambridge University Press, 2002.
[Fellbaum1998] C. Fellbaum, editor. WordNet: An Electonic Lexical Database. MIT Press, 1998.
[Evens1986] M.W. Evens. Relational Models of the Lexicon. Cambridge University Press, 1986.