2. A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc
3. Core Research Questions
How to represent information?
- The information need and search requests
- The objects to be shown in response to an information request
How to match information representations?
The information objects
to be retrieved
are not necessarily
textual!
Van Rijsbergen, 1979
4. Two views on ‘search’
DB
Business applications
Deductive reasoning
Precise and efficient
query processing
Users with technical skills
(SQL) and precise
information needs
Selection
Books where category=‘CS’
IR
Digital libraries, patent
collections, etc.
Inductive reasoning
Best-effort processing
Untrained users with
imprecise information
needs
Ranking
Books about CS
Note: SemWeb more DB than IR!!!
Symbolic Connectionist
5. Search Flow Chart
A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5
6. IR vs. AI
Many related topics in AI:
- Computational Linguistics
- Natural Language Processing
- Question Answering
- Information Extraction
- Machine Translation
- Computer vision / Multimedia
vs.
Information Retrieval?
7. IR vs. AI (Kunstmatige Intelligentie)
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with seeing, talking or
even chess playing that corresponded to the search through
millions of words of text on the basis of indices. But if one took
the view, by contrast, that theologians, lawyers and, later, literary
scholars were able, albeit slowly, to search vast libraries of
sources for relevant material, then on that view IR is just the
optimisation of a human skill and not a superhuman activity. If
one takes that view, IR is a proper part of AI, as traditionally
conceived.”
Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR
An “Essay in honour of Karen Spärck Jones”, 2006
8. IR vs. AI
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with seeing, talking or
even chess playing that corresponded to the search through
millions of words of text on the basis of indices. But if one took
the view, by contrast, that theologians, lawyers and, later, literary
scholars were able, albeit slowly, to search vast libraries of
sources for relevant material, then on that view IR is just the
optimisation of a human skill and not a superhuman activity. If
one takes that view, IR is a proper part of AI, as traditionally
conceived.”
Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR
An “Essay in honour of Karen Spärck Jones”, 2006
9. IR vs. AI
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with seeing, talking or
even chess playing that corresponded to the search through
millions of words of text on the basis of indices. But if one took
the view, by contrast, that theologians, lawyers and, later, literary
scholars were able, albeit slowly, to search vast libraries of
sources for relevant material, then on that view IR is just the
optimisation of a human skill and not a superhuman activity. If
one takes that view, IR is a proper part of AI, as traditionally
conceived.”
Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR
An “Essay in honour of Karen Spärck Jones”, 2006
10. Relevance
Inherently dependent on user, context and task
Different “relevance criteria”
- Topicality: is the document about the information request?
- Readability: can I understand the text?
- Authoritiveness: can I trust the text?
- Child-suitability: is the text appropriate for children?
- Etc.
11. “Computational Relevance”
“Intellectually it is possible for a human to establish the
relevance of a document to a query. For a computer to do
this we need to construct a model within which
relevance decisions can be quantified. It is interesting to
note that most research in information retrieval can be
shown to have been concerned with different aspects of
such a model.”
Van Rijsbergen, 1976
Retrieval Model
12. ‘Computational Relevance’
How to combine different
indicators of relevance?
- E.g., topicality, child-
suitability, polarity, …
Apply ‘copulas’ (a
technique from
econometrics) to model
non-linear dependencies
(SIGIR 2013, CIKM 2014)
13. Relevance
Various aspects of understanding this notion of relevance
position information retrieval between computer science
and information science
Examples of questions that traditionally do not even
presume involvement of a computer:
- What makes an information object relevant?
- What stages constitute a search process?
- How does relevance evolve during this search process?
- How do users learn from the search process?
- Why do users issue short queries even if we know that long
ones are more effective?
Etc.
14. NLP in IR
Stemming & Stopping
- De facto default setting
N-grams (bi-grams)
- SDM (Sequential Dependence Model)
Entity tagging
15. Footnote in Victor Lavrenko’s PhD thesis
“It is my personal observation that almost every
mathematically inclined graduate student in Information
Retrieval attempts to formulate some sort of a non-
independent model of IR within the first two or three years
of his studies. The vast majority of these attempts yield no
improvements and remain unpublished.”
18. The Secret
The user can simply reformulate their information need in
response to insufficiently relevant results retrieved by the
system!
19. Why Search Remains Difficult to Get Right
Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”
20. System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
Anchors that describe the URI
- Anchor text
Queries that lead to clicks on the URI
- Session, user, dwell-time, …
Tweets that mention the URI
- Time, location, user, …
Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!
Editor's Notes
The fundamental research questions are all about REPRESENTATION
And MATCHING these representations.
MOUSECLICK
The long term research agenda is to unify two fundamentally different views on these problems: those from the database domain, and those from the information retrieval domain
Fundamental, as the deductive approach of DB world is not that easily brought together with the inductive approach underlying IR.
Some of research is really about the mathematical modelling, like our recent ACM SIGIR paper on
MOUSECLICK
deploying copulas - a mathematical approach first applied in economy to represent macro-economic process - to model
MOUSECLICK
the interactions between different types of relevance; here, topic relevance and subjectivity.