Information Retrieval intro TMM

Text Mining lecture
Information Retrieval
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, October 18th
, 2017

A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc

Core Research Questions
 How to represent information?
- The information need and search requests
- The objects to be shown in response to an information request
 How to match information representations?
The information objects
to be retrieved
are not necessarily
textual!
Van Rijsbergen, 1979

Two views on ‘search’
DB
 Business applications
 Deductive reasoning
 Precise and efficient
query processing
 Users with technical skills
(SQL) and precise
information needs
Selection
Books where category=‘CS’
IR
 Digital libraries, patent
collections, etc.
 Inductive reasoning
 Best-effort processing
 Untrained users with
imprecise information
needs
Ranking
Books about CS
Note: SemWeb more DB than IR!!!
Symbolic Connectionist

Search Flow Chart
A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5

IR vs. AI
 Many related topics in AI:
- Computational Linguistics
- Natural Language Processing
- Question Answering
- Information Extraction
- Machine Translation
- Computer vision / Multimedia
vs.
 Information Retrieval?

IR vs. AI (Kunstmatige Intelligentie)
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with seeing, talking or
even chess playing that corresponded to the search through
millions of words of text on the basis of indices. But if one took
the view, by contrast, that theologians, lawyers and, later, literary
scholars were able, albeit slowly, to search vast libraries of
sources for relevant material, then on that view IR is just the
optimisation of a human skill and not a superhuman activity. If
one takes that view, IR is a proper part of AI, as traditionally
conceived.”
Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR
An “Essay in honour of Karen Spärck Jones”, 2006

IR vs. AI
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with seeing, talking or
even chess playing that corresponded to the search through
millions of words of text on the basis of indices. But if one took
the view, by contrast, that theologians, lawyers and, later, literary
scholars were able, albeit slowly, to search vast libraries of
sources for relevant material, then on that view IR is just the
optimisation of a human skill and not a superhuman activity. If
one takes that view, IR is a proper part of AI, as traditionally
conceived.”
Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR
An “Essay in honour of Karen Spärck Jones”, 2006

Relevance
 Inherently dependent on user, context and task
 Different “relevance criteria”
- Topicality: is the document about the information request?
- Readability: can I understand the text?
- Authoritiveness: can I trust the text?
- Child-suitability: is the text appropriate for children?
- Etc.

“Computational Relevance”
“Intellectually it is possible for a human to establish the
relevance of a document to a query. For a computer to do
this we need to construct a model within which
relevance decisions can be quantified. It is interesting to
note that most research in information retrieval can be
shown to have been concerned with different aspects of
such a model.”
Van Rijsbergen, 1976
Retrieval Model

‘Computational Relevance’
 How to combine different
indicators of relevance?
- E.g., topicality, child-
suitability, polarity, …
 Apply ‘copulas’ (a
technique from
econometrics) to model
non-linear dependencies
(SIGIR 2013, CIKM 2014)

Relevance
 Various aspects of understanding this notion of relevance
position information retrieval between computer science
and information science
 Examples of questions that traditionally do not even
presume involvement of a computer:
- What makes an information object relevant?
- What stages constitute a search process?
- How does relevance evolve during this search process?
- How do users learn from the search process?
- Why do users issue short queries even if we know that long
ones are more effective?
Etc.

NLP in IR
 Stemming & Stopping
- De facto default setting
 N-grams (bi-grams)
- SDM (Sequential Dependence Model)
 Entity tagging

Footnote in Victor Lavrenko’s PhD thesis
 “It is my personal observation that almost every
mathematically inclined graduate student in Information
Retrieval attempts to formulate some sort of a non-
independent model of IR within the first two or three years
of his studies. The vast majority of these attempts yield no
improvements and remain unpublished.”

The Secret
 The user can simply reformulate their information need in
response to insufficiently relevant results retrieved by the
system!

Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
 Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
 Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”

 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
 Anchors that describe the URI
- Anchor text
 Queries that lead to clicks on the URI
- Session, user, dwell-time, …
 Tweets that mention the URI
- Time, location, user, …
 Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!

Information Retrieval intro TMM

Recommended

Recommended

More Related Content

Similar to Information Retrieval intro TMM

Similar to Information Retrieval intro TMM (20)

More from Arjen de Vries

More from Arjen de Vries (20)

Recently uploaded

Recently uploaded (20)

Information Retrieval intro TMM

Editor's Notes