SlideShare a Scribd company logo
1 of 200
Download to read offline
Introduction to Information Retrieval 
June, 2013 Roi Blanco
Acknowledgements 
• Many of these slides were taken from other presentations 
– P. Raghavan, C. Manning, H. Schutze IR lectures 
– Mounia Lalmas’s personal stash 
– Other random slide decks 
• Textbooks 
– Ricardo Baeza-Yates, Berthier Ribeiro Neto 
– Raghavan, Manning, Schutze 
– … among other good books 
• Many online tutorials, many online tools available (full toolkits) 
2
Big Plan 
• What is Information Retrieval? 
– Search engine history 
– Examples of IR systems (you might now have known!) 
• Is IR hard? 
– Users and human cognition 
– What is it like to be a search engine? 
• Web Search 
– Architecture 
– Differences between Web search and IR 
– Crawling 
3
• Representation 
– Document view 
– Document processing 
– Indexing 
• Modeling 
– Vector space 
– Probabilistic 
– Language Models 
– Extensions 
• Others 
– Distributed 
– Efficiency 
– Caching 
– Temporal issues 
– Relevance feedback 
– … 
4
5
Information Retrieval 
Information Retrieval (IR) is finding material 
(usually documents) of an unstructured nature 
(usually text) that satisfies an information need 
from within large collections (usually stored on 
computers). 
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze 
Introduction to Information Retrieval 
6 
6
Information Retrieval (II) 
• What do we understand by documents? How do 
we decide what is a document and whatnot? 
• What is an information need? What types of 
information needs can we satisfy automatically? 
• What is a large collection? Which environments 
are suitable for IR 
7 
7
Basic assumptions of Information Retrieval 
• Collection: A set of documents 
– Assume it is a static collection 
• Goal: Retrieve documents with information that is 
relevant to the user’s information need and helps 
the user complete a task 
8
Key issues 
• How to describe information resources or information-bearing 
objects in ways that they can be effectively used 
by those who need to use them ? 
– Organizing/Indexing/Storing 
• How to find the appropriate information resources or 
information-bearing objects for someone’s (or your own) 
needs 
– Retrieving / Accessing / Filtering 
9
Unstructured data 
Unstructured data? 
SELECT * from HOTELS 
where city = Bangalore and 
$$$ < 2 
10 
Cheap hotels in 
Bangalore 
CITY $$$ name 
Bangalore 1.5 Cheapo one 
Barcelona 1 EvenCheapoer 
10
Unstructured (text) vs. structured (database) data in the 
mid-nineties 
11
Unstructured (text) vs. structured (database) data today
13
Search Engine 
Index 
Square 
Pants! 
14
15
Timeline 
1990 1991 1993 1994 1998 
... 
16
... 
1995 
1996 
1997 
1998 
1999 
2000 
17
2009 
2005 
... 
2008 
18
2001 
2003 
2002 
2003 
2003 
2003 
2003 
2010 2010 
2003 
19
20
Your ads here! 
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Usability 
We also fail at using the technology 
Sometimes
36
Applications 
• Text Search 
• Ad search 
• Image/Video search 
• Email Search 
• Question Answering systems 
• Recommender systems 
• Desktop Search 
• Expert Finding 
• .... 
Jobs 
Prizes 
Products 
News 
Source code 
Videogames 
Maps 
Partners 
Mashups 
... 
37
Types of search engines 
• Q&A engines 
• Collaborative 
• Enterprise 
• Web 
• Metasearch 
• Semantic 
• NLP 
• ... 
38
40
IR issues 
• Find out what the user needs 
… and do it quickly 
• Challenges: user intention, accessibility, volatility, 
redundancy, lack of structure, low quality, different data 
sources, volume, scale 
• The main bottleneck is human cognition and not 
computational 
41
IR is mostly about relevance 
• Relevance is the core concept in IR, but nobody has a good 
definition 
• Relevance = useful 
• Relevance = topically related 
• Relevance = new 
• Relevance = interesting 
• Relevance = ??? 
• However we still want relevant information 
42
• Information needs must be expressed as a query 
– But users don’t often know what they want 
• Problems 
– Verbalizing information needs 
– Understanding query syntax 
– Understanding search engines 
43
Understanding(?) the user 
I am a hungry tourist in 
Barcelona, and I want to 
find a place to eat; 
however I don’t want to 
spend a lot of money 
I want information 
on places with 
cheap food in 
Barcelona 
Info about bars in 
Barcelona 
Bar celona 
Misconception 
Mistranslation 
Misformulation 
44
Why this is hard? 
• Documents/images/ video/speech/etc are complex. We 
need some representation 
• Semantics 
– What do words mean? 
• Natural language 
– How do we say things? 
• L Computers cannot deal with these easily 
45
… and even harder 
• Context 
• Opinion 
Funny? Talented? Honest? 
46
Semantics 
Bank Note River Bank Bank 
47 
Blood bank
What is it like to be a search engine? 
• How can we figure out what you’re trying to do? 
• Signal can be somehow weak, sometimes! 
[ jaguar ] 
[ iraq ] 
[ latest release Thinkpad drivers touchpad ] [ 
ebay ] 
[ first ] 
[ google ] 
[ brittttteny spirs ] 
48
Search is a multi-step process 
• Session search 
– Verbalize your query 
– Look for a document 
– Find your information there 
– Refine 
• Teleporting 
– Go directly to the site you like 
– Formulating the query is too hard, you trust more 
the final site, etc. 
49
• Someone told me that in the mid-1800’s, people often would carry 
around a special kind of notebook. They would use the notebook to 
write down quotations that they heard, or copy passages from books 
they’d read. The notebook was an important part of their education, 
and it had a particular name. 
– What was the name of the notebook? 
50 
Examples from Dan Russel
Naming the un-nameable 
• What’s this thing called? 
51
More tasks … 
• Going beyond a search engine 
– Using images / multimedia content 
– Using maps 
– Using other sources 
• Think of how to express things differently (synonyms) 
– A friend told me that there is an abandoned city in the waters of San Francisco 
Bay. Is that true? If it IS true, what was the name of the supposed city? 
• Exploring a topic further in depth 
• Refining a question 
– Suppose you want to buy a unicycle for your Mom or Dad. How would you find 
it? 
• Looking for lists of information 
– Can you find a list of all the groups that inhabited California at the time of the 
missions? 
52
IR tasks 
• Known-item finding 
– You want to retrieve some data that you know they exist 
– What year was Peter Mika born? 
• Exploratory seeking 
– You want to find some information through an iterative process 
– Not a single answer to your query 
• Exhaustive search 
– You want to find all the information possible about a particular issue 
– Issuing several queries to cover the user information need 
• Re-finding 
– You want to find an item you have found already 
53
Scale 
• >300TB of print data produced per year 
– +Video, speech, domain-specific information (>600PB per year) 
• IR has to be fast + scalable 
• Information is dynamic 
– News, web pages, maps, … 
– Queries are dynamic (you might even change your information needs while 
searching) 
• Cope with data and searcher change 
– This introduces tensions in every component of a search engine 
54
Methodology 
• Experimentation in IR 
• Three fundamental types of IR research: 
– Systems (efficiency) 
– Methods (effectiveness) 
– Applications (user utility) 
• Empirical evaluation plays a critical role across all three types 
of research 
55
Methodology (II) 
• Information retrieval (IR) is a highly applied scientific 
discipline 
• Experimentation is a critical component of the scientific 
method 
• Poor experimental methodologies are not scientifically 
sound and should be avoided 
56
57
58 
Task 
Info 
need 
Verbal 
form 
query 
Search 
engine 
Corpus 
results 
Query 
refinement
User 
Interface 
Query 
interpretation 
Document 
Collection 
Crawling 
Text Processing 
Indexing 
General Voodoo 
Matching 
Ranking 
Metadata 
Index 
Document 
Interpretation 
59
Crawler 
NLP 
pipeline 
Indexer 
Documents Tokens 
Index 
Query 
System 
60
Broker 
DNS 
Cluster 
Cluster 
cache 
server 
partition 
replication 
61
<a href= 
• Web pages are linked 
– AKA Web Graph 
• We can walk trough the 
graph to crawl 
• We can rank using the 
graph 
62
Web pages are connected 
63
Web Search 
• Basic search technology shared with IR systems 
– Representation 
– Indexing 
– Ranking 
• Scale (in terms of data and users) changes the game 
– Efficiency/architectural design decisions 
• Link structure 
– For data acquisition (crawling) 
– For ranking (PageRank, HITS) 
– For spam detection 
– For extending document representations (anchor text) 
• Adversarial IR 
• Monetization 
64
User Needs 
• Need 
– Informational – want to learn about something (~40% / 65%) 
– Navigational – want to go to that page (~25% / 15%) 
– Transactional – want to do something (web-mediated) (~35% / 20%) 
• Access a service 
• Downloads 
• Shop 
– Gray areas 
• Find a good hub 
• Exploratory search “see what’s there” 
Low hemoglobin 
United Airlines 
Seattle weather 
Mars surface images 
Canon S410 
Car rental Brasil 
65
How far do people look for results? 
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 
66
Users’ empirical evaluation of results 
• Quality of pages varies widely 
– Relevance is not enough 
– Other desirable qualities (non IR!!) 
• Content: Trustworthy, diverse, non-duplicated, well maintained 
• Web readability: display correctly & fast 
• No annoyances: pop-ups, etc. 
• Precision vs. recall 
– On the web, recall seldom matters 
• What matters 
– Precision at 1? Precision above the fold? 
– Comprehensiveness – must be able to deal with obscure queries 
• Recall matters when the number of matches is very small 
• User perceptions may be unscientific, but are significant 
over a large aggregate 
67
Users’ empirical evaluation of engines 
• Relevance and validity of results 
• UI – Simple, no clutter, error tolerant 
• Trust – Results are objective 
• Coverage of topics for ambiguous queries 
• Pre/Post process tools provided 
– Mitigate user errors (auto spell check, search assist,…) 
– Explicit: Search within results, more like this, refine ... 
– Anticipative: related searches 
• Deal with idiosyncrasies 
– Web specific vocabulary 
• Impact on stemming, spell-check, etc. 
– Web addresses typed in the search box 
• “The first, the last, the best and the worst …” 
68
The Web document collection 
• No design/co-ordination 
• Distributed content creation, linking, 
democratization of publishing 
• Content includes truth, lies, obsolete 
information, contradictions … 
• Unstructured (text, html, …), semi-structured 
(XML, annotated photos), structured 
(Databases)… 
• Scale much larger than previous text collections 
… but corporate records are catching up 
• Growth – slowed down from initial “volume 
doubling every few months” but still expanding 
• Content can be dynamically generated The Web 
69
Basic crawler operation 
• Begin with known “seed” URLs 
• Fetch and parse them 
–Extract URLs they point to 
–Place the extracted URLs on a queue 
• Fetch each URL on the queue and 
repeat 
70
Crawling picture 
Web 
URLs frontier 
Unseen Web 
URLs crawled 
and parsed 
Seed 
pages 
71
Simple picture – complications 
• Web crawling isn’t feasible with one machine 
– All of the above steps distributed 
• Malicious pages 
– Spam pages 
– Spider traps – including dynamically generated 
• Even non-malicious pages pose challenges 
– Latency/bandwidth to remote servers vary 
– Webmasters’ stipulations 
• How “deep” should you crawl a site’s URL hierarchy? 
– Site mirrors and duplicate pages 
• Politeness – don’t hit a server too often 
72
What any crawler must do 
• Be Polite: Respect implicit and explicit 
politeness considerations 
– Only crawl allowed pages 
– Respect robots.txt 
• Be Robust: Be immune to spider traps 
and other malicious behavior from 
web servers 
–Be efficient 
73
What any crawler should do 
• Be capable of distributed operation: designed to 
run on multiple distributed machines 
• Be scalable: designed to increase the crawl rate 
by adding more machines 
• Performance/efficiency: permit full use of 
available processing and network resources 
74
What any crawler should do 
• Fetch pages of “higher quality” first 
• Continuous operation: Continue fetching 
fresh copies of a previously fetched page 
• Extensible: Adapt to new data formats, 
protocols 
75
Updated crawling picture 
URLs crawled 
and parsed 
Unseen Web 
Seed 
Pages 
URL frontier 
Crawling thread 
76
77
Document views 
sailing 
greece 
mediterranean 
fish 
sunset 
Author = “B. Smith” 
Crdate = “14.12.96” 
Ladate = “11.07.02” 
Sailing in 
Greece 
B. Smith 
content 
view 
head 
title 
author 
chapter 
section 
section 
structure 
view 
data 
view 
layout 
view 
78
What is a document: document views 
• Content view is concerned with representing the content 
of the document; that is, what is the document about. 
• Data view is concerned with factual data associated with 
the document (e.g. author names, publishing date) 
• Layout view is concerned with how documents are 
displayed to the users; this view is related to user interface 
and visualization issues. 
• Structure view is concerned with the logical structure of 
the document, (e.g. a book being composed of chapters, 
themselves composed of sections, etc.) 
79
Indexing language 
• An indexing language: 
– Is the language used to describe the content of 
documents (and queries) 
– And it usually consists of index terms that are derived 
from the text (automatic indexing), or arrived at 
independently (manual indexing), using a controlled 
or uncontrolled vocabulary 
– Basic operation: is this query term present in this 
document? 
80
Generating document representations 
• The building of the indexing language, that is generating 
the document representation, is done in several steps: 
– Character encoding 
– Language recognition 
– Page segmentation (boilerplate detection) 
– Tokenization (identification of words) 
– Term normalization 
– Stopword removal 
– Stemming 
– Others (doc. Expansion, etc.) 
81
Generating document representations: overview 
documents 
tokens 
stop-words 
stems 
terms (index terms) 
tokenization 
remove noisy words 
reduce to stems 
+ others: e.g. 
- thesaurus 
- more complex 
processing 
82
Parsing a document 
• What format is it in? 
– pdf/word/excel/html? 
• What language is it in? 
• What character set is in use? 
– (ISO-8818, UTF-8, …) 
But these tasks are often done heuristically … 
83
Complications: Format/language 
• Documents being indexed can include docs from many 
different languages 
– A single index may contain terms from many languages. 
• Sometimes a document or its components can contain 
multiple languages/formats 
– French email with a German pdf attachment. 
– French email quote clauses from an English-language 
contract 
• There are commercial and open source libraries that can 
handle a lot of this stuff 
84
Complications: What is a document? 
We return from our query “documents” but there are often 
interesting questions of grain size: 
What is a unit document? 
– A file? 
– An email? (Perhaps one of many in a single mbox file) 
• What about an email with 5 attachments? 
– A group of files (e.g., PPT or LaTeX split over HTML pages) 
85
Tokenization 
• Input: “Friends, Romans and Countrymen” 
• Output: Tokens 
– Friends 
– Romans 
– Countrymen 
• A token is an instance of a sequence of characters 
• Each such token is now a candidate for an index entry, after 
further processing 
• But what are valid tokens to emit? 
86
Tokenization 
• Issues in tokenization: 
– Finland’s capital  
Finland AND s? Finlands? Finland’s? 
– Hewlett-Packard  Hewlett and Packard as two 
tokens? 
• state-of-the-art: break up hyphenated sequence. 
• co-education 
• lowercase, lower-case, lower case ? 
• It can be effective to get the user to put in possible hyphens 
– San Francisco: one token or two? 
• How do you decide it is one token? 
87
Numbers 
• 3/20/91 Mar. 12, 1991 20/3/91 
• 55 B.C. 
• B-52 
• My PGP key is 324a3df234cb23e 
• (800) 234-2333 
• Often have embedded spaces 
• Older IR systems may not index numbers 
But often very useful: think about things like looking up error 
codes/stacktraces on the web 
• Will often index “meta-data” separately 
Creation date, format, etc. 
88
Tokenization: language issues 
• French 
– L'ensemble  one token or two? 
• L ? L’ ? Le ? 
• Want l’ensemble to match with un ensemble 
– Until at least 2003, it didn’t on Google 
» Internationalization! 
• German noun compounds are not segmented 
– Lebensversicherungsgesellschaftsangestellter 
– ‘life insurance company employee’ 
– German retrieval systems benefit greatly from a compound splitter 
module 
– Can give a 15% performance boost for German 
89
Tokenization: language issues 
• Chinese and Japanese have no spaces between words: 
– 莎拉波娃现在居住在美国东南部的佛罗里达。 
– Not always guaranteed a unique tokenization 
• Further complicated in Japanese, with multiple alphabets 
intermingled 
– Dates/amounts in multiple formats 
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) 
Katakana Hiragana Kanji Romaji 
End-user can express query entirely in hiragana! 
90
Tokenization: language issues 
• Arabic (or Hebrew) is basically written right to left, but with certain items 
like numbers written left to right 
• Words are separated, but letter forms within a word form complex 
ligatures 
← → ← → ← start 
‘Algeria achieved its independence in 1962 after 132 years of French 
occupation.’ 
• With Unicode, the surface presentation is complex, but the stored 
form is straightforward 
91
Stop words 
• With a stop list, you exclude from the dictionary entirely the commonest 
words. Intuition: 
– They have little semantic content: the, a, and, to, be 
– There are a lot of them: ~30% of postings for top 30 words 
• But the trend is away from doing this: 
– Good compression techniques means the space for including stop words in a system 
can be small 
– Good query optimization techniques mean you pay little at query time for including 
stop words. 
– You need them for: 
• Phrase queries: “King of Denmark” 
• Various song titles, etc.: “Let it be”, “To be or not to be” 
• “Relational” queries: “flights to London” 
92
Normalization to terms 
• Want: matches to occur despite superficial differences in the 
character sequences of the tokens 
• We may need to “normalize” words in indexed text as well as query words 
into the same form 
– We want to match U.S.A. and USA 
• Result is terms: a term is a (normalized) word type, which is an entry in 
our IR system dictionary 
• We most commonly implicitly define equivalence classes of terms by, e.g., 
– deleting periods to form a term 
• U.S.A., USA  USA 
– deleting hyphens to form a term 
• anti-discriminatory, antidiscriminatory  antidiscriminatory 
93
Normalization: other languages 
• Accents: e.g., French résumé vs. resume. 
• Umlauts: e.g., German: Tuebingen vs. Tübingen 
– Should be equivalent 
• Most important criterion: 
– How are your users like to write their queries for these words? 
• Even in languages that standardly have accents, users often may not type 
them 
– Often best to normalize to a de-accented term 
• Tuebingen, Tübingen, Tubingen  Tubingen 
94
Case folding 
• Reduce all letters to lower case 
– exception: upper case in mid-sentence? 
• e.g., General Motors 
• Fed vs. fed 
• SAIL vs. sail 
– Often best to lower case everything, since users will use lowercase 
regardless of ‘correct’ capitalization… 
• Longstanding Google example: [fixed in 2011…] 
– Query C.A.T. 
– #1 result is for “cats” (well, Lolcats) not Caterpillar Inc. 
95
Normalization to terms 
• An alternative to equivalence classing is to do asymmetric 
expansion 
• An example of where this may be useful 
– Enter: window Search: window, windows 
– Enter: windows Search: Windows, windows, window 
– Enter: Windows Search: Windows 
• Potentially more powerful, but less efficient 
96
Thesauri and soundex 
• Do we handle synonyms and homonyms? 
– E.g., by hand-constructed equivalence classes 
• car = automobile color = colour 
– We can rewrite to form equivalence-class terms 
• When the document contains automobile, index it under 
car-automobile (and vice-versa) 
– Or we can expand a query 
• When the query contains automobile, look under car as 
well 
• What about spelling mistakes? 
– One approach is Soundex, which forms equivalence classes of 
words based on phonetic heuristics 
97
Lemmatization 
• Reduce inflectional/variant forms to base form 
• E.g., 
– am, are, is  be 
– car, cars, car's, cars'  car 
• the boy's cars are different colors  the boy car be 
different color 
• Lemmatization implies doing “proper” reduction to 
dictionary headword form 
98
Stemming 
• Reduce terms to their “roots” before indexing 
• “Stemming” suggests crude affix chopping 
– language dependent 
– e.g., automate(s), automatic, automation all reduced to automat. 
for example compressed 
and compression are both 
accepted as equivalent to 
compress. 
for exampl compress and 
compress ar both accept 
as equival to compress 
99
– Affix removal 
• remove the longest affix: {sailing, sailor} => sail 
• simple and effective stemming 
• a widely used such stemmer is Porter’s algorithm 
– Dictionary-based using a look-up table 
• look for stem of a word in table: play + ing => play 
• space is required to store the (large) table, so often not practical 
100
Stemming: some issues 
• Detect equivalent stems: 
– {organize, organise}: e as the longest affix leads to {organiz, 
organis}, which should lead to one stem: organis 
– Heuristics are therefore used to deal with such cases. 
• Over-stemming: 
– {organisation, organ} reduced into org, which is incorrect 
– Again heuristics are used to deal with such cases. 
101
Porter’s algorithm 
• Commonest algorithm for stemming English 
– Results suggest it’s at least as good as other stemming options 
• Conventions + 5 phases of reductions 
– phases applied sequentially 
– each phase consists of a set of commands 
– sample convention: Of the rules in a compound command, select 
the one that applies to the longest suffix. 
102
Typical rules in Porter 
• sses  ss 
• ies  i 
• ational  ate 
• tional  tion 
103
Language-specificity 
• The above methods embody transformations that are 
– Language-specific, and often 
– Application-specific 
• These are “plug-in” addenda to the indexing process 
• Both open source and commercial plug-ins are 
available for handling these 
104
Does stemming help? 
• English: very mixed results. Helps recall for some queries but 
harms precision on others 
– E.g., operative (dentistry) ⇒ oper 
• Definitely useful for Spanish, German, Finnish, … 
– 30% performance gains for Finnish! 
105
Others: Using a thesaurus 
• A thesaurus provides a standard vocabulary for indexing 
(and searching) 
• More precisely, a thesaurus provides a classified 
hierarchy for broadening and narrowing terms 
bank: 1. Finance institute 
2. River edge 
– if a document is indexed with bank, then index it with 
“finance institute” or “river edge” 
– need to disambiguate the sense of bank in the text: e.g. if 
money appears in the document, then chose “finance 
institute” 
• A widely used online thesaurus: WordNet 
106
Information storage 
• Whole topic on its own 
• How do we keep fresh copies of the web manageable by a cluster of 
computers and are able to answer millions of queries in milliseconds 
– Inverted indexes 
– Compression 
– Caching 
– Distributed architectures 
– … and a lot of tricks 
• Inverted indexes: cornerstone data structure of IR systems 
– For each term t, we must store a list of all documents that contain t. 
– Identify each doc by a docID, a document serial number 
– Index construction is tricky (can’t hold all the information needed in memory) 
107
108 
docs t1 t2 t3 
D1 1 0 1 
D2 1 0 0 
D3 0 1 1 
D4 1 0 0 
D5 1 1 1 
D6 1 1 0 
D7 0 1 0 
D8 0 1 0 
D9 0 1 1 
D10 0 1 1 
Terms D1 D2 D3 D4 
t1 1 1 0 1 
t2 0 0 1 0 
t3 1 0 1 0
• Most basic form: 
– Document frequency 
– Term frequency 
– Document identifiers 
109 
term Term id df 
a 1 4 
as 2 3 
(1,2), (2,5), (10,1), (11,1) 
(1,3), (3,4), (20,1)
• Indexes contain more information 
– Position in the document 
• Useful for “phrase queries” or “proximity queries” 
– Fields in which the term appears in the document 
– Metadata … 
– All that can be used for ranking 
110 
(1,2, [1, 1], [2,10]), … 
Field 1 (title), position 1
Queries 
• How do we process a query? 
• Several kinds of queries 
– Boolean 
•Chicken AND salt 
• Gnome OR KDE 
• Salt AND NOT pepper 
– Phrase queries 
– Ranked 
111
List Merging 
•“Exact match” queries 
– Chicken AND curry 
– Locate Chicken in the dictionary 
– Fetch its postings 
– Locate curry in the dictionary 
–Fetch its postings 
–Merge both postings 
112
Intersecting two postings lists 
113
List Merging 
Walk through the postings in O(x+y) time 
salt 
pepper 
3 22 23 25 
3 5 22 25 36 
3 22 25 
114
115
Models of information retrieval 
• A model: 
– abstracts away from the real world 
– uses a branch of mathematics 
– possibly: uses a metaphor for searching 
116
Short history of IR modelling 
• Boolean model (±1950) 
• Document similarity (±1957) 
• Vector space model (±1970) 
• Probabilistic retrieval (±1976) 
• Language models (±1998) 
• Linkage-based models (±1998) 
• Positional models (±2004) 
• Fielded models (±2005) 
117
The Boolean model (±1950) 
• Exact matching: data retrieval (instead of 
information retrieval) 
– A term specifies a set of documents 
– Boolean logic to combine terms / document sets 
– AND, OR and NOT: intersection, union, and 
difference 
118
Statistical similarity between documents (±1957) 
• The principle of similarity 
"The more two representations agree in given elements and their 
distribution, the higher would be the probability of their representing 
similar information” 
(Luhn 1957) 
It is here proposed that the frequency of word [term] occurrence in an 
article [document ] furnishes a useful measurement of word [term] 
significance” 
119
Zipf’s law 
terms by rank order 
frequency of terms 
f 
r 
120
Zipf’s law 
• Relative frequencies of terms. 
• In natural language, there are a few very frequent terms and very many 
very rare terms. 
• Zipf’s law: The ith most frequent term has frequency proportional to 1/i . 
• cfi ∝ 1/i = K/i where K is a normalizing constant 
• cfi is collection frequency: the number of occurrences of the term ti in the 
collection. 
• Zipf’s law holds for different languages 
121
Zipf consequences 
• If the most frequent term (the) occurs cf1 times 
– then the second most frequent term (of) occurs cf1/2 times 
– the third most frequent term (and) occurs cf1/3 times … 
• Equivalent: cfi = K/i where K is a normalizing factor, so 
– log cfi = log K - log i 
– Linear relationship between log cfi and log i 
• Another power law relationship 
122
Zipf’s law in action 
123
Luhn’s analysis -Observation 
terms by rank order 
frequency of terms 
f 
resolving power 
r 
upper cut-off lower cut-off 
common terms 
rare terms 
significant terms 
Resolving power of significant terms: 
ability of terms to discriminate document content 
peak at rank order position half way between the two cut-offs 
124
Luhn’s analysis - Implications 
• Common terms are not good at representing document 
content 
– partly implemented through the removal of stop words 
• Rare words are also not good at representing document 
content 
– usually nothing is done 
– Not true for every “document” 
• Need a means to quantify the resolving power of a term: 
– associate weights to index terms 
– tf×idf approach 
125
Ranked retrieval 
• Boolean queries are good for expert users with precise 
understanding of their needs and the collection. 
– Also good for applications: Applications can easily consume 
1000s of results. 
• Not good for the majority of users. 
– Most users incapable of writing Boolean queries (or they are, 
but they think it’s too much work). 
– Most users don’t want to wade through 1000s of results. 
• This is particularly true of web search.
Feast or Famine 
• Boolean queries often result in either too few (=0) or too 
many (1000s) results. 
• Query 1: “standard user dlink 650” → 200,000 hits 
• Query 2: “standard user dlink 650 no card found”: 0 hits 
• It takes a lot of skill to come up with a query that produces 
a manageable number of hits. 
– AND gives too few; OR gives too many
Ranked retrieval models 
• Rather than a set of documents satisfying a query expression, 
in ranked retrieval, the system returns an ordering over the 
(top) documents in the collection for a query 
• Free text queries: Rather than a query language of operators 
and expressions, the user’s query is just one or more words in 
a human language 
• In principle, there are two separate choices here, but in 
practice, ranked retrieval has normally been associated with 
free text queries and vice versa 
128
Feast or famine: not a problem in ranked retrieval 
• When a system produces a ranked result set, large result sets 
are not an issue 
– Indeed, the size of the result set is not an issue 
– We just show the top k ( ≈ 10) results 
– We do not overwhelm the user 
– Premise: the ranking algorithm works
Scoring as the basis of ranked retrieval 
• We wish to return in order the documents most likely to 
be useful to the searcher 
• How can we rank-order the documents in the collection 
with respect to a query? 
• Assign a score – say in [0, 1] – to each document 
• This score measures how well document and query 
“match”.
Query-document matching scores 
• We need a way of assigning a score to a query/document 
pair 
• Let’s start with a one-term query 
• If the query term does not occur in the document: score 
should be 0 
• The more frequent the query term in the document, the 
higher the score (should be) 
• We will look at a number of alternatives for this.
Bag of words model 
• Vector representation does not consider the ordering of 
words in a document 
• John is quicker than Mary and Mary is quicker than John 
have the same vectors 
• This is called the bag of words model.
Term frequency tf 
• The term frequency tf(t,d) of term t in document d is defined 
as the number of times that t occurs in d. 
• We want to use tf when computing query-document match 
scores. But how? 
• Raw term frequency is not what we want: 
– A document with 10 occurrences of the term is more 
relevant than a document with 1 occurrence of the term. 
– But not 10 times more relevant. 
• Relevance does not increase proportionally with term 
frequency.
Log-frequency weighting 
• The log frequency weight of term t in d is 
   
1 log tf , if tf 0 
 
 
 
10 t,d t,d 
0, otherwise 
t,d w 
• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. 
• Score for a document-query pair: sum over terms t in both q and d: 
• score 
• The score is 0 if none of the query terms is present in the document. 
   
  
t q d t d (1 log tf ) ,
Document frequency 
• Rare terms are more informative than frequent terms 
– Recall stop words 
• Consider a term in the query that is rare in the collection (e.g., 
arachnocentric) 
• A document containing this term is very likely to be relevant to 
the query arachnocentric 
• → We want a high weight for rare terms like arachnocentric.
Document frequency, continued 
• Frequent terms are less informative than rare terms 
• Consider a query term that is frequent in the collection (e.g., high, 
increase, line) 
• A document containing such a term is more likely to be relevant than a 
document that does not 
• But it’s not a sure indicator of relevance. 
• → For frequent terms, we want high positive weights for words like high, 
increase, and line 
• But lower weights than for rare terms. 
• We will use document frequency (df) to capture this.
idf weight 
• dft is the document frequency of t: the number of documents that contain 
t 
– dft is an inverse measure of the informativeness of t 
– dft  N 
• We define the idf (inverse document frequency) of t by 
– We use log (N/dft) instead of N/dft to “dampen” the effect of idf. 
idf log ( /df ) t 10 t  N
Effect of idf on ranking 
• Does idf have an effect on ranking for one-term queries, like 
– iPhone 
• idf has no effect on ranking one term queries 
– idf affects the ranking of documents for queries with at least 
two terms 
– For the query capricious person, idf weighting makes 
occurrences of capricious count for much more in the final 
document ranking than occurrences of person. 
138
tf-idf weighting 
• The tf-idf weight of a term is the product of its tf weight and its 
idf weight. 
w  log(1  tf )  
log ( N 
/ df ) t , d 
t ,d 10 t • Best known weighting scheme in information retrieval 
– Note: the “-” in tf-idf is a hyphen, not a minus sign! 
– Alternative names: tf.idf, tf x idf 
• Increases with the number of occurrences within a document 
• Increases with the rarity of the term in the collection
Score for a document given a query 
tÎqÇd å 
• There are many variants 
– How “tf” is computed (with/without logs) 
– Whether the terms in the query are also weighted 
– … 
140 
Score(q,d) = tf.idft,d
Documents as vectors 
• So we have a |V|-dimensional vector space 
• Terms are axes of the space 
• Documents are points or vectors in this space 
• Very high-dimensional: tens of millions of dimensions when 
you apply this to a web search engine 
• These are very sparse vectors - most entries are zero.
Statistical similarity between documents (±1957) 
• Vector product 
– If the vector has binary components, then the product 
measures the number of shared terms 
– Vector components might be "weights" 
 
score q d  q  
d 
k k  
matching terms 
( , ) 
k 
 
Why distance is a bad idea 
The Euclidean 
distance between q 
and d2 is large even 
though the 
distribution of terms 
in the query q and the 
distribution of 
terms in the 
document d2 are 
very similar.
Vector space model (±1970) 
• Documents and 
queries are vectors in 
a high-dimensional 
space 
• Geometric measures 
(distances, angles)
Vector space model (±1970) 
• Cosine of an angle: 
– close to 1 if angle is small 
– 0 if vectors are orthogonal 
2 
m 
d q 
k k k 
d q 
m 
k 1 
k 
  
2 
m 
k 1 
k 
 
1 
( ) ( ) 
  
cos( , ) 
  
 
  
d q 
1 ( )2 
 m 
 
    
k  
k 
i 
i 
m 
k 
k k 
v 
v 
  
d q n d n q n v 
1 
cos( , ) ( ) ( ), ( )
Vector space model (±1970) 
• PRO: Nice metaphor, easily explained; 
Mathematically sound: geometry; 
Great for relevance feedback 
• CON: Need term weighting (tf-idf); 
Hard to model structured queries
Probabilistic IR 
• An IR system has an uncertain understanding of user’s queries and 
makes uncertain guesses on whether a document satisfies a query 
or not. 
• Probability theory provides a principled foundation for reasoning 
under uncertainty. 
• Probabilistic models build upon this foundation to estimate how 
likely it is that a document is relevant for a query. 
147
Event Space 
• Query representation 
• Document representation 
• Relevance 
• Event space 
• Conceptually there might be pairs with same q and d, 
but different r 
• Some times include include user u, context c, etc. 
148
Probability Ranking Principle 
• Robertson (1977) 
– “If a reference retrieval system’s response to each 
request is a ranking of the documents in the collection 
in order of decreasing probability of relevance to the 
user who submitted the request, where the 
probabilities are estimated as accurately as possible 
on the basis of whatever data have been made 
available to the system for this purpose, the overall 
effectiveness of the system to its user will be the best 
that is obtainable on the basis of those data.” 
• Basis for probabilistic approaches for IR 
149
Dissecting PRP 
• Probability of relevance 
• Estimated accurately 
• Based on whatever data available 
• Best possible accuracy 
– The perfect IR system! 
– Assumes relevance is independent on other 
documents in the collection 
150
Relevance? 
• What is ? 
– Isn’t it decided by the user? her opinion? 
• User doesn’t mean a human being! 
– We are working with representations 
– ... or parts of the reality available to us 
• 2/3 keywords, no profile, no context ... 
– relevance is uncertain 
• depends on what the system sees 
• may be marginalized over all the 
unseen context/profiles 
151
Retrieval as binary classification 
• For every (q,d), r takes two values 
– Relevant and non-relevant documents 
– can be extended to multiple values 
• Retrieve using Bayes’ decision 
– PRP is related to the Bayes error rate (lowest 
possible error rate for a class) 
– How do we estimate this probability? 
152
PRP ranking 
• How to represent the random variables? 
• How to estimate the model’s parameters? 
153
• d is a binary vector 
• Multiple Bernoulli variables 
• Under MB, we can decompose into a 
product of probabilities, with likelihoods: 
154
If the terms are not in the query: 
Otherwise we need estimates for them! 
155
Estimates 
• Assign new weights for query terms based on relevant/non-relevant 
documents 
• Give higher weights to important terms: 
Relevant Non-relevant 
156 
Document with 
t 
r n-r n 
Document 
without t 
R-r N-r-R+r N-n 
R N-R
Robertson-Spark Jones weight 
157 
Relevant docs with t 
Relevant docs without t 
Non-relevant docs with t 
Non-relevant docs without t
Estimates without relevance info 
• If we pick a relevant document, words are equally like to be 
present or absent 
• Non-relevant can be approximated with the collection as a 
whole 
158
Modeling term frequencies 
159
Modeling TF 
• Naïve estimation: separate probability for every 
outcome 
• BIR had only two parameters, now we have plenty 
(~many outcomes) 
• We can plug in a parametric estimate for the term 
frequencies 
• For instance, a Poisson mixture 
160
Okapi BM25 
• Same ranking function as before but with new 
estimates. Models term frequencies and 
document length. 
• Words are generated by a mixture of two 
Poissons 
• Assumes an eliteness variable (elite ~ word 
occurs unusually frequently, non-elite ~ word 
occurs as expected by chance). 
161
BM25 
• As a graphical model 
162
BM25 
• In order to approximate the formula, Robertson and Walker came up 
with: 
• Two model parameters 
• Very effective 
• The more words in common with the query the better 
• Repetitions less important than different query words 
– But more important if the document is relatively long 
163
Generative Probabilistic Language Models 
• The generative approach – A generator which produces 
events/tokens with some probability 
– Probability distribution over strings of text 
– URN Metaphor – a bucket of different colour balls (10 red, 5 
blue, 3 yellow, 2 white) 
• What is the probability of drawing a yellow ball? 3/20 
• what is the probability of drawing (with replacement) a red ball and a 
white ball? ½*1/10 
– IR Metaphor: Documents are urns, full of tokens (balls) of (in) 
different terms (colors)
What is a language model? 
• How likely is a string of words in a “language”? 
– P1(“the cat sat on the mat”) 
– P2(“the mat sat on the cat”) 
– P3(“the cat sat en la alfombra”) 
– P4(“el gato se sentó en la alfombra”) 
• Given a model M and a observation s we want 
– Probability of getting s through random sampling from M 
– A mechanism to produce observations (strings) legal in M 
• User thinks of a relevant document and then picks some keywords 
to use as a query 
165
Generative Probabilistic Models 
• What is the probability of producing the query from a document? p(q|d) 
• Referred to as query-likelihood 
• Assumptions: 
• The probability of a document being relevant is strongly correlated with 
the probability of a query given a document, i.e. p(d|r) is correlated 
with p(q|d) 
• User has a reasonable idea of the terms that are like to appear in the 
“ideal” document 
• User’s query terms can distinguish the “ideal” document from the rest 
of the corpus 
• The query is generated as a representative of the “ideal” document 
• System’s task is to estimate for each of the documents in the collection, 
which is most likely to be the “ideal” document
Language Models (1998/2001) 
• Let’s assume we point blindly, one at a time, at 3 words 
in a document 
– What is the probability that I, by accident, pointed at the words 
“Master”, “computer” and “Science”? 
– Compute the probability, and use it to rank the documents. 
• Words are “sampled” independently of each other 
– Joint probability decomposed into a product of marginals 
– Estimation of probabilities just by counting 
• Higher models or unigrams? 
– Parameter estimation can be very expensive
Standard LM Approach 
• Assume that query terms are drawn identically and 
independently from a document
Estimating language models 
• Usually we don’t know M 
• Maximum Likelihood Estimate of 
– Simply use the number of times the query term occurs in 
the document divided by the total number of term 
occurrences. 
• Zero Probability (frequency) problem 
169
Document Models 
• Solution: Infer a language model for each document, 
where 
• Then we can estimate 
• Standard approach is to use the probability of a term to 
smooth the document model. 
• Interpolate the ML estimator with general language 
expectations
Estimating Document Models 
• Basic Components 
– Probability of a term given a document (maximum likelihood estimate) 
– Probability of a term given the collection 
– tf(t,d) is the number of times term t occurs in document d (term frequency)
Language Models 
• Implementation
Implementation as vector product 
df t 
tf t D 
p t  
 
 
' 
( ) 
( ' ) 
( ) 
t 
df t 
 
' 
( , ) 
( ' , ) 
( | ) 
t 
tf t D 
p t D 
Recall: 
score q d q dk 
q  
tf k q 
( , ) . 
( , ) 
tf k d df t 
( , ) ( ) 
k 
tf.idf of term k in document d 
 
 
 
Odds of the probability of 
  
 
Inverse length of d Term importance 
 
 
 
1 
. 
( ) ( , ) 
log 
Matching Text 
t 
t 
k 
k 
k 
df k tf t d 
d
Document length normalization 
• Probabilistic models assume causes for documents differing in 
length 
– Scope 
– Verbosity 
• In practice, document length softens the term frequency 
contribution to the final score 
– We’ve seen it in BM25 and LMs 
– Usually with a tunable parameter that regulates the 
amount of softening 
– Can be a function of the deviation of the average 
document length 
– Can be incorporated into vanilla tf-idf 
174
Other models 
• Modeling term dependencies (positions) in the language 
modeling framework 
– Markov Random Fields 
• Modeling matches (occurrences of words) in different 
parts of a document -> fielded models 
– BM25F 
– Markov Random Fields can account for this as well 
175
More involved signals for ranking 
• From document understanding to query 
understanding 
• Query rewrites (gazetteers, spell correction), 
named entity recognition, query suggestions, 
query categories, query segmentation ... 
• Detecting query intent, triggering verticals 
– direct target towards answers 
– richer interfaces 
176
Signals for Ranking 
• Signals for ranking: matches of query terms in 
documents, query-independent quality measures, 
CTR, among others 
• Probabilistic IR models are all about counting 
– occurrences of terms in documents, in sets of 
documents, etc. 
• How to aggregate efficiently a large number of 
“different” counts 
– coming from the same terms 
– no double counts! 
177
Searching for food 
• New York’s greatest pizza 
‣ New OR York’s OR greatest OR pizza 
‣ New AND York’s AND greatest AND pizza 
‣ New OR York OR great OR pizza 
‣ “New York” OR “great pizza” 
‣ “New York” AND “great pizza” 
‣ York < New AND great OR pizza 
• among many more. 
178
“Refined”matching 
• Extract a number of virtual regions in the document 
that match some version of the query (operators) 
– Each region provides a different evidence of 
relevance (i.e. signal) 
• Aggregate the scores over the different regions 
• Ex. :“at least any two words in the query appear 
either consecutively or with an extra word between 
them” 
179
Probability of Relevance 
180
Remember BM25 
• Term (tf) independence 
• Vague Prior over terms not 
appearing in the query 
• Eliteness - topical model that 
perturbs the word distribution 
• 2-poisson distribution of term 
frequencies over relevant and non-relevant 
documents 
181
Feature dependencies 
• Class-linearly dependent (or affine) features 
– add no extra evidence/signal 
– model overfitting (vs capacity) 
• Still, it is desirable to enrich the model with more 
involved features 
• Some features are surprisingly correlated 
• Positional information requires a large number of 
parameters to estimate 
• Potentially up to 
182
Query concept segmentation 
• Queries are made up of basic conceptual units, 
comprising many words 
– “Indian summer victor herbert” 
• Spurious matches: “san jose airport” -> “san jose 
city airport” 
• Model to detect segments based on generative 
language models and Wikipedia 
• Relax matches using factors of the max ratio 
between span length and segment length 
183
Virtual regions 
• Different parts of the document 
provide different evidence of 
relevance 
• Create a (finite) set of (latent) 
artificial regions and re-weight 
184
Implementation 
• An operator maps a query to a set of queries, 
which could match a document 
• Each operator has a weight 
• The average term frequency in a document is 
185
Remarks 
• Different saturation (eliteness) function? 
– learn the real functional shape! 
– log-logistic is good if the class-conditional 
distributions are drawn from an exp. family 
• Positions as variables? 
– kernel-like method or exp. #parameters 
• Apply operators on a per query or per query class 
basis? 
186
Operator examples 
• BOW: maps a raw query to the set of queries 
whose elements are the single terms 
• p-grams: set of all p-gram of consecutive terms 
• p-and: all conjunctions of p arbitrary terms 
• segments: match only the “concepts” 
• Enlargement: some words might sneak in 
between the phrases/segments 
187
How does it work in practice? 
188
... not that far away 
term frequency 
link information 
query intent information 
editorial information 
click-through information 
geographical information 
language information 
user preferences 
document length 
document fields 
other gazillion sources of information 
189
Dictionaries 
• Fast look-up 
– Might need specific structures to scale up 
• Hash tables 
• Trees 
– Tolerant retrieval (prefixes) 
– Spell checking 
• Document correction (OCR) 
• Query misspellings (did you mean … ?) 
• (Weighted) edit distance – dynamic programming 
• Jaccard overlap (index character k-grams) 
• Context sensitive 
• http://norvig.com/spell-correct.html 
– Wild-card queries 
• Permuterm index 
• K-gram indexes 
190
Hardware basics 
• Access to data in memory is much faster than access to data on disk. 
• Disk seeks: No data is transferred from disk while the disk head is being 
positioned. 
• Therefore: Transferring one large chunk of data from disk to memory is 
faster than transferring many small chunks. 
• Disk I/O is block-based: Reading and writing of entire blocks (as opposed 
to smaller chunks). 
• Block sizes: 8KB to 256 KB. 
191
Hardware basics 
• Many design decisions in information retrieval are based on the 
characteristics of hardware 
• Servers used in IR systems now typically have several GB of main memory, 
sometimes tens of GB. 
• Available disk space is several (2-3) orders of magnitude larger. 
• Fault tolerance is very expensive: It is much cheaper to use many regular 
machines rather than one fault tolerant machine. 
192
Data flow 
splits 
Parser 
Parser 
Parser 
Master 
a-f g-p q-z 
a-f g-p q-z 
a-f g-p q-z 
Inverter 
Inverter 
Inverter 
Postings 
a-f 
g-p 
q-z 
assign assign 
Map 
phase Segment files 
Reduce 
phase 
193
MapReduce 
• The index construction algorithm we just described is an instance of 
MapReduce. 
• MapReduce (Dean and Ghemawat 2004) is a robust and conceptually 
simple framework for distributed computing … 
• … without having to write code for the distribution part. 
• They describe the Google indexing system (ca. 2002) as consisting of a 
number of phases, each implemented in MapReduce. 
• Open source implementation Hadoop 
– Widely used throughout industry 
194
MapReduce 
• Index construction was just one phase. 
• Another phase: transforming a term-partitioned index 
into a document-partitioned index. 
– Term-partitioned: one machine handles a subrange of 
terms 
– Document-partitioned: one machine handles a 
subrange of documents 
• Msearch engines use a document-partitioned index for 
better load balancing, etc. 
195
Distributed IR 
• Basic process 
– All queries sent to a director machine 
– Director then sends messages to many index servers 
• Each index server does some portion of the query processing 
– Director organizes the results and returns them to the user 
• Two main approaches 
– Document distribution 
• by far the most popular 
– Term distribution 
196
Distributed IR (II) 
• Document distribution 
– each index server acts as a search engine for a small fraction of 
the total collection 
– director sends a copy of the query to each of the index servers, 
each of which returns the top k results 
– results are merged into a single ranked list by the director 
• Collection statistics should be shared for effective ranking 
197
Caching 
• Query distributions similar to Zipf 
• About ½ each day are unique, but some are very popular 
– Caching can significantly improve effectiveness 
• Cache popular query results 
• Cache common inverted lists 
– Inverted list caching can help with unique queries 
– Cache must be refreshed to prevent stale data 
198
Others 
• Efficiency (compression, storage, caching, 
distribution) 
• Novelty and diversity 
• Evaluation 
• Relevance feedback 
• Learning to rank 
• User models 
– Context, personalization 
• Sponsored Search 
• Temporal aspects 
• Social aspects 
199
200

More Related Content

What's hot

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval systemLeslie Vargas
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrievalMalobe Lottin Cyrille Marcel
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database Avnish Patel
 

What's hot (20)

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Document Database
Document DatabaseDocument Database
Document Database
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrieval
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 
Text mining
Text miningText mining
Text mining
 

Similar to Introduction to Information Retrieval

Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Grace Hui Yang
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Designing Structure Part II: Information Archtecture
Designing Structure Part II: Information ArchtectureDesigning Structure Part II: Information Archtecture
Designing Structure Part II: Information ArchtectureChristina Wodtke
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010vman916
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Information retrieval 1 introduction to ir
Information retrieval 1 introduction to irInformation retrieval 1 introduction to ir
Information retrieval 1 introduction to irVaibhav Khanna
 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxlekhacce
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 

Similar to Introduction to Information Retrieval (20)

Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Unit 1
Unit 1Unit 1
Unit 1
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Ojala "The Sophisticated User"
Ojala "The Sophisticated User"Ojala "The Sophisticated User"
Ojala "The Sophisticated User"
 
Designing Structure Part II: Information Archtecture
Designing Structure Part II: Information ArchtectureDesigning Structure Part II: Information Archtecture
Designing Structure Part II: Information Archtecture
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Information Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based ResearchInformation Discovery and Search Strategies for Evidence-Based Research
Information Discovery and Search Strategies for Evidence-Based Research
 
Searching Online
Searching OnlineSearching Online
Searching Online
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
Information retrieval 1 introduction to ir
Information retrieval 1 introduction to irInformation retrieval 1 introduction to ir
Information retrieval 1 introduction to ir
 
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptxInformation RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 

More from Roi Blanco

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationRoi Blanco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF GraphsRoi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search EnginesRoi Blanco
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesRoi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entitiesRoi Blanco
 

More from Roi Blanco (12)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Introduction to Information Retrieval

  • 1. Introduction to Information Retrieval June, 2013 Roi Blanco
  • 2. Acknowledgements • Many of these slides were taken from other presentations – P. Raghavan, C. Manning, H. Schutze IR lectures – Mounia Lalmas’s personal stash – Other random slide decks • Textbooks – Ricardo Baeza-Yates, Berthier Ribeiro Neto – Raghavan, Manning, Schutze – … among other good books • Many online tutorials, many online tools available (full toolkits) 2
  • 3. Big Plan • What is Information Retrieval? – Search engine history – Examples of IR systems (you might now have known!) • Is IR hard? – Users and human cognition – What is it like to be a search engine? • Web Search – Architecture – Differences between Web search and IR – Crawling 3
  • 4. • Representation – Document view – Document processing – Indexing • Modeling – Vector space – Probabilistic – Language Models – Extensions • Others – Distributed – Efficiency – Caching – Temporal issues – Relevance feedback – … 4
  • 5. 5
  • 6. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Introduction to Information Retrieval 6 6
  • 7. Information Retrieval (II) • What do we understand by documents? How do we decide what is a document and whatnot? • What is an information need? What types of information needs can we satisfy automatically? • What is a large collection? Which environments are suitable for IR 7 7
  • 8. Basic assumptions of Information Retrieval • Collection: A set of documents – Assume it is a static collection • Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task 8
  • 9. Key issues • How to describe information resources or information-bearing objects in ways that they can be effectively used by those who need to use them ? – Organizing/Indexing/Storing • How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs – Retrieving / Accessing / Filtering 9
  • 10. Unstructured data Unstructured data? SELECT * from HOTELS where city = Bangalore and $$$ < 2 10 Cheap hotels in Bangalore CITY $$$ name Bangalore 1.5 Cheapo one Barcelona 1 EvenCheapoer 10
  • 11. Unstructured (text) vs. structured (database) data in the mid-nineties 11
  • 12. Unstructured (text) vs. structured (database) data today
  • 13. 13
  • 14. Search Engine Index Square Pants! 14
  • 15. 15
  • 16. Timeline 1990 1991 1993 1994 1998 ... 16
  • 17. ... 1995 1996 1997 1998 1999 2000 17
  • 18. 2009 2005 ... 2008 18
  • 19. 2001 2003 2002 2003 2003 2003 2003 2010 2010 2003 19
  • 20. 20
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. Usability We also fail at using the technology Sometimes
  • 36. 36
  • 37. Applications • Text Search • Ad search • Image/Video search • Email Search • Question Answering systems • Recommender systems • Desktop Search • Expert Finding • .... Jobs Prizes Products News Source code Videogames Maps Partners Mashups ... 37
  • 38. Types of search engines • Q&A engines • Collaborative • Enterprise • Web • Metasearch • Semantic • NLP • ... 38
  • 39.
  • 40. 40
  • 41. IR issues • Find out what the user needs … and do it quickly • Challenges: user intention, accessibility, volatility, redundancy, lack of structure, low quality, different data sources, volume, scale • The main bottleneck is human cognition and not computational 41
  • 42. IR is mostly about relevance • Relevance is the core concept in IR, but nobody has a good definition • Relevance = useful • Relevance = topically related • Relevance = new • Relevance = interesting • Relevance = ??? • However we still want relevant information 42
  • 43. • Information needs must be expressed as a query – But users don’t often know what they want • Problems – Verbalizing information needs – Understanding query syntax – Understanding search engines 43
  • 44. Understanding(?) the user I am a hungry tourist in Barcelona, and I want to find a place to eat; however I don’t want to spend a lot of money I want information on places with cheap food in Barcelona Info about bars in Barcelona Bar celona Misconception Mistranslation Misformulation 44
  • 45. Why this is hard? • Documents/images/ video/speech/etc are complex. We need some representation • Semantics – What do words mean? • Natural language – How do we say things? • L Computers cannot deal with these easily 45
  • 46. … and even harder • Context • Opinion Funny? Talented? Honest? 46
  • 47. Semantics Bank Note River Bank Bank 47 Blood bank
  • 48. What is it like to be a search engine? • How can we figure out what you’re trying to do? • Signal can be somehow weak, sometimes! [ jaguar ] [ iraq ] [ latest release Thinkpad drivers touchpad ] [ ebay ] [ first ] [ google ] [ brittttteny spirs ] 48
  • 49. Search is a multi-step process • Session search – Verbalize your query – Look for a document – Find your information there – Refine • Teleporting – Go directly to the site you like – Formulating the query is too hard, you trust more the final site, etc. 49
  • 50. • Someone told me that in the mid-1800’s, people often would carry around a special kind of notebook. They would use the notebook to write down quotations that they heard, or copy passages from books they’d read. The notebook was an important part of their education, and it had a particular name. – What was the name of the notebook? 50 Examples from Dan Russel
  • 51. Naming the un-nameable • What’s this thing called? 51
  • 52. More tasks … • Going beyond a search engine – Using images / multimedia content – Using maps – Using other sources • Think of how to express things differently (synonyms) – A friend told me that there is an abandoned city in the waters of San Francisco Bay. Is that true? If it IS true, what was the name of the supposed city? • Exploring a topic further in depth • Refining a question – Suppose you want to buy a unicycle for your Mom or Dad. How would you find it? • Looking for lists of information – Can you find a list of all the groups that inhabited California at the time of the missions? 52
  • 53. IR tasks • Known-item finding – You want to retrieve some data that you know they exist – What year was Peter Mika born? • Exploratory seeking – You want to find some information through an iterative process – Not a single answer to your query • Exhaustive search – You want to find all the information possible about a particular issue – Issuing several queries to cover the user information need • Re-finding – You want to find an item you have found already 53
  • 54. Scale • >300TB of print data produced per year – +Video, speech, domain-specific information (>600PB per year) • IR has to be fast + scalable • Information is dynamic – News, web pages, maps, … – Queries are dynamic (you might even change your information needs while searching) • Cope with data and searcher change – This introduces tensions in every component of a search engine 54
  • 55. Methodology • Experimentation in IR • Three fundamental types of IR research: – Systems (efficiency) – Methods (effectiveness) – Applications (user utility) • Empirical evaluation plays a critical role across all three types of research 55
  • 56. Methodology (II) • Information retrieval (IR) is a highly applied scientific discipline • Experimentation is a critical component of the scientific method • Poor experimental methodologies are not scientifically sound and should be avoided 56
  • 57. 57
  • 58. 58 Task Info need Verbal form query Search engine Corpus results Query refinement
  • 59. User Interface Query interpretation Document Collection Crawling Text Processing Indexing General Voodoo Matching Ranking Metadata Index Document Interpretation 59
  • 60. Crawler NLP pipeline Indexer Documents Tokens Index Query System 60
  • 61. Broker DNS Cluster Cluster cache server partition replication 61
  • 62. <a href= • Web pages are linked – AKA Web Graph • We can walk trough the graph to crawl • We can rank using the graph 62
  • 63. Web pages are connected 63
  • 64. Web Search • Basic search technology shared with IR systems – Representation – Indexing – Ranking • Scale (in terms of data and users) changes the game – Efficiency/architectural design decisions • Link structure – For data acquisition (crawling) – For ranking (PageRank, HITS) – For spam detection – For extending document representations (anchor text) • Adversarial IR • Monetization 64
  • 65. User Needs • Need – Informational – want to learn about something (~40% / 65%) – Navigational – want to go to that page (~25% / 15%) – Transactional – want to do something (web-mediated) (~35% / 20%) • Access a service • Downloads • Shop – Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Car rental Brasil 65
  • 66. How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 66
  • 67. Users’ empirical evaluation of results • Quality of pages varies widely – Relevance is not enough – Other desirable qualities (non IR!!) • Content: Trustworthy, diverse, non-duplicated, well maintained • Web readability: display correctly & fast • No annoyances: pop-ups, etc. • Precision vs. recall – On the web, recall seldom matters • What matters – Precision at 1? Precision above the fold? – Comprehensiveness – must be able to deal with obscure queries • Recall matters when the number of matches is very small • User perceptions may be unscientific, but are significant over a large aggregate 67
  • 68. Users’ empirical evaluation of engines • Relevance and validity of results • UI – Simple, no clutter, error tolerant • Trust – Results are objective • Coverage of topics for ambiguous queries • Pre/Post process tools provided – Mitigate user errors (auto spell check, search assist,…) – Explicit: Search within results, more like this, refine ... – Anticipative: related searches • Deal with idiosyncrasies – Web specific vocabulary • Impact on stemming, spell-check, etc. – Web addresses typed in the search box • “The first, the last, the best and the worst …” 68
  • 69. The Web document collection • No design/co-ordination • Distributed content creation, linking, democratization of publishing • Content includes truth, lies, obsolete information, contradictions … • Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… • Scale much larger than previous text collections … but corporate records are catching up • Growth – slowed down from initial “volume doubling every few months” but still expanding • Content can be dynamically generated The Web 69
  • 70. Basic crawler operation • Begin with known “seed” URLs • Fetch and parse them –Extract URLs they point to –Place the extracted URLs on a queue • Fetch each URL on the queue and repeat 70
  • 71. Crawling picture Web URLs frontier Unseen Web URLs crawled and parsed Seed pages 71
  • 72. Simple picture – complications • Web crawling isn’t feasible with one machine – All of the above steps distributed • Malicious pages – Spam pages – Spider traps – including dynamically generated • Even non-malicious pages pose challenges – Latency/bandwidth to remote servers vary – Webmasters’ stipulations • How “deep” should you crawl a site’s URL hierarchy? – Site mirrors and duplicate pages • Politeness – don’t hit a server too often 72
  • 73. What any crawler must do • Be Polite: Respect implicit and explicit politeness considerations – Only crawl allowed pages – Respect robots.txt • Be Robust: Be immune to spider traps and other malicious behavior from web servers –Be efficient 73
  • 74. What any crawler should do • Be capable of distributed operation: designed to run on multiple distributed machines • Be scalable: designed to increase the crawl rate by adding more machines • Performance/efficiency: permit full use of available processing and network resources 74
  • 75. What any crawler should do • Fetch pages of “higher quality” first • Continuous operation: Continue fetching fresh copies of a previously fetched page • Extensible: Adapt to new data formats, protocols 75
  • 76. Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread 76
  • 77. 77
  • 78. Document views sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02” Sailing in Greece B. Smith content view head title author chapter section section structure view data view layout view 78
  • 79. What is a document: document views • Content view is concerned with representing the content of the document; that is, what is the document about. • Data view is concerned with factual data associated with the document (e.g. author names, publishing date) • Layout view is concerned with how documents are displayed to the users; this view is related to user interface and visualization issues. • Structure view is concerned with the logical structure of the document, (e.g. a book being composed of chapters, themselves composed of sections, etc.) 79
  • 80. Indexing language • An indexing language: – Is the language used to describe the content of documents (and queries) – And it usually consists of index terms that are derived from the text (automatic indexing), or arrived at independently (manual indexing), using a controlled or uncontrolled vocabulary – Basic operation: is this query term present in this document? 80
  • 81. Generating document representations • The building of the indexing language, that is generating the document representation, is done in several steps: – Character encoding – Language recognition – Page segmentation (boilerplate detection) – Tokenization (identification of words) – Term normalization – Stopword removal – Stemming – Others (doc. Expansion, etc.) 81
  • 82. Generating document representations: overview documents tokens stop-words stems terms (index terms) tokenization remove noisy words reduce to stems + others: e.g. - thesaurus - more complex processing 82
  • 83. Parsing a document • What format is it in? – pdf/word/excel/html? • What language is it in? • What character set is in use? – (ISO-8818, UTF-8, …) But these tasks are often done heuristically … 83
  • 84. Complications: Format/language • Documents being indexed can include docs from many different languages – A single index may contain terms from many languages. • Sometimes a document or its components can contain multiple languages/formats – French email with a German pdf attachment. – French email quote clauses from an English-language contract • There are commercial and open source libraries that can handle a lot of this stuff 84
  • 85. Complications: What is a document? We return from our query “documents” but there are often interesting questions of grain size: What is a unit document? – A file? – An email? (Perhaps one of many in a single mbox file) • What about an email with 5 attachments? – A group of files (e.g., PPT or LaTeX split over HTML pages) 85
  • 86. Tokenization • Input: “Friends, Romans and Countrymen” • Output: Tokens – Friends – Romans – Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit? 86
  • 87. Tokenization • Issues in tokenization: – Finland’s capital  Finland AND s? Finlands? Finland’s? – Hewlett-Packard  Hewlett and Packard as two tokens? • state-of-the-art: break up hyphenated sequence. • co-education • lowercase, lower-case, lower case ? • It can be effective to get the user to put in possible hyphens – San Francisco: one token or two? • How do you decide it is one token? 87
  • 88. Numbers • 3/20/91 Mar. 12, 1991 20/3/91 • 55 B.C. • B-52 • My PGP key is 324a3df234cb23e • (800) 234-2333 • Often have embedded spaces • Older IR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web • Will often index “meta-data” separately Creation date, format, etc. 88
  • 89. Tokenization: language issues • French – L'ensemble  one token or two? • L ? L’ ? Le ? • Want l’ensemble to match with un ensemble – Until at least 2003, it didn’t on Google » Internationalization! • German noun compounds are not segmented – Lebensversicherungsgesellschaftsangestellter – ‘life insurance company employee’ – German retrieval systems benefit greatly from a compound splitter module – Can give a 15% performance boost for German 89
  • 90. Tokenization: language issues • Chinese and Japanese have no spaces between words: – 莎拉波娃现在居住在美国东南部的佛罗里达。 – Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled – Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! 90
  • 91. Tokenization: language issues • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ • With Unicode, the surface presentation is complex, but the stored form is straightforward 91
  • 92. Stop words • With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: – They have little semantic content: the, a, and, to, be – There are a lot of them: ~30% of postings for top 30 words • But the trend is away from doing this: – Good compression techniques means the space for including stop words in a system can be small – Good query optimization techniques mean you pay little at query time for including stop words. – You need them for: • Phrase queries: “King of Denmark” • Various song titles, etc.: “Let it be”, “To be or not to be” • “Relational” queries: “flights to London” 92
  • 93. Normalization to terms • Want: matches to occur despite superficial differences in the character sequences of the tokens • We may need to “normalize” words in indexed text as well as query words into the same form – We want to match U.S.A. and USA • Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary • We most commonly implicitly define equivalence classes of terms by, e.g., – deleting periods to form a term • U.S.A., USA  USA – deleting hyphens to form a term • anti-discriminatory, antidiscriminatory  antidiscriminatory 93
  • 94. Normalization: other languages • Accents: e.g., French résumé vs. resume. • Umlauts: e.g., German: Tuebingen vs. Tübingen – Should be equivalent • Most important criterion: – How are your users like to write their queries for these words? • Even in languages that standardly have accents, users often may not type them – Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen  Tubingen 94
  • 95. Case folding • Reduce all letters to lower case – exception: upper case in mid-sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail – Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: [fixed in 2011…] – Query C.A.T. – #1 result is for “cats” (well, Lolcats) not Caterpillar Inc. 95
  • 96. Normalization to terms • An alternative to equivalence classing is to do asymmetric expansion • An example of where this may be useful – Enter: window Search: window, windows – Enter: windows Search: Windows, windows, window – Enter: Windows Search: Windows • Potentially more powerful, but less efficient 96
  • 97. Thesauri and soundex • Do we handle synonyms and homonyms? – E.g., by hand-constructed equivalence classes • car = automobile color = colour – We can rewrite to form equivalence-class terms • When the document contains automobile, index it under car-automobile (and vice-versa) – Or we can expand a query • When the query contains automobile, look under car as well • What about spelling mistakes? – One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics 97
  • 98. Lemmatization • Reduce inflectional/variant forms to base form • E.g., – am, are, is  be – car, cars, car's, cars'  car • the boy's cars are different colors  the boy car be different color • Lemmatization implies doing “proper” reduction to dictionary headword form 98
  • 99. Stemming • Reduce terms to their “roots” before indexing • “Stemming” suggests crude affix chopping – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 99
  • 100. – Affix removal • remove the longest affix: {sailing, sailor} => sail • simple and effective stemming • a widely used such stemmer is Porter’s algorithm – Dictionary-based using a look-up table • look for stem of a word in table: play + ing => play • space is required to store the (large) table, so often not practical 100
  • 101. Stemming: some issues • Detect equivalent stems: – {organize, organise}: e as the longest affix leads to {organiz, organis}, which should lead to one stem: organis – Heuristics are therefore used to deal with such cases. • Over-stemming: – {organisation, organ} reduced into org, which is incorrect – Again heuristics are used to deal with such cases. 101
  • 102. Porter’s algorithm • Commonest algorithm for stemming English – Results suggest it’s at least as good as other stemming options • Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 102
  • 103. Typical rules in Porter • sses  ss • ies  i • ational  ate • tional  tion 103
  • 104. Language-specificity • The above methods embody transformations that are – Language-specific, and often – Application-specific • These are “plug-in” addenda to the indexing process • Both open source and commercial plug-ins are available for handling these 104
  • 105. Does stemming help? • English: very mixed results. Helps recall for some queries but harms precision on others – E.g., operative (dentistry) ⇒ oper • Definitely useful for Spanish, German, Finnish, … – 30% performance gains for Finnish! 105
  • 106. Others: Using a thesaurus • A thesaurus provides a standard vocabulary for indexing (and searching) • More precisely, a thesaurus provides a classified hierarchy for broadening and narrowing terms bank: 1. Finance institute 2. River edge – if a document is indexed with bank, then index it with “finance institute” or “river edge” – need to disambiguate the sense of bank in the text: e.g. if money appears in the document, then chose “finance institute” • A widely used online thesaurus: WordNet 106
  • 107. Information storage • Whole topic on its own • How do we keep fresh copies of the web manageable by a cluster of computers and are able to answer millions of queries in milliseconds – Inverted indexes – Compression – Caching – Distributed architectures – … and a lot of tricks • Inverted indexes: cornerstone data structure of IR systems – For each term t, we must store a list of all documents that contain t. – Identify each doc by a docID, a document serial number – Index construction is tricky (can’t hold all the information needed in memory) 107
  • 108. 108 docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 1 1 D10 0 1 1 Terms D1 D2 D3 D4 t1 1 1 0 1 t2 0 0 1 0 t3 1 0 1 0
  • 109. • Most basic form: – Document frequency – Term frequency – Document identifiers 109 term Term id df a 1 4 as 2 3 (1,2), (2,5), (10,1), (11,1) (1,3), (3,4), (20,1)
  • 110. • Indexes contain more information – Position in the document • Useful for “phrase queries” or “proximity queries” – Fields in which the term appears in the document – Metadata … – All that can be used for ranking 110 (1,2, [1, 1], [2,10]), … Field 1 (title), position 1
  • 111. Queries • How do we process a query? • Several kinds of queries – Boolean •Chicken AND salt • Gnome OR KDE • Salt AND NOT pepper – Phrase queries – Ranked 111
  • 112. List Merging •“Exact match” queries – Chicken AND curry – Locate Chicken in the dictionary – Fetch its postings – Locate curry in the dictionary –Fetch its postings –Merge both postings 112
  • 114. List Merging Walk through the postings in O(x+y) time salt pepper 3 22 23 25 3 5 22 25 36 3 22 25 114
  • 115. 115
  • 116. Models of information retrieval • A model: – abstracts away from the real world – uses a branch of mathematics – possibly: uses a metaphor for searching 116
  • 117. Short history of IR modelling • Boolean model (±1950) • Document similarity (±1957) • Vector space model (±1970) • Probabilistic retrieval (±1976) • Language models (±1998) • Linkage-based models (±1998) • Positional models (±2004) • Fielded models (±2005) 117
  • 118. The Boolean model (±1950) • Exact matching: data retrieval (instead of information retrieval) – A term specifies a set of documents – Boolean logic to combine terms / document sets – AND, OR and NOT: intersection, union, and difference 118
  • 119. Statistical similarity between documents (±1957) • The principle of similarity "The more two representations agree in given elements and their distribution, the higher would be the probability of their representing similar information” (Luhn 1957) It is here proposed that the frequency of word [term] occurrence in an article [document ] furnishes a useful measurement of word [term] significance” 119
  • 120. Zipf’s law terms by rank order frequency of terms f r 120
  • 121. Zipf’s law • Relative frequencies of terms. • In natural language, there are a few very frequent terms and very many very rare terms. • Zipf’s law: The ith most frequent term has frequency proportional to 1/i . • cfi ∝ 1/i = K/i where K is a normalizing constant • cfi is collection frequency: the number of occurrences of the term ti in the collection. • Zipf’s law holds for different languages 121
  • 122. Zipf consequences • If the most frequent term (the) occurs cf1 times – then the second most frequent term (of) occurs cf1/2 times – the third most frequent term (and) occurs cf1/3 times … • Equivalent: cfi = K/i where K is a normalizing factor, so – log cfi = log K - log i – Linear relationship between log cfi and log i • Another power law relationship 122
  • 123. Zipf’s law in action 123
  • 124. Luhn’s analysis -Observation terms by rank order frequency of terms f resolving power r upper cut-off lower cut-off common terms rare terms significant terms Resolving power of significant terms: ability of terms to discriminate document content peak at rank order position half way between the two cut-offs 124
  • 125. Luhn’s analysis - Implications • Common terms are not good at representing document content – partly implemented through the removal of stop words • Rare words are also not good at representing document content – usually nothing is done – Not true for every “document” • Need a means to quantify the resolving power of a term: – associate weights to index terms – tf×idf approach 125
  • 126. Ranked retrieval • Boolean queries are good for expert users with precise understanding of their needs and the collection. – Also good for applications: Applications can easily consume 1000s of results. • Not good for the majority of users. – Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). – Most users don’t want to wade through 1000s of results. • This is particularly true of web search.
  • 127. Feast or Famine • Boolean queries often result in either too few (=0) or too many (1000s) results. • Query 1: “standard user dlink 650” → 200,000 hits • Query 2: “standard user dlink 650 no card found”: 0 hits • It takes a lot of skill to come up with a query that produces a manageable number of hits. – AND gives too few; OR gives too many
  • 128. Ranked retrieval models • Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query • Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language • In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 128
  • 129. Feast or famine: not a problem in ranked retrieval • When a system produces a ranked result set, large result sets are not an issue – Indeed, the size of the result set is not an issue – We just show the top k ( ≈ 10) results – We do not overwhelm the user – Premise: the ranking algorithm works
  • 130. Scoring as the basis of ranked retrieval • We wish to return in order the documents most likely to be useful to the searcher • How can we rank-order the documents in the collection with respect to a query? • Assign a score – say in [0, 1] – to each document • This score measures how well document and query “match”.
  • 131. Query-document matching scores • We need a way of assigning a score to a query/document pair • Let’s start with a one-term query • If the query term does not occur in the document: score should be 0 • The more frequent the query term in the document, the higher the score (should be) • We will look at a number of alternatives for this.
  • 132. Bag of words model • Vector representation does not consider the ordering of words in a document • John is quicker than Mary and Mary is quicker than John have the same vectors • This is called the bag of words model.
  • 133. Term frequency tf • The term frequency tf(t,d) of term t in document d is defined as the number of times that t occurs in d. • We want to use tf when computing query-document match scores. But how? • Raw term frequency is not what we want: – A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. – But not 10 times more relevant. • Relevance does not increase proportionally with term frequency.
  • 134. Log-frequency weighting • The log frequency weight of term t in d is    1 log tf , if tf 0    10 t,d t,d 0, otherwise t,d w • 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. • Score for a document-query pair: sum over terms t in both q and d: • score • The score is 0 if none of the query terms is present in the document.      t q d t d (1 log tf ) ,
  • 135. Document frequency • Rare terms are more informative than frequent terms – Recall stop words • Consider a term in the query that is rare in the collection (e.g., arachnocentric) • A document containing this term is very likely to be relevant to the query arachnocentric • → We want a high weight for rare terms like arachnocentric.
  • 136. Document frequency, continued • Frequent terms are less informative than rare terms • Consider a query term that is frequent in the collection (e.g., high, increase, line) • A document containing such a term is more likely to be relevant than a document that does not • But it’s not a sure indicator of relevance. • → For frequent terms, we want high positive weights for words like high, increase, and line • But lower weights than for rare terms. • We will use document frequency (df) to capture this.
  • 137. idf weight • dft is the document frequency of t: the number of documents that contain t – dft is an inverse measure of the informativeness of t – dft  N • We define the idf (inverse document frequency) of t by – We use log (N/dft) instead of N/dft to “dampen” the effect of idf. idf log ( /df ) t 10 t  N
  • 138. Effect of idf on ranking • Does idf have an effect on ranking for one-term queries, like – iPhone • idf has no effect on ranking one term queries – idf affects the ranking of documents for queries with at least two terms – For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 138
  • 139. tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. w  log(1  tf )  log ( N / df ) t , d t ,d 10 t • Best known weighting scheme in information retrieval – Note: the “-” in tf-idf is a hyphen, not a minus sign! – Alternative names: tf.idf, tf x idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection
  • 140. Score for a document given a query tÎqÇd å • There are many variants – How “tf” is computed (with/without logs) – Whether the terms in the query are also weighted – … 140 Score(q,d) = tf.idft,d
  • 141. Documents as vectors • So we have a |V|-dimensional vector space • Terms are axes of the space • Documents are points or vectors in this space • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine • These are very sparse vectors - most entries are zero.
  • 142. Statistical similarity between documents (±1957) • Vector product – If the vector has binary components, then the product measures the number of shared terms – Vector components might be "weights"  score q d  q  d k k  matching terms ( , ) k  
  • 143. Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.
  • 144. Vector space model (±1970) • Documents and queries are vectors in a high-dimensional space • Geometric measures (distances, angles)
  • 145. Vector space model (±1970) • Cosine of an angle: – close to 1 if angle is small – 0 if vectors are orthogonal 2 m d q k k k d q m k 1 k   2 m k 1 k  1 ( ) ( )   cos( , )      d q 1 ( )2  m      k  k i i m k k k v v   d q n d n q n v 1 cos( , ) ( ) ( ), ( )
  • 146. Vector space model (±1970) • PRO: Nice metaphor, easily explained; Mathematically sound: geometry; Great for relevance feedback • CON: Need term weighting (tf-idf); Hard to model structured queries
  • 147. Probabilistic IR • An IR system has an uncertain understanding of user’s queries and makes uncertain guesses on whether a document satisfies a query or not. • Probability theory provides a principled foundation for reasoning under uncertainty. • Probabilistic models build upon this foundation to estimate how likely it is that a document is relevant for a query. 147
  • 148. Event Space • Query representation • Document representation • Relevance • Event space • Conceptually there might be pairs with same q and d, but different r • Some times include include user u, context c, etc. 148
  • 149. Probability Ranking Principle • Robertson (1977) – “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” • Basis for probabilistic approaches for IR 149
  • 150. Dissecting PRP • Probability of relevance • Estimated accurately • Based on whatever data available • Best possible accuracy – The perfect IR system! – Assumes relevance is independent on other documents in the collection 150
  • 151. Relevance? • What is ? – Isn’t it decided by the user? her opinion? • User doesn’t mean a human being! – We are working with representations – ... or parts of the reality available to us • 2/3 keywords, no profile, no context ... – relevance is uncertain • depends on what the system sees • may be marginalized over all the unseen context/profiles 151
  • 152. Retrieval as binary classification • For every (q,d), r takes two values – Relevant and non-relevant documents – can be extended to multiple values • Retrieve using Bayes’ decision – PRP is related to the Bayes error rate (lowest possible error rate for a class) – How do we estimate this probability? 152
  • 153. PRP ranking • How to represent the random variables? • How to estimate the model’s parameters? 153
  • 154. • d is a binary vector • Multiple Bernoulli variables • Under MB, we can decompose into a product of probabilities, with likelihoods: 154
  • 155. If the terms are not in the query: Otherwise we need estimates for them! 155
  • 156. Estimates • Assign new weights for query terms based on relevant/non-relevant documents • Give higher weights to important terms: Relevant Non-relevant 156 Document with t r n-r n Document without t R-r N-r-R+r N-n R N-R
  • 157. Robertson-Spark Jones weight 157 Relevant docs with t Relevant docs without t Non-relevant docs with t Non-relevant docs without t
  • 158. Estimates without relevance info • If we pick a relevant document, words are equally like to be present or absent • Non-relevant can be approximated with the collection as a whole 158
  • 160. Modeling TF • Naïve estimation: separate probability for every outcome • BIR had only two parameters, now we have plenty (~many outcomes) • We can plug in a parametric estimate for the term frequencies • For instance, a Poisson mixture 160
  • 161. Okapi BM25 • Same ranking function as before but with new estimates. Models term frequencies and document length. • Words are generated by a mixture of two Poissons • Assumes an eliteness variable (elite ~ word occurs unusually frequently, non-elite ~ word occurs as expected by chance). 161
  • 162. BM25 • As a graphical model 162
  • 163. BM25 • In order to approximate the formula, Robertson and Walker came up with: • Two model parameters • Very effective • The more words in common with the query the better • Repetitions less important than different query words – But more important if the document is relatively long 163
  • 164. Generative Probabilistic Language Models • The generative approach – A generator which produces events/tokens with some probability – Probability distribution over strings of text – URN Metaphor – a bucket of different colour balls (10 red, 5 blue, 3 yellow, 2 white) • What is the probability of drawing a yellow ball? 3/20 • what is the probability of drawing (with replacement) a red ball and a white ball? ½*1/10 – IR Metaphor: Documents are urns, full of tokens (balls) of (in) different terms (colors)
  • 165. What is a language model? • How likely is a string of words in a “language”? – P1(“the cat sat on the mat”) – P2(“the mat sat on the cat”) – P3(“the cat sat en la alfombra”) – P4(“el gato se sentó en la alfombra”) • Given a model M and a observation s we want – Probability of getting s through random sampling from M – A mechanism to produce observations (strings) legal in M • User thinks of a relevant document and then picks some keywords to use as a query 165
  • 166. Generative Probabilistic Models • What is the probability of producing the query from a document? p(q|d) • Referred to as query-likelihood • Assumptions: • The probability of a document being relevant is strongly correlated with the probability of a query given a document, i.e. p(d|r) is correlated with p(q|d) • User has a reasonable idea of the terms that are like to appear in the “ideal” document • User’s query terms can distinguish the “ideal” document from the rest of the corpus • The query is generated as a representative of the “ideal” document • System’s task is to estimate for each of the documents in the collection, which is most likely to be the “ideal” document
  • 167. Language Models (1998/2001) • Let’s assume we point blindly, one at a time, at 3 words in a document – What is the probability that I, by accident, pointed at the words “Master”, “computer” and “Science”? – Compute the probability, and use it to rank the documents. • Words are “sampled” independently of each other – Joint probability decomposed into a product of marginals – Estimation of probabilities just by counting • Higher models or unigrams? – Parameter estimation can be very expensive
  • 168. Standard LM Approach • Assume that query terms are drawn identically and independently from a document
  • 169. Estimating language models • Usually we don’t know M • Maximum Likelihood Estimate of – Simply use the number of times the query term occurs in the document divided by the total number of term occurrences. • Zero Probability (frequency) problem 169
  • 170. Document Models • Solution: Infer a language model for each document, where • Then we can estimate • Standard approach is to use the probability of a term to smooth the document model. • Interpolate the ML estimator with general language expectations
  • 171. Estimating Document Models • Basic Components – Probability of a term given a document (maximum likelihood estimate) – Probability of a term given the collection – tf(t,d) is the number of times term t occurs in document d (term frequency)
  • 172. Language Models • Implementation
  • 173. Implementation as vector product df t tf t D p t    ' ( ) ( ' ) ( ) t df t  ' ( , ) ( ' , ) ( | ) t tf t D p t D Recall: score q d q dk q  tf k q ( , ) . ( , ) tf k d df t ( , ) ( ) k tf.idf of term k in document d    Odds of the probability of    Inverse length of d Term importance    1 . ( ) ( , ) log Matching Text t t k k k df k tf t d d
  • 174. Document length normalization • Probabilistic models assume causes for documents differing in length – Scope – Verbosity • In practice, document length softens the term frequency contribution to the final score – We’ve seen it in BM25 and LMs – Usually with a tunable parameter that regulates the amount of softening – Can be a function of the deviation of the average document length – Can be incorporated into vanilla tf-idf 174
  • 175. Other models • Modeling term dependencies (positions) in the language modeling framework – Markov Random Fields • Modeling matches (occurrences of words) in different parts of a document -> fielded models – BM25F – Markov Random Fields can account for this as well 175
  • 176. More involved signals for ranking • From document understanding to query understanding • Query rewrites (gazetteers, spell correction), named entity recognition, query suggestions, query categories, query segmentation ... • Detecting query intent, triggering verticals – direct target towards answers – richer interfaces 176
  • 177. Signals for Ranking • Signals for ranking: matches of query terms in documents, query-independent quality measures, CTR, among others • Probabilistic IR models are all about counting – occurrences of terms in documents, in sets of documents, etc. • How to aggregate efficiently a large number of “different” counts – coming from the same terms – no double counts! 177
  • 178. Searching for food • New York’s greatest pizza ‣ New OR York’s OR greatest OR pizza ‣ New AND York’s AND greatest AND pizza ‣ New OR York OR great OR pizza ‣ “New York” OR “great pizza” ‣ “New York” AND “great pizza” ‣ York < New AND great OR pizza • among many more. 178
  • 179. “Refined”matching • Extract a number of virtual regions in the document that match some version of the query (operators) – Each region provides a different evidence of relevance (i.e. signal) • Aggregate the scores over the different regions • Ex. :“at least any two words in the query appear either consecutively or with an extra word between them” 179
  • 181. Remember BM25 • Term (tf) independence • Vague Prior over terms not appearing in the query • Eliteness - topical model that perturbs the word distribution • 2-poisson distribution of term frequencies over relevant and non-relevant documents 181
  • 182. Feature dependencies • Class-linearly dependent (or affine) features – add no extra evidence/signal – model overfitting (vs capacity) • Still, it is desirable to enrich the model with more involved features • Some features are surprisingly correlated • Positional information requires a large number of parameters to estimate • Potentially up to 182
  • 183. Query concept segmentation • Queries are made up of basic conceptual units, comprising many words – “Indian summer victor herbert” • Spurious matches: “san jose airport” -> “san jose city airport” • Model to detect segments based on generative language models and Wikipedia • Relax matches using factors of the max ratio between span length and segment length 183
  • 184. Virtual regions • Different parts of the document provide different evidence of relevance • Create a (finite) set of (latent) artificial regions and re-weight 184
  • 185. Implementation • An operator maps a query to a set of queries, which could match a document • Each operator has a weight • The average term frequency in a document is 185
  • 186. Remarks • Different saturation (eliteness) function? – learn the real functional shape! – log-logistic is good if the class-conditional distributions are drawn from an exp. family • Positions as variables? – kernel-like method or exp. #parameters • Apply operators on a per query or per query class basis? 186
  • 187. Operator examples • BOW: maps a raw query to the set of queries whose elements are the single terms • p-grams: set of all p-gram of consecutive terms • p-and: all conjunctions of p arbitrary terms • segments: match only the “concepts” • Enlargement: some words might sneak in between the phrases/segments 187
  • 188. How does it work in practice? 188
  • 189. ... not that far away term frequency link information query intent information editorial information click-through information geographical information language information user preferences document length document fields other gazillion sources of information 189
  • 190. Dictionaries • Fast look-up – Might need specific structures to scale up • Hash tables • Trees – Tolerant retrieval (prefixes) – Spell checking • Document correction (OCR) • Query misspellings (did you mean … ?) • (Weighted) edit distance – dynamic programming • Jaccard overlap (index character k-grams) • Context sensitive • http://norvig.com/spell-correct.html – Wild-card queries • Permuterm index • K-gram indexes 190
  • 191. Hardware basics • Access to data in memory is much faster than access to data on disk. • Disk seeks: No data is transferred from disk while the disk head is being positioned. • Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. • Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). • Block sizes: 8KB to 256 KB. 191
  • 192. Hardware basics • Many design decisions in information retrieval are based on the characteristics of hardware • Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. • Available disk space is several (2-3) orders of magnitude larger. • Fault tolerance is very expensive: It is much cheaper to use many regular machines rather than one fault tolerant machine. 192
  • 193. Data flow splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign Map phase Segment files Reduce phase 193
  • 194. MapReduce • The index construction algorithm we just described is an instance of MapReduce. • MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … • … without having to write code for the distribution part. • They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce. • Open source implementation Hadoop – Widely used throughout industry 194
  • 195. MapReduce • Index construction was just one phase. • Another phase: transforming a term-partitioned index into a document-partitioned index. – Term-partitioned: one machine handles a subrange of terms – Document-partitioned: one machine handles a subrange of documents • Msearch engines use a document-partitioned index for better load balancing, etc. 195
  • 196. Distributed IR • Basic process – All queries sent to a director machine – Director then sends messages to many index servers • Each index server does some portion of the query processing – Director organizes the results and returns them to the user • Two main approaches – Document distribution • by far the most popular – Term distribution 196
  • 197. Distributed IR (II) • Document distribution – each index server acts as a search engine for a small fraction of the total collection – director sends a copy of the query to each of the index servers, each of which returns the top k results – results are merged into a single ranked list by the director • Collection statistics should be shared for effective ranking 197
  • 198. Caching • Query distributions similar to Zipf • About ½ each day are unique, but some are very popular – Caching can significantly improve effectiveness • Cache popular query results • Cache common inverted lists – Inverted list caching can help with unique queries – Cache must be refreshed to prevent stale data 198
  • 199. Others • Efficiency (compression, storage, caching, distribution) • Novelty and diversity • Evaluation • Relevance feedback • Learning to rank • User models – Context, personalization • Sponsored Search • Temporal aspects • Social aspects 199
  • 200. 200

Editor's Notes

  1. Not only the data is different, also the queries, and the results we get from it!
  2. To the surprise of many, the search box has become the preferred method of information access. Customers ask: Why can’t I search my database in the same way?
  3. Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine. In the summer of 1993, no search engine existed for the web, just catalog One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.
  4. In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.[7][8] Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet.[6]
  5. Aardvark was a social search service that connected users live with friends or friends-of-friends who were able to answer their questions, also known as a knowledge market. Bought by google 2010 Kaltix Corp., commonly known as Kaltix is a personalized search engine company founded at Stanford University in June 2003 by Sep Kamvar, Taher Haveliwala and Glen Jeh.[1][2] It was acquired by Google in September 2003.
  6. How do we communicate with search engines
  7. Information needs must be expressed as a query – But users don’t often know what they want ASK   Hypothesis Belkin et al (1982) Proposed a model called Anomalous State of Knowledge ASK  hypothesis: – difficult for people to define exactly what their information need is, because that information is a gap in their knowledge - Search Engines should look for information that fills those gaps Interesting ideas, little practical impact (yet)
  8. Under specified Ambiguous Context sensitive  represent different types of search –  E.g. decision making –  background search –  fact search
  9. Need to have fairly deep knowledge... –  What sites are possible –  What’s in a given site (what’s likely to be there) –  Authority of source / site –  Index structure (time, place, person, ...) what kinds of searches? –  How to read a SERP critically
  10. Commonplace book
  11. Start with the simplest search you can think of: [ upper lip indentation ] If it’s not right, you can always modify it. • When I did this, I clicked on the first result, which took me to Yahoo Answers. There’s a nice article there about something called the philtrum.
  12. Ghost town vs abandoned 1750 Search for images with creative commons attributions
  13. The need is verbalized mentally
  14. Queries and documents must share a (at least comparable if not the same) representation
  15. SCC – single connected component IN – pages not discovered yet OUT – sites that contain only in-host link Tendrils – can’t reach or be reached from the SCC
  16. creation of indefinitely deep directory structures like http://foo.com/bar/foo/bar/foo/bar/foo/bar/..... dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow. pages filled with a large number of characters, crashing the lexical analyzer parsing the page. pages with session-id's based on required cookies.
  17. Data: ; this type of data is conventionally dealt with a database management system. Structure: With this view, documents are not treated as flat entities, so a document and its components (e.g. sections) can be retrieved
  18. How do we arrive to the content representation of a document?
  19. Nontrivial issues. Requires some design decisions.
  20. Nontrivial issues. Requires some design decisions. Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find. The problems with large document units can be alleviated by use of explicit or implicit proximity search
  21. A simple strategy is to just split on all non-alphanumeric characters – bad you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenize Conceptually, splitting on white space can also split what should be re- garded as a single token. This occurs most commonly with names (San Fran- cisco, Los Angeles) but also with borrowed foreign phrases (au fait)
  22. Index numbers -> (One answer is using n-grams: IIR ch. 3)
  23. Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words
  24. No unique tokenization + completely different interpretation of a sequence depending on where you split
  25. Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)
  26. Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either. The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory.
  27. An alternative to creating equivalence classes is to maintain relations between not normalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in
  28. Too much equivalence class
  29. Why not the reverse?
  30. Also stemmers based on N-grams-based For example trigrams: information => {inf, nfo, for, etc}
  31. careses parties separational -> separate factional -> faction
  32. Compression Cache pressure
  33. The distribution of term frequencies is similar for different texts of significant large size.
  34. Heaps’ law gives the vocabulary size in collections.
  35. Positional indexes are helpful, but we’ll ignore them for now
  36. (Salton & McGill 1983)
  37. The classifier that assigns a vector x to the class with the highest posterior is called the Bayes classifier. The error associated with this classifier is called the Bayes error. This is the lowest possible error rate for any classifier over the distribution of all examples and for a chosen hypothesis space
  38. A complete probability distribution over documents −  defines likelihood for any possible document d (observation) −  P(relevant) via P(document): PR∣d∝Pd∣RPR −  can “generate” synthetic documents  willsharesomepropertiesoftheoriginalcollection Not all IR Models do this – possible to estimate p(R|d) directly – log regression Assumptions: one relevance value for every word w Words are conditionally independent given R – false, but allows to lower down the number of parameters All words absent are equally likely to be observed in relevant and not relevant classes
  39. One relevance status value per word empty document (all words absent) is equally likely to be observed in relevant and non-relevant classes (provides a natural zero) - practical reason, only score terms that appear in the query (TAT)
  40. Doesn’t model word dependence. Doesn’t account for document length. Doesn’t model word frequencies
  41. Now D_t = d_t account for the number of times we observe the term in the document (we have a vector of frequencies)
  42. Can we seen as a probabilisitic automata They originate from probabilistic models of language gen- eration developed for automatic speech recognition systems in the early 1980's (see e.g. Rabiner 1990). Automatic speech recognition systems combine prob- abilities of two distinct models: the acoustic model, and the language model. The acoustic model might for instance produce the following candidate texts in decreasing order of probability: \food born thing", \good corn sing", \mood morning", and \good morning". Now, the language model would determine that the phrase \good morning" is much more probable, i.e., it occurs more frequently in English than the other phrases. When combined with the acoustic model, the system is able to decide that \good morning" was the most likely utterance, thereby increasing the system's performance. For information retrieval, language models are built for each document. By following this approach, the language model of the book you are reading now would assign an exceptionally high probability to the word \retrieval" indicating that this book would be a good candidate for retrieval if the query contains this word.
  43. For some applications we want all this highly probable P3 In IR P1=P2
  44. Veto terms Original multiple bernoulli, multinomial widely used now accountsformultiplewordoccurrencesinthequery(primitive) – wellunderstood:lotsofresearchinrelatedfields(andnowinIR) – possibilityforintegrationwithASR/MT/NLP(sameeventspace)
  45. Discounting methods Problem with all discounting methods: – discounting treats unseen words equally (add or subtract ε) – somewordsaremorefrequentthanothers Essentially, the data model and retrieval function are one and the same
  46. Different ways of smoothing, dirichlet priors smoothing particularly popular