The personal search engine

The personal search engine
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, November 7th, 2016

“Computational Relevance”
“Intellectually it is possible for a human to establish the
relevance of a document to a query. For a computer to do
this we need to construct a model within which
relevance decisions can be quantified. It is interesting to
note that most research in information retrieval can be
shown to have been concerned with different aspects of
such a model.”
Van Rijsbergen, 1976
Retrieval Model

Probabilistic Ranking Principle
 “Provides a theoretical justification for why documents
should be ranked by the probability of relevance”
Stephen Robertson, 1977

IR Solved?
 “Provides a theoretical justification for why documents
should be ranked by the probability of relevance”
Stephen Robertson, 1977
 PRP assumes (unreasonably?) independence between
results and 1/0 loss (or Boolean relevance assessments)
 PRP does not state how the probability of relevance should
be estimated

Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
 Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
 Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”

 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
 Anchors that describe the URI
- Anchor text
 Queries that lead to clicks on the URI
- Session, user, dwell-time, …
 Tweets that mention the URI
- Time, location, user, …
 Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!

Learning to Rank (LTOR)
 IR as a machine learning problem
 Learn the matching function from observations
- E.g., pairwise – clicked document below retrieved document
should trigger a swap of their positions

Detect and classify NEs
Rank search results
Predict query intent
Search suggestions

Spelling correction
Predict query intent
Rank Verticals
Search suggestions

RobertJohnson(1911-13):“Earlythismorning
whenyouknockeduponmydoor/AndIsaid,‘Hello,
Satan,Ibelieveit’stimetogo.’”
ttps://youtu.be/3MCHI23FTP8

WWW
 The Web has become ever more centralized
+ Cloud services – good value-for-money/value-for-effort
 Mobile makes things only worse
“There is an app for that”

Without the log data, web search isn’t as good
 This also hinders retrieval experiments in academia!
- Reproducibility vs. Representativeness of research results?
Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230.
doi: 10.1007/s10791-015-9276-9

http://www.mkomo.com/cost-per-gigabyte-update

WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST
1TERABYTE SDXC CARD AT PHOTOKINA 2016
SEP 20, 2016

Realistic?
 Clueweb 2012: 80TB
Recent CommonCrawl: 150TB
 Average web page takes up 320 KB
- Large sample collected with Googlebot, May 26th, 2010
- Reported 4.2B pages (would require ~1.3 Petabyte)
 De Kunder & Van de Bosch estimate an upper bound of ~50B pages
- http://www.worldwidewebsize.com/
 Also considering continuing growth (claimed in unpublished work by colleagues)
- Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web
Search Architectures. https://arxiv.org/abs/1307.1179
https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html

Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?

Getting the Data
 Idea:
- Organize the web crawl in topically related bundles
- Apply bittorrent-like decentralization to share & update bundles
 Use techniques inspired by query obfuscation to hide the
real user’s interests when downloading bundles
 Web Archives to the rescue?
- Web Archive to play a role as “super-peer”
See also WebRTC based in-browser implementations:
 Webtorrent: https://webtorrent.io/
 CacheP2P: http://www.cachep2p.com/
And, http://academictorrents.com/ shares 16TB
research data, including Clueweb 2009 and 2012

“… communication and media
limitations, due to the distance
between Earth and Mars,
resulting in time delays: they will
have to request the movies or
news broadcasts they want to
see in advance.
[…]
Easy Internet access will be
limited to their preferred sites
that are constantly updated on
the local Mars web server. Other
websites will take between 6 and
45 minutes to appear on their
screen - first 3-22 minutes for
your click to reach Earth, and
then another 3-22 minutes for
the website data to reach Mars.”
http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars

Analogy
 Web Archive ~ Earth
 Personal search engine (@ people’s homes) ~ Mars

“Searching from Mars”
 Tradeoff between “effort” (waiting for responses from Earth) and “data
transfer” (pre-fetching or caching data on Mars).
 Related work:
- Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet
Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2
- Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest.
Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430
- Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest.
Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468

Pre-fetching & Caching
 Hide latencies of getting the data from the live web
- Pre-fetch pages linked from initial query results page
- Pre-fetch additional related pages
- Pre-fetches expanded with those from query suggestions
 Cache web data to avoid accessing the live web

Truly personal search?
 Safely gain access to rich personal data including email,
browsing history, documents read and contents of the
user’s home directory
 Can high quality evidence about an individual’s recurring
long-term interests replace the shallow information of
many?

Better Search – “Deep Personalization”
 “Even more broadly than trying to get people the right
content based on their context, we as a community need to
be thinking about how to support people through the entire
search experience.”
Jaime Teevan on “Slow Search”
 Search as a dialogue
My first journal paper:
De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)

Alternatives for Log Data?
 Social annotations
- E.g., bit.ly shortened urls
- Still requires access to an API conveying the query representation
- E.g., anchor text
- E.g., “twanchor text” – tweets providing context to a URL

Anchor Text & Timestamps

Exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.]
 Anchor text with timestamps can be used to capture &
trace entity evolution [Kanhabua and Nejdl]
 Anchor text with timestamps lets us reconstruct (past) topic
popularity [Samar et al.]

Open challenges
 How to select the part of your log data you are willing to
share?
 How to estimate the value of this log data?

Blueprint of the Personal Search Engine
 Decentralize search
 Webarchives to rescue
- Super-peers in a P2P network of personal search engines
 “Deep personalization”
- Exploit the rich source data that can be processed safely locally
 A sharing economy:
- Data markets to trade log-data and improve – mutually – your
search results

The personal search engine

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to The personal search engine

Similar to The personal search engine (20)

More from Arjen de Vries

More from Arjen de Vries (20)

Recently uploaded

Recently uploaded (20)

The personal search engine