SlideShare a Scribd company logo
1 of 35
The personal search engine
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, November 7th, 2016
Disclaimer:
“Computational Relevance”
“Intellectually it is possible for a human to establish the
relevance of a document to a query. For a computer to do
this we need to construct a model within which
relevance decisions can be quantified. It is interesting to
note that most research in information retrieval can be
shown to have been concerned with different aspects of
such a model.”
Van Rijsbergen, 1976
Retrieval Model
Probabilistic Ranking Principle
 “Provides a theoretical justification for why documents
should be ranked by the probability of relevance”
Stephen Robertson, 1977
IR Solved?
 “Provides a theoretical justification for why documents
should be ranked by the probability of relevance”
Stephen Robertson, 1977
 PRP assumes (unreasonably?) independence between
results and 1/0 loss (or Boolean relevance assessments)
 PRP does not state how the probability of relevance should
be estimated
Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
 Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
 Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”
 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
 Anchors that describe the URI
- Anchor text
 Queries that lead to clicks on the URI
- Session, user, dwell-time, …
 Tweets that mention the URI
- Time, location, user, …
 Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!
Learning to Rank (LTOR)
 IR as a machine learning problem
 Learn the matching function from observations
- E.g., pairwise – clicked document below retrieved document
should trigger a swap of their positions
Detect and classify NEs
Rank search results
Predict query intent
Search suggestions
Spelling correction
Predict query intent
Rank Verticals
Search suggestions
RobertJohnson(1911-13):“Earlythismorning
whenyouknockeduponmydoor/AndIsaid,‘Hello,
Satan,Ibelieveit’stimetogo.’”
ttps://youtu.be/3MCHI23FTP8
WWW
 The Web has become ever more centralized
+ Cloud services – good value-for-money/value-for-effort
 Mobile makes things only worse
“There is an app for that”
Without the log data, web search isn’t as good
 This also hinders retrieval experiments in academia!
- Reproducibility vs. Representativeness of research results?
Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230.
doi: 10.1007/s10791-015-9276-9
Decentralize Web Search?
Personal search engine!
http://www.mkomo.com/cost-per-gigabyte-update
WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST
1TERABYTE SDXC CARD AT PHOTOKINA 2016
SEP 20, 2016
Realistic?
 Clueweb 2012: 80TB
Recent CommonCrawl: 150TB
 Average web page takes up 320 KB
- Large sample collected with Googlebot, May 26th, 2010
- Reported 4.2B pages (would require ~1.3 Petabyte)
 De Kunder & Van de Bosch estimate an upper bound of ~50B pages
- http://www.worldwidewebsize.com/
 Also considering continuing growth (claimed in unpublished work by colleagues)
- Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web
Search Architectures. https://arxiv.org/abs/1307.1179
https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?
Getting the Data
 Idea:
- Organize the web crawl in topically related bundles
- Apply bittorrent-like decentralization to share & update bundles
 Use techniques inspired by query obfuscation to hide the
real user’s interests when downloading bundles
 Web Archives to the rescue?
- Web Archive to play a role as “super-peer”
See also WebRTC based in-browser implementations:
 Webtorrent: https://webtorrent.io/
 CacheP2P: http://www.cachep2p.com/
And, http://academictorrents.com/ shares 16TB
research data, including Clueweb 2009 and 2012
“… communication and media
limitations, due to the distance
between Earth and Mars,
resulting in time delays: they will
have to request the movies or
news broadcasts they want to
see in advance.
[…]
Easy Internet access will be
limited to their preferred sites
that are constantly updated on
the local Mars web server. Other
websites will take between 6 and
45 minutes to appear on their
screen - first 3-22 minutes for
your click to reach Earth, and
then another 3-22 minutes for
the website data to reach Mars.”
http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars
Analogy
 Web Archive ~ Earth
 Personal search engine (@ people’s homes) ~ Mars
“Searching from Mars”
 Tradeoff between “effort” (waiting for responses from Earth) and “data
transfer” (pre-fetching or caching data on Mars).
 Related work:
- Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet
Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2
- Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest.
Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430
- Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest.
Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468
Pre-fetching & Caching
 Hide latencies of getting the data from the live web
- Pre-fetch pages linked from initial query results page
- Pre-fetch additional related pages
- Pre-fetches expanded with those from query suggestions
 Cache web data to avoid accessing the live web
Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?
Truly personal search?
 Safely gain access to rich personal data including email,
browsing history, documents read and contents of the
user’s home directory
 Can high quality evidence about an individual’s recurring
long-term interests replace the shallow information of
many?
Better Search – “Deep Personalization”
 “Even more broadly than trying to get people the right
content based on their context, we as a community need to
be thinking about how to support people through the entire
search experience.”
Jaime Teevan on “Slow Search”
 Search as a dialogue
My first journal paper:
De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)
Alternatives for Log Data?
 Social annotations
- E.g., bit.ly shortened urls
- Still requires access to an API conveying the query representation
- E.g., anchor text
- E.g., “twanchor text” – tweets providing context to a URL
Anchor Text & Timestamps

Exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.]
 Anchor text with timestamps can be used to capture &
trace entity evolution [Kanhabua and Nejdl]
 Anchor text with timestamps lets us reconstruct (past) topic
popularity [Samar et al.]
Trade log data!
Open challenges
 How to select the part of your log data you are willing to
share?
 How to estimate the value of this log data?
Blueprint of the Personal Search Engine
 Decentralize search
 Webarchives to rescue
- Super-peers in a P2P network of personal search engines
 “Deep personalization”
- Exploit the rich source data that can be processed safely locally
 A sharing economy:
- Data markets to trade log-data and improve – mutually – your
search results

More Related Content

Viewers also liked

Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelArjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
The Future of Quantified Self in Healthcare
The Future of Quantified Self in HealthcareThe Future of Quantified Self in Healthcare
The Future of Quantified Self in HealthcareQuantified Self Dublin
 
How Lifelogging Transforms Us All : Changing habits, memories, and selves.
How Lifelogging Transforms Us All : Changing habits, memories, and selves.How Lifelogging Transforms Us All : Changing habits, memories, and selves.
How Lifelogging Transforms Us All : Changing habits, memories, and selves.Institute of Customer Experience
 

Viewers also liked (6)

Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward Panel
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
The Future of Quantified Self in Healthcare
The Future of Quantified Self in HealthcareThe Future of Quantified Self in Healthcare
The Future of Quantified Self in Healthcare
 
How Lifelogging Transforms Us All : Changing habits, memories, and selves.
How Lifelogging Transforms Us All : Changing habits, memories, and selves.How Lifelogging Transforms Us All : Changing habits, memories, and selves.
How Lifelogging Transforms Us All : Changing habits, memories, and selves.
 

Similar to The personal search engine

Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Datajonblower
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Adrian Stevenson
 
BIBFRAME: MARC Replacement
BIBFRAME: MARC ReplacementBIBFRAME: MARC Replacement
BIBFRAME: MARC ReplacementJoy Nelson
 
Joy Nelson - BIBFRAME: MARC Replacement and Much More
Joy Nelson - BIBFRAME: MARC Replacement and Much MoreJoy Nelson - BIBFRAME: MARC Replacement and Much More
Joy Nelson - BIBFRAME: MARC Replacement and Much MoreKohaGruppoItaliano
 
Beyond HREF (LAWDI)
Beyond HREF (LAWDI)Beyond HREF (LAWDI)
Beyond HREF (LAWDI)sfsheath
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMMArjen de Vries
 
Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesLeon Kuunders
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?MIUR
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008Blogtalk 2008
 
Nova Spivack - Semantic Web Talk
Nova Spivack - Semantic Web TalkNova Spivack - Semantic Web Talk
Nova Spivack - Semantic Web Talksyawal
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEcsandit
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenBookNet Canada
 

Similar to The personal search engine (20)

Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?
 
BIBFRAME: MARC Replacement
BIBFRAME: MARC ReplacementBIBFRAME: MARC Replacement
BIBFRAME: MARC Replacement
 
Joy Nelson - BIBFRAME: MARC Replacement and Much More
Joy Nelson - BIBFRAME: MARC Replacement and Much MoreJoy Nelson - BIBFRAME: MARC Replacement and Much More
Joy Nelson - BIBFRAME: MARC Replacement and Much More
 
Beyond HREF (LAWDI)
Beyond HREF (LAWDI)Beyond HREF (LAWDI)
Beyond HREF (LAWDI)
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMM
 
Kohacon2016
Kohacon2016Kohacon2016
Kohacon2016
 
Web Scale Named Entity Mining
Web Scale Named Entity MiningWeb Scale Named Entity Mining
Web Scale Named Entity Mining
 
Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slides
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Week12
Week12Week12
Week12
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
Nova Spivack - Semantic Web Talk
Nova Spivack - Semantic Web TalkNova Spivack - Semantic Web Talk
Nova Spivack - Semantic Web Talk
 
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVEA LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
A LITERATURE REVIEW ON SEMANTIC WEB – UNDERSTANDING THE PIONEERS’ PERSPECTIVE
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is Unwritten
 

More from Arjen de Vries

Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Arjen de Vries
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Arjen de Vries
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Arjen de Vries
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsArjen de Vries
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master SpecialisationArjen de Vries
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Arjen de Vries
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaArjen de Vries
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by StrategyArjen de Vries
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?Arjen de Vries
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?! Arjen de Vries
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 
Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Arjen de Vries
 
Context Adaptation in Image Search
Context Adaptation in Image SearchContext Adaptation in Image Search
Context Adaptation in Image SearchArjen de Vries
 
20090914 Petamedia Irp5
20090914 Petamedia Irp520090914 Petamedia Irp5
20090914 Petamedia Irp5Arjen de Vries
 

More from Arjen de Vries (20)

Doing a PhD @ DOSSIER
Doing a PhD @ DOSSIERDoing a PhD @ DOSSIER
Doing a PhD @ DOSSIER
 
Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen)
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6)
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC Chairs
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain Knowledge
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by Strategy
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?!
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Twente ir-course 20-10-2010
Twente ir-course 20-10-2010
 
Context Adaptation in Image Search
Context Adaptation in Image SearchContext Adaptation in Image Search
Context Adaptation in Image Search
 
20090914 Petamedia Irp5
20090914 Petamedia Irp520090914 Petamedia Irp5
20090914 Petamedia Irp5
 

Recently uploaded

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 

Recently uploaded (20)

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

The personal search engine

  • 1. The personal search engine Prof.dr.ir. Arjen P. de Vries arjen@acm.org Nijmegen, November 7th, 2016
  • 3. “Computational Relevance” “Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.” Van Rijsbergen, 1976 Retrieval Model
  • 4. Probabilistic Ranking Principle  “Provides a theoretical justification for why documents should be ranked by the probability of relevance” Stephen Robertson, 1977
  • 5. IR Solved?  “Provides a theoretical justification for why documents should be ranked by the probability of relevance” Stephen Robertson, 1977  PRP assumes (unreasonably?) independence between results and 1/0 loss (or Boolean relevance assessments)  PRP does not state how the probability of relevance should be estimated
  • 6. Why Search Remains Difficult to Get Right  Heterogeneous data sources - WWW, wikipedia, news, e-mail, patents, twitter, personal information, …  Varying result types - “Documents”, tweets, courses, people, experts, gene expressions, temperatures, …  Multiple dimensions of relevance - Topicality, recency, reading level, … Actual information needs often require a mix within and across dimensions. E.g., “recent news and patents from our top competitors”
  • 7.  System’s internal information representation - Linguistic annotations - Named entities, sentiment, dependencies, … - Knowledge resources - Wikipedia, Freebase, IDC9, IPTC, … - Links to related documents - Citations, urls  Anchors that describe the URI - Anchor text  Queries that lead to clicks on the URI - Session, user, dwell-time, …  Tweets that mention the URI - Time, location, user, …  Other social media that describe the URI - User, rating - Tag, organisation of `folksonomy’ + UNCERTAINTY ALL OVER!
  • 8. Learning to Rank (LTOR)  IR as a machine learning problem  Learn the matching function from observations - E.g., pairwise – clicked document below retrieved document should trigger a swap of their positions
  • 9. Detect and classify NEs Rank search results Predict query intent Search suggestions
  • 10. Spelling correction Predict query intent Rank Verticals Search suggestions
  • 12. WWW  The Web has become ever more centralized + Cloud services – good value-for-money/value-for-effort  Mobile makes things only worse “There is an app for that”
  • 13. Without the log data, web search isn’t as good  This also hinders retrieval experiments in academia! - Reproducibility vs. Representativeness of research results? Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230. doi: 10.1007/s10791-015-9276-9
  • 16.
  • 18. WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST 1TERABYTE SDXC CARD AT PHOTOKINA 2016 SEP 20, 2016
  • 19. Realistic?  Clueweb 2012: 80TB Recent CommonCrawl: 150TB  Average web page takes up 320 KB - Large sample collected with Googlebot, May 26th, 2010 - Reported 4.2B pages (would require ~1.3 Petabyte)  De Kunder & Van de Bosch estimate an upper bound of ~50B pages - http://www.worldwidewebsize.com/  Also considering continuing growth (claimed in unpublished work by colleagues) - Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web Search Architectures. https://arxiv.org/abs/1307.1179 https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
  • 20. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  • 21. Getting the Data  Idea: - Organize the web crawl in topically related bundles - Apply bittorrent-like decentralization to share & update bundles  Use techniques inspired by query obfuscation to hide the real user’s interests when downloading bundles  Web Archives to the rescue? - Web Archive to play a role as “super-peer” See also WebRTC based in-browser implementations:  Webtorrent: https://webtorrent.io/  CacheP2P: http://www.cachep2p.com/ And, http://academictorrents.com/ shares 16TB research data, including Clueweb 2009 and 2012
  • 22. “… communication and media limitations, due to the distance between Earth and Mars, resulting in time delays: they will have to request the movies or news broadcasts they want to see in advance. […] Easy Internet access will be limited to their preferred sites that are constantly updated on the local Mars web server. Other websites will take between 6 and 45 minutes to appear on their screen - first 3-22 minutes for your click to reach Earth, and then another 3-22 minutes for the website data to reach Mars.” http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars
  • 23. Analogy  Web Archive ~ Earth  Personal search engine (@ people’s homes) ~ Mars
  • 24. “Searching from Mars”  Tradeoff between “effort” (waiting for responses from Earth) and “data transfer” (pre-fetching or caching data on Mars).  Related work: - Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2 - Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest. Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430 - Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest. Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468
  • 25. Pre-fetching & Caching  Hide latencies of getting the data from the live web - Pre-fetch pages linked from initial query results page - Pre-fetch additional related pages - Pre-fetches expanded with those from query suggestions  Cache web data to avoid accessing the live web
  • 26. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  • 27. Truly personal search?  Safely gain access to rich personal data including email, browsing history, documents read and contents of the user’s home directory  Can high quality evidence about an individual’s recurring long-term interests replace the shallow information of many?
  • 28.
  • 29. Better Search – “Deep Personalization”  “Even more broadly than trying to get people the right content based on their context, we as a community need to be thinking about how to support people through the entire search experience.” Jaime Teevan on “Slow Search”  Search as a dialogue My first journal paper: De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)
  • 30. Alternatives for Log Data?  Social annotations - E.g., bit.ly shortened urls - Still requires access to an API conveying the query representation - E.g., anchor text - E.g., “twanchor text” – tweets providing context to a URL
  • 31. Anchor Text & Timestamps  Exhibits characteristics similar to user query and document title [Eiron & McCurley, Jin et al.]  Anchor text with timestamps can be used to capture & trace entity evolution [Kanhabua and Nejdl]  Anchor text with timestamps lets us reconstruct (past) topic popularity [Samar et al.]
  • 33.
  • 34. Open challenges  How to select the part of your log data you are willing to share?  How to estimate the value of this log data?
  • 35. Blueprint of the Personal Search Engine  Decentralize search  Webarchives to rescue - Super-peers in a P2P network of personal search engines  “Deep personalization” - Exploit the rich source data that can be processed safely locally  A sharing economy: - Data markets to trade log-data and improve – mutually – your search results