SlideShare a Scribd company logo
1 of 55
Web Archives and the dream of the
Personal Search Engine
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Hannover, October 19th, 2017
Library of the “Muntmuseum” in Utrecht (Erik van Hannen)
Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
 Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
 Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”
 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
 Anchors that describe the URI
- Anchor text
 Queries that lead to clicks on the URI
- Session, user, dwell-time, …
 Tweets that mention the URI
- Time, location, user, …
 Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!
Learning to Rank (LTOR)
 IR as a machine learning problem
 Learn the matching function from observations
- E.g., pairwise – clicked document below retrieved document
should trigger a swap of their positions
Detect and classify NEs
Rank search results
Predict query intent
Search suggestions
Spelling correction
Predict query intent
Rank Verticals
Search suggestions
RobertJohnson(1911-13):“Earlythismorning
whenyouknockeduponmydoor/AndIsaid,‘Hello,
Satan,Ibelieveit’stimetogo.’”
https://youtu.be/3MCHI23FTP8
WWW
 The Web has become ever more centralized
+ Cloud services – good value-for-money/value-for-effort
 Mobile makes things only worse
“There is an app for that”
Decentralize Web Search?
See also yacy.net/
Without the log data, web search isn’t as good
 This also hinders retrieval experiments in academia!
- Reproducibility vs. Representativeness of research results?
Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230.
doi: 10.1007/s10791-015-9276-9
Disclaimer:
Personal search engine!
http://www.mkomo.com/cost-per-gigabyte-update
WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST
1TERABYTE SDXC CARD AT PHOTOKINA 2016
SEP 20, 2016
Realistic?
 Clueweb 2012: 80TB
Recent CommonCrawl (August 2017): 3.28B pages, 280TB
 Average web page takes up 320 KB
- Large sample collected with Googlebot, May 26th, 2010
- Reported 4.2B pages (would require ~1.3 Petabyte)
 De Kunder & Van de Bosch estimate an upper bound of ~50B pages
- http://www.worldwidewebsize.com/
 Also considering continuing growth (claimed in unpublished work)
- Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web
Search Architectures. https://arxiv.org/abs/1307.1179
https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
Realistic?
 Who actually needs all of the Web if their search engine is
truly personal?
 E.g., I do not read more than 4 or 5 languages…
 And, I do not want to see or read anything related to
qualifiers for the world cup
Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?
Getting the Data
 Idea:
- Organize the web crawl in topically related bundles
- Apply bittorrent-like decentralization to share & update bundles
 Use techniques inspired by query obfuscation to hide the
real user’s interests when downloading bundles
See also WebRTC based in-browser implementations:
 Webtorrent: https://webtorrent.io/
 CacheP2P: http://www.cachep2p.com/
academictorrents.com shares 16TB research data, including
Clueweb 2009 and 2012 anchor tekst
And, IPFS: https://ipfs.io/
“A peer-to-peer hypermedia protocol to make the web faster, safer,
and more open.”
Web Archives to the Rescue?
 Web Archives already store the data that the personal
search engine would need
- Just not (yet) organized in topical bundles
Rescue the Web archives?!
 Q: a business model for archiving?
 Q: enrich the (rarely used) web archives with usage data?
 Q: crowd-sourced seed-lists for crawling?
See also different direction to rescue the Web Archive:
bit.ly/VisualNavigationProject by Hugo Huurdeman
IPFS
 “The Permanent Web”
- Smart mix of bittorrent for peer-2-peer filesharing and git for
versioning
- Each file and all of the blocks within it are given a unique
fingerprint called a cryptographic hash
- This hash is used to lookup files
 IPFS = the Inter-Planetary File System
 Decentralized file sharing, but no decentralized search
“… communication and media
limitations, due to the distance
between Earth and Mars,
resulting in time delays: they will
have to request the movies or
news broadcasts they want to
see in advance.
[…]
Easy Internet access will be
limited to their preferred sites
that are constantly updated on
the local Mars web server.
Other websites will take
between 6 and 45 minutes to
appear on their screen - first 3-
22 minutes for your click to
reach Earth, and then another
3-22 minutes for the website
data to reach Mars.”
http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars
“Searching from Mars”
 Tradeoff between “effort” (waiting for responses from Earth) and “data
transfer” (pre-fetching or caching data on Mars).
 Related work:
- Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet
Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2
- Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest.
Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430
- Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest.
Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468
Pre-fetching & Caching
 Hide latencies of getting the data from the live web
- Pre-fetch pages linked from initial query results page
- Pre-fetch additional related pages
- Pre-fetches expanded with those from query suggestions
 Cache web data to avoid accessing the live web
Analogy
 Web Archive ~ Earth
 Personal search engine (@ people’s homes) ~ Mars
Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?
Alternatives for Log Data?
 Social annotations
- E.g., bit.ly shortened urls
- Still requires access to an API conveying the query representation
- E.g., anchor text
- E.g., “twanchor text” – tweets providing context to a URL
“SearsiaSuggest”
 Searsia (federated search engine created by Djoerd
Hiemstra) uses anchor text instead of query logs for its
autocompletions
- “… for queries of 2 words or more (the average query length in
the test data is 2.6), anchor text autocompletions perform better
than query log autocompletions”
- No more tracking of users!
 See also:
- searsia.org/blog/2017-03-18-query-suggestions-without-
tracking-users/
- github.com/searsia/searsiasuggest
Anchor Text & Timestamps
 Anchor text exhibits characteristics similar to user query
and document title [Eiron & McCurley, Jin et al.]
 Anchor text with timestamps can be used to capture &
trace entity evolution [Kanhabua and Nejdl]
 Anchor text with timestamps lets us reconstruct (past) topic
popularity [Samar et al.]
Again, the Web Archive to the rescue!
Recover Past Trends
 “Ground-truth” from WikiStats, Google Trends and the KB
online newspaper archive’s query log
 Anchor Text combined with timestamps can be used to find
past popular topics
- The % of coverage varies across the sources of past trends
- Anchor Text popularity correlates with the % of coverage
 Crawl strategy: KB vs. CommonCrawl
- Breadth-first (CommonCrawl) covers more topics globally and
from the NL domain
Investigate Bias
 Our study, on the Dutch Web Archive:
- Anchor Text from external links
- Create query sets with a timestamp per query (2009 – 2012)
- De-duplicated for year of crawl
(Most sites crawled once a year, but a subset more frequently.)
 Retrievability study
- Number of sites crawled in a year does not influence the
retrievability of documents from that year
- Difficulty to retrieve a document from a certain timeframe does
depend on the subset size
L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
http://dx.doi.org/10.1007/s00799-017-0215-9
PhDdefense:October30th,2017
Trade log data!
IR-809: (2011) Feild, H., Allan, J. and Glatt, J.,
"CrowdLogging: Distributed, private, and
anonymous search logging," Proceedings of the
International Conference on Research and
Development in Information Retrieval (SIGIR'11),
pp. 375-384. [View bibtex]
We describe an approach for distributed search log collection, storage, and mining,
with the dual goals of preserving privacy and making the mined information broadly
available. [..] The approach works with any search behavior artifact that can be
extracted from a search log, including queries, query reformulations, and query-
click pairs.
Open challenges
 How to select the part of your log data you are willing to
share?
 How to estimate the value of this log data?
Share Log Segments by Topic?
 Represent searchers’ previous search history in the form of
concise human-readable topical profiles
- Classifier trained on ODP applied to clicked pages
Carsten Eickhoff, Kevyn Collins-Thompson, Paul Bennett, and
Susan Dumais. Designing human-readable user profiles for
search evaluation (ECIR’13)
Share Log Segments by Topic?
 Linking a year of query logs to Wikipedia categories helped
distill the segments corresponding to events like marriage,
a first-born, and expat life (taxes)
- Jiyin He, Marc Bron: Measuring Demonstrated Potential
Domain Knowledge with Knowledge Graphs.
KG4IR@SIGIR 2017: 13-18
But wait…
… do we REALLY need all that query log info?
Personal search engine
 Safely gain access to rich personal data including email,
browsing history, documents read and contents of the
user’s home directory
 Can high quality evidence about an individual’s recurring
long-term interests replace the shallow information of
many?
Better Search – “Deep Personalization”
 “Even more broadly than trying to get people the right
content based on their context, we as a community need to
be thinking about how to support people through the entire
search experience.”
Jaime Teevan on “Slow Search”
 Search as a dialogue
My first journal paper:
De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)
“Deep Personalization”
 How could the indexer know about the wide variety of
sources and their schema information...
 Or, How to build 1000+ search engines?!
Create LOD representation
Engineer the Search EngineModel the Search Engine
“Search by Strategy”
• “No idealized one-shot search engine”
• Hand over control to the user (or, most
likely, the search intermediary)
“Search by Strategy”
• “No idealized one-shot search engine”
• Hand over control to the user (or, most
likely, the search intermediary)
• Search (and link) strategies can be
shared!
Note: Enhances Reproducibility of IR Research!
Web Archives to Lead the Revolution!
 Two main opportunities:
- Free us from the mass surveillance that is now the default
business model of the internet
- Improve Web Archive and Web Archive search
 Long run: realize truly personal search engines?
Blueprint of the Personal Search Engine
 Decentralize search
 Webarchives to rescue
- Super-peers in a P2P network of personal search engines
 “Deep personalization”
- Exploit the rich source data that can be processed safely locally
 A sharing economy:
- Data markets to trade log-data and improve – mutually – your
search results

More Related Content

What's hot

Social Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsSocial Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsPeter Mika
 
Linked data and Semantic Web Applications for Libraries
Linked data and Semantic Web Applications for LibrariesLinked data and Semantic Web Applications for Libraries
Linked data and Semantic Web Applications for LibrariesVikas Bhushan
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web scienceAastha Madaan
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...John Breslin
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionJohn Breslin
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social SemanticsMatthew Rowe
 
Linked open data and libraries
Linked open data and librariesLinked open data and libraries
Linked open data and librariesAlison Hitchens
 
What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)Alison Hitchens
 
SNSInkCloudWiner20150410
SNSInkCloudWiner20150410SNSInkCloudWiner20150410
SNSInkCloudWiner20150410Dov Winer
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web MiningIOSR Journals
 
Online data sources and information exposure
Online data sources and information exposureOnline data sources and information exposure
Online data sources and information exposureUniversity of Southampton
 
Semantic web
Semantic webSemantic web
Semantic webRehithaP
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...DeVonne Parks, CEM
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the cataloglisld
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015University of Cape Town
 

What's hot (20)

About the Social Semantic Web
About the Social Semantic WebAbout the Social Semantic Web
About the Social Semantic Web
 
Social Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsSocial Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 years
 
Linked data and Semantic Web Applications for Libraries
Linked data and Semantic Web Applications for LibrariesLinked data and Semantic Web Applications for Libraries
Linked data and Semantic Web Applications for Libraries
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web science
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An Introduction
 
Web Mining
Web MiningWeb Mining
Web Mining
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Linked open data and libraries
Linked open data and librariesLinked open data and libraries
Linked open data and libraries
 
What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)What is #LODLAM?! (revised January 2015)
What is #LODLAM?! (revised January 2015)
 
SNSInkCloudWiner20150410
SNSInkCloudWiner20150410SNSInkCloudWiner20150410
SNSInkCloudWiner20150410
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web Mining
 
Hypertext
HypertextHypertext
Hypertext
 
Online data sources and information exposure
Online data sources and information exposureOnline data sources and information exposure
Online data sources and information exposure
 
Semantic web
Semantic webSemantic web
Semantic web
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the catalog
 
Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015Digital Humanities - Conversation Starter 2015
Digital Humanities - Conversation Starter 2015
 

Similar to Web Archives and the dream of the Personal Search Engine

Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11William Hall
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Amit Sheth
 
The personal search engine
The personal search engineThe personal search engine
The personal search engineArjen de Vries
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection descriptionAndy Powell
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebJohn Breslin
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 
How to put an annotation in html
How to put an annotation in htmlHow to put an annotation in html
How to put an annotation in htmlSTIinnsbruck
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 

Similar to Web Archives and the dream of the Personal Search Engine (20)

Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection description
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
How to put an annotation in html
How to put an annotation in htmlHow to put an annotation in html
How to put an annotation in html
 
L017447590
L017447590L017447590
L017447590
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
Web Information Systems Introduction and Origin of World Wide Web
Web Information Systems Introduction and Origin of World Wide WebWeb Information Systems Introduction and Origin of World Wide Web
Web Information Systems Introduction and Origin of World Wide Web
 
5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 

More from Arjen de Vries

Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Arjen de Vries
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Arjen de Vries
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Arjen de Vries
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMMArjen de Vries
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsArjen de Vries
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master SpecialisationArjen de Vries
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelArjen de Vries
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Arjen de Vries
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaArjen de Vries
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by StrategyArjen de Vries
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?Arjen de Vries
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?! Arjen de Vries
 

More from Arjen de Vries (20)

Doing a PhD @ DOSSIER
Doing a PhD @ DOSSIERDoing a PhD @ DOSSIER
Doing a PhD @ DOSSIER
 
Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen)
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6)
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMM
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC Chairs
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward Panel
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain Knowledge
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by Strategy
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?!
 

Recently uploaded

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 

Recently uploaded (20)

Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 

Web Archives and the dream of the Personal Search Engine

  • 1. Web Archives and the dream of the Personal Search Engine Prof.dr.ir. Arjen P. de Vries arjen@acm.org Hannover, October 19th, 2017
  • 2. Library of the “Muntmuseum” in Utrecht (Erik van Hannen)
  • 3. Why Search Remains Difficult to Get Right  Heterogeneous data sources - WWW, wikipedia, news, e-mail, patents, twitter, personal information, …  Varying result types - “Documents”, tweets, courses, people, experts, gene expressions, temperatures, …  Multiple dimensions of relevance - Topicality, recency, reading level, … Actual information needs often require a mix within and across dimensions. E.g., “recent news and patents from our top competitors”
  • 4.  System’s internal information representation - Linguistic annotations - Named entities, sentiment, dependencies, … - Knowledge resources - Wikipedia, Freebase, IDC9, IPTC, … - Links to related documents - Citations, urls  Anchors that describe the URI - Anchor text  Queries that lead to clicks on the URI - Session, user, dwell-time, …  Tweets that mention the URI - Time, location, user, …  Other social media that describe the URI - User, rating - Tag, organisation of `folksonomy’ + UNCERTAINTY ALL OVER!
  • 5. Learning to Rank (LTOR)  IR as a machine learning problem  Learn the matching function from observations - E.g., pairwise – clicked document below retrieved document should trigger a swap of their positions
  • 6. Detect and classify NEs Rank search results Predict query intent Search suggestions
  • 7. Spelling correction Predict query intent Rank Verticals Search suggestions
  • 9. WWW  The Web has become ever more centralized + Cloud services – good value-for-money/value-for-effort  Mobile makes things only worse “There is an app for that”
  • 11. Without the log data, web search isn’t as good  This also hinders retrieval experiments in academia! - Reproducibility vs. Representativeness of research results? Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230. doi: 10.1007/s10791-015-9276-9
  • 15.
  • 16. WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST 1TERABYTE SDXC CARD AT PHOTOKINA 2016 SEP 20, 2016
  • 17. Realistic?  Clueweb 2012: 80TB Recent CommonCrawl (August 2017): 3.28B pages, 280TB  Average web page takes up 320 KB - Large sample collected with Googlebot, May 26th, 2010 - Reported 4.2B pages (would require ~1.3 Petabyte)  De Kunder & Van de Bosch estimate an upper bound of ~50B pages - http://www.worldwidewebsize.com/  Also considering continuing growth (claimed in unpublished work) - Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web Search Architectures. https://arxiv.org/abs/1307.1179 https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
  • 18. Realistic?  Who actually needs all of the Web if their search engine is truly personal?  E.g., I do not read more than 4 or 5 languages…  And, I do not want to see or read anything related to qualifiers for the world cup
  • 19. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  • 20. Getting the Data  Idea: - Organize the web crawl in topically related bundles - Apply bittorrent-like decentralization to share & update bundles  Use techniques inspired by query obfuscation to hide the real user’s interests when downloading bundles See also WebRTC based in-browser implementations:  Webtorrent: https://webtorrent.io/  CacheP2P: http://www.cachep2p.com/ academictorrents.com shares 16TB research data, including Clueweb 2009 and 2012 anchor tekst And, IPFS: https://ipfs.io/ “A peer-to-peer hypermedia protocol to make the web faster, safer, and more open.”
  • 21. Web Archives to the Rescue?  Web Archives already store the data that the personal search engine would need - Just not (yet) organized in topical bundles
  • 22. Rescue the Web archives?!  Q: a business model for archiving?  Q: enrich the (rarely used) web archives with usage data?  Q: crowd-sourced seed-lists for crawling? See also different direction to rescue the Web Archive: bit.ly/VisualNavigationProject by Hugo Huurdeman
  • 23. IPFS  “The Permanent Web” - Smart mix of bittorrent for peer-2-peer filesharing and git for versioning - Each file and all of the blocks within it are given a unique fingerprint called a cryptographic hash - This hash is used to lookup files  IPFS = the Inter-Planetary File System  Decentralized file sharing, but no decentralized search
  • 24. “… communication and media limitations, due to the distance between Earth and Mars, resulting in time delays: they will have to request the movies or news broadcasts they want to see in advance. […] Easy Internet access will be limited to their preferred sites that are constantly updated on the local Mars web server. Other websites will take between 6 and 45 minutes to appear on their screen - first 3- 22 minutes for your click to reach Earth, and then another 3-22 minutes for the website data to reach Mars.” http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars
  • 25. “Searching from Mars”  Tradeoff between “effort” (waiting for responses from Earth) and “data transfer” (pre-fetching or caching data on Mars).  Related work: - Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2 - Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest. Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430 - Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest. Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468
  • 26. Pre-fetching & Caching  Hide latencies of getting the data from the live web - Pre-fetch pages linked from initial query results page - Pre-fetch additional related pages - Pre-fetches expanded with those from query suggestions  Cache web data to avoid accessing the live web
  • 27. Analogy  Web Archive ~ Earth  Personal search engine (@ people’s homes) ~ Mars
  • 28. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  • 29. Alternatives for Log Data?  Social annotations - E.g., bit.ly shortened urls - Still requires access to an API conveying the query representation - E.g., anchor text - E.g., “twanchor text” – tweets providing context to a URL
  • 30. “SearsiaSuggest”  Searsia (federated search engine created by Djoerd Hiemstra) uses anchor text instead of query logs for its autocompletions - “… for queries of 2 words or more (the average query length in the test data is 2.6), anchor text autocompletions perform better than query log autocompletions” - No more tracking of users!  See also: - searsia.org/blog/2017-03-18-query-suggestions-without- tracking-users/ - github.com/searsia/searsiasuggest
  • 31. Anchor Text & Timestamps  Anchor text exhibits characteristics similar to user query and document title [Eiron & McCurley, Jin et al.]  Anchor text with timestamps can be used to capture & trace entity evolution [Kanhabua and Nejdl]  Anchor text with timestamps lets us reconstruct (past) topic popularity [Samar et al.] Again, the Web Archive to the rescue!
  • 32. Recover Past Trends  “Ground-truth” from WikiStats, Google Trends and the KB online newspaper archive’s query log  Anchor Text combined with timestamps can be used to find past popular topics - The % of coverage varies across the sources of past trends - Anchor Text popularity correlates with the % of coverage  Crawl strategy: KB vs. CommonCrawl - Breadth-first (CommonCrawl) covers more topics globally and from the NL domain
  • 33. Investigate Bias  Our study, on the Dutch Web Archive: - Anchor Text from external links - Create query sets with a timestamp per query (2009 – 2012) - De-duplicated for year of crawl (Most sites crawled once a year, but a subset more frequently.)  Retrievability study - Number of sites crawled in a year does not influence the retrievability of documents from that year - Difficulty to retrieve a document from a certain timeframe does depend on the subset size L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 36. Trade log data! IR-809: (2011) Feild, H., Allan, J. and Glatt, J., "CrowdLogging: Distributed, private, and anonymous search logging," Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR'11), pp. 375-384. [View bibtex] We describe an approach for distributed search log collection, storage, and mining, with the dual goals of preserving privacy and making the mined information broadly available. [..] The approach works with any search behavior artifact that can be extracted from a search log, including queries, query reformulations, and query- click pairs.
  • 37.
  • 38. Open challenges  How to select the part of your log data you are willing to share?  How to estimate the value of this log data?
  • 39. Share Log Segments by Topic?  Represent searchers’ previous search history in the form of concise human-readable topical profiles - Classifier trained on ODP applied to clicked pages Carsten Eickhoff, Kevyn Collins-Thompson, Paul Bennett, and Susan Dumais. Designing human-readable user profiles for search evaluation (ECIR’13)
  • 40. Share Log Segments by Topic?  Linking a year of query logs to Wikipedia categories helped distill the segments corresponding to events like marriage, a first-born, and expat life (taxes) - Jiyin He, Marc Bron: Measuring Demonstrated Potential Domain Knowledge with Knowledge Graphs. KG4IR@SIGIR 2017: 13-18
  • 41. But wait… … do we REALLY need all that query log info?
  • 42. Personal search engine  Safely gain access to rich personal data including email, browsing history, documents read and contents of the user’s home directory  Can high quality evidence about an individual’s recurring long-term interests replace the shallow information of many?
  • 43.
  • 44. Better Search – “Deep Personalization”  “Even more broadly than trying to get people the right content based on their context, we as a community need to be thinking about how to support people through the entire search experience.” Jaime Teevan on “Slow Search”  Search as a dialogue My first journal paper: De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)
  • 45. “Deep Personalization”  How could the indexer know about the wide variety of sources and their schema information...  Or, How to build 1000+ search engines?!
  • 47. Engineer the Search EngineModel the Search Engine
  • 48. “Search by Strategy” • “No idealized one-shot search engine” • Hand over control to the user (or, most likely, the search intermediary)
  • 49.
  • 50. “Search by Strategy” • “No idealized one-shot search engine” • Hand over control to the user (or, most likely, the search intermediary) • Search (and link) strategies can be shared!
  • 51.
  • 53.
  • 54. Web Archives to Lead the Revolution!  Two main opportunities: - Free us from the mass surveillance that is now the default business model of the internet - Improve Web Archive and Web Archive search  Long run: realize truly personal search engines?
  • 55. Blueprint of the Personal Search Engine  Decentralize search  Webarchives to rescue - Super-peers in a P2P network of personal search engines  “Deep personalization” - Exploit the rich source data that can be processed safely locally  A sharing economy: - Data markets to trade log-data and improve – mutually – your search results

Editor's Notes

  1. 33
  2. 34