Servosystem Theory / Cybernetic Theory by Petrovic
Tpdl 2016 topic_coverage_indf-bf_crawls
1. Comparing Topic Coverage in Breadth-first & Depth-first
Crawls using Anchor Texts
Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries
2.
Web archives were initiated to preserve the fast
changing Web
Web archives are created using Web crawlers
Crawling strategy has a great influence on the
data that is archived
Web Archives
2
3. Web Crawlers
Two main strategies
Depth-first, focus only on selected web sites, as deep
as possible
Breadth-first, focus on the entire web, but not in depth
3
4. Our Study
Research Question:
How does the crawling strategy impact web
archive's coverage of past popular topics?
Challenges
How to approximate user interest when no query log is
available?
Scalability, Web archives or crawls consist of huge
amount of data.
e.g., one month snapshot from Common Crawl can be 80 TB of
compressed data.
4
5. Approach
We use hyperlinks anchor text from the raw
content of two datasets
crawled using different crawling strategies
We compare anchor texts with external sources
that identify past popular topics on the Web at the
time of our datasets
5
6. Based on What (Anchor Text)
Hyperlinks define the Web structure
link information: source URL, target URL, and the anchor text
Anchor text is a short text describing the target page
exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.]
widely used in Information Retrieval to improve search
effectiveness
combined with timestamps has be used to
capture & trace entity evolution [Kanhabua and Nejdl]
uncover & provide representation of missing pages from the archive
[Klein & Nelson, Huurdeman et al.]
6
7. Anchor Text (data)
Anchor Text from two different crawls
Dutch Web archive by National Library of the
Netherlands (KB)
depth-first (selective)
+10,000 websites
Dutch domain
100 million crawled urls
Common Crawl (CC)
breadth-first
millions of websites
entire Web
2.8 billion crawled urls
7
8. Link Processing
Extraction
Cleaning
URL normalization; get host of the source and the target
Clean spam e.g., rolex watches
Aggregation
Pre-processing: lowercase, Stop-words removing &
stemming
Aggregate based on the anchor text
Deduplication
Remove duplicate links; due to crawling frequency
Same source, target, anchor text, year of crawl-date
SourceURL, targetURL, anchorText
anchorText, count
8
SourceURL, targetURL, anchorText, crawl-date (year)
9. Anchor Text Summary (1/2)
Short texts describing target pages
Available for both archived & unarchived pages
not all target pages exist in the archive
9
e.g., 6.5% of the unique
hosts of target pages of
external links were
crawled
10. Anchor Text Summary (2/2)
The breadth-first crawl (KB) and the breadth-
first crawl (CC) differ in terms of size &
coverage
links from CC is 559x times larger than the number of KB
links
KB dataset mainly covers the NL domain, and CC dataset
covers the entire Web
CC contains more hosts (websites) than the KB; source
and target hosts
10
11. For Fair Comparison
Subsets from CC dataset
NL part, we focus on links that originate from the NL
domain.
Links whose source is from the KB seeds list
11
Number of links from
the NL part of CC is
comparable to number
of links from the KB
12. Past Popular Topics (of interests to users)
Queries represent user's information need
User queries have usually not been preserved
Impossible to go back in time reconstruct which
queries the user would have used to search the
archive
12
13. Past Popular Topics (data 1/3)
Three different sources to identify past popular
topics
Wikistats
views aggregation of Wikipedia (WP) pages over time
contains WP title, number of views, timestamp, and the language
keep WP titles viewed >= 1,000 times
Google trends
top searched queries from entire world or per country
Query log from users visiting the public Dutch historic
newspaper archive
13
14. Past Popular Topics (data 2/3)
Our assumption
is that the CC (a breadth-first crawl) covers more global
topics, and that KB (a depth-first crawl) covers more topics
from NL domain
For validation:
split topic sources into topics that attracted attention in the
entire Web (global) and topics that were only picked up in
the NL domain
14
15. Past Popular Topics (data 3/3)
Wikistats
we used the language to get the Dutch Wikipedia pages
other pages, we labeled them as global
Google trends
most searched queries from the entire world
most searched queries from Netherlands
15
16. Match Anchor Text to Topics
Pre-process topics from all sources
lowercase
stop-words removing, and stemming
Exact string matching between anchor text and
topics from all sources
16
17. Topic Coverage (1/3)
The breadth-first crawl (CC) covers more topics
than depth-first crawl (KB)
for all topic sources
not only for global topics (as expected)
but also for topics from the NL domain!
17
19. Topic Coverage (3/3)
Influence of anchor text popularity in the archive
rank anchor texts based on their frequency
high percentage of anchor text occurs occurs once
match again to topics at different rank cut-off
19
20.
Anchor text combined with timestamps can be
used to find past popular topics
the % of coverage varies across the sources
we find a relation between anchor text frequency and the
percentage of coverage
Breadth-first (CC) covers more topics globally
and from the NL domain
Subsets from CC
the NL part covers more topics
the KB seeds part shows comparable results
20
21. References
[Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information
retrieval. In SIGIR 2002
[Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in
wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014.
[Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods
for discovering replacement web pages. Int. J. on Digital Libraries, 2014.
[Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and
Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital
Libraries, 2015.
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
22. Comparing Topic Coverage in Breadth-first & Depth-first Crawls
using Anchor Texts
We would like to thank
for making the Web archive data available for us
And
For giving us access to their Hadoop cluster