Tpdl 2016 topic_coverage_indf-bf_crawls

Comparing Topic Coverage in Breadth-first & Depth-first
Crawls using Anchor Texts
Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries


Web archives were initiated to preserve the fast
changing Web

Web archives are created using Web crawlers

Crawling strategy has a great influence on the
data that is archived
Web Archives
2

Web Crawlers
 Two main strategies
 Depth-first, focus only on selected web sites, as deep
as possible
 Breadth-first, focus on the entire web, but not in depth
3

Our Study
 Research Question:
How does the crawling strategy impact web
archive's coverage of past popular topics?
 Challenges
 How to approximate user interest when no query log is
available?
 Scalability, Web archives or crawls consist of huge
amount of data.
 e.g., one month snapshot from Common Crawl can be 80 TB of
compressed data.
4

Approach
 We use hyperlinks anchor text from the raw
content of two datasets
 crawled using different crawling strategies
 We compare anchor texts with external sources
that identify past popular topics on the Web at the
time of our datasets
5

Based on What (Anchor Text)

Hyperlinks define the Web structure

link information: source URL, target URL, and the anchor text

Anchor text is a short text describing the target page

exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.]

widely used in Information Retrieval to improve search
effectiveness

combined with timestamps has be used to

capture & trace entity evolution [Kanhabua and Nejdl]

uncover & provide representation of missing pages from the archive
[Klein & Nelson, Huurdeman et al.]
6

Anchor Text (data)

Anchor Text from two different crawls

Dutch Web archive by National Library of the
Netherlands (KB)

depth-first (selective)

+10,000 websites

Dutch domain

100 million crawled urls

Common Crawl (CC)

breadth-first

millions of websites

entire Web

2.8 billion crawled urls
7

Link Processing
Extraction
Cleaning

URL normalization; get host of the source and the target

Clean spam e.g., rolex watches
Aggregation

Pre-processing: lowercase, Stop-words removing &
stemming

Aggregate based on the anchor text
Deduplication

Remove duplicate links; due to crawling frequency

Same source, target, anchor text, year of crawl-date
SourceURL, targetURL, anchorText
anchorText, count
8
SourceURL, targetURL, anchorText, crawl-date (year)

Anchor Text Summary (1/2)

Short texts describing target pages

Available for both archived & unarchived pages

not all target pages exist in the archive
9
e.g., 6.5% of the unique
hosts of target pages of
external links were
crawled

Anchor Text Summary (2/2)

The breadth-first crawl (KB) and the breadth-
first crawl (CC) differ in terms of size &
coverage

links from CC is 559x times larger than the number of KB
links

KB dataset mainly covers the NL domain, and CC dataset
covers the entire Web

CC contains more hosts (websites) than the KB; source
and target hosts
10

For Fair Comparison

Subsets from CC dataset

NL part, we focus on links that originate from the NL
domain.

Links whose source is from the KB seeds list
11
Number of links from
the NL part of CC is
comparable to number
of links from the KB

Past Popular Topics (of interests to users)

Queries represent user's information need

User queries have usually not been preserved

Impossible to go back in time reconstruct which
queries the user would have used to search the
archive
12

Past Popular Topics (data 1/3)

Three different sources to identify past popular
topics

Wikistats

views aggregation of Wikipedia (WP) pages over time

contains WP title, number of views, timestamp, and the language

keep WP titles viewed >= 1,000 times

Google trends

top searched queries from entire world or per country

Query log from users visiting the public Dutch historic
newspaper archive
13


Our assumption

is that the CC (a breadth-first crawl) covers more global
topics, and that KB (a depth-first crawl) covers more topics
from NL domain

For validation:

split topic sources into topics that attracted attention in the
entire Web (global) and topics that were only picked up in
the NL domain
14

 Wikistats
 we used the language to get the Dutch Wikipedia pages
 other pages, we labeled them as global
 Google trends
 most searched queries from the entire world
 most searched queries from Netherlands
15

Match Anchor Text to Topics
 Pre-process topics from all sources
 lowercase
 stop-words removing, and stemming
 Exact string matching between anchor text and
topics from all sources
16

Topic Coverage (1/3)

The breadth-first crawl (CC) covers more topics
than depth-first crawl (KB)

for all topic sources

not only for global topics (as expected)

but also for topics from the NL domain!
17


Subsets from the CC dataset have comparable
results
18
NL part KB seeds


Influence of anchor text popularity in the archive

rank anchor texts based on their frequency

high percentage of anchor text occurs occurs once

match again to topics at different rank cut-off
19


Anchor text combined with timestamps can be
used to find past popular topics

the % of coverage varies across the sources

we find a relation between anchor text frequency and the
percentage of coverage

Breadth-first (CC) covers more topics globally
and from the NL domain

Subsets from CC

the NL part covers more topics

the KB seeds part shows comparable results
20

References
[Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information
retrieval. In SIGIR 2002
[Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in
wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014.
[Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods
for discovering replacement web pages. Int. J. on Digital Libraries, 2014.
[Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and
Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital
Libraries, 2015.
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/

Comparing Topic Coverage in Breadth-first & Depth-first Crawls
using Anchor Texts
We would like to thank
for making the Web archive data available for us
And
For giving us access to their Hadoop cluster

Tpdl 2016 topic_coverage_indf-bf_crawls

Recommended

Recommended

More Related Content

Similar to Tpdl 2016 topic_coverage_indf-bf_crawls

Similar to Tpdl 2016 topic_coverage_indf-bf_crawls (20)

Recently uploaded

Recently uploaded (20)

Tpdl 2016 topic_coverage_indf-bf_crawls