SlideShare a Scribd company logo
1 of 22
Download to read offline
Comparing Topic Coverage in Breadth-first & Depth-first
Crawls using Anchor Texts
Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries

Web archives were initiated to preserve the fast
changing Web

Web archives are created using Web crawlers

Crawling strategy has a great influence on the
data that is archived
Web Archives
2
Web Crawlers
 Two main strategies
 Depth-first, focus only on selected web sites, as deep
as possible
 Breadth-first, focus on the entire web, but not in depth
3
Our Study
 Research Question:
How does the crawling strategy impact web
archive's coverage of past popular topics?
 Challenges
 How to approximate user interest when no query log is
available?
 Scalability, Web archives or crawls consist of huge
amount of data.
 e.g., one month snapshot from Common Crawl can be 80 TB of
compressed data.
4
Approach
 We use hyperlinks anchor text from the raw
content of two datasets
 crawled using different crawling strategies
 We compare anchor texts with external sources
that identify past popular topics on the Web at the
time of our datasets
5
Based on What (Anchor Text)

Hyperlinks define the Web structure

link information: source URL, target URL, and the anchor text

Anchor text is a short text describing the target page

exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.]

widely used in Information Retrieval to improve search
effectiveness

combined with timestamps has be used to

capture & trace entity evolution [Kanhabua and Nejdl]

uncover & provide representation of missing pages from the archive
[Klein & Nelson, Huurdeman et al.]
6
Anchor Text (data)

Anchor Text from two different crawls

Dutch Web archive by National Library of the
Netherlands (KB)

depth-first (selective)

+10,000 websites

Dutch domain

100 million crawled urls

Common Crawl (CC)

breadth-first

millions of websites

entire Web

2.8 billion crawled urls
7
Link Processing
Extraction
Cleaning

URL normalization; get host of the source and the target

Clean spam e.g., rolex watches
Aggregation

Pre-processing: lowercase, Stop-words removing &
stemming

Aggregate based on the anchor text
Deduplication

Remove duplicate links; due to crawling frequency

Same source, target, anchor text, year of crawl-date
SourceURL, targetURL, anchorText
anchorText, count
8
SourceURL, targetURL, anchorText, crawl-date (year)
Anchor Text Summary (1/2)

Short texts describing target pages

Available for both archived & unarchived pages

not all target pages exist in the archive
9
e.g., 6.5% of the unique
hosts of target pages of
external links were
crawled
Anchor Text Summary (2/2)

The breadth-first crawl (KB) and the breadth-
first crawl (CC) differ in terms of size &
coverage

links from CC is 559x times larger than the number of KB
links

KB dataset mainly covers the NL domain, and CC dataset
covers the entire Web

CC contains more hosts (websites) than the KB; source
and target hosts
10
For Fair Comparison

Subsets from CC dataset

NL part, we focus on links that originate from the NL
domain.

Links whose source is from the KB seeds list
11
Number of links from
the NL part of CC is
comparable to number
of links from the KB
Past Popular Topics (of interests to users)

Queries represent user's information need

User queries have usually not been preserved

Impossible to go back in time reconstruct which
queries the user would have used to search the
archive
12
Past Popular Topics (data 1/3)

Three different sources to identify past popular
topics

Wikistats

views aggregation of Wikipedia (WP) pages over time

contains WP title, number of views, timestamp, and the language

keep WP titles viewed >= 1,000 times

Google trends

top searched queries from entire world or per country

Query log from users visiting the public Dutch historic
newspaper archive
13
Past Popular Topics (data 2/3)

Our assumption

is that the CC (a breadth-first crawl) covers more global
topics, and that KB (a depth-first crawl) covers more topics
from NL domain

For validation:

split topic sources into topics that attracted attention in the
entire Web (global) and topics that were only picked up in
the NL domain
14
Past Popular Topics (data 3/3)
 Wikistats
 we used the language to get the Dutch Wikipedia pages
 other pages, we labeled them as global
 Google trends
 most searched queries from the entire world
 most searched queries from Netherlands
15
Match Anchor Text to Topics
 Pre-process topics from all sources
 lowercase
 stop-words removing, and stemming
 Exact string matching between anchor text and
topics from all sources
16
Topic Coverage (1/3)

The breadth-first crawl (CC) covers more topics
than depth-first crawl (KB)

for all topic sources

not only for global topics (as expected)

but also for topics from the NL domain!
17
Topic Coverage (2/3)

Subsets from the CC dataset have comparable
results
18
NL part KB seeds
Topic Coverage (3/3)

Influence of anchor text popularity in the archive

rank anchor texts based on their frequency

high percentage of anchor text occurs occurs once

match again to topics at different rank cut-off
19

Anchor text combined with timestamps can be
used to find past popular topics

the % of coverage varies across the sources

we find a relation between anchor text frequency and the
percentage of coverage

Breadth-first (CC) covers more topics globally
and from the NL domain

Subsets from CC

the NL part covers more topics

the KB seeds part shows comparable results
20
References
[Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information
retrieval. In SIGIR 2002
[Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in
wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014.
[Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods
for discovering replacement web pages. Int. J. on Digital Libraries, 2014.
[Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and
Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital
Libraries, 2015.
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
Comparing Topic Coverage in Breadth-first & Depth-first Crawls
using Anchor Texts
We would like to thank
for making the Web archive data available for us
And
For giving us access to their Hadoop cluster

More Related Content

Similar to Tpdl 2016 topic_coverage_indf-bf_crawls

Temporal Anchor Text as Proxy for user Queries
Temporal Anchor Text as Proxy for user Queries Temporal Anchor Text as Proxy for user Queries
Temporal Anchor Text as Proxy for user Queries Thaer Samar
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsWARCnet
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Collision course presentation (corrrect)
Collision course presentation (corrrect)Collision course presentation (corrrect)
Collision course presentation (corrrect)William Worford
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
Compressed full text indexes
Compressed full text indexesCompressed full text indexes
Compressed full text indexesunyil96
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 
Uk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcaseUk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcaseRDTF-Discovery
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingOCLC
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintokeee
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
 
The significance of linking
The significance of linkingThe significance of linking
The significance of linkingunyil96
 

Similar to Tpdl 2016 topic_coverage_indf-bf_crawls (20)

Temporal Anchor Text as Proxy for user Queries
Temporal Anchor Text as Proxy for user Queries Temporal Anchor Text as Proxy for user Queries
Temporal Anchor Text as Proxy for user Queries
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Collision course presentation (corrrect)
Collision course presentation (corrrect)Collision course presentation (corrrect)
Collision course presentation (corrrect)
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
Compressed full text indexes
Compressed full text indexesCompressed full text indexes
Compressed full text indexes
 
Longwell final ppt
Longwell final pptLongwell final ppt
Longwell final ppt
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Uk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcaseUk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcase
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Identifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpediaIdentifying Topics in Social Media Posts using DBpedia
Identifying Topics in Social Media Posts using DBpedia
 
The significance of linking
The significance of linkingThe significance of linking
The significance of linking
 

Recently uploaded

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detailhaiderbaloch3
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 

Recently uploaded (20)

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detail
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 

Tpdl 2016 topic_coverage_indf-bf_crawls

  • 1. Comparing Topic Coverage in Breadth-first & Depth-first Crawls using Anchor Texts Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries
  • 2.  Web archives were initiated to preserve the fast changing Web  Web archives are created using Web crawlers  Crawling strategy has a great influence on the data that is archived Web Archives 2
  • 3. Web Crawlers  Two main strategies  Depth-first, focus only on selected web sites, as deep as possible  Breadth-first, focus on the entire web, but not in depth 3
  • 4. Our Study  Research Question: How does the crawling strategy impact web archive's coverage of past popular topics?  Challenges  How to approximate user interest when no query log is available?  Scalability, Web archives or crawls consist of huge amount of data.  e.g., one month snapshot from Common Crawl can be 80 TB of compressed data. 4
  • 5. Approach  We use hyperlinks anchor text from the raw content of two datasets  crawled using different crawling strategies  We compare anchor texts with external sources that identify past popular topics on the Web at the time of our datasets 5
  • 6. Based on What (Anchor Text)  Hyperlinks define the Web structure  link information: source URL, target URL, and the anchor text  Anchor text is a short text describing the target page  exhibits characteristics similar to user query and document title [Eiron & McCurley, Jin et al.]  widely used in Information Retrieval to improve search effectiveness  combined with timestamps has be used to  capture & trace entity evolution [Kanhabua and Nejdl]  uncover & provide representation of missing pages from the archive [Klein & Nelson, Huurdeman et al.] 6
  • 7. Anchor Text (data)  Anchor Text from two different crawls  Dutch Web archive by National Library of the Netherlands (KB)  depth-first (selective)  +10,000 websites  Dutch domain  100 million crawled urls  Common Crawl (CC)  breadth-first  millions of websites  entire Web  2.8 billion crawled urls 7
  • 8. Link Processing Extraction Cleaning  URL normalization; get host of the source and the target  Clean spam e.g., rolex watches Aggregation  Pre-processing: lowercase, Stop-words removing & stemming  Aggregate based on the anchor text Deduplication  Remove duplicate links; due to crawling frequency  Same source, target, anchor text, year of crawl-date SourceURL, targetURL, anchorText anchorText, count 8 SourceURL, targetURL, anchorText, crawl-date (year)
  • 9. Anchor Text Summary (1/2)  Short texts describing target pages  Available for both archived & unarchived pages  not all target pages exist in the archive 9 e.g., 6.5% of the unique hosts of target pages of external links were crawled
  • 10. Anchor Text Summary (2/2)  The breadth-first crawl (KB) and the breadth- first crawl (CC) differ in terms of size & coverage  links from CC is 559x times larger than the number of KB links  KB dataset mainly covers the NL domain, and CC dataset covers the entire Web  CC contains more hosts (websites) than the KB; source and target hosts 10
  • 11. For Fair Comparison  Subsets from CC dataset  NL part, we focus on links that originate from the NL domain.  Links whose source is from the KB seeds list 11 Number of links from the NL part of CC is comparable to number of links from the KB
  • 12. Past Popular Topics (of interests to users)  Queries represent user's information need  User queries have usually not been preserved  Impossible to go back in time reconstruct which queries the user would have used to search the archive 12
  • 13. Past Popular Topics (data 1/3)  Three different sources to identify past popular topics  Wikistats  views aggregation of Wikipedia (WP) pages over time  contains WP title, number of views, timestamp, and the language  keep WP titles viewed >= 1,000 times  Google trends  top searched queries from entire world or per country  Query log from users visiting the public Dutch historic newspaper archive 13
  • 14. Past Popular Topics (data 2/3)  Our assumption  is that the CC (a breadth-first crawl) covers more global topics, and that KB (a depth-first crawl) covers more topics from NL domain  For validation:  split topic sources into topics that attracted attention in the entire Web (global) and topics that were only picked up in the NL domain 14
  • 15. Past Popular Topics (data 3/3)  Wikistats  we used the language to get the Dutch Wikipedia pages  other pages, we labeled them as global  Google trends  most searched queries from the entire world  most searched queries from Netherlands 15
  • 16. Match Anchor Text to Topics  Pre-process topics from all sources  lowercase  stop-words removing, and stemming  Exact string matching between anchor text and topics from all sources 16
  • 17. Topic Coverage (1/3)  The breadth-first crawl (CC) covers more topics than depth-first crawl (KB)  for all topic sources  not only for global topics (as expected)  but also for topics from the NL domain! 17
  • 18. Topic Coverage (2/3)  Subsets from the CC dataset have comparable results 18 NL part KB seeds
  • 19. Topic Coverage (3/3)  Influence of anchor text popularity in the archive  rank anchor texts based on their frequency  high percentage of anchor text occurs occurs once  match again to topics at different rank cut-off 19
  • 20.  Anchor text combined with timestamps can be used to find past popular topics  the % of coverage varies across the sources  we find a relation between anchor text frequency and the percentage of coverage  Breadth-first (CC) covers more topics globally and from the NL domain  Subsets from CC  the NL part covers more topics  the KB seeds part shows comparable results 20
  • 21. References [Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002 [Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014. [Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. Int. J. on Digital Libraries, 2014. [Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital Libraries, 2015. [CommonCrawl] https://commoncrawl.org/ [WikiStats] http://wikistats.ins.cwi.nl/
  • 22. Comparing Topic Coverage in Breadth-first & Depth-first Crawls using Anchor Texts We would like to thank for making the Web archive data available for us And For giving us access to their Hadoop cluster