Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text elements in an HTML page as either boilerplate or main content. Our method uses convolutional networks on top of DOM tree features to learn unary classification potentials for each block of text on the page and pairwise potentials for each pair of neighboring text blocks. We find the most likely labeling according to these potentials using the Viterbi algorithm.
The proposed method improves page cleaning performance on the CleanEval benchmark compared to the state-of-the-art. As a component of information retrieval pipelines it improves retrieval performance on the ClueWeb12 collection.
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Web2Text: Deep Structured Boilerplate Removal
1. Web2Text: Deep Structured
Boilerplate Removal
Thijs Vogels1
, Octavian-Eugen Ganea2
, Carsten Eickhoff3
1
Disney Research
2
ETH Zurich
3
Brown University
21. Unary Features (128) per Block
● Avg. word length
● Stopword ratio
● Numeric character ratio
● Relative distance from root
● parent/grandparent information
● ...
22. Pairwise Features (25) per Neighboring Pair
● Tree distance in CDOM
● Are blocks separated by line break?
● Node features of common ancestor...
24. Representation Learning
● 128/25 hand crafted features
● Consolidate “raw” features in 2 CNNs:
● Unary
○ 5 layers
○ ReLU
○ Softmax
○ Cross entropy loss
○ Outputs:
■ pi
(li
= 1)
■ pi
(li
= 0)
25. Representation Learning
● 128/25 hand crafted features
● Consolidate “raw” features in 2 CNNs:
● Pairwise
○ 5 layers
○ ReLU
○ Softmax
○ Cross entropy loss
○ Outputs:
■ pi,i+1
(li
= 1, li+1
= 1)
■ pi,i+1
(li
= 1, li+1
= 0)
■ pi,i+1
(li
= 0, li+1
= 1)
■ pi,i+1
(li
= 0, li+1
= 0)
● Unary
○ 5 layers
○ ReLU
○ Softmax
○ Cross entropy loss
○ Outputs:
■ pi
(li
= 1)
■ pi
(li
= 0)
39. Experiment II: Document Retrieval
● Low-recall extractors hurt retrieval
performance (BTE, art, text, Unfluff)
● CRF and Web2Text extraction sig. better
than raw content indexing
40. Experiment II: Document Retrieval
● Low-recall extractors hurt retrieval
performance (BTE, art, text, Unfluff)
● CRF and Web2Text extraction sig. better
than raw content indexing
● Web2Text extraction sig. better than all
compared methods
41. Run Times
● Average runtime per Web page
● Macbook with 2.8 GHz Intel Core i5 processor
● Global: 54 ms
○ DOM parsing & feature extraction: 35 ms
○ NN forward pass & Viterbi: 19 ms
42. Conclusion
● Deep structure prediction pipeline for Web content extraction
○ Collapsed DOMs
○ Unary and pairwise potentials
○ CNN representation learning
○ HMM-based inference
● Solid content extraction performance
● Can translate into increased downstream effectiveness