Web2Text: Deep Structured Boilerplate Removal

Web2Text: Deep Structured
Boilerplate Removal
Thijs Vogels1
, Octavian-Eugen Ganea2
, Carsten Eickhoff3
1
Disney Research
2
ETH Zurich
3
Brown University

Outline
1. Background & Goals
2. Feature Extraction
3. Representation Learning
4. Inference
5. Experiments

Boilerplate Removal
CFP
Header

Boilerplate Removal
CFP
Header
Navigation

Boilerplate Removal
CFP
Header
Navigation
Search

Boilerplate Removal
CFP
Header
Navigation
Search
Dates

Beyond HTML Cleaning
● Main content ● Boilerplate
○ Ads
○ Banners
○ Navigation
○ Feeds
○ Next article preview
○ Link lists
○ etc.

Downstream Benefits
● More accurate content extraction
● Improved quality of derived systems
● More compact dataset/index sizes

DOM Parsing
● JSoup
● Remove empty nodes:
○ Empty content
○ Whitespace only
● Remove non-content nodes:
○ <br>
○ <checkbox>
○ <hr>
○ Etc.

Collapsing the DOM Tree
● Inflated node hierarchies can
hinder expressiveness
● Especially troublesome for
distance calculation

Collapsing DOM Trees
● Merge single-child nodes with their
respective children
● Repeat until collapse

Webpages as Sequences of Blocks

Unary Features (128) per Block
● Avg. word length
● Stopword ratio
● Numeric character ratio
● Relative distance from root
● parent/grandparent information
● ...

Pairwise Features (25) per Neighboring Pair
● Tree distance in CDOM
● Are blocks separated by line break?
● Node features of common ancestor...

Representation Learning
● 128/25 hand crafted features
● Consolidate “raw” features in 2 CNNs:

● Unary
○ 5 layers
○ ReLU
○ Softmax
○ Cross entropy loss
○ Outputs:
■ pi
(li
= 1)
■ pi
(li
= 0)

● Pairwise
○ 5 layers
○ ReLU
○ Softmax
○ Outputs:
■ pi,i+1
(li
= 1, li+1
= 1)
■ pi,i+1
(li
= 1, li+1
= 0)
■ pi,i+1
(li
= 0, li+1
= 1)
■ pi,i+1
(li
= 0, li+1
= 0)
● Unary
○ 5 layers
○ ReLU
○ Softmax
○ Outputs:
■ pi
(li
= 1)
■ pi
(li
= 0)

Inference
p1
(l1
=1)
b1
p1
(l1
=0)
l1
= ?

Inference
p1
(l1
=1)
b1
p1
(l1
=0)
l1
= ?
p2
(l2
=1)
b2
p2
(l2
=0)
l2
= ?

Inference
p1
(l1
=1)
b1
p1
(l1
=0)
l1
= ?
p2
(l2
=1)
b2
p2
(l2
=0)
l2
= ?
p1,2
(l1
= 1, l2
= 1)
p1,2
(l1
= 0, l2
= 0)
p
1,2(l1=0,l2=1)
p1,2
(l1
=1,l2
=0)

Inference
p1
(l1
=1)
b1
p1
(l1
=0)
l1
= ?
p2
(l2
=1)
b2
p2
(l2
=0)
l2
= ?
p1,2
(l1
= 1, l2
= 1)
p1,2
(l1
= 0, l2
= 0)
p
1,2(l1=0,l2=1)
p1,2
(l1
=1,l2
=0)
...
...

Inference
p1
(l1
=1)
b1
p1
(l1
=0)
l1
= ?
p2
(l2
=1)
b2
p2
(l2
=0)
l2
= ?
pn
(ln
=1)
bn
pn
(ln
=0)
ln
= ?
p1,2
(l1
= 1, l2
= 1)
p1,2
(l1
= 0, l2
= 0)
p
1,2(l1=0,l2=1)
p1,2
(l1
=1,l2
=0)
...
...

Experiment I: Boilerplate Removal
● CleanEval data
● 736 manually annotated Web pages
● On average 188 blocks per page
● Measure Acc, P, R, F
● Baselines:
○ BTE heuristic
○ CRF
○ Unfluff
○ Boilerpipe

Experiment I: Boilerplate Removal
Acc P R F1
BTE 0.75 0.76 0.84 0.80
CRF 0.82 0.88 0.81 0.84
Boilerpipe (def) 0.79 0.89 0.74 0.81
Boilerpipe (art) 0.67 0.89 0.50 0.64
Boilerpipe (text) 0.59 0.93 0.33 0.48
Unfluff 0.68 0.90 0.51 0.65
Web2Text 0.86 0.87 0.90 0.88

Experiment II: Document Retrieval
● Effect on ad-hoc retrieval
● ClueWeb’12 collection (733M docs)
● 50 TREC 2013 Web track queries
● Indri Query likelihood (QL) and relevance-based language model (RM)

● Low-recall extractors hurt retrieval
performance (BTE, art, text, Unfluff)

● CRF and Web2Text extraction sig. better
than raw content indexing

● CRF and Web2Text extraction sig. better
than raw content indexing
● Web2Text extraction sig. better than all
compared methods

Run Times
● Average runtime per Web page
● Macbook with 2.8 GHz Intel Core i5 processor
● Global: 54 ms
○ DOM parsing & feature extraction: 35 ms
○ NN forward pass & Viterbi: 19 ms

Conclusion
● Deep structure prediction pipeline for Web content extraction
○ Collapsed DOMs
○ Unary and pairwise potentials
○ CNN representation learning
○ HMM-based inference
● Solid content extraction performance
● Can translate into increased downstream effectiveness

Thank You!
Code available at:
https://github.com/dalab/web2text

Web2Text: Deep Structured Boilerplate Removal

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Web2Text: Deep Structured Boilerplate Removal

Similar to Web2Text: Deep Structured Boilerplate Removal (15)

More from Carsten Eickhoff

More from Carsten Eickhoff (8)

Recently uploaded

Recently uploaded (20)

Web2Text: Deep Structured Boilerplate Removal