Feature Extraction for Large-Scale Text Collections

Feature Extraction for Large-Scale
Text Collections
Luke Gallagher1
Antonio Mallia2
J. Shane Culpepper1
Torsten Suel2
B. Barla Cambazoglu1
RMIT University1
New York University2
1

Feature Extraction—Why Do We Care?
We want open and accessible tooling around feature extraction,
because many people in our research group are interested in
Efficient retrieval over massive text collections
Efficient and scalable algorithm design
Multi-stage retrieval systems
LTR and cascade ranking
End-to-end retrieval experiments
2

Multi-Stage Search
Multi-stage system described by Pederson
(Graphic J. Mackenzie1)
1
jmmackenzie.io/publication/thesis
J. Pedersen. “Query Understanding at Bing”. In: Proc. SIGIR Industry Track Keynote. 2010.
3

What is Feature Extraction?
Map a larger problem space to a smaller problem space
Parts of an inverted index are a result of feature extraction
D. Manolescu. “Feature Extraction–A Pattern for Information Retrieval”. In: Proc. PLOP. 1998.
G. Salton. Interactive Information Retrieval. Tech. rep. TR69-40. Cornell University, 1969.
4

Feature Extraction in Multi-Stage Retrieval
Many search applications use LTR (GBRT, LambdaMART)
Feature-based models depend on feature engineering and
infrastructure to support feature extraction
Sets a high bar for research on feature extraction and related
tasks (e.g. efficiency, model interpretation)
5

Feature Extraction in LTR
Which features to implement or extract?
Features may depend on the search task
Many “good” ranking features are query dependent
(i.e. require both query and document)
Results for “seen” queries can be cached/pre-computed
In general, not possible to pre-compute query dependent
features
6

Key Contributions
Feature extraction software
Easier to test out ideas in feature extraction
Better simulation of search tasks dependent upon feature
extraction
LTR dataset on ClueWeb09B
Possibly first public LTR dataset that is completely transparent
(i.e., queries, documents, features, etc are known)
8

Open Feature Extraction Tooling
Existing open source solutions:
Support LTR but do not implement features for users
(Solr, Elastic, Terrier)
Anserini provides some features with LTR
Published work tends to be “single-use” engineering
We don’t have open feature extraction tools that:
Provide a large set of text based features
Facilitates the feature extraction process
Can be used standalone or within a retrieval pipeline
L. Wang, J. Lin, and D. Metzler. In: Proc. SIGIR. 2011.
9

Fxt – Feature Extraction Toolkit
What does the Fxt2 software provide?
Configurable collection of 448 features
Features mainly from literature in QPP and LTR
indexer—build feature index
extractor—extract features from candidate documents
Use cases
Standalone feature extraction
Generate training data
End-to-end retrieval experiments (more work required)
2
github.com/ten-blue-links/fxt
N. Asadi and J. Lin. In: Inf. Retr. (2013).
10

Summary of Features in Fxt
Description No. Features
Term Score Aggregation (Unigram) 159
Term Score Aggregation (Bigram) 147
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
Query (Document Independent) 13
11

LTR Dataset – ClueWeb09B
Web Track queries and judgments from 2009–2012
134 features
Feature classes (see paper for details):
Query-document unigram (e.g. BM25, BM25-title)
Query-document bigram (e.g. SDM, BM25-TP)
Static document quality (e.g. Stop Ratio, AlexaRank)
Publicly available3
Download the dataset
Reproduce4
the dataset and/or experiments
3
github.com/ten-blue-links/cikm20
4
Definitions for “replicate” and “reproduce” were recently swapped: tinyurl.com/acm-replicate-reproduce
J. S. Culpepper, C. L. A. Clarke, and J. Lin. In: Proc. ADCS. 2016.
M. Bendersky, W. B. Croft, and Y. Diao. In: Proc. WSDM. 2011.
X. Lu, A. Moffat, and J. S. Culpepper. In: Proc. CIKM. 2015. 12

Summary of Features in Dataset
Description No. Features
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
AlexaRank (2010) 1
13

Dataset Construction Process
1. Indri index with fields
2. Fxt index
3. BM25 to generate candidate set for queries (depth 1k)
4. Extract features to CSV
5. Random shuffle and split into train/val/test
14

ClueWeb09B Relevance Judgment Methods
Track Grades Method
MQ09 0–2 TREC, MTC, NEU
WT095 0–2 TREC, MTC, NEU
WT10 -2, 0–4 TREC
WT11 -2, 0–3 TREC
WT12 -2, 0–4 TREC
5
Judgments for WT09 topics are identical to those from MQ09
B. Carterette, J. Allan, and R. Sitaraman. “Minimal Test Collections for Retrieval Evaluation”. In: Proc. SIGIR. 2006.
J. Aslam and V. Pavlu. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Tech. rep. Northeastern U., 2007.
B. Carterette et al. “Million Query Track 2009 Overview”. In: Proc. TREC. 2009.
X. Lu, A. Moffat, and J. S. Culpepper. “The Effect of Pooling and Evaluation Depth on IR Metrics”. In: Inf. Retr. (2016).
15

Train–Test Setup
Test Queries Train/Valid Queries
WT09 MQ09
WT10 WT09, WT11, WT12
WT11 WT09, WT10, WT12
WT12 WT09, WT10, WT11
16

No Dataset is Perfect—What’s Broken Here?
Building datasets is hard, but we’re lucky (system study)
SDM feature is broken
Unfortunately was found after camera-ready
SDM was decoding postings for every document!
Used a workaround that created a post-processing bug
tinyurl.com/yxkm9867
The dataset is versioned6 and a fix will be released
6
github.com/ten-blue-links/cikm20/releases
17

Experimental Details
Compare effectiveness of LambdaMART7 to
traditional baselines
Evaluation on the 4 Web Track query sets
Conduct brief study on feature importance
7
lightgbm.readthedocs.io
18

Effectiveness Results
0.10
0.15
0.20
0.25
0.30
WT09 WT10 WT11 WT12
NDCG20
BM25 LambdaMART SDM TREC Best
WT09 WT10 WT11 WT12
19

Feature Importance Summary
BM25kMax
StreamLenInlink
AlexaRank
BEInlink
StopCover
AvgTermLen
LM2500
LM1500Inlink
LM2500Inlink
DPHInlink
LM2500Body
VisibleText
Stage0
LM2500Title
0 50 100 150
Average Importance (2009-2012)
20

Summary
Fxt software for feature based machine learning in IR
Released LTR dataset on ClueWeb09B
Facilitate more open research and collaboration
Avenues for future research:
Efficiency in feature extraction
Model interpretation
Ablation/Feature selection
End-to-end system prototyping
21

Feature Extraction for Large-Scale Text Collections

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Feature Extraction for Large-Scale Text Collections

Similar to Feature Extraction for Large-Scale Text Collections (20)

More from Sease

More from Sease (20)

Recently uploaded

Recently uploaded (20)

Feature Extraction for Large-Scale Text Collections