Feature engineering is a fundamental but poorly documented component in LTR search applications.
As a result, there are still few open access software packages that allow researchers and practitioners to easily simulate a feature extraction pipeline and conduct experiments in a lab setting.
This talk introduces Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt may be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.
The talk details how we built and documented a reproducible feature extraction pipeline with LTR experiments using the ClueWeb09B collection.
This LTR dataset is publicly available.
We’ll also discuss some of the benefits (feature extraction efficiency, model interpretation) of having open access tooling in this area for researchers and practitioners alike.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Feature Extraction for Large-Scale Text Collections
1. Feature Extraction for Large-Scale
Text Collections
Luke Gallagher1
Antonio Mallia2
J. Shane Culpepper1
Torsten Suel2
B. Barla Cambazoglu1
RMIT University1
New York University2
1
2. Feature Extraction—Why Do We Care?
We want open and accessible tooling around feature extraction,
because many people in our research group are interested in
Efficient retrieval over massive text collections
Efficient and scalable algorithm design
Multi-stage retrieval systems
LTR and cascade ranking
End-to-end retrieval experiments
2
3. Multi-Stage Search
Multi-stage system described by Pederson
(Graphic J. Mackenzie1)
1
jmmackenzie.io/publication/thesis
J. Pedersen. “Query Understanding at Bing”. In: Proc. SIGIR Industry Track Keynote. 2010.
3
4. What is Feature Extraction?
Map a larger problem space to a smaller problem space
Parts of an inverted index are a result of feature extraction
D. Manolescu. “Feature Extraction–A Pattern for Information Retrieval”. In: Proc. PLOP. 1998.
G. Salton. Interactive Information Retrieval. Tech. rep. TR69-40. Cornell University, 1969.
4
5. Feature Extraction in Multi-Stage Retrieval
Many search applications use LTR (GBRT, LambdaMART)
Feature-based models depend on feature engineering and
infrastructure to support feature extraction
Sets a high bar for research on feature extraction and related
tasks (e.g. efficiency, model interpretation)
5
6. Feature Extraction in LTR
Which features to implement or extract?
Features may depend on the search task
Many “good” ranking features are query dependent
(i.e. require both query and document)
Results for “seen” queries can be cached/pre-computed
In general, not possible to pre-compute query dependent
features
6
8. Key Contributions
Feature extraction software
Easier to test out ideas in feature extraction
Better simulation of search tasks dependent upon feature
extraction
LTR dataset on ClueWeb09B
Possibly first public LTR dataset that is completely transparent
(i.e., queries, documents, features, etc are known)
8
9. Open Feature Extraction Tooling
Existing open source solutions:
Support LTR but do not implement features for users
(Solr, Elastic, Terrier)
Anserini provides some features with LTR
Published work tends to be “single-use” engineering
We don’t have open feature extraction tools that:
Provide a large set of text based features
Facilitates the feature extraction process
Can be used standalone or within a retrieval pipeline
L. Wang, J. Lin, and D. Metzler. In: Proc. SIGIR. 2011.
9
10. Fxt – Feature Extraction Toolkit
What does the Fxt2 software provide?
Configurable collection of 448 features
Features mainly from literature in QPP and LTR
indexer—build feature index
extractor—extract features from candidate documents
Use cases
Standalone feature extraction
Generate training data
End-to-end retrieval experiments (more work required)
2
github.com/ten-blue-links/fxt
N. Asadi and J. Lin. In: Inf. Retr. (2013).
10
11. Summary of Features in Fxt
Description No. Features
Term Score Aggregation (Unigram) 159
Term Score Aggregation (Bigram) 147
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
Query (Document Independent) 13
11
12. LTR Dataset – ClueWeb09B
Web Track queries and judgments from 2009–2012
134 features
Feature classes (see paper for details):
Query-document unigram (e.g. BM25, BM25-title)
Query-document bigram (e.g. SDM, BM25-TP)
Static document quality (e.g. Stop Ratio, AlexaRank)
Publicly available3
Download the dataset
Reproduce4
the dataset and/or experiments
3
github.com/ten-blue-links/cikm20
4
Definitions for “replicate” and “reproduce” were recently swapped: tinyurl.com/acm-replicate-reproduce
J. S. Culpepper, C. L. A. Clarke, and J. Lin. In: Proc. ADCS. 2016.
M. Bendersky, W. B. Croft, and Y. Diao. In: Proc. WSDM. 2011.
X. Lu, A. Moffat, and J. S. Culpepper. In: Proc. CIKM. 2015. 12
13. Summary of Features in Dataset
Description No. Features
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
AlexaRank (2010) 1
13
14. Dataset Construction Process
1. Indri index with fields
2. Fxt index
3. BM25 to generate candidate set for queries (depth 1k)
4. Extract features to CSV
5. Random shuffle and split into train/val/test
14
15. ClueWeb09B Relevance Judgment Methods
Track Grades Method
MQ09 0–2 TREC, MTC, NEU
WT095 0–2 TREC, MTC, NEU
WT10 -2, 0–4 TREC
WT11 -2, 0–3 TREC
WT12 -2, 0–4 TREC
5
Judgments for WT09 topics are identical to those from MQ09
B. Carterette, J. Allan, and R. Sitaraman. “Minimal Test Collections for Retrieval Evaluation”. In: Proc. SIGIR. 2006.
J. Aslam and V. Pavlu. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Tech. rep. Northeastern U., 2007.
B. Carterette et al. “Million Query Track 2009 Overview”. In: Proc. TREC. 2009.
X. Lu, A. Moffat, and J. S. Culpepper. “The Effect of Pooling and Evaluation Depth on IR Metrics”. In: Inf. Retr. (2016).
15
17. No Dataset is Perfect—What’s Broken Here?
Building datasets is hard, but we’re lucky (system study)
SDM feature is broken
Unfortunately was found after camera-ready
SDM was decoding postings for every document!
Used a workaround that created a post-processing bug
tinyurl.com/yxkm9867
The dataset is versioned6 and a fix will be released
6
github.com/ten-blue-links/cikm20/releases
17
18. Experimental Details
Compare effectiveness of LambdaMART7 to
traditional baselines
Evaluation on the 4 Web Track query sets
Conduct brief study on feature importance
7
lightgbm.readthedocs.io
18
21. Summary
Fxt software for feature based machine learning in IR
Released LTR dataset on ClueWeb09B
Facilitate more open research and collaboration
Avenues for future research:
Efficiency in feature extraction
Model interpretation
Ablation/Feature selection
End-to-end system prototyping
21