Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Feature Extraction for Large-Scale
Text Collections
Luke Gallagher1
Antonio Mallia2
J. Shane Culpepper1
Torsten Suel2
B. B...
Feature Extraction—Why Do We Care?
We want open and accessible tooling around feature extraction,
because many people in o...
Multi-Stage Search
Multi-stage system described by Pederson
(Graphic J. Mackenzie1)
1
jmmackenzie.io/publication/thesis
J....
What is Feature Extraction?
Map a larger problem space to a smaller problem space
Parts of an inverted index are a result ...
Feature Extraction in Multi-Stage Retrieval
Many search applications use LTR (GBRT, LambdaMART)
Feature-based models depen...
Feature Extraction in LTR
Which features to implement or extract?
Features may depend on the search task
Many “good” ranki...
Runtime Feature Extraction
7
Key Contributions
Feature extraction software
Easier to test out ideas in feature extraction
Better simulation of search t...
Open Feature Extraction Tooling
Existing open source solutions:
Support LTR but do not implement features for users
(Solr,...
Fxt – Feature Extraction Toolkit
What does the Fxt2 software provide?
Configurable collection of 448 features
Features mai...
Summary of Features in Fxt
Description No. Features
Term Score Aggregation (Unigram) 159
Term Score Aggregation (Bigram) 1...
LTR Dataset – ClueWeb09B
Web Track queries and judgments from 2009–2012
134 features
Feature classes (see paper for detail...
Summary of Features in Dataset
Description No. Features
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4...
Dataset Construction Process
1. Indri index with fields
2. Fxt index
3. BM25 to generate candidate set for queries (depth ...
ClueWeb09B Relevance Judgment Methods
Track Grades Method
MQ09 0–2 TREC, MTC, NEU
WT095 0–2 TREC, MTC, NEU
WT10 -2, 0–4 TR...
Train–Test Setup
Test Queries Train/Valid Queries
WT09 MQ09
WT10 WT09, WT11, WT12
WT11 WT09, WT10, WT12
WT12 WT09, WT10, W...
No Dataset is Perfect—What’s Broken Here?
Building datasets is hard, but we’re lucky (system study)
SDM feature is broken
...
Experimental Details
Compare effectiveness of LambdaMART7 to
traditional baselines
Evaluation on the 4 Web Track query set...
Effectiveness Results
0.10
0.15
0.20
0.25
0.30
WT09 WT10 WT11 WT12
NDCG20
BM25 LambdaMART SDM TREC Best
WT09 WT10 WT11 WT1...
Feature Importance Summary
BM25kMax
StreamLenInlink
AlexaRank
BEInlink
StopCover
AvgTermLen
LM2500
LM1500Inlink
LM2500Inli...
Summary
Fxt software for feature based machine learning in IR
Released LTR dataset on ClueWeb09B
Facilitate more open rese...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Feature Extraction for Large-Scale Text Collections

Download to read offline

Feature engineering is a fundamental but poorly documented component in LTR search applications.
As a result, there are still few open access software packages that allow researchers and practitioners to easily simulate a feature extraction pipeline and conduct experiments in a lab setting.

This talk introduces Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt may be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.
The talk details how we built and documented a reproducible feature extraction pipeline with LTR experiments using the ClueWeb09B collection.
This LTR dataset is publicly available.
We’ll also discuss some of the benefits (feature extraction efficiency, model interpretation) of having open access tooling in this area for researchers and practitioners alike.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Feature Extraction for Large-Scale Text Collections

  1. 1. Feature Extraction for Large-Scale Text Collections Luke Gallagher1 Antonio Mallia2 J. Shane Culpepper1 Torsten Suel2 B. Barla Cambazoglu1 RMIT University1 New York University2 1
  2. 2. Feature Extraction—Why Do We Care? We want open and accessible tooling around feature extraction, because many people in our research group are interested in Efficient retrieval over massive text collections Efficient and scalable algorithm design Multi-stage retrieval systems LTR and cascade ranking End-to-end retrieval experiments 2
  3. 3. Multi-Stage Search Multi-stage system described by Pederson (Graphic J. Mackenzie1) 1 jmmackenzie.io/publication/thesis J. Pedersen. “Query Understanding at Bing”. In: Proc. SIGIR Industry Track Keynote. 2010. 3
  4. 4. What is Feature Extraction? Map a larger problem space to a smaller problem space Parts of an inverted index are a result of feature extraction D. Manolescu. “Feature Extraction–A Pattern for Information Retrieval”. In: Proc. PLOP. 1998. G. Salton. Interactive Information Retrieval. Tech. rep. TR69-40. Cornell University, 1969. 4
  5. 5. Feature Extraction in Multi-Stage Retrieval Many search applications use LTR (GBRT, LambdaMART) Feature-based models depend on feature engineering and infrastructure to support feature extraction Sets a high bar for research on feature extraction and related tasks (e.g. efficiency, model interpretation) 5
  6. 6. Feature Extraction in LTR Which features to implement or extract? Features may depend on the search task Many “good” ranking features are query dependent (i.e. require both query and document) Results for “seen” queries can be cached/pre-computed In general, not possible to pre-compute query dependent features 6
  7. 7. Runtime Feature Extraction 7
  8. 8. Key Contributions Feature extraction software Easier to test out ideas in feature extraction Better simulation of search tasks dependent upon feature extraction LTR dataset on ClueWeb09B Possibly first public LTR dataset that is completely transparent (i.e., queries, documents, features, etc are known) 8
  9. 9. Open Feature Extraction Tooling Existing open source solutions: Support LTR but do not implement features for users (Solr, Elastic, Terrier) Anserini provides some features with LTR Published work tends to be “single-use” engineering We don’t have open feature extraction tools that: Provide a large set of text based features Facilitates the feature extraction process Can be used standalone or within a retrieval pipeline L. Wang, J. Lin, and D. Metzler. In: Proc. SIGIR. 2011. 9
  10. 10. Fxt – Feature Extraction Toolkit What does the Fxt2 software provide? Configurable collection of 448 features Features mainly from literature in QPP and LTR indexer—build feature index extractor—extract features from candidate documents Use cases Standalone feature extraction Generate training data End-to-end retrieval experiments (more work required) 2 github.com/ten-blue-links/fxt N. Asadi and J. Lin. In: Inf. Retr. (2013). 10
  11. 11. Summary of Features in Fxt Description No. Features Term Score Aggregation (Unigram) 159 Term Score Aggregation (Bigram) 147 Query Document Score (Unigram) 106 Query Document Score (Bigram) 4 Static Document Quality 23 Query (Document Independent) 13 11
  12. 12. LTR Dataset – ClueWeb09B Web Track queries and judgments from 2009–2012 134 features Feature classes (see paper for details): Query-document unigram (e.g. BM25, BM25-title) Query-document bigram (e.g. SDM, BM25-TP) Static document quality (e.g. Stop Ratio, AlexaRank) Publicly available3 Download the dataset Reproduce4 the dataset and/or experiments 3 github.com/ten-blue-links/cikm20 4 Definitions for “replicate” and “reproduce” were recently swapped: tinyurl.com/acm-replicate-reproduce J. S. Culpepper, C. L. A. Clarke, and J. Lin. In: Proc. ADCS. 2016. M. Bendersky, W. B. Croft, and Y. Diao. In: Proc. WSDM. 2011. X. Lu, A. Moffat, and J. S. Culpepper. In: Proc. CIKM. 2015. 12
  13. 13. Summary of Features in Dataset Description No. Features Query Document Score (Unigram) 106 Query Document Score (Bigram) 4 Static Document Quality 23 AlexaRank (2010) 1 13
  14. 14. Dataset Construction Process 1. Indri index with fields 2. Fxt index 3. BM25 to generate candidate set for queries (depth 1k) 4. Extract features to CSV 5. Random shuffle and split into train/val/test 14
  15. 15. ClueWeb09B Relevance Judgment Methods Track Grades Method MQ09 0–2 TREC, MTC, NEU WT095 0–2 TREC, MTC, NEU WT10 -2, 0–4 TREC WT11 -2, 0–3 TREC WT12 -2, 0–4 TREC 5 Judgments for WT09 topics are identical to those from MQ09 B. Carterette, J. Allan, and R. Sitaraman. “Minimal Test Collections for Retrieval Evaluation”. In: Proc. SIGIR. 2006. J. Aslam and V. Pavlu. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Tech. rep. Northeastern U., 2007. B. Carterette et al. “Million Query Track 2009 Overview”. In: Proc. TREC. 2009. X. Lu, A. Moffat, and J. S. Culpepper. “The Effect of Pooling and Evaluation Depth on IR Metrics”. In: Inf. Retr. (2016). 15
  16. 16. Train–Test Setup Test Queries Train/Valid Queries WT09 MQ09 WT10 WT09, WT11, WT12 WT11 WT09, WT10, WT12 WT12 WT09, WT10, WT11 16
  17. 17. No Dataset is Perfect—What’s Broken Here? Building datasets is hard, but we’re lucky (system study) SDM feature is broken Unfortunately was found after camera-ready SDM was decoding postings for every document! Used a workaround that created a post-processing bug tinyurl.com/yxkm9867 The dataset is versioned6 and a fix will be released 6 github.com/ten-blue-links/cikm20/releases 17
  18. 18. Experimental Details Compare effectiveness of LambdaMART7 to traditional baselines Evaluation on the 4 Web Track query sets Conduct brief study on feature importance 7 lightgbm.readthedocs.io 18
  19. 19. Effectiveness Results 0.10 0.15 0.20 0.25 0.30 WT09 WT10 WT11 WT12 NDCG20 BM25 LambdaMART SDM TREC Best WT09 WT10 WT11 WT12 19
  20. 20. Feature Importance Summary BM25kMax StreamLenInlink AlexaRank BEInlink StopCover AvgTermLen LM2500 LM1500Inlink LM2500Inlink DPHInlink LM2500Body VisibleText Stage0 LM2500Title 0 50 100 150 Average Importance (2009-2012) 20
  21. 21. Summary Fxt software for feature based machine learning in IR Released LTR dataset on ClueWeb09B Facilitate more open research and collaboration Avenues for future research: Efficiency in feature extraction Model interpretation Ablation/Feature selection End-to-end system prototyping 21

Feature engineering is a fundamental but poorly documented component in LTR search applications. As a result, there are still few open access software packages that allow researchers and practitioners to easily simulate a feature extraction pipeline and conduct experiments in a lab setting. This talk introduces Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt may be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. The talk details how we built and documented a reproducible feature extraction pipeline with LTR experiments using the ClueWeb09B collection. This LTR dataset is publicly available. We’ll also discuss some of the benefits (feature extraction efficiency, model interpretation) of having open access tooling in this area for researchers and practitioners alike.

Views

Total views

122

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×