SlideShare a Scribd company logo
1 of 21
Download to read offline
Feature Extraction for Large-Scale
Text Collections
Luke Gallagher1
Antonio Mallia2
J. Shane Culpepper1
Torsten Suel2
B. Barla Cambazoglu1
RMIT University1
New York University2
1
Feature Extraction—Why Do We Care?
We want open and accessible tooling around feature extraction,
because many people in our research group are interested in
Efficient retrieval over massive text collections
Efficient and scalable algorithm design
Multi-stage retrieval systems
LTR and cascade ranking
End-to-end retrieval experiments
2
Multi-Stage Search
Multi-stage system described by Pederson
(Graphic J. Mackenzie1)
1
jmmackenzie.io/publication/thesis
J. Pedersen. “Query Understanding at Bing”. In: Proc. SIGIR Industry Track Keynote. 2010.
3
What is Feature Extraction?
Map a larger problem space to a smaller problem space
Parts of an inverted index are a result of feature extraction
D. Manolescu. “Feature Extraction–A Pattern for Information Retrieval”. In: Proc. PLOP. 1998.
G. Salton. Interactive Information Retrieval. Tech. rep. TR69-40. Cornell University, 1969.
4
Feature Extraction in Multi-Stage Retrieval
Many search applications use LTR (GBRT, LambdaMART)
Feature-based models depend on feature engineering and
infrastructure to support feature extraction
Sets a high bar for research on feature extraction and related
tasks (e.g. efficiency, model interpretation)
5
Feature Extraction in LTR
Which features to implement or extract?
Features may depend on the search task
Many “good” ranking features are query dependent
(i.e. require both query and document)
Results for “seen” queries can be cached/pre-computed
In general, not possible to pre-compute query dependent
features
6
Runtime Feature Extraction
7
Key Contributions
Feature extraction software
Easier to test out ideas in feature extraction
Better simulation of search tasks dependent upon feature
extraction
LTR dataset on ClueWeb09B
Possibly first public LTR dataset that is completely transparent
(i.e., queries, documents, features, etc are known)
8
Open Feature Extraction Tooling
Existing open source solutions:
Support LTR but do not implement features for users
(Solr, Elastic, Terrier)
Anserini provides some features with LTR
Published work tends to be “single-use” engineering
We don’t have open feature extraction tools that:
Provide a large set of text based features
Facilitates the feature extraction process
Can be used standalone or within a retrieval pipeline
L. Wang, J. Lin, and D. Metzler. In: Proc. SIGIR. 2011.
9
Fxt – Feature Extraction Toolkit
What does the Fxt2 software provide?
Configurable collection of 448 features
Features mainly from literature in QPP and LTR
indexer—build feature index
extractor—extract features from candidate documents
Use cases
Standalone feature extraction
Generate training data
End-to-end retrieval experiments (more work required)
2
github.com/ten-blue-links/fxt
N. Asadi and J. Lin. In: Inf. Retr. (2013).
10
Summary of Features in Fxt
Description No. Features
Term Score Aggregation (Unigram) 159
Term Score Aggregation (Bigram) 147
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
Query (Document Independent) 13
11
LTR Dataset – ClueWeb09B
Web Track queries and judgments from 2009–2012
134 features
Feature classes (see paper for details):
Query-document unigram (e.g. BM25, BM25-title)
Query-document bigram (e.g. SDM, BM25-TP)
Static document quality (e.g. Stop Ratio, AlexaRank)
Publicly available3
Download the dataset
Reproduce4
the dataset and/or experiments
3
github.com/ten-blue-links/cikm20
4
Definitions for “replicate” and “reproduce” were recently swapped: tinyurl.com/acm-replicate-reproduce
J. S. Culpepper, C. L. A. Clarke, and J. Lin. In: Proc. ADCS. 2016.
M. Bendersky, W. B. Croft, and Y. Diao. In: Proc. WSDM. 2011.
X. Lu, A. Moffat, and J. S. Culpepper. In: Proc. CIKM. 2015. 12
Summary of Features in Dataset
Description No. Features
Query Document Score (Unigram) 106
Query Document Score (Bigram) 4
Static Document Quality 23
AlexaRank (2010) 1
13
Dataset Construction Process
1. Indri index with fields
2. Fxt index
3. BM25 to generate candidate set for queries (depth 1k)
4. Extract features to CSV
5. Random shuffle and split into train/val/test
14
ClueWeb09B Relevance Judgment Methods
Track Grades Method
MQ09 0–2 TREC, MTC, NEU
WT095 0–2 TREC, MTC, NEU
WT10 -2, 0–4 TREC
WT11 -2, 0–3 TREC
WT12 -2, 0–4 TREC
5
Judgments for WT09 topics are identical to those from MQ09
B. Carterette, J. Allan, and R. Sitaraman. “Minimal Test Collections for Retrieval Evaluation”. In: Proc. SIGIR. 2006.
J. Aslam and V. Pavlu. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Tech. rep. Northeastern U., 2007.
B. Carterette et al. “Million Query Track 2009 Overview”. In: Proc. TREC. 2009.
X. Lu, A. Moffat, and J. S. Culpepper. “The Effect of Pooling and Evaluation Depth on IR Metrics”. In: Inf. Retr. (2016).
15
Train–Test Setup
Test Queries Train/Valid Queries
WT09 MQ09
WT10 WT09, WT11, WT12
WT11 WT09, WT10, WT12
WT12 WT09, WT10, WT11
16
No Dataset is Perfect—What’s Broken Here?
Building datasets is hard, but we’re lucky (system study)
SDM feature is broken
Unfortunately was found after camera-ready
SDM was decoding postings for every document!
Used a workaround that created a post-processing bug
tinyurl.com/yxkm9867
The dataset is versioned6 and a fix will be released
6
github.com/ten-blue-links/cikm20/releases
17
Experimental Details
Compare effectiveness of LambdaMART7 to
traditional baselines
Evaluation on the 4 Web Track query sets
Conduct brief study on feature importance
7
lightgbm.readthedocs.io
18
Effectiveness Results
0.10
0.15
0.20
0.25
0.30
WT09 WT10 WT11 WT12
NDCG20
BM25 LambdaMART SDM TREC Best
WT09 WT10 WT11 WT12
19
Feature Importance Summary
BM25kMax
StreamLenInlink
AlexaRank
BEInlink
StopCover
AvgTermLen
LM2500
LM1500Inlink
LM2500Inlink
DPHInlink
LM2500Body
VisibleText
Stage0
LM2500Title
0 50 100 150
Average Importance (2009-2012)
20
Summary
Fxt software for feature based machine learning in IR
Released LTR dataset on ClueWeb09B
Facilitate more open research and collaboration
Avenues for future research:
Efficiency in feature extraction
Model interpretation
Ablation/Feature selection
End-to-end system prototyping
21

More Related Content

What's hot

Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachAlessandro Benedetti
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)Koji Sekiguchi
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformaticsCharlie Hull
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahoutlucenerevolution
 
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsSynchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsLucidworks
 

What's hot (18)

Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformatics
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - Haystack
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPedia
 
Enhance discovery Solr and Mahout
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahout
 
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsSynchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
 

Similar to Feature Extraction for Large-Scale Text Collections

Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Evaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionEvaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionNTNU
 
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...iosrjce
 
Learning To Rank User Queries to Detect Search Tasks
Learning To Rank User Queries to Detect Search TasksLearning To Rank User Queries to Detect Search Tasks
Learning To Rank User Queries to Detect Search TasksFranco Maria Nardini
 
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptAnirbanBhar3
 
SCOPUS PAPER EJMCM.pdf
SCOPUS PAPER EJMCM.pdfSCOPUS PAPER EJMCM.pdf
SCOPUS PAPER EJMCM.pdfSharmilaDevi90
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Survey Paper Review By Bekalu vchgf.pptx
Survey Paper Review By Bekalu vchgf.pptxSurvey Paper Review By Bekalu vchgf.pptx
Survey Paper Review By Bekalu vchgf.pptxzelalem77
 
Self Automated Rovers
Self Automated RoversSelf Automated Rovers
Self Automated RoversRutikBhoyar
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmeSAT Publishing House
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
 
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelApplying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelIJCSEA Journal
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET Journal
 

Similar to Feature Extraction for Large-Scale Text Collections (20)

Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Evaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionEvaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy prediction
 
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
 
G017664551
G017664551G017664551
G017664551
 
Learning To Rank User Queries to Detect Search Tasks
Learning To Rank User Queries to Detect Search TasksLearning To Rank User Queries to Detect Search Tasks
Learning To Rank User Queries to Detect Search Tasks
 
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
 
SCOPUS PAPER EJMCM.pdf
SCOPUS PAPER EJMCM.pdfSCOPUS PAPER EJMCM.pdf
SCOPUS PAPER EJMCM.pdf
 
T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Survey Paper Review By Bekalu vchgf.pptx
Survey Paper Review By Bekalu vchgf.pptxSurvey Paper Review By Bekalu vchgf.pptx
Survey Paper Review By Bekalu vchgf.pptx
 
November 16, Learning
November 16, LearningNovember 16, Learning
November 16, Learning
 
Self Automated Rovers
Self Automated RoversSelf Automated Rovers
Self Automated Rovers
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelApplying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant Colony
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Feature Extraction for Large-Scale Text Collections

  • 1. Feature Extraction for Large-Scale Text Collections Luke Gallagher1 Antonio Mallia2 J. Shane Culpepper1 Torsten Suel2 B. Barla Cambazoglu1 RMIT University1 New York University2 1
  • 2. Feature Extraction—Why Do We Care? We want open and accessible tooling around feature extraction, because many people in our research group are interested in Efficient retrieval over massive text collections Efficient and scalable algorithm design Multi-stage retrieval systems LTR and cascade ranking End-to-end retrieval experiments 2
  • 3. Multi-Stage Search Multi-stage system described by Pederson (Graphic J. Mackenzie1) 1 jmmackenzie.io/publication/thesis J. Pedersen. “Query Understanding at Bing”. In: Proc. SIGIR Industry Track Keynote. 2010. 3
  • 4. What is Feature Extraction? Map a larger problem space to a smaller problem space Parts of an inverted index are a result of feature extraction D. Manolescu. “Feature Extraction–A Pattern for Information Retrieval”. In: Proc. PLOP. 1998. G. Salton. Interactive Information Retrieval. Tech. rep. TR69-40. Cornell University, 1969. 4
  • 5. Feature Extraction in Multi-Stage Retrieval Many search applications use LTR (GBRT, LambdaMART) Feature-based models depend on feature engineering and infrastructure to support feature extraction Sets a high bar for research on feature extraction and related tasks (e.g. efficiency, model interpretation) 5
  • 6. Feature Extraction in LTR Which features to implement or extract? Features may depend on the search task Many “good” ranking features are query dependent (i.e. require both query and document) Results for “seen” queries can be cached/pre-computed In general, not possible to pre-compute query dependent features 6
  • 8. Key Contributions Feature extraction software Easier to test out ideas in feature extraction Better simulation of search tasks dependent upon feature extraction LTR dataset on ClueWeb09B Possibly first public LTR dataset that is completely transparent (i.e., queries, documents, features, etc are known) 8
  • 9. Open Feature Extraction Tooling Existing open source solutions: Support LTR but do not implement features for users (Solr, Elastic, Terrier) Anserini provides some features with LTR Published work tends to be “single-use” engineering We don’t have open feature extraction tools that: Provide a large set of text based features Facilitates the feature extraction process Can be used standalone or within a retrieval pipeline L. Wang, J. Lin, and D. Metzler. In: Proc. SIGIR. 2011. 9
  • 10. Fxt – Feature Extraction Toolkit What does the Fxt2 software provide? Configurable collection of 448 features Features mainly from literature in QPP and LTR indexer—build feature index extractor—extract features from candidate documents Use cases Standalone feature extraction Generate training data End-to-end retrieval experiments (more work required) 2 github.com/ten-blue-links/fxt N. Asadi and J. Lin. In: Inf. Retr. (2013). 10
  • 11. Summary of Features in Fxt Description No. Features Term Score Aggregation (Unigram) 159 Term Score Aggregation (Bigram) 147 Query Document Score (Unigram) 106 Query Document Score (Bigram) 4 Static Document Quality 23 Query (Document Independent) 13 11
  • 12. LTR Dataset – ClueWeb09B Web Track queries and judgments from 2009–2012 134 features Feature classes (see paper for details): Query-document unigram (e.g. BM25, BM25-title) Query-document bigram (e.g. SDM, BM25-TP) Static document quality (e.g. Stop Ratio, AlexaRank) Publicly available3 Download the dataset Reproduce4 the dataset and/or experiments 3 github.com/ten-blue-links/cikm20 4 Definitions for “replicate” and “reproduce” were recently swapped: tinyurl.com/acm-replicate-reproduce J. S. Culpepper, C. L. A. Clarke, and J. Lin. In: Proc. ADCS. 2016. M. Bendersky, W. B. Croft, and Y. Diao. In: Proc. WSDM. 2011. X. Lu, A. Moffat, and J. S. Culpepper. In: Proc. CIKM. 2015. 12
  • 13. Summary of Features in Dataset Description No. Features Query Document Score (Unigram) 106 Query Document Score (Bigram) 4 Static Document Quality 23 AlexaRank (2010) 1 13
  • 14. Dataset Construction Process 1. Indri index with fields 2. Fxt index 3. BM25 to generate candidate set for queries (depth 1k) 4. Extract features to CSV 5. Random shuffle and split into train/val/test 14
  • 15. ClueWeb09B Relevance Judgment Methods Track Grades Method MQ09 0–2 TREC, MTC, NEU WT095 0–2 TREC, MTC, NEU WT10 -2, 0–4 TREC WT11 -2, 0–3 TREC WT12 -2, 0–4 TREC 5 Judgments for WT09 topics are identical to those from MQ09 B. Carterette, J. Allan, and R. Sitaraman. “Minimal Test Collections for Retrieval Evaluation”. In: Proc. SIGIR. 2006. J. Aslam and V. Pavlu. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Tech. rep. Northeastern U., 2007. B. Carterette et al. “Million Query Track 2009 Overview”. In: Proc. TREC. 2009. X. Lu, A. Moffat, and J. S. Culpepper. “The Effect of Pooling and Evaluation Depth on IR Metrics”. In: Inf. Retr. (2016). 15
  • 16. Train–Test Setup Test Queries Train/Valid Queries WT09 MQ09 WT10 WT09, WT11, WT12 WT11 WT09, WT10, WT12 WT12 WT09, WT10, WT11 16
  • 17. No Dataset is Perfect—What’s Broken Here? Building datasets is hard, but we’re lucky (system study) SDM feature is broken Unfortunately was found after camera-ready SDM was decoding postings for every document! Used a workaround that created a post-processing bug tinyurl.com/yxkm9867 The dataset is versioned6 and a fix will be released 6 github.com/ten-blue-links/cikm20/releases 17
  • 18. Experimental Details Compare effectiveness of LambdaMART7 to traditional baselines Evaluation on the 4 Web Track query sets Conduct brief study on feature importance 7 lightgbm.readthedocs.io 18
  • 19. Effectiveness Results 0.10 0.15 0.20 0.25 0.30 WT09 WT10 WT11 WT12 NDCG20 BM25 LambdaMART SDM TREC Best WT09 WT10 WT11 WT12 19
  • 21. Summary Fxt software for feature based machine learning in IR Released LTR dataset on ClueWeb09B Facilitate more open research and collaboration Avenues for future research: Efficiency in feature extraction Model interpretation Ablation/Feature selection End-to-end system prototyping 21