Learn How to Optimize Solr Search with Machine Learning

Learning To Rank For Solr
Michael Nilsson – Software Engineer
Diego Ceccarelli – Software Engineer
Joshua Pantony – Software Engineer
Bloomberg LP

OUTLINE
●  Search at Bloomberg
●  Why do we need machine learning for search?
●  Learning to Rank
●  Solr Learning to Rank Plugin

8 millions searches PER DAY
1 million PER DAY
400
million
stories
in
the
index

SOLR IN BLOOMBERG
●  Search engine of choice at Bloomberg
─  Large community / Well distributed committers
─  Open source Apache Project
─  Used within many commercial products
─  Large feature set and rapid growth
●  Committed to open-source
─  Ability to contribute to core engine
─  Ability to fix bugs ourselves
─  Contributions in almost every Solr release since 4.5.0

PROBLEM SETUP
score: 30
score: 1.0

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=100∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
10∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
score: 52.2
score: 30.8

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=100∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
10∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟏𝟓𝟎∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
𝟑.𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+
𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒
+𝟑.𝟏𝟏𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
+𝟒𝟐.𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠 +
5 ∗ timeElapsedFrom LastUpdate

●  It’s hard to manually tweak the ranking
─  You must be an expert in the domain
─  … or a magician
PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒
+𝟑.𝟏𝟏𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
+𝟒𝟐.𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠 +
5 ∗ timeElapsedFrom LastUpdate
query = solr query = lucene query = austin query = bloomberg query = …

PROBLEM SETUP
It’s easier with Machine Learning
●  2,000+ parameters (non-linear, factorially larger than linear form)
●  8,000+ queries that are regularly tuned
●  Early on we spent many days hand tuning…

SEARCH PIPELINE (ONLINE)
Index
Top-k
retrieval
User
Query
People
Commodities
News
Other Sources
ReRanking
Model
Top-k
reranked
Top-x
retrieval
x >> k

TRAINING PIPELINE (OFFLINE)
Index
Feature
Extraction
Learning
Algorithm
Ranking
Model
Training
Query-Document
Pairs
People
Commodities
News
Other Sources
Metrics

TRAINING DATA: IMPLICIT VS EXPLICIT
What is explicit data?
●  A set of judges will assess the
search results manually given a
query
─  Experts
─  Crowd
What is implicit data?
●  Infer user preferences based on
user behavior
─  Aggregated results clicks
─  Query reformulation
─  Dwell time
Pros:
─  Data is very clean
Cons:
─  Can be very expensive!
Pros:
─  A lot of data!
Cons:
─  Extremely noisy
─  Privacy concerns

FEATURES
●  A feature is an individual measurable property
●  Given a query, and a collection we can produce many features for each
document in the collection
─  If the query matches the title
─  Length of the document
─  Number of views
─  How old is it?
─  Can be visualized on a mobile device?

FEATURES
Extract “features”
Was the result a
cofounder?
0
Features are signals that give an indication of a result’s importance

FEATURES
Was the result a
cofounder?
0
Does the document
have an exec. position?
1
Query : APPL US

FEATURES
Was the result a
cofounder?
0
Does the query match
the document title?
0
Does the document
1

FEATURES
Was the result a
cofounder?
0
the document title?
0
Does the document
1
Popularity (%) 0.9

FEATURES
Was the result a
cofounder?
0
the document title?
1
Does the document
0
Popularity (%) 0.6

METRICS
How do we know if our model is doing better?
●  Offline metrics
─  Precision/Recall/F1 score
─  nDCG (Normalized Discount Cumulative Gain)
─  Other metrics (e.g., ERR, MAP, …)
●  Online Metrics
─  Click through rates à higher
─  Time to first click à lower
─  Interleaving1
1O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM
Transactions on Information Science, 30(1), 2012.

LEARNING TO RANK
●  Learn how to combine the features for optimizing one or more metrics
●  Many learning algorithms
─  RankSVM1
─  LambdaMART2
─  …
1T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and
Data Mining (KDD), ACM, 2002.
2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-
TR-2010-82, 2010.

SEARCH PIPELINE: STANDARD
Index
Top-k
retrieval
User
Query
SolrPeople
Commodities
News
Other Sources

Index
Top-k
retrieval
User
Query
Solr
Training
Data
Learning
Algorithm
Ranking
Model Offline
People
Commodities
News
Other Sources

Index
Top-k
retrieval
User
Query
Solr
Ranking
ModelOnline
Top-x
reranked
People
Commodities
News
Other Sources

SEARCH PIPELINE: SOLR INTEGRATION
Index
Top-k
retrieval
User
Query
Solr
Ranking
ModelOnline
Top-x
reranked
People
Commodities
News
Other Sources

SOLR RELEVANCY
●  Pros
─  Simple and quick scoring computation
─  Phrase matching
─  Function query boosting on time, distance, popularity, etc
─  Customized fields for stemming, synonyms, etc
●  Cons
─  Lots of manual time for creating a well tuned query
─  Weights are brittle, and may not be compatible in the future with more documents
or fields added

LTR PLUGIN: GOALS
●  Don’t tune the relevancy manually!
─  Uses machine learning to power automatic relevancy tuning
●  Significant relevancy improvements
●  Allow comparable scores across collections
─  Collections of different sizes
●  Maintaining low latency
─  Re-use the vast Solr search functionality that is already built-in
─  Less data transport
●  Makes it simple to use domain knowledge to rapidly create features
─  Features are no longer coded but rather scripted

STANDARD SOLR SEARCH REQUEST
Index
Top-k
retrieval
User
Query
People
Commodities
News
Other Sources

Index
STANDARD SOLR SEARCH REQUEST
Index
[10 Million]
Top-10
retrieval
User
Query
Matches
[10k]
Score
[10k]
Solr Query
People
Commodities
News
Other Sources

LTR SOLR SEARCH REQUEST
Index
[10 Million]
Top-1000
retrieval
User
Query
Matches
[10k]
Score
[10k]
Ranking
Model
Top-10
reranked
Solr Query
LTR Query
People
Commodities
News
Other Sources

<queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />

LTR PLUGIN: RERANKING
●  LTRQuery extends Solr’s RankQuery
─  Wraps main query to fetch initial results
─  Returns custom TopDocsCollector for reranked ordered results
●  Solr rerank request parameter
rq={!ltr model=myModel1 reRankDocs=100 efi.user_query=‘james’ efi.my_var=123}
─  !ltr – name used in the solrconfig.xml for the LTRQParserPlugin
─  model – name of deployed model to use for reranking
─  reRankDocs – total number of documents to rerank
─  efi.* – custom parameters used to pass external feature information for your
features to use
•  Query intent
•  Personalization

SEARCH PIPELINE (ONLINE)
Index
[10 Million]
Top-1000
retrieval
User
Query
Matches
[10k]
Score
[10k]
Ranking
Model
Top-10
reranked
Feature
Extraction
People
Commodities
News
Other Sources

{

"name":

"Tim
Cook",

"primary_position":

"ceo",

"category
":

"person",

…

}

FEATURES
Was the result a
cofounder?
0
the document title?
0
Does the document
1
Popularity (%) 0.9

[

{

"name":

"isPersonAndExecutive",

"type":
"org.apache.solr.ltr.feature.impl.SolrFeature",

"params":
{

"fq":
[

"{!terms
f=category}person",

"{!terms
f=primary_position}ceo,
cto,
cfo,
president"

]

}

},

…

]

LTR PLUGIN: FEATURES AFTER

LTR PLUGIN: FUNCTION QUERIES
[

{

"name":

"documentRecency",

"type":
"org.apache.solr.ltr.feature.impl.SolrFeature",

"params":
{

"q":
"{!func}recip(
ms(NOW,publish_date),
3.16e-‐11,
1,
1)"

}

},

…

]

1
for
docs
dated
now,
1/2
for
docs
dated
1
year
ago,
1/3
for
docs
dated
2
years
ago,
etc..

See
http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

LTR PLUGIN: FEATURE STORE
●  FeatureStore is a Solr Managed Resource
─  REST API endpoint for performing CRUD operations on Solr objects
─  Stored in maintained in Zookeeper
●  Deploy
─  curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'
--data-binary @./features.json -H 'Content-type:application/json'
●  View
─  http://yoursolrserver/solr/collection/config/fstore

LTR PLUGIN: FEATURES
●  Simplifies feature engineering through configuration file
●  Utilizes rich search functionality built-in to Solr
─  Phrase matching
─  Synonyms, Stemming, etc
●  Inherit the Feature class for specialized features

TRAINING PIPELINE (OFFLINE)
Index
[10 Million]
Top-1000
retrieval
Training
Queries
Matches
[10k]
Score
[10k]
Feature
Extraction
Learning
Algorithm
Ranking
Model
People
Commodities
News
Other Sources

<transformer name="fv" class= "org.apache.solr.ltr.ranking.LTRFeatureTransformer" />

LTR PLUGIN: FEATURE EXTRACTION
●  Feature extraction uses Solr’s TransformerFactory
─  Returns a custom field with each document
●  fl = *, [fv]
{

"name":

"Tim
Cook",

"primary_position":

"ceo",

"category
":

"person",

…

"[fv]":

"isCofounder:0.0,
isPersonAndExecutive:1.0,
matchTitle:0.0,
popularity:0.9"

}

LTR PLUGIN: MODEL{

"type":
"org.apache.solr.ltr.ranking.LambdaMARTModel",

"name":
"mymodel1",

"features":
[

{
"name":
"matchedTitle"},

{
"name":
"isPersonAndExecutive"}

],

"params":
{

"trees":
[

{

"weight":
1,

"tree":
{

"feature":
"matchedTitle",

"threshold":
0.5,

"left":
{
"value":
-‐100
},

"right":
{

"feature":
"isPersonAndExecutive",

"threshold":
0.5,

"left":
{
"value":
50
},

"right":
{
"value":
75
}

}

}

}

]

}

}

LTR PLUGIN: MODEL
●  ModelStore is also a Solr Managed Resource
●  Deploy
─  curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'
--data-binary @./model.json -H 'Content-type:application/json'
●  View
─  http://yoursolrserver/solr/collection/config/mstore
●  Inherit from the model class for new scoring algorithms
─  score()
─  explain()

LTR PLUGIN: EVALUATION
●  Offline Metrics
─  nDCG increased approximately 10% after reranking
─  Clicks @ 1 up by approximately 10%

BEFORE AND AFTER
Query: “unemployment”
Solr Ranking Machine Learned Reranking

LTR PLUGIN: EVALUATION
●  Offline Metrics
─  nDCG increased approximately 10% after reranking
─  Clicks @ 1 up by approximately 10%
●  Performance
─  About 30% faster than previous external ranking system
10 million documents in collection
100k queries
1k features
1k documents/query reranked

LTR PLUGIN: BENEFITS
●  Simpler feature engineering, without compiling
●  Access to rich internal Solr search functionality for feature building
●  Search result relevancy improvements vs regular Solr relevance
●  Automatic relevancy tuning
●  Compatible scores across collections
●  Performance benefits vs external ranking system

FUTURE WORK
●  Continue work to open source the plugin
●  Support pipelining multiple reranking models
●  Allow a simple ranking model to be used in the first pass

Learn How to Optimize Solr Search with Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learn How to Optimize Solr Search with Machine Learning

Similar to Learn How to Optimize Solr Search with Machine Learning (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Learn How to Optimize Solr Search with Machine Learning