SlideShare a Scribd company logo
1 of 53
© 2018 Bloomberg Finance L.P. All rights reserved.
Learning to Rank:
From Theory to Production
Malvina Josephidou & Diego Ceccarelli
Bloomberg
@malvijosephidou | @diegoceccarelli
#Activate18 #ActivateSearch
© 2018 Bloomberg Finance L.P. All rights reserved.
About Us
o Software Engineers at Bloomberg
o Working on relevance of the News search engine
o Before joining, PhDs in ML and IR
© 2018 Bloomberg Finance L.P. All rights reserved.
Bloomberg – Who are we?
o A technology company with
5,000+ software engineers
o Financial data, analytics,
communication and trading tools
o More than 325K subscribers in
170 countries
© 2018 Bloomberg Finance L.P. All rights reserved.
News search in numbers
5
© 2018 Bloomberg Finance L.P. All rights reserved.
How do we retrieve relevant results?
o Use relevance
functions to assign
scores to each
matching document
o Sort documents by
relevance score
6
50.1
35.5
10.2
Relevance
Score
© 2018 Bloomberg Finance L.P. All rights reserved.
How do we design relevance functions?
score = tf
+ 5.2 x tf(title)
+ 4.5 x tf(desc)
© 2018 Bloomberg Finance L.P. All rights reserved.
Good Luck With That…
query = Solr query = Italy query = Facebook query = Trump
score = tf(body)
+ 5.2 x tf(title)
+ 4.5 x tf(desc)
+ ??? x doc-length
+ ??? x freshness
+ ??? x popularity
+ ??? x author
+ ..... ?????
?
© 2018 Bloomberg Finance L.P. All rights reserved.
How do we come up with ranking functions?
o We don’t. Hand-tuning ranking functions is
insane
o ML to the rescue: Use data to train algorithms
to distinguish relevant documents from
irrelevant documents
© 2018 Bloomberg Finance L.P. All rights reserved.
2018 achievement
Learning-to-Rank fully deployed in production
© 2018 Bloomberg Finance L.P. All rights reserved.
Learning-to-Rank
(aka LTR)
Use machine learning algorithms to rank
results in a way that optimizes search
relevance
© 2018 Bloomberg Finance L.P. All rights reserved.
How does this work?
Index
Top-k
retrieval
User
Query
People
Commodities
News
Other Sources
ReRanking
Model
Top-k
reranked
Top-x
retrieval
x >> k
© 2018 Bloomberg Finance L.P. All rights reserved.
How to Ship LTR in Production in 3 Steps
Make it Work
Make it Fast
Deploy to Production
© 2018 Bloomberg Finance L.P. All rights reserved.
LTR steps
I. Collect query-document judgments [Offline]
II. Extract query-document features [Solr]
III. Train model with judgments + features [Offline]
IV. Deploy model [Solr]
V. Apply model [Solr]
VI. Evaluate results [Offline]
© 2018 Bloomberg Finance L.P. All rights reserved.
I. Collect judgements
Judgement
(good/bad)
Judgement
(5 stars)
3/5
5/5
0/5
© 2018 Bloomberg Finance L.P. All rights reserved.
I. Collect judgements
Explicit – judges assess search results manually
o Experts
o Crowdsourced
Implicit – infer assessments through user behavior
o Aggregated result clicks
o Query reformulation
o Dwell time
© 2018 Bloomberg Finance L.P. All rights reserved.
II. Extract Features
Signals that give an indication of a result’s importance
Query matches
the title
Freshness Is it from
bloomberg.com?
Popularity
0 0.7 0 3583
1 0.9 1 625
0 0.1 0 129
© 2018 Bloomberg Finance L.P. All rights reserved.
II. Extract Features
o Define features to extract in
myFeatures.json
o Deploy features definition file
to Solr
curl -XPUT
'http://localhost:8983/solr/myCollection
/schema/feature-store' --data-binary
"@/path/myFeatures.json" -H 'Content-
type:application/json'
[
{
"name": "matchTitle",
"type": "org.apache.solr.ltr.feature. SolrFeature",
"params": {
"q": "{!field f=title}${text}"
}, {
"name": "freshness",
"type": "org.apache.solr.ltr.feature. SolrFeature",
"params": {
"q": "{!func}recip(ms(NOW,timestamp),3.16e-11,1,1)"
},
{ "name": "isFromBloomberg", … },
{ "name": "popularity", … }
]
© 2018 Bloomberg Finance L.P. All rights reserved.
o Add features transformer to Solr config
o Request features for document by adding [features] to the fl
parameter
http://localhost:8983/solr/myCollection/query?q=test&fl=title,url,[features]
II. Extract Features
<!– Document transformer adding feature vectors with each retrieved document ->
<transformer name=“features”
class=“org.apache.solr…LTRFeatureLoggerTransformerFactory” />
© 2018 Bloomberg Finance L.P. All rights reserved.
III. Train Model
o Combine query-document judgments & features into training data
file
o Train ranking model offline
• RankSVM1 [liblinear]
• LambdaMART2 [ranklib]
Example: Linear model
score = 1.2 x matchTitle + 10 x popularity
+ 5.4 x isFromBloomberg - 2 x freshness
1T. Joachims, Optimizing Search Engines Using Clickthrough Data,
Proceedings of the ACM Conference on Knowledge Discovery and Data
Mining (KDD), ACM, 2002.
2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An
Overview", Microsoft Research Technical Report MSR-TR-2010-82, 2010.
© 2018 Bloomberg Finance L.P. All rights reserved.
IV. Deploy Model
o Generate trained output model
in myModelName.json
o Deploy model definition file to
Solr
curl -XPUT
'http://localhost:8983/solr/techproducts/schema/mode
l-store' --data-binary "@/path/myModelName.json" -H
'Content-type:application/json'
© 2018 Bloomberg Finance L.P. All rights reserved.
V. Re-rank Results
o Add LTR query parser to Solr config
o Search and re-rank results
http://localhost:8983/solr/myCollection/query?q=AAPL&
rq={!ltr model="myModelName" reRankDocs=100}
<!– Query parser used to re-rank top docs with a provided model -->
<queryParser name=“ltr” class=“org.apache.solr.ltr.search.LTRQParserPlagin”/>
© 2018 Bloomberg Finance L.P. All rights reserved.
VI. Evaluate quality of search
Precision
how many relevant results I
returned divided by total number
of results returned
Recall
how many relevant results I
returned divided by total number
of relevant results for the query
NDCG (discounted cumulative
gain)
Image credit: https://commons.wikimedia.org/wiki/File:Precisionrecall.svg by User:Walber
© 2018 Bloomberg Finance L.P. All rights reserved.
How to Ship LTR in Production in 3 Steps
Make it Work
Make it Fast
Deploy to Production
© 2018 Bloomberg Finance L.P. All rights reserved.
Are We Good to Go?
o Mid 2017: the infrastructure bit
seemed to be done.
o We had a ‘placeholder’ feature store.
But, we needed useful features
computed in production.
o We deployed a new shiny feature
store…
© 2018 Bloomberg Finance L.P. All rights reserved.
There’s no model without features…
Search latency when we rolled out a
new set of features
© 2018 Bloomberg Finance L.P. All rights reserved.
Why was
it slow? Which
features
were to
blame?
© 2018 Bloomberg Finance L.P. All rights reserved.
Metrics on feature latency
o We instrumented support in Apache Solr to log
the time it took to compute each feature on each
document.
o We added analytics on top of that.
© 2018 Bloomberg Finance L.P. All rights reserved.
Why is it slow?
o 1.9ms is ok for 10 documents, but not for 100 docs…
Feature Latencies using the old set of
features
Total time per search: 19ms
Totalfeaturetimeperdoc
Feature Latencies using the new set of features,
including the FastSlothFeature
Total time per search: 15ms
Totalfeaturetimeperdoc
Total time per search: 145ms
Feature Latencies using the new set of features
SlothFeature
Totalfeaturetimeperoc
© 2018 Bloomberg Finance L.P. All rights reserved.
How do we go faster?
oSome of the features are unrelated to
the query: For example: is the source
reliable?
oWe can precompute them and store
them in the index.
© 2018 Bloomberg Finance L.P. All rights reserved.
Index Static Features
• Add feature_is_wire_BLAH to the Solr schema
<field name="feature_is_wire_BLAH" type="tdouble" indexed="false" stored="true"
docValues=”false" required="false”/>
• Use UpdateRequestProcessors in Solr config to create new fields
<updateRequestProcessorChain
<processor
class="com.internal.solr.update.processor.IsFoundUpdateProcessorFactory">
<str name="source">wire</str>
<str name="dest">feature_is_wire_BLAH</str>
<str name="map">
{
”BLAH" : 1.0
}
</str>
<double name="default">0.0</double>
</processor>
</updateRequestProcessorChain>
© 2018 Bloomberg Finance L.P. All rights reserved.
Index Static Features
• Features at runtime are produced by reading the value from the
index using the FieldValueFeature.
Would this ease all our performance troubles?
"features":[
{
"type":"ltr.feature.impl.FieldValueFeature",
"params": {
"field": "feature_is_wire_BLAH"
},
"name": “is_wire_BLAH",
"default": 0.0
},
© 2018 Bloomberg Finance L.P. All rights reserved.
Better?
© 2018 Bloomberg Finance L.P. All rights reserved.
Why is this
happening to
us?
© 2018 Bloomberg Finance L.P. All rights reserved.
Reading from the index can still be slow…
o Retrieving field values from their stored values is
slow
o So we changed the implementation of
FieldValueFeature to use DocValues if they are
available
o DocValues record field values in a column-
oriented way, mapping doc ids to values
© 2018 Bloomberg Finance L.P. All rights reserved.
From Stored Fields to Doc Values
Feature latencies when retrieving 100 docs using
DocValues
Feature latencies when retrieving 100 docs using
StoredFields
Total time per search: 135ms Total time per search: 25ms
: 5x faster!
© 2018 Bloomberg Finance L.P. All rights reserved.
Are we there yet?
Feature logging/computation
Document re-ranking
© 2018 Bloomberg Finance L.P. All rights reserved.
Evaluating performance: NO-OP Model
o A linear model, performing all the computations but not
modifying the original order
Original
Solr Score
Query
matches
the title
Freshness Is the
document
from
Bloomberg
.com?
Popularity
2.3 0 0.7 0 3583
2.1 1 0.9 1 625
1.3 0 0.1 0 129
No-op
1.0
0.0
0.0
0.0
0.0
X =
Final Score
2.3
2.1
1.3
© 2018 Bloomberg Finance L.P. All rights reserved.
Need for speed
No LTR, retrieve 3
docs
(ms)
LTR retrieve 3
docs, re-rank 25
(ms)
Median search time 39 77
Why is it so slow???
© 2018 Bloomberg Finance L.P. All rights reserved.
News needs grouping
o Similar news stories are grouped (clustered) together and only the
top-ranked story in each group is shown
© 2018 Bloomberg Finance L.P. All rights reserved.
Grouping + Re-ranking:
Not a match made in heaven
oRegular grouping involves 3
communication rounds between
coordinator and shards
o With re-ranking, we have to re-
rank the groups and the
documents in each group
[ SOLR-8776 Support RankQuery in grouping ]
© 2018 Bloomberg Finance L.P. All rights reserved.
© 2018 Bloomberg Finance L.P. All rights reserved.
Vegas baby
© 2018 Bloomberg Finance L.P. All rights reserved.
What is the Las Vegas Patch?
© 2018 Bloomberg Finance L.P. All rights reserved.
Grouping
Three requests from the
coordinator:
1. Coordinator asks for top n
groups for the query and
computes top n groups.
2. Each shard compute top m
documents for the top n groups
3. Coordinator retrieves top docs
for each group and retrieve them
from the shards
© 2018 Bloomberg Finance L.P. All rights reserved.
Why? Example:
o Top 2 groups, top 2 documents, 2 shards
Doc Group Score
doc6 group3 70.0
doc7 group2 65.0
doc8 group3 60.0
doc9 group2 60.0
doc10 group1 50.0
Doc Group Score
doc1 group1 20.0
doc2 group2 5.0
doc3 group1 100.0
doc4 group1 120.0
doc5 group2 10.0
Group Docs Score
group1 doc4 120.0
doc1 20.0
group2 doc5 10.0
doc2 5.0
Group Docs Score
group3 doc6 70.0
doc8 60.0
group2 doc7 65.0
doc9 60.0
Machine1Machine2 Top Groups:
Group1: doc4 doc1
Group3: doc6 doc8
Top docs should be:
Group1: doc4 doc10
Group3: doc6 doc8
© 2018 Bloomberg Finance L.P. All rights reserved.
Las Vegas Idea
o If you want just one document per group,
you do not have this problem
o We can return the top document from each
group in the first step and skip the second
step entirely
o For LTR: Re-rank only the top document of
each group
© 2018 Bloomberg Finance L.P. All rights reserved.
Show me the numbers
o We made plain old search faster: by
about 40%!
o LTR-served searches are still faster
than they were before we did the Las
Vegas optimization
Method Median time Perc95 time
Normal Grouping (No LTR) 0.20 0.35
Las Vegas (No LTR) 0.12 0.26
Las Vegas+LTR no-op 0.18 0.27
© 2018 Bloomberg Finance L.P. All rights reserved.
How to Ship LTR in Production in 3 Steps
Make it Work
Make it Fast
Deploy to Production
© 2018 Bloomberg Finance L.P. All rights reserved.
Where is the model?
Let the LTR hackathon start…
o Write code to process training data
in svmlite format.
o Wrappers around scikit-learn to
train various linear models, do
regularization, hyperparameter
optimization and model debugging
o Evaluate the model: MAP, NDCG,
MRR
© 2018 Bloomberg Finance L.P. All rights reserved.
A Small Model for LTR, a Giant Step for
Bloomberg
o Released initially to select
internal users
o Then, to all internal users
o Then, to 10% of our clients
o And finally, to 100% of our
clients
© 2018 Bloomberg Finance L.P. All rights reserved.
It’s a whole new (LTR) world!
o Trying out new features, new models
and classes of models
o Experimenting with different types of
training data
o Rolling out new rankers, for different
types of queries
© 2018 Bloomberg Finance L.P. All rights reserved.
Take home messages
o Make sure you can measure success
and failure – metrics, metrics, metrics!
o If a feature is static index it
o Don’t use stored values for static
features, always rely on DocValues
o If you are not happy with the
performance of search, consider a trip
to Las Vegas
you may end up improving
performance by 40% 
Thank you! Eυχαριστούμε! Grazie!
And btw: we are hiring a senior search relevance engineer!
bit.ly/2Orb8bc
https://www.bloomberg.com/careers
Malvina Josephidou & Diego Ceccarelli
Bloomberg
@malvijosephidou |@diegoceccarelli
#Activate18 #ActivateSearch

More Related Content

What's hot

How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
Eugene Yan Ziyou
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
Amazon Web Services
 

What's hot (20)

ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
 

Similar to Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Ceccarelli, Bloomberg

Similar to Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Ceccarelli, Bloomberg (20)

ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
 
Real-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka StreamsReal-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka Streams
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
vinay-mittal-new
vinay-mittal-newvinay-mittal-new
vinay-mittal-new
 
Using Amazon Mechanical Turk to Crowdsource Data Collection (AIM359) - AWS re...
Using Amazon Mechanical Turk to Crowdsource Data Collection (AIM359) - AWS re...Using Amazon Mechanical Turk to Crowdsource Data Collection (AIM359) - AWS re...
Using Amazon Mechanical Turk to Crowdsource Data Collection (AIM359) - AWS re...
 
Analyze your application portfolio to know where the quality and risk issues ...
Analyze your application portfolio to know where the quality and risk issues ...Analyze your application portfolio to know where the quality and risk issues ...
Analyze your application portfolio to know where the quality and risk issues ...
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Talentica - JS Meetup - Angular Schematics
Talentica - JS Meetup - Angular SchematicsTalentica - JS Meetup - Angular Schematics
Talentica - JS Meetup - Angular Schematics
 
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
AWS Toronto Summit 2019 - AIM302 - Build, train, and deploy ML models with Am...
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same Coin
 
TenYearsCPOptimizer
TenYearsCPOptimizerTenYearsCPOptimizer
TenYearsCPOptimizer
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Serverless patterns
Serverless patternsServerless patterns
Serverless patterns
 
Agile-plus-DevOps Testing for Packaged Applications
Agile-plus-DevOps Testing for Packaged ApplicationsAgile-plus-DevOps Testing for Packaged Applications
Agile-plus-DevOps Testing for Packaged Applications
 
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Ceccarelli, Bloomberg

  • 1. © 2018 Bloomberg Finance L.P. All rights reserved. Learning to Rank: From Theory to Production Malvina Josephidou & Diego Ceccarelli Bloomberg @malvijosephidou | @diegoceccarelli #Activate18 #ActivateSearch
  • 2. © 2018 Bloomberg Finance L.P. All rights reserved. About Us o Software Engineers at Bloomberg o Working on relevance of the News search engine o Before joining, PhDs in ML and IR
  • 3. © 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg – Who are we? o A technology company with 5,000+ software engineers o Financial data, analytics, communication and trading tools o More than 325K subscribers in 170 countries
  • 4. © 2018 Bloomberg Finance L.P. All rights reserved. News search in numbers 5
  • 5. © 2018 Bloomberg Finance L.P. All rights reserved. How do we retrieve relevant results? o Use relevance functions to assign scores to each matching document o Sort documents by relevance score 6 50.1 35.5 10.2 Relevance Score
  • 6. © 2018 Bloomberg Finance L.P. All rights reserved. How do we design relevance functions? score = tf + 5.2 x tf(title) + 4.5 x tf(desc)
  • 7. © 2018 Bloomberg Finance L.P. All rights reserved. Good Luck With That… query = Solr query = Italy query = Facebook query = Trump score = tf(body) + 5.2 x tf(title) + 4.5 x tf(desc) + ??? x doc-length + ??? x freshness + ??? x popularity + ??? x author + ..... ????? ?
  • 8. © 2018 Bloomberg Finance L.P. All rights reserved. How do we come up with ranking functions? o We don’t. Hand-tuning ranking functions is insane o ML to the rescue: Use data to train algorithms to distinguish relevant documents from irrelevant documents
  • 9. © 2018 Bloomberg Finance L.P. All rights reserved. 2018 achievement Learning-to-Rank fully deployed in production
  • 10. © 2018 Bloomberg Finance L.P. All rights reserved. Learning-to-Rank (aka LTR) Use machine learning algorithms to rank results in a way that optimizes search relevance
  • 11. © 2018 Bloomberg Finance L.P. All rights reserved. How does this work? Index Top-k retrieval User Query People Commodities News Other Sources ReRanking Model Top-k reranked Top-x retrieval x >> k
  • 12. © 2018 Bloomberg Finance L.P. All rights reserved. How to Ship LTR in Production in 3 Steps Make it Work Make it Fast Deploy to Production
  • 13. © 2018 Bloomberg Finance L.P. All rights reserved. LTR steps I. Collect query-document judgments [Offline] II. Extract query-document features [Solr] III. Train model with judgments + features [Offline] IV. Deploy model [Solr] V. Apply model [Solr] VI. Evaluate results [Offline]
  • 14. © 2018 Bloomberg Finance L.P. All rights reserved. I. Collect judgements Judgement (good/bad) Judgement (5 stars) 3/5 5/5 0/5
  • 15. © 2018 Bloomberg Finance L.P. All rights reserved. I. Collect judgements Explicit – judges assess search results manually o Experts o Crowdsourced Implicit – infer assessments through user behavior o Aggregated result clicks o Query reformulation o Dwell time
  • 16. © 2018 Bloomberg Finance L.P. All rights reserved. II. Extract Features Signals that give an indication of a result’s importance Query matches the title Freshness Is it from bloomberg.com? Popularity 0 0.7 0 3583 1 0.9 1 625 0 0.1 0 129
  • 17. © 2018 Bloomberg Finance L.P. All rights reserved. II. Extract Features o Define features to extract in myFeatures.json o Deploy features definition file to Solr curl -XPUT 'http://localhost:8983/solr/myCollection /schema/feature-store' --data-binary "@/path/myFeatures.json" -H 'Content- type:application/json' [ { "name": "matchTitle", "type": "org.apache.solr.ltr.feature. SolrFeature", "params": { "q": "{!field f=title}${text}" }, { "name": "freshness", "type": "org.apache.solr.ltr.feature. SolrFeature", "params": { "q": "{!func}recip(ms(NOW,timestamp),3.16e-11,1,1)" }, { "name": "isFromBloomberg", … }, { "name": "popularity", … } ]
  • 18. © 2018 Bloomberg Finance L.P. All rights reserved. o Add features transformer to Solr config o Request features for document by adding [features] to the fl parameter http://localhost:8983/solr/myCollection/query?q=test&fl=title,url,[features] II. Extract Features <!– Document transformer adding feature vectors with each retrieved document -> <transformer name=“features” class=“org.apache.solr…LTRFeatureLoggerTransformerFactory” />
  • 19. © 2018 Bloomberg Finance L.P. All rights reserved. III. Train Model o Combine query-document judgments & features into training data file o Train ranking model offline • RankSVM1 [liblinear] • LambdaMART2 [ranklib] Example: Linear model score = 1.2 x matchTitle + 10 x popularity + 5.4 x isFromBloomberg - 2 x freshness 1T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-TR-2010-82, 2010.
  • 20. © 2018 Bloomberg Finance L.P. All rights reserved. IV. Deploy Model o Generate trained output model in myModelName.json o Deploy model definition file to Solr curl -XPUT 'http://localhost:8983/solr/techproducts/schema/mode l-store' --data-binary "@/path/myModelName.json" -H 'Content-type:application/json'
  • 21. © 2018 Bloomberg Finance L.P. All rights reserved. V. Re-rank Results o Add LTR query parser to Solr config o Search and re-rank results http://localhost:8983/solr/myCollection/query?q=AAPL& rq={!ltr model="myModelName" reRankDocs=100} <!– Query parser used to re-rank top docs with a provided model --> <queryParser name=“ltr” class=“org.apache.solr.ltr.search.LTRQParserPlagin”/>
  • 22. © 2018 Bloomberg Finance L.P. All rights reserved. VI. Evaluate quality of search Precision how many relevant results I returned divided by total number of results returned Recall how many relevant results I returned divided by total number of relevant results for the query NDCG (discounted cumulative gain) Image credit: https://commons.wikimedia.org/wiki/File:Precisionrecall.svg by User:Walber
  • 23. © 2018 Bloomberg Finance L.P. All rights reserved. How to Ship LTR in Production in 3 Steps Make it Work Make it Fast Deploy to Production
  • 24. © 2018 Bloomberg Finance L.P. All rights reserved. Are We Good to Go? o Mid 2017: the infrastructure bit seemed to be done. o We had a ‘placeholder’ feature store. But, we needed useful features computed in production. o We deployed a new shiny feature store…
  • 25. © 2018 Bloomberg Finance L.P. All rights reserved. There’s no model without features… Search latency when we rolled out a new set of features
  • 26. © 2018 Bloomberg Finance L.P. All rights reserved. Why was it slow? Which features were to blame?
  • 27. © 2018 Bloomberg Finance L.P. All rights reserved. Metrics on feature latency o We instrumented support in Apache Solr to log the time it took to compute each feature on each document. o We added analytics on top of that.
  • 28. © 2018 Bloomberg Finance L.P. All rights reserved. Why is it slow? o 1.9ms is ok for 10 documents, but not for 100 docs… Feature Latencies using the old set of features Total time per search: 19ms Totalfeaturetimeperdoc Feature Latencies using the new set of features, including the FastSlothFeature Total time per search: 15ms Totalfeaturetimeperdoc Total time per search: 145ms Feature Latencies using the new set of features SlothFeature Totalfeaturetimeperoc
  • 29. © 2018 Bloomberg Finance L.P. All rights reserved. How do we go faster? oSome of the features are unrelated to the query: For example: is the source reliable? oWe can precompute them and store them in the index.
  • 30. © 2018 Bloomberg Finance L.P. All rights reserved. Index Static Features • Add feature_is_wire_BLAH to the Solr schema <field name="feature_is_wire_BLAH" type="tdouble" indexed="false" stored="true" docValues=”false" required="false”/> • Use UpdateRequestProcessors in Solr config to create new fields <updateRequestProcessorChain <processor class="com.internal.solr.update.processor.IsFoundUpdateProcessorFactory"> <str name="source">wire</str> <str name="dest">feature_is_wire_BLAH</str> <str name="map"> { ”BLAH" : 1.0 } </str> <double name="default">0.0</double> </processor> </updateRequestProcessorChain>
  • 31. © 2018 Bloomberg Finance L.P. All rights reserved. Index Static Features • Features at runtime are produced by reading the value from the index using the FieldValueFeature. Would this ease all our performance troubles? "features":[ { "type":"ltr.feature.impl.FieldValueFeature", "params": { "field": "feature_is_wire_BLAH" }, "name": “is_wire_BLAH", "default": 0.0 },
  • 32. © 2018 Bloomberg Finance L.P. All rights reserved. Better?
  • 33. © 2018 Bloomberg Finance L.P. All rights reserved. Why is this happening to us?
  • 34. © 2018 Bloomberg Finance L.P. All rights reserved. Reading from the index can still be slow… o Retrieving field values from their stored values is slow o So we changed the implementation of FieldValueFeature to use DocValues if they are available o DocValues record field values in a column- oriented way, mapping doc ids to values
  • 35. © 2018 Bloomberg Finance L.P. All rights reserved. From Stored Fields to Doc Values Feature latencies when retrieving 100 docs using DocValues Feature latencies when retrieving 100 docs using StoredFields Total time per search: 135ms Total time per search: 25ms : 5x faster!
  • 36. © 2018 Bloomberg Finance L.P. All rights reserved. Are we there yet? Feature logging/computation Document re-ranking
  • 37. © 2018 Bloomberg Finance L.P. All rights reserved. Evaluating performance: NO-OP Model o A linear model, performing all the computations but not modifying the original order Original Solr Score Query matches the title Freshness Is the document from Bloomberg .com? Popularity 2.3 0 0.7 0 3583 2.1 1 0.9 1 625 1.3 0 0.1 0 129 No-op 1.0 0.0 0.0 0.0 0.0 X = Final Score 2.3 2.1 1.3
  • 38. © 2018 Bloomberg Finance L.P. All rights reserved. Need for speed No LTR, retrieve 3 docs (ms) LTR retrieve 3 docs, re-rank 25 (ms) Median search time 39 77 Why is it so slow???
  • 39. © 2018 Bloomberg Finance L.P. All rights reserved. News needs grouping o Similar news stories are grouped (clustered) together and only the top-ranked story in each group is shown
  • 40. © 2018 Bloomberg Finance L.P. All rights reserved. Grouping + Re-ranking: Not a match made in heaven oRegular grouping involves 3 communication rounds between coordinator and shards o With re-ranking, we have to re- rank the groups and the documents in each group [ SOLR-8776 Support RankQuery in grouping ]
  • 41. © 2018 Bloomberg Finance L.P. All rights reserved.
  • 42. © 2018 Bloomberg Finance L.P. All rights reserved. Vegas baby
  • 43. © 2018 Bloomberg Finance L.P. All rights reserved. What is the Las Vegas Patch?
  • 44. © 2018 Bloomberg Finance L.P. All rights reserved. Grouping Three requests from the coordinator: 1. Coordinator asks for top n groups for the query and computes top n groups. 2. Each shard compute top m documents for the top n groups 3. Coordinator retrieves top docs for each group and retrieve them from the shards
  • 45. © 2018 Bloomberg Finance L.P. All rights reserved. Why? Example: o Top 2 groups, top 2 documents, 2 shards Doc Group Score doc6 group3 70.0 doc7 group2 65.0 doc8 group3 60.0 doc9 group2 60.0 doc10 group1 50.0 Doc Group Score doc1 group1 20.0 doc2 group2 5.0 doc3 group1 100.0 doc4 group1 120.0 doc5 group2 10.0 Group Docs Score group1 doc4 120.0 doc1 20.0 group2 doc5 10.0 doc2 5.0 Group Docs Score group3 doc6 70.0 doc8 60.0 group2 doc7 65.0 doc9 60.0 Machine1Machine2 Top Groups: Group1: doc4 doc1 Group3: doc6 doc8 Top docs should be: Group1: doc4 doc10 Group3: doc6 doc8
  • 46. © 2018 Bloomberg Finance L.P. All rights reserved. Las Vegas Idea o If you want just one document per group, you do not have this problem o We can return the top document from each group in the first step and skip the second step entirely o For LTR: Re-rank only the top document of each group
  • 47. © 2018 Bloomberg Finance L.P. All rights reserved. Show me the numbers o We made plain old search faster: by about 40%! o LTR-served searches are still faster than they were before we did the Las Vegas optimization Method Median time Perc95 time Normal Grouping (No LTR) 0.20 0.35 Las Vegas (No LTR) 0.12 0.26 Las Vegas+LTR no-op 0.18 0.27
  • 48. © 2018 Bloomberg Finance L.P. All rights reserved. How to Ship LTR in Production in 3 Steps Make it Work Make it Fast Deploy to Production
  • 49. © 2018 Bloomberg Finance L.P. All rights reserved. Where is the model? Let the LTR hackathon start… o Write code to process training data in svmlite format. o Wrappers around scikit-learn to train various linear models, do regularization, hyperparameter optimization and model debugging o Evaluate the model: MAP, NDCG, MRR
  • 50. © 2018 Bloomberg Finance L.P. All rights reserved. A Small Model for LTR, a Giant Step for Bloomberg o Released initially to select internal users o Then, to all internal users o Then, to 10% of our clients o And finally, to 100% of our clients
  • 51. © 2018 Bloomberg Finance L.P. All rights reserved. It’s a whole new (LTR) world! o Trying out new features, new models and classes of models o Experimenting with different types of training data o Rolling out new rankers, for different types of queries
  • 52. © 2018 Bloomberg Finance L.P. All rights reserved. Take home messages o Make sure you can measure success and failure – metrics, metrics, metrics! o If a feature is static index it o Don’t use stored values for static features, always rely on DocValues o If you are not happy with the performance of search, consider a trip to Las Vegas you may end up improving performance by 40% 
  • 53. Thank you! Eυχαριστούμε! Grazie! And btw: we are hiring a senior search relevance engineer! bit.ly/2Orb8bc https://www.bloomberg.com/careers Malvina Josephidou & Diego Ceccarelli Bloomberg @malvijosephidou |@diegoceccarelli #Activate18 #ActivateSearch

Editor's Notes

  1. Malvina starts here
  2. Diego 16 slides
  3. Malvina starts here
  4. https://pxhere.com/en/photo/1143377 https://creativecommons.org/publicdomain/zero/1.0/ You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
  5. https://upload.wikimedia.org/wikipedia/commons/a/a7/Cute_Sloth.jpg the sloth is under CC commons, can be reused https://commons.wikimedia.org/wiki/File:Cute_Sloth.jpg You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission
  6. Every update request in Solr is passed through a chain of events defined in the solr config under the updaterequestprocessorchain. Updaterequestprocessors allow us to create new fields from existing ones or change existing fields at index time. We use updaterequestprocessors to compute features at index time and write those values under new feature fields which are kept in the index. In this snippet of code we add in the updaterequestprocessorchain a custom made updaterequestprocessor that looks at the wire field and if it finds the string it will create this new feature field with value 1 else it will create it it with the default value of 0. And we can define other such updaterequestprocessors with our feature computation logic, for example a feature might involve counting terms in a multivaluedfield and we can easily do that too.
  7. https://www.pexels.com/photo/baby-child-close-up-crying-47090/ CC0 License ✓ Free for personal and commercial use ✓ No attribution required
  8. Picture from https://pxhere.com/en/photo/1143377 Licence is  CC0 Public Domain Free for personal and commercial use No attribution required Learn more
  9. /* Change FieldValueFeature to use docvalues to fetch a feature value when the field has docValues Docvalues are faster for this particular case because they build a forward index from documents to values. They
  10. Public domain license: https://www.goodfreephotos.com/vector-images/you-shall-not-pass-sign-with-gandalf-vector-clipart.png.php  the Work may be freely reproduced, distributed, transmitted, used, modified, built upon, or otherwise exploited by anyone for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been invented or conceived.
  11. https://www.pexels.com/photo/attraction-building-city-hotel-415999/ CCO license You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission
  12. In cases where we do grouping and ask for group.limit=1 only it is possible to skip the second grouping step. In our test datasets it improved speed by around 40%. Essentially, in the first grouping step each shard returns the top K groups based on the highest scoring document in each group. The top K groups from each shard are merged in the federator and in the second step we ask all the shards to return the top documents from each of the top ranking groups. If we only want to return the highest scoring document per group we can return the top document id in the first step, merge results in the federator to retain the top K groups and then skip the second grouping step entirely. This is possible provided that: We do not need to know the total number of matching documents per group b) Within group sort and between group sort is the same.  The LTR optimization is to then to compute the LTR score on the top-ranking document per group only (rather than all members of a group)
  13. instead of applying the model to each document in each group and each shard, apply LTR on the top document per group where ‘top’ is determined by the Solr score. So documents in grouped are ranked using Solr score, groups are ranked using LTR.
  14. Malvina starts here
  15. We were ready to roll out the model. https://www.pexels.com/photo/view-ape-thinking-primate-33535/ CC0 license What we realized at this point was that there was no model. There was something called ltr1 we had developed using features computed offline but we realized that model was broken. We had product on our backs, our code worked and was fast enough, but we didn’t actually have anything to roll out and we also hadn't written a single line of code related to training models using features logged in production.
  16. https://www.pexels.com/photo/drink-beer-cheers-toast-8859/ CC0
  17. You can’t optimize what you can’t measure