Interleaving, Evaluation to Self-learning Search @904Labs
Presented at Open Source Connections Haystack Relevance Conference on 904Labs' "Interleaving: from Evaluation to Self-Learning". 904Labs is the first to commercialize "Online Learning to Rank" as a state-of-art for technical Self-learning Search Ranking that automatically takes into account your customers human behaviors for personalized search results.
In search one of the core problems is to determine the relevance of items for a given search query. Research in this field started in the 1950s and led to the introduction of the popular TF.IDF formula in the 70s. TF.IDF is simple word counting, yet it has been the most popular ranking function for many decades. A parameterized version of it, BM25, was invented in the 70s as well. It is important to note that state-of-the-art open source search technology like Lucene used TF.IDF until 2015, and then replaced it with BM25, almost 40 years after its introduction! No wonder that out-of-the-box ranking performance of systems like Apache Solr and Elasticsearch can be improved as next the paradigm..
But things are changing!
As early as 1992, researchers started thinking about learning the parameters of ranking functions from data. From the 2000s, research in this direction really took off, and between 2000 and 2010 most of the currently used learning to rank approaches were introduced. The idea behind learning to rank is rather simple: estimating relevance is too complex a task to solve with a naive model like BM25. Many more features play a role other than word counts. To capture all these relevance feature and their relative weights, we apply machine learning to learn this from usage data and/or relevance assessments.
The paradigm shift from trying to model relevance to learning it from actual data led to the introduction of learning to rank plugins for Apache Solr and Elasticsearch, making these methods available to a large audience. The popularity of these plugins is apparent from the number of talks here at Haystack about learning to rank.
A quick reminder on what learning to rank is all about, although everyone should know by now… We give a machine training data (queries and clicked documents for example), features that describe the query and the documents, and an objective function. The machine learn a ranking model, which is put into production to generate rankings.
Learning to rank is a batch process. The pipeline of collecting training data, extracting features, and training a model is repeated every couple of hours, days, or weeks, depending on the organization. This batch processing often requires quite an impressive data processing pipeline. The question is whether we really need this batch process...
Reinforcement learning from an ML perspective? We would not ask the question if the answer would not be “no”. So, what’s the next step? Rather than periodically retrain and produce a completely new model, we can apply reinforcement learning to update the existing model in real time based on feedback. There are numerous advantages, but the most important ones are that we move away from the need to retrain our model periodically. We do not have to have large data processing pipelines to feed training data to our learner every couple of hours to train a new model, but rather take real time feedback on the quality of the current model, interpret this feedback, and update the model slightly accordingly. Secondly, since the system is learning continuously, it is easy to introduce new features on the fly. Assigning it some initial weight, the system will quickly learn to adjust this weight according to the new feature’s importance. Thirdly, the system can quickly adapt to changes in user behavior instead of having to wait for the next iteration of batch training.
What does online learning look like on a high level? It starts out much like regular learning to rank. A model is learned from training data, features, and an objective function. The initial model is used to generate a first ranking, which is shown to the user issuing the query. This user then interacts with the ranking, providing (implicit) feedback on the quality of the ranking. This feedback is immediately taken into account by the system, updating the model slightly to match the feedback. The updated model becomes the active model, which is used to generate the next ranking, etc. This probably all sounds very nice, but how can we make such an online learning to rank, or self-learning search system? One of the methods that can be used to power such a system is interleaving. Different between theory vs. practice.
Interleaving started out as an evaluation method for comparing ranking algorithms. Currently, however, it is also used to power self-learning search engines. I’ll explain how interleaving is used for evaluation and how this can be translated into an online learning to rank setting. A practical offline example using mutiple google search sessions for same search terms: Screen 1: lillistrate Search Engine A in Red & B in Blue.
Let’s say we’re running a web search engine and we want to find out whether a new ranking algorithm works better than our current algorithm. We can run an interleaving experiment to find out. Our current algorithm is search session A, the new version is search session B.
Every query that is issued on our site is fired to both versions of the search engine. In this example, the query “online learning to rank” is issued to both A and B, and both engines return a list of results.
The next step is to actually interleave the results from A and B into one final result list. For the interleaving of results we can use a variety of methods, but to simplify things, we simply assume that we pick the first result from A, followed by the first result from B, the second result from A, etc. The final interleaved result list is the ranking that is shown to the user who issued the query. To this user there is no difference between results from A or B, they all look identical.
The user interacts with the final ranking, and clicks a result. In the backend we know that this result came from A, and thus A is the winner for this mini-competition between the two engines.
So, to summarize, if we want to compare two search algorithms, A and B, we turn it into a competition. Both engines generate results for the same query, these results are interleaved into one final result list. This list is shown to the user, who clicks the results she wants. These clicks are mapped to the engine that produced the particular result, and in the end the search engine with most clicks is the winner.
Why would we want to use interleaving to compare two algorithms, and not use the more common A/B test? There’s nothing wrong with doing an A/B test, but an interleaving experiment is faster to run and is low-risk.
Interleaving is faster than an A/B test because each user evaluates both engines at the same time, unlike with an A/B test, in which a user is assigned to only one version. At the same time, because each user is shown at least some results from the current search engine (A), the experiment has a low risk, especially compared to an A/B test in which a certain percentage of users get to see only results from the new engine (B), which might lead to a bad user experience.
Now that we know how we can use interleaving to evaluate search algorithms, we can turn it into a method for self-learning search engines.
The trick to get an online learning to rank system using interleaving is to continuously run a competition between the current best model and a slight adaptation of that model, and to immediately update the current model when it is beaten by the adapted version.
In other words, we have our current model A, and take a slight adaptation of that model to be version B. When a query comes in, we run an interleaving experiment with these two versions, just like before. When clicks come in for that query, we determine the winner right away (instead of waiting for more queries to come in). In case version B wins, we update the current model A in such a way that we move into the direction of B. This updated version of A then becomes the new current model (A), and we generate a new adaptation from that model to be the new B. When a new query comes in, the process repeats itself.
As always, the devil is in the details, but these are too complex to discuss here. Examples of such details are how exactly to do the updating of the model A. There’s a lot of scientific literature on this available.
So, what does online learning to rank look like in practice? Unfortunately, we do not yet have a demo of what is happening. We do have a couple of screenshots from one of our customers and some general numbers to report. When our demo becomes available, we’ll send it around on social media etc.
On the left are the results from Apache Solr with some manual tuning for the query “case”. Results show mainly tool cases at the top.
After adding (online) learning to rank on top of Apache Solr, results become like the right hand side: suitcases have moved to the top, with a slight preference for those on sale.
Another example comparing Apache Solr with manual tuning to (online) learning to rank, now for the query “kitchen”. Results from Solr show toy kitchens and some kitchen equipment at the top.
After learning from what users really want, we see actual kitchens “naturally” moving to the top. Again, we observe a preference for kitchens that are on sale.
We have compared our online learning to rank system to manually tuned Apache Solr instances for three of our customers using A/B tests. In general, we observe increases in revenue of about 30% when using online learning to rank on top of an Apache Solr index. We observe similar improvements for conversion rates. Note that these numbers could be achieved using “batch” learning to rank as well, but without the before mentioned benefits of online learning.
Orange denotes customer’s network, purple 904Labs’. (Explain the procedure.) This type of architecture is great for shortcutting the integration process, however, it is prone to network latency.
904Labs’ core ranking algorithms are based on scientific publications, which are available to everyone. The software, however, is not open source. 904Labs does encourage the use of open source by its clients. In fact, the implementation assumes that the client is running her own Apache Solr or Elasticsearch index for search, but would like to improve the ranking quality. 904Labs acts as a middleware between the web app and the existing search index, and uses this index for its online learning to rank algorithms. There are plenty of advantages for clients, including easy implementation, data remaining at the client, and the lack of a vendor lock-in. If a client removes 904Labs services, she’s back at the original setup including her original search index.
In this way 904Labs is a UNIQUE Search-as-a-Service providers who provide open source search engines as core part of their service. When a client wants to move away from such a solution, she also loses all data stored in the search index, as that is part of the solution. It is an open-source supported vendor lock-in :-)
There are still a lot of open issues when it comes to online learning to rank. Feature engineering is one, although it also applies to batchL2R. Which features are available and which ones could actually add something? The benefit of an online learning system is that you can easily add new features to analyze their impact. Delayed feedback is a technical issue related to the order in which feedback comes back into the system. What to do when feedback for a particular query comes in after feedback for a more recent query has already led to an update to the model? Should we ignore the feedback or still take it into account? Feel free to volenteer and work with 904Labs on these open issues
Efficiency is a potential problem when trying to achieve maximum effectiveness. Learning to rank assumes that there is some initial seed set of documents to rerank. Ideally, this set contains all relevant documents, but we can only select a limited seed set. How to optimize this efficiency-effectiveness trade-off?
Finally, when we have learned a model, we would like to make use of it, or exploit it. But if we would only exploit, we can not learn anything anymore. For that, we need to explore as well. How do we balance between exploiting the current best model and exploring to allow for learning?
To summarize this talk, we observe that a paradigm shift in search is happening, moving from old-fashioned ranking functions to learning from user behavior data. Current state-of-the-art uses learning to rank, but online learning to rank is on its way and offers again certain advantages on top of ltr. As is the case with learning to rank, there are still quite a few open issues for online learning to rank, so keep an eye on the research community to come up with cool new methods!
If you want to know (much) more about these details about this technology, feel free to contact Manos or Wouter directly.
If you want to know (much) more about these details, you could for example check out this tutorial on interleaving.
Two pointers for online learning to rank.
There are three approaches to learning to rank. Pointwise tries to predict the relevance of each single document. Pairwise looks at two documents and tries to determine which one is more relevant. Listwise, finally, tries to optimize the full ranking in one go, using existing IR metrics like nDCG. Going from pointwise to listwise should lead to better effectiveness, but also to a decrease in efficiency; the usual efficiency-effectiveness trade-off.
Interleaving, Evaluation to Self-learning Search @904Labs
to self learning
John T. Kane – representing 904Labs in USA, +
Solution Architect / Product Manager @ Voyager Search
About myself and 904Labs
I’ve been in the search field for 15+ years starting with SQL Server Full-text
Search (FTS) in 1998 with roles in Tech. Support, Sales Engineering (FAST) &
Product Manager roles at HP, Lucidworks (Fusion 1.0) and recently at HPE.
While I currently work for Voyager Search, I’m at Haystack representing
904Labs is a Dutch search company founded by Manos Tsagkias and Wouter
Weerkamp, two former academic researchers in the field of Information
Retrieval. The company offers Online Learning to Rank as-a-Service
For decades people tried to come up with clever ways to model
“relevance”. In the early 70s, TF-IDF was introduced, relying on counting
word overlap between queries and documents. (main use case: early digital
library / card catalog)
In early 80’s, researchers came up with BM25 (used in early SharePoint
Search 2001), a parameterized version of TF-IDF. It wasn’t until 2015 that
Lucene/Solr changed it’s default ranking function to BM25.
So, today’s standard search relevance uses 40 year old ranking functions.
How to determine relevance?
Enter machine learning
Since a couple of years people have started to realize that search,
or modeling relevance, has become too complex to fit in BM25. A
paradigm shift is taking place, moving into the direction of
learning the ranking function from training data.
This paradigm shift is translated into learning to rank plugins for
Apache Solr and Elasticsearch, and is also apparent from the
many talks at Haystack about learning to rank.
Learning to rank is a batch process. Training data is collected,
features are extracted, and a model is trained using an objective
function. Every couple of hours/days/weeks, this process is repeated
and a new model is trained. This requires heavy data processing
infrastructure + required software + expert personnel to run.
So, what’s next?
Reinforcement learning: don’t retrain, but update the existing
model in real time using feedback on the ranking produced by the
current model. Think of this as stage 2 of Paradigm Shift
No need to retrain, no need for batch data processing. This
allows for us to easily launch new features, weights are learned
on the fly (or online) & this allows us to adapts to changing user
behavior almost immediately (in real time).
Online learning to rank uses a pre-trained model to generate an initial
ranking. The user interacts with the ranking, giving (implicit) feedback
on its quality. This feedback is used to update the current model, and
the updated model then becomes the active model. And repeat...
for evaluation (recap 1)
Two competing search Engines, A and B
1) Both generate results for the same query
2) Results are then interleaved into one final result list
3) The final result list is shown to the user
4) Clicks on results are mapped to the originating search
5) Winner is the search engine that receives most clicks
for evaluation (recap 2)
Fast and low-risk evaluation method for algorithmic changes,
esp. compared to A/B test. It is... Always ongoing &
...faster because every user evaluates both search engines at the
...low-risk because every user always sees several results from
the current search engine, which has a known quality.
for online learning
Interleaving is about identifying the winning search engine in a
competition. We can run a competition with every query to get a
continuous learning cycle. (think Ranking Models in one Search Engine)
Search Model B is always a slight adaptation of the current model. In
case B wins the competition, the original model (A) is updated into the
direction of B. The updated model becomes search engine A for the
next query, and competes with a new B.
Online learning to rank
in practice (demo offline)
Increase in revenue and conversion rate for three of eCommerce
Search customers using online learning to rank on top of Apache Solr.
Blog posts with improvements in revenue:
Is 904Labs Open Source?
904Labs’ online learning to rank system is SaaS. It is implemented
on top of a client’s (or customer’s) own Apache Solr or
Elasticsearch. The data remains at the client side, and if the client
wants to move away from 904Labs, they can do so, without
extensive vendor lockin!
Many other (SaaS) search solutions provide Solr/Elasticsearch as
core part of their solution. Moving away from these solutions
leaves clients without any search infrastructure.
Feature engineering. Which features are readily available?
Delayed feedback. How to update the model when feedback is
delayed until after another update has already happened?
Efficiency vs. effectiveness. How to balance the number of
queries to Solr and the extent of the candidate document set?
Exploration vs exploitation. We want to exploit the current best
model, but need to explore to keep learning. What is the best
Some open issues (as time allows)
Take home message for 904Labs
Search has moved from modeling relevance to learning from
user behavior data. The next Paradigm Shift is to learn these
models in real time, allowing immediate adaptation to changes in
user behavior and removing the necessity of large-scale data
(pre)processing for batch learning.
Many open issues remain, so expect lots of cool research on
(and you all know the LtR plugins for Apache Solr and Elasticsearch)
Pointwise: Try to predict the relevance of one document at a
Pairwise: For a pair of documents, predict which is more relevant.
Listwise: Try to optimize the full ranking using existing IR
Approaches to learning to rank