Companies like Amazon, Netflix, and Youtube have popularized mantras like "More Like This" recommendations. Now, almost all online shops/content sites implement such solutions.
But is it possible to build a scalable Content-Based Recommendation System using open-source software that is easy-to-maintain, simple to tune and straightforward to deploy?
I will present how to use "More Like This" from Apache SOLR. Built as a Search Tool, Apache SOLR can also be used as a Recommendation System as they both operate with computing relevance.
However, the "More Like This" functionality of SOLR uses only text fields. I will show you how to overcome this and fully profit from the powerful capabilities of SOLR.
I will also present how an Inverted Index works, the TF/IDF scoring formula and how to measure the performance of a Recommendation System. All with a step-by-step example.
6. “Movie Store”
Use Case
When
● A user visualizes the details of a
movie
Then
● The application recommends
“similar” movies
7. Example
Target Movie
● The Lord of the Rings: The
Fellowship of the Ring
Recommendations
1) The Lord of the Rings: The Return of
the King
2) The Lord of the Rings: The Two
Towers
3) The Lord of the Rings
4) Lord of War
5) The Lord Protector
8. What Does
“Similar”
Mean?
Target Movie
● “The Lord of the Rings: The
Fellowship of the Ring”
Action / Adventure / Drama
8.8 on IMDB
Recommended (Similar) Movies
● The same words in the title
● The same movie genre
● The same words in the description
● Similar IMDB vote
9. Questions
Questions for our
Recommendation System
● Do all the words have the
same importance?
● Do all the fields have the same
importance?
● How does the engine
differentiate between results?
13. Movie Fields -> with Types
● imdb_title_id -> string
● original_title -> “analyzed” text
● description -> “analyzed” text
● genre -> array of strings
● avg_vote -> number
14. String vs “Analyzed” Text Field Types
● Field Type: String
● Example: “Comedy” (field: genre)
Indexed: “Comedy”
● Field Type: “Analyzed” Text
● Example: “The Lord of the Rings: The Fellowship of the Ring” (field:
original_title)
Indexed (lowercased and without stopwords):
○ “lord”
○ “rings”
○ “fellowship”
○ “ring”
15. “The Lord of the Rings: The Fellowship of the
Ring”
● Movie Id (imdb_title_id): tt0120737
● Original Title
“The Lord of the Rings: The Fellowship of the Ring”
● Description
“A meek Hobbit from the Shire and eight companions set out on a
journey to destroy the powerful One Ring and save Middle-earth from the
Dark Lord Sauron.”
● Genre
“Action, Adventure, Drama”
● Imdb vote (avg_vote): 8.8
16.
17. “More Like
This” Feature
in SOLR
More Like This
● Given a movie id => list
“similar” movies
● Uses the “Search” functionality
20. “Search”
Example 2:
Query
original_title: “Lord” AND
original_title: “Rings”
Results (4)
1) "The Lord of the Rings"
2) "The Lord of the Rings: The
Fellowship of the Ring"
3) "The Lord of the Rings: The
Return of the King"
4) "The Lord of the Rings: The Two
Towers”
Execution time: 21 ms
21. How Does the Search original_title: “Lord”
AND original_title: “Rings” Function?
● Searches in the original_title index all the movies that contain
the words “lord” AND “rings” (lowercased!)
● Computes search score based on Boosting, Term Frequency (TF)
and Inverse Document Frequency (IDF)
● Displays the results in descending order of the score
22. The TF / IDF Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
where:
boost(field[j]) = custom weight given to the field j
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength/avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
word[i] = every word in the field, excluding stop words (in our case)
fieldLength = count of words in the field, excluding stop words (in our case)
avgFieldLength = average length of field
23. original_title = “The Lord of the Rings”
genre = “Animation, Adventure, Fantasy”
description = “The Fellowship of the Ring embark ...”
score = 1 * tf(“lord”) * idf(“lord”) +
1 * tf(“rings”) * idf(“rings”) +
1 * tf(“Animation”) * idf(“Animation”) + ...
Debug the Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
24. Debug the TF / IDF Formula for the
QUERY = original_title:Lord AND original_title:Rings
Original title CTF (Field)
Lord Rings
CDF (Corpus)
Lord Rings
Field
Length
Score
The Lord of the Rings 1 1 26 10 2 8.29
The Lord of the Rings:
The Fellowship of the Ring
1 1 26 10 4 6.06
The Lord of the Rings:
The Return of the King
1 1 26 10 4 6.06
The Lord of the Rings:
The Two Towers
1 1 26 10 4 6.06
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength / avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
26. Inverted Index (original_title)
Id
(imdb_title_id)
Tile (original_title)
tt0120737 The Lord of the Rings:
The Fellowship of the Ring
tt0167260 The Lord of the Rings:
The Return of the King
tt0167261 The Lord of the Rings:
The Two Towers
tt0077869 The Lord of the Rings
Word Ids (imbd_title_id)
lord tt0120737,
tt0167260,
tt0167261, tt0077869
rings tt0120737,
tt0167260,
tt0167261, tt0077869
ring tt0120737
fellowship tt0120737
return tt0167260
king tt0167260
towers tt0167261
two tt0167261
28. “More Like
This”
Example
Query
● q = imdb_title_id:tt0120737
(“The Lord of the Rings: The
Fellowship of the Ring”)
● Other parameters:
mlt = true
mlt.fl=original_title,
description, genre, avg_vote
mlt.mintf = 1
mlt.count = 5
30. Results
Results (“The Lord of the
Rings: The Fellowship of the
Ring”)
● Execution Time: <100 ms
● Total Results: 62387
31. Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
32. Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of
the Ring” (Action, Adventure, Drama - 8.8)
37. Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
38. Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
40. Numeric Fields
Ignored in MLT
Issue
● Only text fields are used in MLT
queries
Solution
● Rewrite the whole query as a
search query and include also
the numeric fields
42. “More Like This”
Steps
1) Extract the “interesting terms”
from the target movie
2) Add boostings / field (as given in
the query) for every interesting term
3) Perform a Search with those words
and boostings
43. “More Like This” Step 1
1) Extract the “interesting terms” from the target movie (from the field list in
the query): take all the words from all the fields and compute their relevance. Keep
the first 25.
Ex: word “ring” -> very relevant for the movie: “The Lord of the Rings: The
Fellowship of the Ring”:
- 2 occurrences: once in “original_title” and once in “description”
- in the whole corpus of 85855 movies:
- 35 times in the field “original_title” and
- 282 times in the field “description”
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
45. “More Like This” Step 2
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term:
avg_vote^40 genre^30 original_title^20 description
3) Perform a Search with those words and boostings
47. “More Like This” Step 3
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
48. Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
49. Add Numeric
Fields to
“More Like This”
1) SOLR Request 1: perform a MLT and
get the “interesting terms”
2) Add boostings
3) Add numeric fields with their
boostings
4) SOLR Request 2: perform a Search
with numeric fields and “interesting
terms” with their respective
boostings
50. Example of Numeric Field Syntax
Target movie: avg_vote = 8.8
=> a similar movie would have:
avg_vote: [8.8 - 1.5 TO 8.8 + 1.5]
=> add boosting factor:
avg_vote: [7.3 TO 10.3] ^ 40
53. Final Results for “The Lord of the Rings: The Fellowship of
the Ring”(Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
249 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
246 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
222 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
161 Lord of War 2005 Action / Crime / Drama 7.6
157 The Lord Protector 1996 Action / Adventure / Fantasy 4.2