2. April 14, 2014
I’m Erik Bernhardsson
Engineering Manager at Spotify in NYC
@fulhack
3. The “Prism” team
Chris Johnson
Andy Sloane
Sam Rozenberg
Ahmad Qamar
Romain Yon
Gandalf Hernandez
Neville Li
Rodrigo Araya
Edward Newett
Emily Samuels
Vidhya Murali
Rohan Agrawal
3
8. How do we structure music understanding?
How do you teach music to machines?
!
Editorial tagging
Audio analysis
Metadata
Natural language processing
Collaborative filtering
8
9. Collaborative filtering
Find patterns in usage
data
!
With millions of users and
billions of streams, lots of
patterns
9
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
10. Some real data points
36.5% of playlists containing Notorious BIG also contain 2Pac
(6.4% of playlists containing Notorious BIG also contain Justin Bieber)
!
10
11. Main problem: how similar are two items?
If you understand that well, you can do most other things.
!
So our main problem: how do you model a function similarity(x, y)
!
For item similarity it’s also much easier to acquire good test set data, unlike personal
recommendations. It’s hard to evaluate personal recommendations – most offline metrics like
precision are irrelevant.
!
11
12. “Essentially, all models are wrong,
but some are useful.”
– George Box
!
!
!
We can’t perfectly model how users choose music. But modeling is a craft not a science and we can
use common sense when building models.
!
For play count, is Poisson or a Normal distribution better?
!
Always check your assumptions. Eg. SVD minimizes squared loss, which assumes the underlying
data is Gaussian. Is it?
12
13. OK so how do we do it?
There’s a lot of interesting unsupervised language models that work really well for us. Docs =
playlists/users, words=tracks/artists/albums. You could also call it implicit collaborative filtering
because we have no ratings whatsoever.
!
Main approach: matrix factorization (or latent factor methods), historically with bag-of-words on play
counts (but today sequence is also important)
13
Or more generally:
P =
0
B
B
B
@
p11 p12 . . . p1n
p21 p22 . . . p2n
...
...
pm1 pm2 . . . pmn
1
C
C
C
A
The idea with matrix factorization is to represent this probability distribu-
tion like this:
pui = aT
u bi
M0
= AT
B
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
| {z }
f
f
15. Step 1: Put everything into a big sparse matrix
15
@ . . . 7 . . . . . . . . .
...
...
...
A
a very big matrix too:
M =
0
B
B
B
@
c11 c12 . . . c1n
c21 c22 . . . c2n
...
...
cm1 cm2 . . . cmn
1
C
C
C
A
| {z }
107
items
9
>>>>>>>>>=
>>>>>>>>>;
107
users
16. Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
16
Erik
Never gonna give
you up
Erik listened to Never
gonna give you up 1
times
17. For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)
Invented as a method intended for text classification
17
P =
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
user vectors
✓
. . . . . . .
. . . . . . .
◆
| {z }
item vectors
PLSA
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
| {z }
P (u,i)=
P
z
P (u|z)P (i,z)
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
P (u|z)
✓
. . . . . . .
. . . . . . .
◆
| {z }
P (i,z)
X
18. Run this for n iterations
Start with random vectors
around the origin.
!
Then run alternating least
squares, gradient
descent, or something
like that.
18
19. Why are latent factor models nice?
They find vectors which are super small fingerprints of the musical style or the user’s taste
Usually something like 40-1000 elements
19
0.87 1.17 -0.26 0.56 2.21 0.77 -0.03
Latent factor 1
Latent factor 2
track x's vector
Track X:
20. Why are latent factor models nice? (part 2)
- Fast (linear in input size)
- Do not have a big problem with overfitting
- Have a solid underlying model (i.e. not just a bunch of heuristics)
- Easy to scale (at least compared to other models)
- Gives a compact representation of items
20
21. Similarity now becomes schoolbook trigonometry
21
Latent factor 1
Latent factor 2
track x
track y
cos(x, y) = HIGH
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
22. Why does cosine make sense?
Intuitively it makes sense, because we’re factoring out popularity and introducing a distance metric.
!
In fact, best result seems to be: train a latent factor model as usual, but normalize all vectors as a
post-processing step.
!
Even for models without any geometric interpretation (like LDA), cosine works
22
23. It’s still tricky to search for similar tracks though
Locality Sensitive Hashing:
Cut the space recursively by random
plane.
If two points are close, they are more
likely to end up on the same side of
each plane.
!
https://github.com/spotify/annoy
23
25. Section name
Old school models
- Latent Semantic Analysis (LSA)
- Probabilistic Latent Semantic Analysis (PLSA)
- Latent Dirichlet Allocation (LDA)
!
Bag of words models
Need a lot of topics, and usually not very great for music recs
25
26. What about scalability of models?
When we started experimenting with latent factor models, PLSA needed at least 400 factors (topics)
to give decent results.
!
That gives at least 10 billion parameters, or way more that you could conveniently fit in RAM.
!
So what to do? We turned to Hadoop.
26
27. One iteration, one map/reduce job
“Google News Personalization: Scalable Online Collaborative Filtering”
27
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
28. Section name
Other MF models
- Collaborative Filtering for Implicit Feedback Datasets (“Koren”)
- “vector_exp”: our own: every stream is a softmax over all tracks
!
Need a much more compact representation of items, typically only say 40 elements.
!
Benefit a lot from handling the zero case separately
28
29. Section name
New trendy models
- Recurrent neural networks (RNN)
- word2vec
!
Take into account sequence of events
!
Future: Take into account the time – maybe hidden markov models, etc?
29
30. Power of combining models
All models have their own objective and their own biases. Combining them (with Gradient Boosting
Decision Trees) yields kickass results:
30
31. Section name
Album cover based models
Just a fun experiment that shows that any signal (weak learner) adds value to the ensemble. Turns
out it probably just works as a classifier for minimal techno. We will most likely never put this in
production :)
31
32. What happened with Hadoop?
Most newer models don’t need a ton of latent factors, so all parameters fit nicely in RAM.
!
Additionally, you can do more complex things on a single machine. Lately we’ve started focusing on
a combination of non-scalable models (more complex, less data) and scalable models (simple, but
with more data)
!
Hadoop makes things “scale”, but at a ridiculous constant I/O overhead. We are in the process of
moving our models to Spark instead
!
32
33. Orders of magnitude numbers
Data points Parameters Time to train
Single-machine model 1B 100M 10h
Hadoop model 100B 10B 10h
Spark?? 100B 10B 1h
33
34. Source:
What are we optimizing for?
... a story of surrogate loss functions
34
35. We want to optimize Spotify’s “success”
Long term business value or something similar.
Problem: You only get one shot!
35
36. Let’s run A/B tests
Typically: DAU (daily active users), Day 2 retention, etc
Super inefficient way of collecting roughly 1 bit of information!
36
37. So let’s do offline testing
Editorial judgement
“Look at the results”
!
37
38. The “Daft Punk Test”
… why does collaborative filtering always fail?
38
LDA RNN Koren PLSA vector_exp
Daft Punk Daft Punk Daft Punk Daft Punk Daft Punk
Daft Punk - Stardust Rizzle Kicks Coldplay The PURSUIT Gorillaz
Raccoon Daft Funk Gotye Junior Senior deadmau5
Dave Droid La Roux Lana Del Rey Chuckie & LMFAO Macklemore & Ryan
Lewis
The Local Abilities Rudimental Of Monsters And Men Beatbullyz M83
Daft Funk Pacjam The Lumineers Pursuit Gotye
M83 VS Big Black Delta Su Bailey Green Day La Roux The xx
Leandro Dutra Capital Cities John Mayer Fatboy Slim Calvin Harris
Huw Costin YYZ Foster The People Chase & Drive Kavinsky
Jesús Alonso Various, WMGA Florence + The Machine Knivez Out Coldplay
39. Wait maybe machines can evaluate things?
Sure! We just need a ground truth data set
!
Use things like thumbs, skips, editorial data sets
!
Note that thumbs etc has observation bias
!
Doesn’t have to be as high volume, few thousand data points is enough
!
We can also optimize for this using e.g. GBDT
39
41. Ensemble workflow
41
Cross validate ensemble model
Model 1 Model 2 Model 3 ... Model n
Thumbs Gradient boosted decision tree
Combined model Offline metrics
Production
Editorial data sets
42. This One Weird Trick Sort of Fixes Observation Bias
Augment the data set with lots of random negative data. Works well in practice.
42
parameter 2
parameter 1
current best estimate
+
-
+
+
+
+
+
+
+ -
-
-
--
-
-
data points from earlier batches
43. What have we learned so far?
- Figuring out what to optimize for is hard
- Combining lots of models really helps
- Large scale algorithms are great, but not everything has to scale
43
44. So what are we working on now?
Combine even more signals
- Content-based methods: use audio, lyrics, images
- Read about music and understand it
- Personalize everything
- Just acquired Echo Nest in Boston!
44