2. Introduction
1. Developed by folks at Twitter and contributed to MLLib
a. https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
b. https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
2. A more efficient algorithm to evaluate pairwise similar items
3. Twitter reported an efficiency gain of 40%
3. Cosine Similarity
● Is a measure of similarity between 2 non-zero vectors
● Measures orientation not magnitude
● Outcome of cosine similarity is bounded in [0, 1]
● Similarity = cos( ) = A.B/||A||.||B||
● Tutorial in python
4. Problem Statement:
1. All pairs similarity for Sparse Vectors
2. Cosine Similarity as objective function
3. Computing similarity for all pairs is computationally expensive
4. Implementation using spark-scala:
https://github.com/pramitchoudhary/BigDataFramework/blob/master/itemRecommendationApp/src/main/s
cala/com/recommendation/core/CollaborativeFiltering.scala
5. Solutions:
1. Brute Force: Compute for all pairs
a. Not all pairs are similar
b. Considering all paris is O(n^2) problem which will start breaking soon
2. DIMSUM- by Reza Bosagh Zadeh and Gunnar Carlsson
a. Clever sampling technique to compute similarity scores only for items that are similar over a
certain threshold.
b. Single values of X are estimated within relative error with constant probability
c. Needs access
d. For e.g. user-item matrix is of dimension m*n matrix
i. x_ij is user’s response to an item; x_i:row; x_j:column
ii. Oversampling parameter(Ɣ) = 4* log(n)/s, where s is the similarity threshold
e. Helps in computing A^t * A(matrix decomposition) in an effective manner
f. Reduce complexity to O(nLog(n)/s)