The Minhash is implemented by Hadoop to provide high Jaccard similarities, which is used to make friends recommendation on Facebook friend link New Orleans data through a way of collaborative filtering.
Written by Chengeng Ma
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
ย
Local sensitive hashing & minhash on facebook friend
1. Local Sensitive Hashing &
Minhash on Facebook friend
links data & friends
recommendation
Chengeng Ma
Stony Brook University
2016/03/05
2. 1. What is Local Sensitive Hash & Minhash?
โข If you are familiar with LSH and Minhash, please directly go to
page 12, because the following pages are just fundamental
knowledge about this topic, which you can find more details in
the book, Mining of Massive Dataset, written by Jure Leskovec,
Anand Rajaraman and Jeffrey D. Ullman.
3. What is LSH &
Minhash about?
โข Local Sensitive Hash (LSH) &
Minhash are two profoundly
important methods in Big Datafor
finding similar items.
โข In Amazon, if you can find two
similar persons, you can
recommend to one person the
items the other has purchased.
โข For Google, Baidu, โฆ, users always
hope the search engine can find
pictures similar to the one they
have uploaded.
4. Calculating similarity between each pair is a
lot of computation (Why LSH?)
โข If you have 106 items within your
data, you will need almost
0.5 ร 1012
times computation to
know the similarities between each
pair.
โข You will need to parallel a lot of
tasks to deal with this huge
computation amount.
โข You can do this with the help of
Hadoop, but you can do better
with the help of LSH & Minhash.
โข The LSH can hash one item to a bucket
based on the feature list that item has.
โข If two items are quite similar with each
other in their feature lists, then they
will have a large probability to be
hashed into the same bucket.
โข You can amplify this effect to different
extent by setting parameters.
โข Finally, you only need to compute
similarities for the pairs formed by the
items within the same bucket.
5. How Minhash
comes in?
โข The LSH needs you keep
the feature list of each
item in the format like a
matrix (the sequence is
important).
โข If the size of universal set
is fixed or small, e.g., the
fingerprint array, then LSH
alone can work well.
1st column
represents the
items person S1
has purchased.
1st row
represents who
has purchased
item a.
6. How Minhash comes in?
โข Jaccard Similarity = 2/7
โข However, if the universal set is
large or not size-fixed, e.g., items
purchased by each account,
friend list on social network, โฆ
โข Then formatting the dataset into
matrix is not efficient, since the
dataset is usually very sparse.
โข Then Minhash works, if the
similarities between two feature
lists is calculated as Jaccard
similarities.
7. Whatโs
Minhash value?
โข Permute the original
matrix by row.
โข For each column (set), the
1st non-empty elementโs
row index is the minhash
value of that column.
Original matrix
Permute to a
different order:
b, e, a, d, c.
H(S1)=a, H(S2)=c, H(S3)=b, H(S4)=a.
8. Minhashโs property (similarity preserved):
โข 3 kinds of rows between set ๐ ๐ & ๐ ๐:
(x): both sets have 1;
(y): one has 1, the other has 0;
(z): both sets have 0.
๐ฝ ๐๐ =
|๐|
๐ + |๐|
Pr โ ๐ ๐ = โ ๐ ๐ =
|๐|
๐ + |๐|
โข If you do 100 times different
minhash, you reduce one
dimension of the matrix from
unknown large to 100.
โข The probability that two sets
share the same minhash value
equals the Jaccard similarity
between them.
Pr โ ๐ ๐ = โ ๐ ๐ = ๐ฝ ๐๐
9. Permutations can be simulated by hash functions
โข For j th column in original
matrix, find all the non-empty
elements, try to input their
indexes into the i th hash
function, the minimum output
is the element SIG(i, j).
โข Hash function: ๐ โ ๐ฅ + ๐ % ๐ :
โข where N is a prime, equal to or slightly
larger than the size of universal set (# of
rows of original matrix),
โข a & b must be integers within [1, N-1].
โข The result signature matrix, where
row index is for hash functions,
column index is for sets.
For example, we use 2 hash functions to simulate 2
permutations: (x+1)%5 and (3x+1)%5, where x is row index
SIG
10. Now you have signature matrix, you use it
instead of original matrix to do LSH.
โข Divide the signature matrix into b
bands, each of which has r rows.
โข For each band range, build an
empty hashtable, hash each
column (portion within the band
range) into a bucket, so that only
identical bands are hashed into
the same bucket.
โข Columns within the same bucket
are considered candidates that
you should form pairs and
calculate similarities.
โข Take the union of different band
ranges and filter out the false
positives.
12. 2. Details of my class project: dataset
โข User-to-user links from Facebook
New Orleans networks data.
โข The data is created by Bimal
Viswanath et al. and used for their
paper On the Evolution of User
Interactionin Facebook.
โข It can be download in
http://socialnetworks.mpi-
sws.org/data/facebook-links.txt.gz
โข It has 63,731 persons and 1,545,686
links, 10.4 MB in size.
โข The data is not large, but as a
training, I will use Hadoop during this
project.
13. My class project plan:
โข Firstly find similar persons based
on usersโ friend lists, where LSH
and Minhash will be implemented
in Hadoop.
โข The similar persons are called
โclose friendโ in this project.
โข Then recommend to you the
persons who are friends of your
close friend but not yet of you.
โข It generally sounds like
Collaborative Filtering.
โข Two persons who have similar
friend list are considered โclose
friendsโ, since they must have some
relationship in the real world, e.g.,
schoolmates, workmates,
teammate, โฆ
โข If youโre good friend of someone,
you may like to know more of
his/her friend.
โข We do not set too high threshold
for similarity, since finding a
duplicate of you is not interesting.
14. Why not just use
common friend counts?
โข The classicalway is based on
number of common friends.
โข However, there are some persons
who have a lot of common friends
with you, but has nothing to do
with you, e.g., celebrities, politics,
salesmen who want to sell their
stuff through social network, or
even swindlersโฆ
โข People use social network to find
friends that can physically reach
them, but not for persons too far
away from them.
โข Most of my friends may like a pop
singer and become friends of him.
Based on common friends, system
will recommend that pop singer to
me.
โข But the pop singer can never
remember me, since he has
millions of friends on site.
15. Prepare work:
โข 1. Make data into
the format like
below, where j
represents the j th
person, Pj is a list of
friends of person j.
โข 2. In this study, 63731 is both the number of sets to
compare and the size of universal element set.
Because both the key j and the elements within set Pj
are user id.
1: P1
2: P2
โฆ โฆ
n: Pn
โข 3. 63731 is not a prime (101*631). Only prime
number can simulate true permutations. We use
63737 instead, equivalent to adding 6 persons that
has no friends online.
โข 4. Hash function for Minhash:
N=63737L; hashNum=100;
private long fhash ( int i, long x ) {
return 13 + ๐ฅ โ 1 โ (
๐โ๐
3โโ๐๐ โ๐๐ข๐
+ 1) %๐ ;
}
1 โค ๐ฅ โค ๐, 0 โค ๐ โค โ๐๐ โ๐๐ข๐ โ 1
16. Pseudocode of Minhash (Map job only)
โข Mapper input: (c, Pc), where Pc
represents a list [ j1, j2, โฆ, js ];
โข Build a new array s[hashNum]
(hashNum=100here), initialized as
infinity everywhere.
โข For i th hash function, each
element jj in Pc is an opportunity to
get lower hash value, finally the
minimum hash value from all jj is
the minhash in SIG[i,c].
โข Output c as key, the content of
array s as value.
input (c, Pc), where Pc= [ j1, j2, โฆ, js ]
long[] s = new long[hashNum];
for 0 โค ๐๐ โค โ๐๐ โ๐๐ข๐ โ 1:
s[ii] = infinity;
end
for jj in [ j1, j2, โฆ, js ]:
for 0 โค ๐๐ โค โ๐๐ โ๐๐ข๐ โ 1:
s[ii] = min (s[ii], fhash(ii, jj));
end
End
Output (c, array s);
17. Pseudocode for LSH:
โข Mapper input (j, ๐๐), where ๐๐ is
the j th column of signature matrix.
โข Split array ๐๐ into B bands as, ๐๐1,
๐๐2, โฆ, ๐๐๐ต.
โข For b th band, get its hash value
stored in myHash.
โข Output the tuple (b, myHash) as
key, j as value.
for 1 โค ๐ โค ๐ต:
myHash = getHashValue(๐๐๐)
Output { ( b, myHash ), j }
end
โข Reducer input:
{ (b, aHashValue), [๐1, ๐2, โฆ, ๐ ๐] }
Now, form pairs between ๐1, โฆ,
๐ ๐and output as candidate pairs.
for 1 โค ๐ฅ โค ๐ โ 1:
for ๐ฅ + 1 โค ๐ฆ โค ๐:
output (๐ ๐ฅ, ๐ ๐ฆ)
end
end
One more program is needed to remove duplicates.
Hadoopโssorting procedure helps us gathering
all the items that both has the same hash
value and comes from the same band range.
18. Hash function for LSH:
โข The LSH needs to hash a band
portion of vector into a value.
โข It hopes only identical vectors
can be hashed into the same
bucket.
โข An easy way is to directly use its
string expression, since Hadoop
also uses Text to transport data.
โข For example, hash the below
portion into string:
โข In this way, only exactly same
vector portion can comes into the
same bucket.
โ21,14,36,55โ
Hash to
string
19. Parameters set:
โข We do not want to set threshold
of similarity too high, since
finding a duplicate of you on
web is not interesting.
โข So we set the threshold of
similarity near 0.1.
โข We set B=50 and hashNum=100,
so that each band in LSH has R=2
rows.
๐ ๐๐๐๐๐๐๐๐๐ ๐ฅ) = 1 โ (1 โ ๐ฅ ๐
) ๐ต
โข S curve grows quickly:
โข X=0.1, P=0.39
โข X=0.15, P=0.68
โข X=0.2, P=0.87
B=50, R=2
20. Result Test:
โข ๐ ๐๐๐๐๐๐๐๐๐ ๐ฅ) =
๐(๐๐๐๐๐๐๐๐๐ & ๐ฅโค๐ <๐ฅ+๐๐ฅ)
๐(๐ฅโค๐ <๐ฅ+๐๐ฅ)
โข ๐ ๐๐๐๐๐๐๐๐๐ ๐ฅ) = 1 โ (1 โ ๐ฅ ๐ ) ๐ต
โข The Hadoop output can be analyzed to get
๐(๐๐๐๐๐๐๐๐๐ & ๐ฅ โค ๐ < ๐ฅ + ๐๐ฅ).
โข For ๐(๐ฅ โค ๐ < ๐ฅ + ๐๐ฅ), another Hadoop program is written
to really calculatethe similarities within all possible pairs
(which takes N(N-1)/2times computation),since the dataset
is not too large.
โข But only the similarities equal to or larger than 0.1 is stored
in output file, because it takes several Terabytes to store all
the similarities.
22. Histogram of LSH recommended pairs and all
existing pairs (but cut at 0.1) within the data
23. Statistics
โข The LSH & Minhash
recommends 1,065,318 pairs.
โข There are 660,334 existing
pairs that really have s larger
than 0.1.
โข Intersection of them have
429,176 pairs, which contains
65% of similar pairs (s>0.1).
โข But the computation is
hundreds of times faster than
before.
โข 1 โ (1 โ ๐ฅ ๐
) ๐ต
) =
๐(๐๐๐๐๐๐๐๐๐ & ๐ฅโค๐ <๐ฅ+๐๐ฅ)
๐(๐ฅโค๐ <๐ฅ+๐๐ฅ)
โข We define a reference value ๐ฅ ๐ ๐๐:
โข 1 โ (1 โ ๐ฅ ๐ ๐๐
๐
) ๐ต
)=
0.1
1
๐ ๐๐๐๐๐๐๐๐๐ & ๐ฅ โค ๐ < ๐ฅ + ๐๐ฅ ๐๐ฅ
0.1
1
๐(๐ฅ โค ๐ < ๐ฅ + ๐๐ฅ) ๐๐ฅ
= 429176/660334 = 0.649938
Take in parameters B=50, R=2
๐ฅ ๐ ๐๐ = 0.1441 , which is slightly above 0.1
24. How to calculate ๐(๐ฅ โค ๐ < ๐ฅ + ๐๐ฅ)?
(Similarity Joins Problem)
โข To get the exact P.D.F., you need
to really calculate the similarities
for all N(N-1)/2 pairs.
โข Using Hadoop can parallel and
speed up. But donโt use too high
replicate rate.
โข How about the right hand side
method?
Mapper input: (i, Pi)
for 1 โค j โคN๏ผ
if i < j: output {(i, j),Pi}
else if i > j: output {(j, i), Pi}
end
Reducer input: { (i, j), [Pi, Pj] }
Output { (i, j), Sij }
โข This method takes replicate rate as N and will definitely
fail. The correct way is to split persons into G groups.
25. The correct way to get similarities between all
pairs by Hadoop.
โข Mapper input: (i, Pi)
โข Determine its group number as
u=i%G, where G is the number
of groups you split, it is also the
replicate rate.
โข For 0 โค v โคG-1:
โข If u < v: Output { (u, v), (i, Pi) }
โข Else if u>v: Output { (v, u), (i, Pi) }
โข end
โข Reducer input:
โข { (u, v), [โ ๐, ๐๐ โ ๐บ๐๐๐ข๐ ๐ข,
โ ๐, ๐๐ โ ๐บ๐๐๐ข๐ ๐ฃ] }
โข Create two empty list uList &
vList, separately to gather all
๐, ๐๐ that belongs to group u
and v.
For 0 โค ๐ผ โค ๐ ๐๐ง๐ ๐ข๐ฟ๐๐ ๐ก โ 1:
Get i and Pi from ๐ข๐ฟ๐๐ ๐ก[๐ผ]
For 0 โค ๐ฝ โค ๐ ๐๐ง๐ ๐ฃ๐ฟ๐๐ ๐ก โ 1:
Get j and Pj from ๐ฃ๐ฟ๐๐ ๐ก[๐ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
v
Continued
in the next
page
26. Still within the reducer:
โข The above only consider pairs
whose element comes from
different groups.
โข Now we consider elements
within the same group.
โข We manage to avoid calculate
the same pairs multiple times by
setting if conditions.
If v==u+1:
For 0 โค ๐ผ โค ๐ ๐๐ง๐ ๐ข๐ฟ๐๐ ๐ก โ 2:
Get i and Pi from ๐ข๐ฟ๐๐ ๐ก[๐ผ]
For ๐ผ + 1 โค ๐ฝ โค ๐ ๐๐ง๐ ๐ข๐ฟ๐๐ ๐ก โ 1:
Get j and Pj from ๐ข๐ฟ๐๐ ๐ก[๐ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
If u==0 & v==G-1:
For 0 โค ๐ผ โค ๐ ๐๐ง๐ ๐ฃ๐ฟ๐๐ ๐ก โ 2:
Get i and Pi from ๐ฃ๐ฟ๐๐ ๐ก[๐ผ]
For ๐ผ + 1 โค ๐ฝ โค ๐ ๐๐ง๐ ๐ฃ๐ฟ๐๐ ๐ก โ 1:
Get j and Pj from ๐ฃ๐ฟ๐๐ ๐ก[๐ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
27. Post processing work:
โข 1. Filter out the false positives by
calculate similarities for those
candidate pairs. Then we will have
the similar persons (โclosefriendโ) for
a lot of users.
โข General ides is using 2 MR jobs:
โข 1st MR job use i as key and change
(i, j) to (i, j, Pi);
โข 2nd MR job change (i, j, Pi) to (i, j,
Pi, Pj), so you can get similarity Sij.
โข 2. Recommendation: for each user,
take the union of his/her close
friendsโfriend list and filter out the
members he/she already knows.
โข General idea is:
โข When you have a similar person
list like {a, [b1, b2, โฆ ,bs]}, then
you transfer it to {a, [Pb1, Pb2, โฆ,
Pbs]}, where Pbi is the friend list of
person bi.
โข Then take the union of Pbi and
finally minus Pa.
28. Filter Out False Positives (2 MR jobs)
โข 1st Mapper (multiple inputs):
Recommendation data:
(i, j) { i, (j, โRโ) } if i<j
{ j, (i, โRโ) } if i>j
Friend List data:
(i, Pi) {i, (Pi, โFโ) }
โข 1st Reducer: input:
{ i, [ (j, โRโ) โj candidate paired
with i & j>i; (Pi, โFโ) ] }
For each j from input:
Output {j, (i, Pi, โtempโ)}
โข 2nd Mapper (multiple inputs):
Temporary data: Pass
Friend List data:
(j, Pj) {j, (Pj, โFโ) }
โข 2nd Reducer: input:
{ j, [ (i, Pi, โtempโ) โ๐ associated
with j; (Pj, โFโ) ] }
For each i:
Sij=similarity(Pi, Pj)
If Sij>=0.1: output {(i, j), Sij}
29. Recommendation (3 MR jobs):
โข 1st Mapper (multiple inputs):
โข Similar persons list data:
{ a, [b1,b2,โฆ,bs]} {bi, (a, โSโ)} for all i
โข Friend List data:
(bi, Pbi) {bi, (Pbi, โFโ) }
โข 1st Reducer: input:
{bi, [ (a, โSโ) โa similar to bi; (Pbi, โFโ) ]}
For each a from input:
Output {a, Pbi}
โข 2nd Mapper: Pass
โข 2nd Reducer:
input: { a, [Pb1, Pb2, โฆ, Pbs] }
U = Pb1 โช Pb2 โช โฆ โช Pbs
Output {a, U}
โข 3rd Mapper (multiple inputs):
(i, Ui) {i, (Ui, โuโ)}
(i, Pi) {i, (Pi, โFโ)]
โข 3rd Reducer: input:
{i, [(Ui, โuโ),(Pi, โFโ)] }
Output {i, Ui-Pi}
30. Reference:
โข 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.
โข 2. Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the
evolution of user interaction in Facebook. In Proceedings of the 2nd ACM workshop on Online
social networks (WOSN '09). ACM, New York, NY, USA, 37-42.
DOI=http://dx.doi.org/10.1145/1592665.1592675