SlideShare a Scribd company logo
1 of 30
Download to read offline
Local Sensitive Hashing &
Minhash on Facebook friend
links data & friends
recommendation
Chengeng Ma
Stony Brook University
2016/03/05
1. What is Local Sensitive Hash & Minhash?
โ€ข If you are familiar with LSH and Minhash, please directly go to
page 12, because the following pages are just fundamental
knowledge about this topic, which you can find more details in
the book, Mining of Massive Dataset, written by Jure Leskovec,
Anand Rajaraman and Jeffrey D. Ullman.
What is LSH &
Minhash about?
โ€ข Local Sensitive Hash (LSH) &
Minhash are two profoundly
important methods in Big Datafor
finding similar items.
โ€ข In Amazon, if you can find two
similar persons, you can
recommend to one person the
items the other has purchased.
โ€ข For Google, Baidu, โ€ฆ, users always
hope the search engine can find
pictures similar to the one they
have uploaded.
Calculating similarity between each pair is a
lot of computation (Why LSH?)
โ€ข If you have 106 items within your
data, you will need almost
0.5 ร— 1012
times computation to
know the similarities between each
pair.
โ€ข You will need to parallel a lot of
tasks to deal with this huge
computation amount.
โ€ข You can do this with the help of
Hadoop, but you can do better
with the help of LSH & Minhash.
โ€ข The LSH can hash one item to a bucket
based on the feature list that item has.
โ€ข If two items are quite similar with each
other in their feature lists, then they
will have a large probability to be
hashed into the same bucket.
โ€ข You can amplify this effect to different
extent by setting parameters.
โ€ข Finally, you only need to compute
similarities for the pairs formed by the
items within the same bucket.
How Minhash
comes in?
โ€ข The LSH needs you keep
the feature list of each
item in the format like a
matrix (the sequence is
important).
โ€ข If the size of universal set
is fixed or small, e.g., the
fingerprint array, then LSH
alone can work well.
1st column
represents the
items person S1
has purchased.
1st row
represents who
has purchased
item a.
How Minhash comes in?
โ€ข Jaccard Similarity = 2/7
โ€ข However, if the universal set is
large or not size-fixed, e.g., items
purchased by each account,
friend list on social network, โ€ฆ
โ€ข Then formatting the dataset into
matrix is not efficient, since the
dataset is usually very sparse.
โ€ข Then Minhash works, if the
similarities between two feature
lists is calculated as Jaccard
similarities.
Whatโ€™s
Minhash value?
โ€ข Permute the original
matrix by row.
โ€ข For each column (set), the
1st non-empty elementโ€™s
row index is the minhash
value of that column.
Original matrix
Permute to a
different order:
b, e, a, d, c.
H(S1)=a, H(S2)=c, H(S3)=b, H(S4)=a.
Minhashโ€™s property (similarity preserved):
โ€ข 3 kinds of rows between set ๐‘† ๐‘Ž & ๐‘† ๐‘:
(x): both sets have 1;
(y): one has 1, the other has 0;
(z): both sets have 0.
๐ฝ ๐‘Ž๐‘ =
|๐‘‹|
๐‘‹ + |๐‘Œ|
Pr โ„Ž ๐‘† ๐‘Ž = โ„Ž ๐‘† ๐‘ =
|๐‘‹|
๐‘‹ + |๐‘Œ|
โ€ข If you do 100 times different
minhash, you reduce one
dimension of the matrix from
unknown large to 100.
โ€ข The probability that two sets
share the same minhash value
equals the Jaccard similarity
between them.
Pr โ„Ž ๐‘† ๐‘Ž = โ„Ž ๐‘† ๐‘ = ๐ฝ ๐‘Ž๐‘
Permutations can be simulated by hash functions
โ€ข For j th column in original
matrix, find all the non-empty
elements, try to input their
indexes into the i th hash
function, the minimum output
is the element SIG(i, j).
โ€ข Hash function: ๐‘Ž โˆ— ๐‘ฅ + ๐‘ % ๐‘ :
โ€ข where N is a prime, equal to or slightly
larger than the size of universal set (# of
rows of original matrix),
โ€ข a & b must be integers within [1, N-1].
โ€ข The result signature matrix, where
row index is for hash functions,
column index is for sets.
For example, we use 2 hash functions to simulate 2
permutations: (x+1)%5 and (3x+1)%5, where x is row index
SIG
Now you have signature matrix, you use it
instead of original matrix to do LSH.
โ€ข Divide the signature matrix into b
bands, each of which has r rows.
โ€ข For each band range, build an
empty hashtable, hash each
column (portion within the band
range) into a bucket, so that only
identical bands are hashed into
the same bucket.
โ€ข Columns within the same bucket
are considered candidates that
you should form pairs and
calculate similarities.
โ€ข Take the union of different band
ranges and filter out the false
positives.
Jaccard
Similarity
Probability
of becoming
a candidate
Why LSH works? --- the
amplification effects
2. Details of my class project: dataset
โ€ข User-to-user links from Facebook
New Orleans networks data.
โ€ข The data is created by Bimal
Viswanath et al. and used for their
paper On the Evolution of User
Interactionin Facebook.
โ€ข It can be download in
http://socialnetworks.mpi-
sws.org/data/facebook-links.txt.gz
โ€ข It has 63,731 persons and 1,545,686
links, 10.4 MB in size.
โ€ข The data is not large, but as a
training, I will use Hadoop during this
project.
My class project plan:
โ€ข Firstly find similar persons based
on usersโ€™ friend lists, where LSH
and Minhash will be implemented
in Hadoop.
โ€ข The similar persons are called
โ€œclose friendโ€ in this project.
โ€ข Then recommend to you the
persons who are friends of your
close friend but not yet of you.
โ€ข It generally sounds like
Collaborative Filtering.
โ€ข Two persons who have similar
friend list are considered โ€œclose
friendsโ€, since they must have some
relationship in the real world, e.g.,
schoolmates, workmates,
teammate, โ€ฆ
โ€ข If youโ€™re good friend of someone,
you may like to know more of
his/her friend.
โ€ข We do not set too high threshold
for similarity, since finding a
duplicate of you is not interesting.
Why not just use
common friend counts?
โ€ข The classicalway is based on
number of common friends.
โ€ข However, there are some persons
who have a lot of common friends
with you, but has nothing to do
with you, e.g., celebrities, politics,
salesmen who want to sell their
stuff through social network, or
even swindlersโ€ฆ
โ€ข People use social network to find
friends that can physically reach
them, but not for persons too far
away from them.
โ€ข Most of my friends may like a pop
singer and become friends of him.
Based on common friends, system
will recommend that pop singer to
me.
โ€ข But the pop singer can never
remember me, since he has
millions of friends on site.
Prepare work:
โ€ข 1. Make data into
the format like
below, where j
represents the j th
person, Pj is a list of
friends of person j.
โ€ข 2. In this study, 63731 is both the number of sets to
compare and the size of universal element set.
Because both the key j and the elements within set Pj
are user id.
1: P1
2: P2
โ€ฆ โ€ฆ
n: Pn
โ€ข 3. 63731 is not a prime (101*631). Only prime
number can simulate true permutations. We use
63737 instead, equivalent to adding 6 persons that
has no friends online.
โ€ข 4. Hash function for Minhash:
N=63737L; hashNum=100;
private long fhash ( int i, long x ) {
return 13 + ๐‘ฅ โˆ’ 1 โˆ— (
๐‘โˆ—๐‘–
3โˆ—โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š
+ 1) %๐‘ ;
}
1 โ‰ค ๐‘ฅ โ‰ค ๐‘, 0 โ‰ค ๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1
Pseudocode of Minhash (Map job only)
โ€ข Mapper input: (c, Pc), where Pc
represents a list [ j1, j2, โ€ฆ, js ];
โ€ข Build a new array s[hashNum]
(hashNum=100here), initialized as
infinity everywhere.
โ€ข For i th hash function, each
element jj in Pc is an opportunity to
get lower hash value, finally the
minimum hash value from all jj is
the minhash in SIG[i,c].
โ€ข Output c as key, the content of
array s as value.
input (c, Pc), where Pc= [ j1, j2, โ€ฆ, js ]
long[] s = new long[hashNum];
for 0 โ‰ค ๐‘–๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1:
s[ii] = infinity;
end
for jj in [ j1, j2, โ€ฆ, js ]:
for 0 โ‰ค ๐‘–๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1:
s[ii] = min (s[ii], fhash(ii, jj));
end
End
Output (c, array s);
Pseudocode for LSH:
โ€ข Mapper input (j, ๐‘†๐‘—), where ๐‘†๐‘— is
the j th column of signature matrix.
โ€ข Split array ๐‘†๐‘— into B bands as, ๐‘†๐‘—1,
๐‘†๐‘—2, โ€ฆ, ๐‘†๐‘—๐ต.
โ€ข For b th band, get its hash value
stored in myHash.
โ€ข Output the tuple (b, myHash) as
key, j as value.
for 1 โ‰ค ๐‘ โ‰ค ๐ต:
myHash = getHashValue(๐‘†๐‘—๐‘)
Output { ( b, myHash ), j }
end
โ€ข Reducer input:
{ (b, aHashValue), [๐‘—1, ๐‘—2, โ€ฆ, ๐‘— ๐‘] }
Now, form pairs between ๐‘—1, โ€ฆ,
๐‘— ๐‘and output as candidate pairs.
for 1 โ‰ค ๐‘ฅ โ‰ค ๐‘ โˆ’ 1:
for ๐‘ฅ + 1 โ‰ค ๐‘ฆ โ‰ค ๐‘:
output (๐‘— ๐‘ฅ, ๐‘— ๐‘ฆ)
end
end
One more program is needed to remove duplicates.
Hadoopโ€™ssorting procedure helps us gathering
all the items that both has the same hash
value and comes from the same band range.
Hash function for LSH:
โ€ข The LSH needs to hash a band
portion of vector into a value.
โ€ข It hopes only identical vectors
can be hashed into the same
bucket.
โ€ข An easy way is to directly use its
string expression, since Hadoop
also uses Text to transport data.
โ€ข For example, hash the below
portion into string:
โ€ข In this way, only exactly same
vector portion can comes into the
same bucket.
โ€œ21,14,36,55โ€
Hash to
string
Parameters set:
โ€ข We do not want to set threshold
of similarity too high, since
finding a duplicate of you on
web is not interesting.
โ€ข So we set the threshold of
similarity near 0.1.
โ€ข We set B=50 and hashNum=100,
so that each band in LSH has R=2
rows.
๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) = 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…
) ๐ต
โ€ข S curve grows quickly:
โ€ข X=0.1, P=0.39
โ€ข X=0.15, P=0.68
โ€ข X=0.2, P=0.87
B=50, R=2
Result Test:
โ€ข ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) =
๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ)
๐‘ƒ(๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ)
โ€ข ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) = 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…) ๐ต
โ€ข The Hadoop output can be analyzed to get
๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ).
โ€ข For ๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ), another Hadoop program is written
to really calculatethe similarities within all possible pairs
(which takes N(N-1)/2times computation),since the dataset
is not too large.
โ€ข But only the similarities equal to or larger than 0.1 is stored
in output file, because it takes several Terabytes to store all
the similarities.
๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) derived from real dataset
(blue) and the theoretical curve (red)
Histogram of LSH recommended pairs and all
existing pairs (but cut at 0.1) within the data
Statistics
โ€ข The LSH & Minhash
recommends 1,065,318 pairs.
โ€ข There are 660,334 existing
pairs that really have s larger
than 0.1.
โ€ข Intersection of them have
429,176 pairs, which contains
65% of similar pairs (s>0.1).
โ€ข But the computation is
hundreds of times faster than
before.
โ€ข 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…
) ๐ต
) =
๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ)
๐‘ƒ(๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ)
โ€ข We define a reference value ๐‘ฅ ๐‘…๐‘’๐‘“:
โ€ข 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…๐‘’๐‘“
๐‘…
) ๐ต
)=
0.1
1
๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ ๐‘‘๐‘ฅ
0.1
1
๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ) ๐‘‘๐‘ฅ
= 429176/660334 = 0.649938
Take in parameters B=50, R=2
๐‘ฅ ๐‘…๐‘’๐‘“ = 0.1441 , which is slightly above 0.1
How to calculate ๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ)?
(Similarity Joins Problem)
โ€ข To get the exact P.D.F., you need
to really calculate the similarities
for all N(N-1)/2 pairs.
โ€ข Using Hadoop can parallel and
speed up. But donโ€™t use too high
replicate rate.
โ€ข How about the right hand side
method?
Mapper input: (i, Pi)
for 1 โ‰ค j โ‰คN๏ผš
if i < j: output {(i, j),Pi}
else if i > j: output {(j, i), Pi}
end
Reducer input: { (i, j), [Pi, Pj] }
Output { (i, j), Sij }
โ€ข This method takes replicate rate as N and will definitely
fail. The correct way is to split persons into G groups.
The correct way to get similarities between all
pairs by Hadoop.
โ€ข Mapper input: (i, Pi)
โ€ข Determine its group number as
u=i%G, where G is the number
of groups you split, it is also the
replicate rate.
โ€ข For 0 โ‰ค v โ‰คG-1:
โ€ข If u < v: Output { (u, v), (i, Pi) }
โ€ข Else if u>v: Output { (v, u), (i, Pi) }
โ€ข end
โ€ข Reducer input:
โ€ข { (u, v), [โˆ€ ๐‘–, ๐‘ƒ๐‘– โˆˆ ๐บ๐‘Ÿ๐‘œ๐‘ข๐‘ ๐‘ข,
โˆ€ ๐‘—, ๐‘ƒ๐‘— โˆˆ ๐บ๐‘Ÿ๐‘œ๐‘ข๐‘ ๐‘ฃ] }
โ€ข Create two empty list uList &
vList, separately to gather all
๐‘–, ๐‘ƒ๐‘– that belongs to group u
and v.
For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1:
Get i and Pi from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ]
For 0 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1:
Get j and Pj from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
v
Continued
in the next
page
Still within the reducer:
โ€ข The above only consider pairs
whose element comes from
different groups.
โ€ข Now we consider elements
within the same group.
โ€ข We manage to avoid calculate
the same pairs multiple times by
setting if conditions.
If v==u+1:
For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 2:
Get i and Pi from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ]
For ๐›ผ + 1 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1:
Get j and Pj from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
If u==0 & v==G-1:
For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 2:
Get i and Pi from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ]
For ๐›ผ + 1 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1:
Get j and Pj from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ]
If i<j: output { (i, j), Sij }
Else if i>j: output { (j, i), Sij }
Post processing work:
โ€ข 1. Filter out the false positives by
calculate similarities for those
candidate pairs. Then we will have
the similar persons (โ€œclosefriendโ€) for
a lot of users.
โ€ข General ides is using 2 MR jobs:
โ€ข 1st MR job use i as key and change
(i, j) to (i, j, Pi);
โ€ข 2nd MR job change (i, j, Pi) to (i, j,
Pi, Pj), so you can get similarity Sij.
โ€ข 2. Recommendation: for each user,
take the union of his/her close
friendsโ€™friend list and filter out the
members he/she already knows.
โ€ข General idea is:
โ€ข When you have a similar person
list like {a, [b1, b2, โ€ฆ ,bs]}, then
you transfer it to {a, [Pb1, Pb2, โ€ฆ,
Pbs]}, where Pbi is the friend list of
person bi.
โ€ข Then take the union of Pbi and
finally minus Pa.
Filter Out False Positives (2 MR jobs)
โ€ข 1st Mapper (multiple inputs):
Recommendation data:
(i, j) { i, (j, โ€œRโ€) } if i<j
{ j, (i, โ€œRโ€) } if i>j
Friend List data:
(i, Pi) {i, (Pi, โ€œFโ€) }
โ€ข 1st Reducer: input:
{ i, [ (j, โ€œRโ€) โˆ€j candidate paired
with i & j>i; (Pi, โ€œFโ€) ] }
For each j from input:
Output {j, (i, Pi, โ€tempโ€)}
โ€ข 2nd Mapper (multiple inputs):
Temporary data: Pass
Friend List data:
(j, Pj) {j, (Pj, โ€œFโ€) }
โ€ข 2nd Reducer: input:
{ j, [ (i, Pi, โ€œtempโ€) โˆ€๐‘– associated
with j; (Pj, โ€œFโ€) ] }
For each i:
Sij=similarity(Pi, Pj)
If Sij>=0.1: output {(i, j), Sij}
Recommendation (3 MR jobs):
โ€ข 1st Mapper (multiple inputs):
โ€ข Similar persons list data:
{ a, [b1,b2,โ€ฆ,bs]} {bi, (a, โ€œSโ€)} for all i
โ€ข Friend List data:
(bi, Pbi) {bi, (Pbi, โ€œFโ€) }
โ€ข 1st Reducer: input:
{bi, [ (a, โ€Sโ€) โˆ€a similar to bi; (Pbi, โ€Fโ€) ]}
For each a from input:
Output {a, Pbi}
โ€ข 2nd Mapper: Pass
โ€ข 2nd Reducer:
input: { a, [Pb1, Pb2, โ€ฆ, Pbs] }
U = Pb1 โˆช Pb2 โˆช โ€ฆ โˆช Pbs
Output {a, U}
โ€ข 3rd Mapper (multiple inputs):
(i, Ui) {i, (Ui, โ€œuโ€)}
(i, Pi) {i, (Pi, โ€œFโ€)]
โ€ข 3rd Reducer: input:
{i, [(Ui, โ€œuโ€),(Pi, โ€œFโ€)] }
Output {i, Ui-Pi}
Reference:
โ€ข 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.
โ€ข 2. Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the
evolution of user interaction in Facebook. In Proceedings of the 2nd ACM workshop on Online
social networks (WOSN '09). ACM, New York, NY, USA, 37-42.
DOI=http://dx.doi.org/10.1145/1592665.1592675

More Related Content

What's hot

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
ย 
Algorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical FileAlgorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical FileKushagraChadha1
ย 
Artificial intelligence and expert system.ppt
Artificial intelligence and expert system.pptArtificial intelligence and expert system.ppt
Artificial intelligence and expert system.pptJiwaji university
ย 
Neural Networks in Data Mining - โ€œAn Overviewโ€
Neural Networks  in Data Mining -   โ€œAn Overviewโ€Neural Networks  in Data Mining -   โ€œAn Overviewโ€
Neural Networks in Data Mining - โ€œAn Overviewโ€Dr.(Mrs).Gethsiyal Augasta
ย 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"BigMC
ย 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligenceharshita virwani
ย 
Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Muhammad Ishaq
ย 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentationKaiwen Qi
ย 
REGULAR EXPRESSION TO N.F.A
REGULAR EXPRESSION TO N.F.AREGULAR EXPRESSION TO N.F.A
REGULAR EXPRESSION TO N.F.ADev Ashish
ย 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in RBabu Priyavrat
ย 
Expert systems Artificial Intelligence
Expert systems Artificial IntelligenceExpert systems Artificial Intelligence
Expert systems Artificial Intelligenceitti rehan
ย 
Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph AJAL A J
ย 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep LearningMyungjin Lee
ย 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
ย 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
ย 
knowledge representation using rules
knowledge representation using rulesknowledge representation using rules
knowledge representation using rulesHarini Balamurugan
ย 
DAA Lab File C Programs
DAA Lab File C ProgramsDAA Lab File C Programs
DAA Lab File C ProgramsKandarp Tiwari
ย 
tic-tac-toe: Game playing
 tic-tac-toe: Game playing tic-tac-toe: Game playing
tic-tac-toe: Game playingkalpana Manudhane
ย 

What's hot (20)

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
ย 
Algorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical FileAlgorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical File
ย 
Artificial intelligence and expert system.ppt
Artificial intelligence and expert system.pptArtificial intelligence and expert system.ppt
Artificial intelligence and expert system.ppt
ย 
Neural Networks in Data Mining - โ€œAn Overviewโ€
Neural Networks  in Data Mining -   โ€œAn Overviewโ€Neural Networks  in Data Mining -   โ€œAn Overviewโ€
Neural Networks in Data Mining - โ€œAn Overviewโ€
ย 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"
ย 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligence
ย 
Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...
ย 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentation
ย 
REGULAR EXPRESSION TO N.F.A
REGULAR EXPRESSION TO N.F.AREGULAR EXPRESSION TO N.F.A
REGULAR EXPRESSION TO N.F.A
ย 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
ย 
Expert systems Artificial Intelligence
Expert systems Artificial IntelligenceExpert systems Artificial Intelligence
Expert systems Artificial Intelligence
ย 
Association rules
Association rulesAssociation rules
Association rules
ย 
Searching Algorithm
Searching AlgorithmSearching Algorithm
Searching Algorithm
ย 
Directed Acyclic Graph
Directed Acyclic Graph Directed Acyclic Graph
Directed Acyclic Graph
ย 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
ย 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
ย 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
ย 
knowledge representation using rules
knowledge representation using rulesknowledge representation using rules
knowledge representation using rules
ย 
DAA Lab File C Programs
DAA Lab File C ProgramsDAA Lab File C Programs
DAA Lab File C Programs
ย 
tic-tac-toe: Game playing
 tic-tac-toe: Game playing tic-tac-toe: Game playing
tic-tac-toe: Game playing
ย 

Viewers also liked

Church. Got an app for that?
Church. Got an app for that?Church. Got an app for that?
Church. Got an app for that?ASDSVV
ย 
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐ
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐ
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐFedor Grabar
ย 
My social representation
My social representationMy social representation
My social representationSocialconsulting.gr
ย 
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...SlideTeam.net
ย 
Kunst, รธkonomi, kreativ kapitalisme
Kunst, รธkonomi, kreativ kapitalismeKunst, รธkonomi, kreativ kapitalisme
Kunst, รธkonomi, kreativ kapitalismeJan Lรธhmann Stephensen
ย 
Econ214 macroeconomics chapter 19
Econ214 macroeconomics chapter 19Econ214 macroeconomics chapter 19
Econ214 macroeconomics chapter 19BHUOnlineDepartment
ย 
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...David Nickelson, PsyD, JD
ย 
Greythorn Market Insights - February 2013
Greythorn Market Insights - February 2013Greythorn Market Insights - February 2013
Greythorn Market Insights - February 2013GreythornAU
ย 
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...SlideTeam.net
ย 
Symbiosis international university
Symbiosis international universitySymbiosis international university
Symbiosis international universityyunus khan
ย 
PrivateWave - sales presentation_en
PrivateWave - sales presentation_enPrivateWave - sales presentation_en
PrivateWave - sales presentation_enMarco Pissarello
ย 
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...Iskander Business Partner GmbH
ย 
Les Grossman's Pro Tips for Local SEO
Les Grossman's Pro Tips for Local SEOLes Grossman's Pro Tips for Local SEO
Les Grossman's Pro Tips for Local SEOGreg Gifford
ย 
NV kunstenaar Jan De Cock in de financieฬˆle problemen
NV kunstenaar Jan De Cock in de financieฬˆle problemenNV kunstenaar Jan De Cock in de financieฬˆle problemen
NV kunstenaar Jan De Cock in de financieฬˆle problemenThierry Debels
ย 
Grafico diario del dax perfomance index para el 12 12-2012
Grafico diario del dax perfomance index para el 12 12-2012Grafico diario del dax perfomance index para el 12 12-2012
Grafico diario del dax perfomance index para el 12 12-2012Experiencia Trading
ย 
Os factores productivos
Os factores productivosOs factores productivos
Os factores productivosLeoperu Return
ย 
Neighborhood Watch prototype (storyboard narrative)
Neighborhood Watch prototype (storyboard narrative)Neighborhood Watch prototype (storyboard narrative)
Neighborhood Watch prototype (storyboard narrative)JosephHowerton
ย 

Viewers also liked (19)

Church. Got an app for that?
Church. Got an app for that?Church. Got an app for that?
Church. Got an app for that?
ย 
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐ
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐ
ะ›ะตัะฝะพะน_ะฟะปะฐะฝ_ั€ะตะณะธะพะฝะฐ_ะบะฐะบ_ะพั€ะธะตะฝั‚ะธั€_ะดะปั_ะธะฝะฒะตัั‚ะพั€ะฐ
ย 
My social representation
My social representationMy social representation
My social representation
ย 
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 8 stages style 1 pow...
ย 
Kunst, รธkonomi, kreativ kapitalisme
Kunst, รธkonomi, kreativ kapitalismeKunst, รธkonomi, kreativ kapitalisme
Kunst, รธkonomi, kreativ kapitalisme
ย 
Econ214 macroeconomics chapter 19
Econ214 macroeconomics chapter 19Econ214 macroeconomics chapter 19
Econ214 macroeconomics chapter 19
ย 
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...
Aimia: The Big Deal About Big Data -- How It Will Transform Pharma Meeting an...
ย 
Naturalismo
NaturalismoNaturalismo
Naturalismo
ย 
Bia
BiaBia
Bia
ย 
Greythorn Market Insights - February 2013
Greythorn Market Insights - February 2013Greythorn Market Insights - February 2013
Greythorn Market Insights - February 2013
ย 
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...
3 d pie chart circular puzzle with hole in center pieces 6 stages style 1 pow...
ย 
Symbiosis international university
Symbiosis international universitySymbiosis international university
Symbiosis international university
ย 
PrivateWave - sales presentation_en
PrivateWave - sales presentation_enPrivateWave - sales presentation_en
PrivateWave - sales presentation_en
ย 
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...
Individualisierte Mehrwertdienste - Wie Telekommunikationsunternehmen der Com...
ย 
Les Grossman's Pro Tips for Local SEO
Les Grossman's Pro Tips for Local SEOLes Grossman's Pro Tips for Local SEO
Les Grossman's Pro Tips for Local SEO
ย 
NV kunstenaar Jan De Cock in de financieฬˆle problemen
NV kunstenaar Jan De Cock in de financieฬˆle problemenNV kunstenaar Jan De Cock in de financieฬˆle problemen
NV kunstenaar Jan De Cock in de financieฬˆle problemen
ย 
Grafico diario del dax perfomance index para el 12 12-2012
Grafico diario del dax perfomance index para el 12 12-2012Grafico diario del dax perfomance index para el 12 12-2012
Grafico diario del dax perfomance index para el 12 12-2012
ย 
Os factores productivos
Os factores productivosOs factores productivos
Os factores productivos
ย 
Neighborhood Watch prototype (storyboard narrative)
Neighborhood Watch prototype (storyboard narrative)Neighborhood Watch prototype (storyboard narrative)
Neighborhood Watch prototype (storyboard narrative)
ย 

Similar to Local sensitive hashing &amp; minhash on facebook friend

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
ย 
Building graphs to discover information by David Martรญnez at Big Data Spain 2015
Building graphs to discover information by David Martรญnez at Big Data Spain 2015Building graphs to discover information by David Martรญnez at Big Data Spain 2015
Building graphs to discover information by David Martรญnez at Big Data Spain 2015Big Data Spain
ย 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
ย 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingMaruf Aytekin
ย 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analyticslakshmidkurup
ย 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
ย 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingzukun
ย 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
ย 
Hashing And Hashing Tables
Hashing And Hashing TablesHashing And Hashing Tables
Hashing And Hashing TablesChinmaya M. N
ย 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your researchDorothy Bishop
ย 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingGabriele Angeletti
ย 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfssuserf86fba
ย 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
ย 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
ย 
How Does Math Matter in Data Science
How Does Math Matter in Data ScienceHow Does Math Matter in Data Science
How Does Math Matter in Data ScienceMutia Ulfi
ย 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basicsDorothy Bishop
ย 
Hash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxHash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxmy6305874
ย 
Sequence alignment unit 3
Sequence alignment unit 3Sequence alignment unit 3
Sequence alignment unit 3gyanikashukla
ย 

Similar to Local sensitive hashing &amp; minhash on facebook friend (20)

Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
ย 
Building graphs to discover information by David Martรญnez at Big Data Spain 2015
Building graphs to discover information by David Martรญnez at Big Data Spain 2015Building graphs to discover information by David Martรญnez at Big Data Spain 2015
Building graphs to discover information by David Martรญnez at Big Data Spain 2015
ย 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
ย 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
ย 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
ย 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
ย 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
ย 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
ย 
Hashing And Hashing Tables
Hashing And Hashing TablesHashing And Hashing Tables
Hashing And Hashing Tables
ย 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
ย 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
ย 
Hashing
HashingHashing
Hashing
ย 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
ย 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
ย 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
ย 
QBIC
QBICQBIC
QBIC
ย 
How Does Math Matter in Data Science
How Does Math Matter in Data ScienceHow Does Math Matter in Data Science
How Does Math Matter in Data Science
ย 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
ย 
Hash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptxHash in datastructures by using the c language.pptx
Hash in datastructures by using the c language.pptx
ย 
Sequence alignment unit 3
Sequence alignment unit 3Sequence alignment unit 3
Sequence alignment unit 3
ย 

More from Chengeng Ma

Certificate ofcompletion extending hadoop for data science streaming spark st...
Certificate ofcompletion extending hadoop for data science streaming spark st...Certificate ofcompletion extending hadoop for data science streaming spark st...
Certificate ofcompletion extending hadoop for data science streaming spark st...Chengeng Ma
ย 
Tang poetry inspiration machine using seq2seq
Tang poetry inspiration machine using seq2seqTang poetry inspiration machine using seq2seq
Tang poetry inspiration machine using seq2seqChengeng Ma
ย 
Tang poetry inspiration machine using char level rnn
Tang poetry inspiration machine using char level rnnTang poetry inspiration machine using char level rnn
Tang poetry inspiration machine using char level rnnChengeng Ma
ย 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationChengeng Ma
ย 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerankChengeng Ma
ย 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonChengeng Ma
ย 

More from Chengeng Ma (6)

Certificate ofcompletion extending hadoop for data science streaming spark st...
Certificate ofcompletion extending hadoop for data science streaming spark st...Certificate ofcompletion extending hadoop for data science streaming spark st...
Certificate ofcompletion extending hadoop for data science streaming spark st...
ย 
Tang poetry inspiration machine using seq2seq
Tang poetry inspiration machine using seq2seqTang poetry inspiration machine using seq2seq
Tang poetry inspiration machine using seq2seq
ย 
Tang poetry inspiration machine using char level rnn
Tang poetry inspiration machine using char level rnnTang poetry inspiration machine using char level rnn
Tang poetry inspiration machine using char level rnn
ย 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
ย 
A hadoop implementation of pagerank
A hadoop implementation of pagerankA hadoop implementation of pagerank
A hadoop implementation of pagerank
ย 
Hadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, sonHadoop implementation for algorithms apriori, pcy, son
Hadoop implementation for algorithms apriori, pcy, son
ย 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
ย 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
ย 
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”soniya singh
ย 
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ทvhwb25kk
ย 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
ย 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
ย 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]๐Ÿ“Š Markus Baersch
ย 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
ย 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
ย 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
ย 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
ย 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
ย 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
ย 
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”DelhiRS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhijennyeacort
ย 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
ย 
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”น
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”นไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”น
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”นyuu sss
ย 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
ย 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
ย 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
ย 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
ย 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
ย 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
ย 
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Defence Colony Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
ย 
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
1:1ๅฎšๅˆถ(UQๆฏ•ไธš่ฏ๏ผ‰ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไฟฎๆ”น็•™ไฟกๅญฆๅŽ†่ฎค่ฏๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ย 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
ย 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
ย 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
ย 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
ย 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
ย 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
ย 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
ย 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
ย 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
ย 
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”DelhiRS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)โ‡›9711147426๐Ÿ”Delhi
ย 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
ย 
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”น
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”นไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”น
ไธ“ไธšไธ€ๆฏ”ไธ€็พŽๅ›ฝไฟ„ไบฅไฟ„ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•pdf็”ตๅญ็‰ˆๅˆถไฝœไฟฎๆ”น
ย 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
ย 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
ย 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
ย 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
ย 

Local sensitive hashing &amp; minhash on facebook friend

  • 1. Local Sensitive Hashing & Minhash on Facebook friend links data & friends recommendation Chengeng Ma Stony Brook University 2016/03/05
  • 2. 1. What is Local Sensitive Hash & Minhash? โ€ข If you are familiar with LSH and Minhash, please directly go to page 12, because the following pages are just fundamental knowledge about this topic, which you can find more details in the book, Mining of Massive Dataset, written by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.
  • 3. What is LSH & Minhash about? โ€ข Local Sensitive Hash (LSH) & Minhash are two profoundly important methods in Big Datafor finding similar items. โ€ข In Amazon, if you can find two similar persons, you can recommend to one person the items the other has purchased. โ€ข For Google, Baidu, โ€ฆ, users always hope the search engine can find pictures similar to the one they have uploaded.
  • 4. Calculating similarity between each pair is a lot of computation (Why LSH?) โ€ข If you have 106 items within your data, you will need almost 0.5 ร— 1012 times computation to know the similarities between each pair. โ€ข You will need to parallel a lot of tasks to deal with this huge computation amount. โ€ข You can do this with the help of Hadoop, but you can do better with the help of LSH & Minhash. โ€ข The LSH can hash one item to a bucket based on the feature list that item has. โ€ข If two items are quite similar with each other in their feature lists, then they will have a large probability to be hashed into the same bucket. โ€ข You can amplify this effect to different extent by setting parameters. โ€ข Finally, you only need to compute similarities for the pairs formed by the items within the same bucket.
  • 5. How Minhash comes in? โ€ข The LSH needs you keep the feature list of each item in the format like a matrix (the sequence is important). โ€ข If the size of universal set is fixed or small, e.g., the fingerprint array, then LSH alone can work well. 1st column represents the items person S1 has purchased. 1st row represents who has purchased item a.
  • 6. How Minhash comes in? โ€ข Jaccard Similarity = 2/7 โ€ข However, if the universal set is large or not size-fixed, e.g., items purchased by each account, friend list on social network, โ€ฆ โ€ข Then formatting the dataset into matrix is not efficient, since the dataset is usually very sparse. โ€ข Then Minhash works, if the similarities between two feature lists is calculated as Jaccard similarities.
  • 7. Whatโ€™s Minhash value? โ€ข Permute the original matrix by row. โ€ข For each column (set), the 1st non-empty elementโ€™s row index is the minhash value of that column. Original matrix Permute to a different order: b, e, a, d, c. H(S1)=a, H(S2)=c, H(S3)=b, H(S4)=a.
  • 8. Minhashโ€™s property (similarity preserved): โ€ข 3 kinds of rows between set ๐‘† ๐‘Ž & ๐‘† ๐‘: (x): both sets have 1; (y): one has 1, the other has 0; (z): both sets have 0. ๐ฝ ๐‘Ž๐‘ = |๐‘‹| ๐‘‹ + |๐‘Œ| Pr โ„Ž ๐‘† ๐‘Ž = โ„Ž ๐‘† ๐‘ = |๐‘‹| ๐‘‹ + |๐‘Œ| โ€ข If you do 100 times different minhash, you reduce one dimension of the matrix from unknown large to 100. โ€ข The probability that two sets share the same minhash value equals the Jaccard similarity between them. Pr โ„Ž ๐‘† ๐‘Ž = โ„Ž ๐‘† ๐‘ = ๐ฝ ๐‘Ž๐‘
  • 9. Permutations can be simulated by hash functions โ€ข For j th column in original matrix, find all the non-empty elements, try to input their indexes into the i th hash function, the minimum output is the element SIG(i, j). โ€ข Hash function: ๐‘Ž โˆ— ๐‘ฅ + ๐‘ % ๐‘ : โ€ข where N is a prime, equal to or slightly larger than the size of universal set (# of rows of original matrix), โ€ข a & b must be integers within [1, N-1]. โ€ข The result signature matrix, where row index is for hash functions, column index is for sets. For example, we use 2 hash functions to simulate 2 permutations: (x+1)%5 and (3x+1)%5, where x is row index SIG
  • 10. Now you have signature matrix, you use it instead of original matrix to do LSH. โ€ข Divide the signature matrix into b bands, each of which has r rows. โ€ข For each band range, build an empty hashtable, hash each column (portion within the band range) into a bucket, so that only identical bands are hashed into the same bucket. โ€ข Columns within the same bucket are considered candidates that you should form pairs and calculate similarities. โ€ข Take the union of different band ranges and filter out the false positives.
  • 11. Jaccard Similarity Probability of becoming a candidate Why LSH works? --- the amplification effects
  • 12. 2. Details of my class project: dataset โ€ข User-to-user links from Facebook New Orleans networks data. โ€ข The data is created by Bimal Viswanath et al. and used for their paper On the Evolution of User Interactionin Facebook. โ€ข It can be download in http://socialnetworks.mpi- sws.org/data/facebook-links.txt.gz โ€ข It has 63,731 persons and 1,545,686 links, 10.4 MB in size. โ€ข The data is not large, but as a training, I will use Hadoop during this project.
  • 13. My class project plan: โ€ข Firstly find similar persons based on usersโ€™ friend lists, where LSH and Minhash will be implemented in Hadoop. โ€ข The similar persons are called โ€œclose friendโ€ in this project. โ€ข Then recommend to you the persons who are friends of your close friend but not yet of you. โ€ข It generally sounds like Collaborative Filtering. โ€ข Two persons who have similar friend list are considered โ€œclose friendsโ€, since they must have some relationship in the real world, e.g., schoolmates, workmates, teammate, โ€ฆ โ€ข If youโ€™re good friend of someone, you may like to know more of his/her friend. โ€ข We do not set too high threshold for similarity, since finding a duplicate of you is not interesting.
  • 14. Why not just use common friend counts? โ€ข The classicalway is based on number of common friends. โ€ข However, there are some persons who have a lot of common friends with you, but has nothing to do with you, e.g., celebrities, politics, salesmen who want to sell their stuff through social network, or even swindlersโ€ฆ โ€ข People use social network to find friends that can physically reach them, but not for persons too far away from them. โ€ข Most of my friends may like a pop singer and become friends of him. Based on common friends, system will recommend that pop singer to me. โ€ข But the pop singer can never remember me, since he has millions of friends on site.
  • 15. Prepare work: โ€ข 1. Make data into the format like below, where j represents the j th person, Pj is a list of friends of person j. โ€ข 2. In this study, 63731 is both the number of sets to compare and the size of universal element set. Because both the key j and the elements within set Pj are user id. 1: P1 2: P2 โ€ฆ โ€ฆ n: Pn โ€ข 3. 63731 is not a prime (101*631). Only prime number can simulate true permutations. We use 63737 instead, equivalent to adding 6 persons that has no friends online. โ€ข 4. Hash function for Minhash: N=63737L; hashNum=100; private long fhash ( int i, long x ) { return 13 + ๐‘ฅ โˆ’ 1 โˆ— ( ๐‘โˆ—๐‘– 3โˆ—โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š + 1) %๐‘ ; } 1 โ‰ค ๐‘ฅ โ‰ค ๐‘, 0 โ‰ค ๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1
  • 16. Pseudocode of Minhash (Map job only) โ€ข Mapper input: (c, Pc), where Pc represents a list [ j1, j2, โ€ฆ, js ]; โ€ข Build a new array s[hashNum] (hashNum=100here), initialized as infinity everywhere. โ€ข For i th hash function, each element jj in Pc is an opportunity to get lower hash value, finally the minimum hash value from all jj is the minhash in SIG[i,c]. โ€ข Output c as key, the content of array s as value. input (c, Pc), where Pc= [ j1, j2, โ€ฆ, js ] long[] s = new long[hashNum]; for 0 โ‰ค ๐‘–๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1: s[ii] = infinity; end for jj in [ j1, j2, โ€ฆ, js ]: for 0 โ‰ค ๐‘–๐‘– โ‰ค โ„Ž๐‘Ž๐‘ โ„Ž๐‘๐‘ข๐‘š โˆ’ 1: s[ii] = min (s[ii], fhash(ii, jj)); end End Output (c, array s);
  • 17. Pseudocode for LSH: โ€ข Mapper input (j, ๐‘†๐‘—), where ๐‘†๐‘— is the j th column of signature matrix. โ€ข Split array ๐‘†๐‘— into B bands as, ๐‘†๐‘—1, ๐‘†๐‘—2, โ€ฆ, ๐‘†๐‘—๐ต. โ€ข For b th band, get its hash value stored in myHash. โ€ข Output the tuple (b, myHash) as key, j as value. for 1 โ‰ค ๐‘ โ‰ค ๐ต: myHash = getHashValue(๐‘†๐‘—๐‘) Output { ( b, myHash ), j } end โ€ข Reducer input: { (b, aHashValue), [๐‘—1, ๐‘—2, โ€ฆ, ๐‘— ๐‘] } Now, form pairs between ๐‘—1, โ€ฆ, ๐‘— ๐‘and output as candidate pairs. for 1 โ‰ค ๐‘ฅ โ‰ค ๐‘ โˆ’ 1: for ๐‘ฅ + 1 โ‰ค ๐‘ฆ โ‰ค ๐‘: output (๐‘— ๐‘ฅ, ๐‘— ๐‘ฆ) end end One more program is needed to remove duplicates. Hadoopโ€™ssorting procedure helps us gathering all the items that both has the same hash value and comes from the same band range.
  • 18. Hash function for LSH: โ€ข The LSH needs to hash a band portion of vector into a value. โ€ข It hopes only identical vectors can be hashed into the same bucket. โ€ข An easy way is to directly use its string expression, since Hadoop also uses Text to transport data. โ€ข For example, hash the below portion into string: โ€ข In this way, only exactly same vector portion can comes into the same bucket. โ€œ21,14,36,55โ€ Hash to string
  • 19. Parameters set: โ€ข We do not want to set threshold of similarity too high, since finding a duplicate of you on web is not interesting. โ€ข So we set the threshold of similarity near 0.1. โ€ข We set B=50 and hashNum=100, so that each band in LSH has R=2 rows. ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) = 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘… ) ๐ต โ€ข S curve grows quickly: โ€ข X=0.1, P=0.39 โ€ข X=0.15, P=0.68 โ€ข X=0.2, P=0.87 B=50, R=2
  • 20. Result Test: โ€ข ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) = ๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ) ๐‘ƒ(๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ) โ€ข ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ ๐‘ฅ) = 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…) ๐ต โ€ข The Hadoop output can be analyzed to get ๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ). โ€ข For ๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ), another Hadoop program is written to really calculatethe similarities within all possible pairs (which takes N(N-1)/2times computation),since the dataset is not too large. โ€ข But only the similarities equal to or larger than 0.1 is stored in output file, because it takes several Terabytes to store all the similarities.
  • 22. Histogram of LSH recommended pairs and all existing pairs (but cut at 0.1) within the data
  • 23. Statistics โ€ข The LSH & Minhash recommends 1,065,318 pairs. โ€ข There are 660,334 existing pairs that really have s larger than 0.1. โ€ข Intersection of them have 429,176 pairs, which contains 65% of similar pairs (s>0.1). โ€ข But the computation is hundreds of times faster than before. โ€ข 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘… ) ๐ต ) = ๐‘ƒ(๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ) ๐‘ƒ(๐‘ฅโ‰ค๐‘ <๐‘ฅ+๐‘‘๐‘ฅ) โ€ข We define a reference value ๐‘ฅ ๐‘…๐‘’๐‘“: โ€ข 1 โˆ’ (1 โˆ’ ๐‘ฅ ๐‘…๐‘’๐‘“ ๐‘… ) ๐ต )= 0.1 1 ๐‘ƒ ๐‘Ÿ๐‘’๐‘๐‘œ๐‘š๐‘š๐‘’๐‘›๐‘‘ & ๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ ๐‘‘๐‘ฅ 0.1 1 ๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ) ๐‘‘๐‘ฅ = 429176/660334 = 0.649938 Take in parameters B=50, R=2 ๐‘ฅ ๐‘…๐‘’๐‘“ = 0.1441 , which is slightly above 0.1
  • 24. How to calculate ๐‘ƒ(๐‘ฅ โ‰ค ๐‘  < ๐‘ฅ + ๐‘‘๐‘ฅ)? (Similarity Joins Problem) โ€ข To get the exact P.D.F., you need to really calculate the similarities for all N(N-1)/2 pairs. โ€ข Using Hadoop can parallel and speed up. But donโ€™t use too high replicate rate. โ€ข How about the right hand side method? Mapper input: (i, Pi) for 1 โ‰ค j โ‰คN๏ผš if i < j: output {(i, j),Pi} else if i > j: output {(j, i), Pi} end Reducer input: { (i, j), [Pi, Pj] } Output { (i, j), Sij } โ€ข This method takes replicate rate as N and will definitely fail. The correct way is to split persons into G groups.
  • 25. The correct way to get similarities between all pairs by Hadoop. โ€ข Mapper input: (i, Pi) โ€ข Determine its group number as u=i%G, where G is the number of groups you split, it is also the replicate rate. โ€ข For 0 โ‰ค v โ‰คG-1: โ€ข If u < v: Output { (u, v), (i, Pi) } โ€ข Else if u>v: Output { (v, u), (i, Pi) } โ€ข end โ€ข Reducer input: โ€ข { (u, v), [โˆ€ ๐‘–, ๐‘ƒ๐‘– โˆˆ ๐บ๐‘Ÿ๐‘œ๐‘ข๐‘ ๐‘ข, โˆ€ ๐‘—, ๐‘ƒ๐‘— โˆˆ ๐บ๐‘Ÿ๐‘œ๐‘ข๐‘ ๐‘ฃ] } โ€ข Create two empty list uList & vList, separately to gather all ๐‘–, ๐‘ƒ๐‘– that belongs to group u and v. For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1: Get i and Pi from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ] For 0 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1: Get j and Pj from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ] If i<j: output { (i, j), Sij } Else if i>j: output { (j, i), Sij } v Continued in the next page
  • 26. Still within the reducer: โ€ข The above only consider pairs whose element comes from different groups. โ€ข Now we consider elements within the same group. โ€ข We manage to avoid calculate the same pairs multiple times by setting if conditions. If v==u+1: For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 2: Get i and Pi from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ] For ๐›ผ + 1 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1: Get j and Pj from ๐‘ข๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ] If i<j: output { (i, j), Sij } Else if i>j: output { (j, i), Sij } If u==0 & v==G-1: For 0 โ‰ค ๐›ผ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 2: Get i and Pi from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ผ] For ๐›ผ + 1 โ‰ค ๐›ฝ โ‰ค ๐‘ ๐‘–๐‘ง๐‘’ ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก โˆ’ 1: Get j and Pj from ๐‘ฃ๐ฟ๐‘–๐‘ ๐‘ก[๐›ฝ] If i<j: output { (i, j), Sij } Else if i>j: output { (j, i), Sij }
  • 27. Post processing work: โ€ข 1. Filter out the false positives by calculate similarities for those candidate pairs. Then we will have the similar persons (โ€œclosefriendโ€) for a lot of users. โ€ข General ides is using 2 MR jobs: โ€ข 1st MR job use i as key and change (i, j) to (i, j, Pi); โ€ข 2nd MR job change (i, j, Pi) to (i, j, Pi, Pj), so you can get similarity Sij. โ€ข 2. Recommendation: for each user, take the union of his/her close friendsโ€™friend list and filter out the members he/she already knows. โ€ข General idea is: โ€ข When you have a similar person list like {a, [b1, b2, โ€ฆ ,bs]}, then you transfer it to {a, [Pb1, Pb2, โ€ฆ, Pbs]}, where Pbi is the friend list of person bi. โ€ข Then take the union of Pbi and finally minus Pa.
  • 28. Filter Out False Positives (2 MR jobs) โ€ข 1st Mapper (multiple inputs): Recommendation data: (i, j) { i, (j, โ€œRโ€) } if i<j { j, (i, โ€œRโ€) } if i>j Friend List data: (i, Pi) {i, (Pi, โ€œFโ€) } โ€ข 1st Reducer: input: { i, [ (j, โ€œRโ€) โˆ€j candidate paired with i & j>i; (Pi, โ€œFโ€) ] } For each j from input: Output {j, (i, Pi, โ€tempโ€)} โ€ข 2nd Mapper (multiple inputs): Temporary data: Pass Friend List data: (j, Pj) {j, (Pj, โ€œFโ€) } โ€ข 2nd Reducer: input: { j, [ (i, Pi, โ€œtempโ€) โˆ€๐‘– associated with j; (Pj, โ€œFโ€) ] } For each i: Sij=similarity(Pi, Pj) If Sij>=0.1: output {(i, j), Sij}
  • 29. Recommendation (3 MR jobs): โ€ข 1st Mapper (multiple inputs): โ€ข Similar persons list data: { a, [b1,b2,โ€ฆ,bs]} {bi, (a, โ€œSโ€)} for all i โ€ข Friend List data: (bi, Pbi) {bi, (Pbi, โ€œFโ€) } โ€ข 1st Reducer: input: {bi, [ (a, โ€Sโ€) โˆ€a similar to bi; (Pbi, โ€Fโ€) ]} For each a from input: Output {a, Pbi} โ€ข 2nd Mapper: Pass โ€ข 2nd Reducer: input: { a, [Pb1, Pb2, โ€ฆ, Pbs] } U = Pb1 โˆช Pb2 โˆช โ€ฆ โˆช Pbs Output {a, U} โ€ข 3rd Mapper (multiple inputs): (i, Ui) {i, (Ui, โ€œuโ€)} (i, Pi) {i, (Pi, โ€œFโ€)] โ€ข 3rd Reducer: input: {i, [(Ui, โ€œuโ€),(Pi, โ€œFโ€)] } Output {i, Ui-Pi}
  • 30. Reference: โ€ข 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman. โ€ข 2. Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P. Gummadi. 2009. On the evolution of user interaction in Facebook. In Proceedings of the 2nd ACM workshop on Online social networks (WOSN '09). ACM, New York, NY, USA, 37-42. DOI=http://dx.doi.org/10.1145/1592665.1592675