Finding Similar Projects in GitHub using Word2Vec and WMD
1. Finding Similar Projects in GitHub using
Word2Vec and WMD
MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
2. Introduction
Given project details (description
and source code), the aim is to find
functionally similar projects
Finding functionally similar project
is important
Application/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project
Recommendation
How developer search for similar
projects?
3. General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
4. GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual
contents
Project name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code content
Augment content by Method, Class, and API name
5. Model Workflow
5
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
6. Model Workflow
6
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
7. How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarity
Bag of Word (BOW)
Document 2: android photo viewer
No common keyword!
Cosine similarity = 0
CS@UVa
8. How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1
𝑤3𝑤2
𝑤4
CS@UVa
9. Word Embedding
“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
Learn word vector for upgrade by its surrounding words
Word2Vec
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
CS@UVa
10. Word2Vec
Input: Text corpus
CS@UVa 10
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
Word2Vec
Model
Word Embedding
Output: Word vectorsTraining
12. Example Word Embedding
In Embedded space
Similar meaning word clustered together
CS@UVa 12
image
photo
picture figure
sample
example
demo illustration
upgrade update
modify
change
install setup
launch
change
dimension size
height
length
range
Embedding for each word
How to get document/sentence level similarity?
Word Mover’s Distance (WMD)
18. Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.2
0.1
0.1
19. Preliminary Results
19
Project Name Description Project Type
Query/
Rank
android_browser
Customize android webclient
(source code with readme file)
Lightning based
android browser
1 Myfacebook MyFacebook source code Lightning based
android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser,
and licensed under the Mozilla Public License, v. 2.0..
Lightning based
android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed
under the Mozilla Public License, v. 2.0..
Lightning based
android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android
Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
20. Summary
We proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s Distance
Leveraged Word2Vec word embedding
20
Hello Everyone,
I am Masudur Rahman. I am a PhD student at Department of Computer Science of University of Virginia. I will present our work, finding similar project in GitHub where we used Word Mover Distance and Word2Vec word embedding.
Finding Functionally similar project is very important fo ap recommendation, code re-use, rapid prototyiping and plagiarism checking
There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.)
No surprise! Google try to find out application based on the search engine, and they are not intended to do project level search for finding source code. We might augment the query to get some meaning results for the developer but, the intent of these general purpose search engine will remain same and it will try to find application not source code that developer might willing to use
There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.)
We will see how we incorporated this method, class and API name to augment the textual information
Let’s consider this two documents, there is no common keyword in this document thus keyword based cosine similarity will give us 0, that means they are totally dissimilar, but actually they are not, they even represent same meaning. And in project documentation developer often use different word to represent he same thing.
Though these two documents are similar in meaning, normal keyword based similarity cannot capture these.
If we look into closely, android and lollipop are similar in meaning. Same for other keywords as well. Now, instead of matching words exactly, can we give some value between these two words that will indicate how much similar they are in meaning. Yes we can.
Learn a weight w where higher weight mean strongly similar and lower weight mean less similar
Intuition: The context words of similar words would be same.
One of the most effective way of doing this is: Word2Vec
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity