SlideShare a Scribd company logo
1 of 22
Finding Similar Projects in GitHub using
Word2Vec and WMD
MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
Introduction
Given project details (description
and source code), the aim is to find
functionally similar projects
Finding functionally similar project
is important
Application/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project
Recommendation
How developer search for similar
projects?
General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual
contents
Project name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code content
 Augment content by Method, Class, and API name
Model Workflow
5
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
Model Workflow
6
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarity
Bag of Word (BOW)
Document 2: android photo viewer
No common keyword!
Cosine similarity = 0
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1
𝑤3𝑤2
𝑤4
CS@UVa
Word Embedding
“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
 Learn word vector for upgrade by its surrounding words
 Word2Vec
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
CS@UVa
Word2Vec
Input: Text corpus
CS@UVa 10
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
Word2Vec
Model
Word Embedding
Output: Word vectorsTraining
Word2Vec Model
CS@UVa 11
Document: image gallery app for android
Skip-gram
image
gallery
app
for
android
Example Word Embedding
In Embedded space
Similar meaning word clustered together
CS@UVa 12
image
photo
picture figure
sample
example
demo illustration
upgrade update
modify
change
install setup
launch
change
dimension size
height
length
range
Embedding for each word
How to get document/sentence level similarity?
 Word Mover’s Distance (WMD)
Word Mover’s Distance(WMD)
CS@UVa 13
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 14
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 15
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.20.6
Word Mover’s Distance
CS@UVa 16
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.150.2
Word Mover’s Distance
CS@UVa 17
image LollipopappgalleryD1
android viewerphotoD2
0.4
0.30.1
Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.2
0.1
0.1
Preliminary Results
19
Project Name Description Project Type
Query/
Rank
android_browser
Customize android webclient
(source code with readme file)
Lightning based
android browser
1 Myfacebook MyFacebook source code Lightning based
android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser,
and licensed under the Mozilla Public License, v. 2.0..
Lightning based
android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed
under the Mozilla Public License, v. 2.0..
Lightning based
android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android
Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
Summary
We proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s Distance
Leveraged Word2Vec word embedding
20
Reference
Word2vec : Gensim python library
https://radimrehurek.com/gensim/models/word2vec.html
WMD
 https://github.com/mkusner/wmd
Wikipedia Dump.
https://dumps.wikimedia.org/enwiki/
GitHub Projects Data: The GHTorrent project
http://ghtorrent.org/
21CS@UVa
Question?
22CS@UVa

More Related Content

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD

Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champKumton Suttiraksiri
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!Evan Mullins
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Tony Frame
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteorNodeXperts
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Amazon Web Services
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration PointsElsevier Developers
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingNarendra Sisodiya
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with PythonBrian Lyttle
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application ModelsMarco Brambilla
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 Evan Mullins
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)Moutasm Tamimi
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxwrite31
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalSarah Dutkiewicz
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with railsRishav Dixit
 

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD (20)

Azure ARM Template
Azure ARM TemplateAzure ARM Template
Azure ARM Template
 
Microsoft graph and power platform champ
Microsoft graph and power platform   champMicrosoft graph and power platform   champ
Microsoft graph and power platform champ
 
Vsts intro
Vsts introVsts intro
Vsts intro
 
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01
 
Complete resource for web development
Complete resource for web developmentComplete resource for web development
Complete resource for web development
 
Introduction to meteor
Introduction to meteorIntroduction to meteor
Introduction to meteor
 
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
Modernising your Applications on AWS: AWS SDKs and Application Web Services –...
 
SciVerse Application Integration Points
SciVerse Application Integration PointsSciVerse Application Integration Points
SciVerse Application Integration Points
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Develop FOSS project using Google Code Hosting
Develop FOSS project using Google Code HostingDevelop FOSS project using Google Code Hosting
Develop FOSS project using Google Code Hosting
 
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee ApplicatiesFinal Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
 
Asp.net Programming Training (Web design, Web development)
Asp.net Programming Training (Web design, Web  development)Asp.net Programming Training (Web design, Web  development)
Asp.net Programming Training (Web design, Web development)
 
COMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docxCOMP6210 Web Services And Design Methodologies.docx
COMP6210 Web Services And Design Methodologies.docx
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data Professional
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Finding Similar Projects in GitHub using Word2Vec and WMD

  • 1. Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
  • 2. Introduction Given project details (description and source code), the aim is to find functionally similar projects Finding functionally similar project is important Application/project recommendation Code re-use, rapid prototyping Discovering code plagiarism CS@UVa 2 Code re-use Plagiarism checking Application/project Recommendation How developer search for similar projects?
  • 3. General Purpose Search(Google) CS@UVa 3 Query: android browser Try to find application relevant to the query Not intended to search for source code
  • 4. GitHub Search: android browser CS@UVa 4 Mostly keyword based search on textual contents Project name, description, etc. Open and analyze jar, class, apk, etc. Might rank irrelevant projects at the top Less textual content Use source code content  Augment content by Method, Class, and API name
  • 5. Model Workflow 5 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 6. Model Workflow 6 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 7. How to measure document similarity? Document 1: image gallery app for Lollipop 7 Keyword based Cosine similarity Bag of Word (BOW) Document 2: android photo viewer No common keyword! Cosine similarity = 0 CS@UVa
  • 8. How to measure document similarity? Document 1: image gallery app for Lollipop 8 Document 2: android photo viewer Word Embedding 𝑤1 𝑤3𝑤2 𝑤4 CS@UVa
  • 9. Word Embedding “You shall know a word by the company it keeps” –J. R. Firth 1957 9 Open source upgrade path for Odoo/OpenERP Plugin to check for obvious upgrade points on the path to 3.0 Codes related to upgrade project Demo app to demonstrate how to upgrade from Angular 1 to Angular 2  Learn word vector for upgrade by its surrounding words  Word2Vec 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade CS@UVa
  • 10. Word2Vec Input: Text corpus CS@UVa 10 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade Word2Vec Model Word Embedding Output: Word vectorsTraining
  • 11. Word2Vec Model CS@UVa 11 Document: image gallery app for android Skip-gram image gallery app for android
  • 12. Example Word Embedding In Embedded space Similar meaning word clustered together CS@UVa 12 image photo picture figure sample example demo illustration upgrade update modify change install setup launch change dimension size height length range Embedding for each word How to get document/sentence level similarity?  Word Mover’s Distance (WMD)
  • 13. Word Mover’s Distance(WMD) CS@UVa 13 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 14. Word Mover’s Distance CS@UVa 14 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 15. Word Mover’s Distance CS@UVa 15 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.20.6
  • 16. Word Mover’s Distance CS@UVa 16 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.150.2
  • 17. Word Mover’s Distance CS@UVa 17 image LollipopappgalleryD1 android viewerphotoD2 0.4 0.30.1
  • 18. Word Mover’s Distance Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55 Smaller score means more similar CS@UVa 18 image LollipopappgalleryD1 android viewerphotoD2 0.15 0.2 0.1 0.1
  • 19. Preliminary Results 19 Project Name Description Project Type Query/ Rank android_browser Customize android webclient (source code with readme file) Lightning based android browser 1 Myfacebook MyFacebook source code Lightning based android browser 2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser 5 VChrome Build an test browser for Viettel in job interview Android Browser CS@UVa
  • 20. Summary We proposed a model for finding functionally similar projects in GitHub Used textual and source code content to construct document Measured similarity between document adopting Word Mover’s Distance Leveraged Word2Vec word embedding 20
  • 21. Reference Word2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html WMD  https://github.com/mkusner/wmd Wikipedia Dump. https://dumps.wikimedia.org/enwiki/ GitHub Projects Data: The GHTorrent project http://ghtorrent.org/ 21CS@UVa

Editor's Notes

  1. Hello Everyone, I am Masudur Rahman. I am a PhD student at Department of Computer Science of University of Virginia. I will present our work, finding similar project in GitHub where we used Word Mover Distance and Word2Vec word embedding.
  2. Finding Functionally similar project is very important fo ap recommendation, code re-use, rapid prototyiping and plagiarism checking
  3. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) No surprise! Google try to find out application based on the search engine, and they are not intended to do project level search for finding source code. We might augment the query to get some meaning results for the developer but, the intent of these general purpose search engine will remain same and it will try to find application not source code that developer might willing to use
  4. There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) We will see how we incorporated this method, class and API name to augment the textual information
  5. Let’s consider this two documents, there is no common keyword in this document thus keyword based cosine similarity will give us 0, that means they are totally dissimilar, but actually they are not, they even represent same meaning. And in project documentation developer often use different word to represent he same thing. Though these two documents are similar in meaning, normal keyword based similarity cannot capture these.
  6. If we look into closely, android and lollipop are similar in meaning. Same for other keywords as well. Now, instead of matching words exactly, can we give some value between these two words that will indicate how much similar they are in meaning. Yes we can. Learn a weight w where higher weight mean strongly similar and lower weight mean less similar
  7. Intuition: The context words of similar words would be same. One of the most effective way of doing this is: Word2Vec
  8. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  9. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  10. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  11. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  12. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  13. How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity