SlideShare a Scribd company logo
1 of 20
Download to read offline
Similarity of Source Code

in the Presence of Pervasive
Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark
Centre for Research on Evolution, Search and Testing (CREST)
Dept. of Computer Science, UCL, London, UK
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
2
/* ORIGINAL */
private static int partition

(Comparable[] a, int lo, int hi) {

int i = lo;

int j = hi+1;

Comparable v = a[lo];

while (true) {

while (less(a[++i], v)) {

if (i == hi) break;

}

while (less(v, a[--j])) {

if (j == lo) break;

}

if (i >= j) break;

exch(a, i, j);

}

exch(a, lo, j);

return j;

}
/* PERVASIVELY MODIFIED CODE */
private static int partition
(int[] bob, int left, int right){

int x = left;

int y = right+1;

for (;;) {

while (less(bob[left],bob[--y]))

if (y == left) break;

while (less(bob[++x],bob[left]))

if (x == right) break;

if (x >= y) break;

swap(bob, y, x);

}

swap(bob, y, left);

return y;

}
From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
3
Changes affecting many locations in the whole method,
file, or project
Examples: layout changes, identifier renaming, API
changes, refactoring
Code cloning, software plagiarism, software evolution
But do not include (strong) code obfuscation
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 4
When source code is pervasively
modified, which similarity detection
techniques or tools get the most
accurate results?
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
30 Similarity Analysers
5
CCFinderX
iClones
Simian, NiCad
Deckard
Clone detectors
JPlag
Plaggie, Sherlock
Sim
Plagiarism detectors
7zncd, bzip2ncd
gzipncd, xz-ncd
icd, ncd
Compression
diff, bsdiff
difflib, fuzzywuzzy
jellyfish, ngram, sklearn
Others
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Test Data Generation
6
original
source
obfuscator
bytecode
obfuscator decompilers
InfixConverter.java
SqrtAlgorithm.java
Hanoi.java
Queens.java
MagicSquare.java
pervasively modified code
to be used in
detection phase
pervasively
modified code
compiler
javac
ARTIFICE
ProGuard Krakatau
Procyon
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Parameter Settings
7
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Similarity Report
8
InfC/
orig
InfC/
artfc
InfC/
orig
no
kraka
tau
InfC/
orig
no
procy
on
InfC/
orig
pg
kraka
tau
InfC/
orig
pg
procy
on
InfC/
artfc
no
kraka
tau
InfC/
artfc
no
procy
on
InfC/
artfc
pg
kraka
tau
InfC/
artfc
pg
procy
on
Sqrt/
orig
Sqrt/
artfc
… Squr/
artfc
pg
kraka
tau
Squr/
artfc
pg
procy
on
InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17
InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17
InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17
InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21
InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20
InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17
InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19
InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17
InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16
Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18
… … … … … … … … … … … … … … … …
Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32
Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Similarity Threshold = 50
9
InfC/
orig
InfC/
artfc
InfC/
orig
no
kraka
tau
InfC/
orig
no
procy
on
InfC/
orig
pg
kraka
tau
InfC/
orig
pg
procy
on
InfC/
artfc
no
kraka
tau
InfC/
artfc
no
procy
on
InfC/
artfc
pg
kraka
tau
InfC/
artfc
pg
procy
on
Sqrt/
orig
Sqrt/
artfc
… Squr/
artfc
pg
kraka
tau
Squr/
artfc
pg
procy
on
InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17
InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17
InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17
InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21
InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20
InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17
InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19
InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17
InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16
Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18
… … … … … … … … … … … … … … … …
Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32
Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Best Threshold
10
F-measure
0.00
0.23
0.45
0.68
0.90
Threshold Value (T)
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
31
F-measure = 0.8282
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Optimal Configuration
11
Best ThresholdBest Parameter Settings
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Results
12
Tool Settings T Acc Prec Rec AUC Prec@n F1
ccfx b=20,t=1 4 0.9640 0.9145 0.9040 0.9468 0.9040 0.9095
simjava r=22 5 0.9568 0.8769 0.9120 0.9490 0.8840 0.8941
jplag-text t=8 2 0.9408 0.8235 0.8960 0.9453 0.8440 0.8582
py-difflib noautojunk 35 0.9392 0.8901 0.7940 0.9147 0.8080 0.8393
7zncd-BZip2 mx=1 39 0.9368 0.8977 0.7720 0.9419 0.8180 0.8301
ncd-bzlib 31 0.9336 0.8584 0.8000 0.9482 0.8200 0.8282
jplag-java t=3 43 0.9160 0.7526 0.8640 0.9667 0.7860 0.8045
py-sklearn 33 0.8488 0.5894 0.8040 0.9146 0.6200 0.6802
ccfx
deckard
iclones
nicad
simian
jplag-java
jplag-text
plaggie
sherlock
simjava
simtext
7zncd-BZip2
7zncd-LZMA
7zncd-LZMA2
7zncd-Deflate
7zncd-Deflate64
7zncd-PPMd
bzip2ncd
gzipncd
icd
ncd-bzlib
ncd-zlib
xz-ncd
bsdiff
diff
py-difflib
py-fuzzywuzzy
py-jellyfish
py-ngram
py-sklearn
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1
Clone 

det.
Plag 

det.
Comp.
Others
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 14
Highly specialised source code similarity
detection techniques and tools can perform
better than more general, textual similarity
measures.
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Normalisation by Decompilation
15
javac
Krakatau
Procyon
Pervasively modified
code
Normalised
code
Normalisation
Compile
Decompile
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code Before Decompilation
16
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code After Decompilation
17
Clone 

det.
Plag 

det.
Comp.
Others
ccfx
deckard
iclones
nicad
simian
jplag-java
jplag-text
plaggie
sherlock
simjava
simtext
7zncd-BZip2
7zncd-LZMA
7zncd-LZMA2
7zncd-Deflate
7zncd-Deflate64
7zncd-PPMd
bzip2ncd
gzipncd
icd
ncd-bzlib
ncd-zlib
xz-ncd
bsdiff
diff
py-difflib
py-fuzzywuzzy
py-jellyfish
py-ngram
py-sklearn
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1
Orig.
Dec.
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 19
Compilation and decompilation can be used
as an effective normalisation method that
greatly improves similarity detection on Java
source code
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 20
Compilation and decompilation can be used as
an effective normalisation method that greatly
improves similarity detection on Java source code
Highly specialised source code similarity
detection techniques and tools can perform
better than more general, textual similarity
measures.
Similarity of Source Code

in the Presence of Pervasive Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark — CREST, UCL
More info: http://crest.cs.ucl.ac.uk/resources/cloplag/

More Related Content

Similar to Similarity of Source Code in the Presence of Pervasive Modifications [SCAM'16]

Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Jean-Paul Calbimonte
 

Similar to Similarity of Source Code in the Presence of Pervasive Modifications [SCAM'16] (20)

Using Compilation/Decompilation to Enhance Clone Detection
Using Compilation/Decompilation to Enhance Clone DetectionUsing Compilation/Decompilation to Enhance Clone Detection
Using Compilation/Decompilation to Enhance Clone Detection
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 
Detailed cryptographic analysis of contact tracing protocols
Detailed cryptographic analysis of contact tracing protocolsDetailed cryptographic analysis of contact tracing protocols
Detailed cryptographic analysis of contact tracing protocols
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Self-Similarity in Complex Networks
Self-Similarity in Complex NetworksSelf-Similarity in Complex Networks
Self-Similarity in Complex Networks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Cto cn
Cto cnCto cn
Cto cn
 
Android & PostgreSQL
Android & PostgreSQLAndroid & PostgreSQL
Android & PostgreSQL
 
Searching for Configurations in Clone Evaluation: A Replication Study [SSBSE'16]
Searching for Configurations in Clone Evaluation: A Replication Study [SSBSE'16]Searching for Configurations in Clone Evaluation: A Replication Study [SSBSE'16]
Searching for Configurations in Clone Evaluation: A Replication Study [SSBSE'16]
 
Tracing Tuples Across Dimensions: A Comparison of Scatterplots and Parallel C...
Tracing Tuples Across Dimensions: A Comparison of Scatterplots and Parallel C...Tracing Tuples Across Dimensions: A Comparison of Scatterplots and Parallel C...
Tracing Tuples Across Dimensions: A Comparison of Scatterplots and Parallel C...
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
GraphQL Relay Introduction
GraphQL Relay IntroductionGraphQL Relay Introduction
GraphQL Relay Introduction
 
Ijetr021108
Ijetr021108Ijetr021108
Ijetr021108
 
Ijetr021108
Ijetr021108Ijetr021108
Ijetr021108
 
On Continuum Limits of Markov Chains and Network Modeling
On Continuum Limits of Markov Chains and  Network ModelingOn Continuum Limits of Markov Chains and  Network Modeling
On Continuum Limits of Markov Chains and Network Modeling
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud data
 
New Broken Time-reversal Symmetry Superconductors: Theoretical Constraints on...
New Broken Time-reversal Symmetry Superconductors: Theoretical Constraints on...New Broken Time-reversal Symmetry Superconductors: Theoretical Constraints on...
New Broken Time-reversal Symmetry Superconductors: Theoretical Constraints on...
 
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
 

Recently uploaded

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Recently uploaded (20)

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 

Similarity of Source Code in the Presence of Pervasive Modifications [SCAM'16]

  • 1. Similarity of Source Code
 in the Presence of Pervasive Modifications Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark Centre for Research on Evolution, Search and Testing (CREST) Dept. of Computer Science, UCL, London, UK
  • 2. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Pervasive Modifications 2 /* ORIGINAL */ private static int partition
 (Comparable[] a, int lo, int hi) {
 int i = lo;
 int j = hi+1;
 Comparable v = a[lo];
 while (true) {
 while (less(a[++i], v)) {
 if (i == hi) break;
 }
 while (less(v, a[--j])) {
 if (j == lo) break;
 }
 if (i >= j) break;
 exch(a, i, j);
 }
 exch(a, lo, j);
 return j;
 } /* PERVASIVELY MODIFIED CODE */ private static int partition (int[] bob, int left, int right){
 int x = left;
 int y = right+1;
 for (;;) {
 while (less(bob[left],bob[--y]))
 if (y == left) break;
 while (less(bob[++x],bob[left]))
 if (x == right) break;
 if (x >= y) break;
 swap(bob, y, x);
 }
 swap(bob, y, left);
 return y;
 } From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/
  • 3. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Pervasive Modifications 3 Changes affecting many locations in the whole method, file, or project Examples: layout changes, identifier renaming, API changes, refactoring Code cloning, software plagiarism, software evolution But do not include (strong) code obfuscation
  • 4. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 4 When source code is pervasively modified, which similarity detection techniques or tools get the most accurate results?
  • 5. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 30 Similarity Analysers 5 CCFinderX iClones Simian, NiCad Deckard Clone detectors JPlag Plaggie, Sherlock Sim Plagiarism detectors 7zncd, bzip2ncd gzipncd, xz-ncd icd, ncd Compression diff, bsdiff difflib, fuzzywuzzy jellyfish, ngram, sklearn Others
  • 6. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Test Data Generation 6 original source obfuscator bytecode obfuscator decompilers InfixConverter.java SqrtAlgorithm.java Hanoi.java Queens.java MagicSquare.java pervasively modified code to be used in detection phase pervasively modified code compiler javac ARTIFICE ProGuard Krakatau Procyon
  • 7. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Parameter Settings 7
  • 8. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Similarity Report 8 InfC/ orig InfC/ artfc InfC/ orig no kraka tau InfC/ orig no procy on InfC/ orig pg kraka tau InfC/ orig pg procy on InfC/ artfc no kraka tau InfC/ artfc no procy on InfC/ artfc pg kraka tau InfC/ artfc pg procy on Sqrt/ orig Sqrt/ artfc … Squr/ artfc pg kraka tau Squr/ artfc pg procy on InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17 InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17 InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17 InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21 InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20 InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21 InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17 InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19 InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17 InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21 Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16 Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18 … … … … … … … … … … … … … … … … Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32 Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
  • 9. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Similarity Threshold = 50 9 InfC/ orig InfC/ artfc InfC/ orig no kraka tau InfC/ orig no procy on InfC/ orig pg kraka tau InfC/ orig pg procy on InfC/ artfc no kraka tau InfC/ artfc no procy on InfC/ artfc pg kraka tau InfC/ artfc pg procy on Sqrt/ orig Sqrt/ artfc … Squr/ artfc pg kraka tau Squr/ artfc pg procy on InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17 InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17 InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17 InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21 InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20 InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21 InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17 InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19 InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17 InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21 Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16 Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18 … … … … … … … … … … … … … … … … Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32 Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
  • 10. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Best Threshold 10 F-measure 0.00 0.23 0.45 0.68 0.90 Threshold Value (T) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31 F-measure = 0.8282
  • 11. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Optimal Configuration 11 Best ThresholdBest Parameter Settings
  • 12. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Results 12 Tool Settings T Acc Prec Rec AUC Prec@n F1 ccfx b=20,t=1 4 0.9640 0.9145 0.9040 0.9468 0.9040 0.9095 simjava r=22 5 0.9568 0.8769 0.9120 0.9490 0.8840 0.8941 jplag-text t=8 2 0.9408 0.8235 0.8960 0.9453 0.8440 0.8582 py-difflib noautojunk 35 0.9392 0.8901 0.7940 0.9147 0.8080 0.8393 7zncd-BZip2 mx=1 39 0.9368 0.8977 0.7720 0.9419 0.8180 0.8301 ncd-bzlib 31 0.9336 0.8584 0.8000 0.9482 0.8200 0.8282 jplag-java t=3 43 0.9160 0.7526 0.8640 0.9667 0.7860 0.8045 py-sklearn 33 0.8488 0.5894 0.8040 0.9146 0.6200 0.6802
  • 14. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 14 Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures.
  • 15. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Normalisation by Decompilation 15 javac Krakatau Procyon Pervasively modified code Normalised code Normalisation Compile Decompile
  • 16. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Code Before Decompilation 16
  • 17. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK Code After Decompilation 17
  • 19. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 19 Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code
  • 20. Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 20 Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. Similarity of Source Code
 in the Presence of Pervasive Modifications Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark — CREST, UCL More info: http://crest.cs.ucl.ac.uk/resources/cloplag/