SlideShare a Scribd company logo
1 of 85
Download to read offline
SUPERVISED BY: Dr Hitham M. Abo Bakr
Implementing Plagiarism
Detection Engine
For English Academic Papers
By
Muhamed Gameel Abd El Aziz
Ahmed Motair El Said Mater
Mohamed Hessien Mohamed
Shreif Hosni Zidan Esmail
Manar Mohamed Said Ahmed
Doaa Abd El Hamid Abd El Hamid
Implementing Plagiarism Detection Engine for English Academic Papers 1
Abstract
Plagiarism became a serious issue now days due to the presence of vast resources easily
available on the web, which makes developing plagiarism detection tool a useful and challenging task
due to the scalability issues.
Our project is implementing a Plagiarism Detection Engine oriented for English academic papers
using text Information Retrieval methods, relational database, and Natural Language Processing
techniques.
The main parts of the projects are:
Gathering and cleaning data: crawling the web and collecting academic papers and parsing it to
extract information about the paper and make a big dataset of these scientific paper content.
Tokenization: Parse, tokenize, and preprocess documents.
Plagiarism engine: checking similarity between the input document and the database to detect
potential plagiarism.
Implementing Plagiarism Detection Engine for English Academic Papers 2
Table of Contents
Abstract___________________________________________________________________________ 1
Table of Contents ___________________________________________________________________ 2
Table of Figures____________________________________________________________________ 4
Table of Tables_____________________________________________________________________ 7
Chapter 1 Introduction ___________________________________________________________ 8
1.1 What is Plagiarism? _________________________________________________________________8
1.2 What is Self-Plagiarism? _____________________________________________________________8
1.3 Plagiarism on the Internet ____________________________________________________________8
1.4 Plagiarism Detection System __________________________________________________________8
1.4.1 Local similarity: __________________________________________________________________________8
1.4.2 Global similarity: _________________________________________________________________________9
1.4.3 Fingerprinting ___________________________________________________________________________9
1.4.4 String Matching __________________________________________________________________________9
1.4.5 Bag of words _____________________________________________________________________________9
1.4.6 Citation-based Analysis____________________________________________________________________9
1.4.7 Stylometry_______________________________________________________________________________9
Chapter 2 Background Theory ____________________________________________________ 10
2.1 Linear Algebra Basics______________________________________________________________ 10
2.1.1 Vectors_________________________________________________________________________________10
2.2 Information Retrieval (IR)__________________________________________________________ 11
2.3 Regular Expression________________________________________________________________ 15
2.4 NLTK Toolkit ____________________________________________________________________ 16
2.5 Node.js __________________________________________________________________________ 16
2.6 Express.js________________________________________________________________________ 16
2.7 Sockets.io ________________________________________________________________________ 16
2.8 Languages Used___________________________________________________________________ 16
Chapter 3 Design and Architecture_________________________________________________ 17
3.1 Extract, Transfer and Load (ETL) ___________________________________________________ 17
3.2 Plagiarism Engine_________________________________________________________________ 17
3.2.1 Natural Language Processing, (Generating k-grams), and vectorization__________________________18
3.2.2 Semantic Analysis (Vector Space Model VSM Representation) _________________________________18
3.2.3 Calculating Similarity ____________________________________________________________________18
3.2.4 Clustering ______________________________________________________________________________18
3.2.5 Communicating Results___________________________________________________________________19
Chapter 4 Implementation________________________________________________________ 20
4.1 Extract, Load and Transform (ETL) _________________________________________________ 20
4.1.1 The Crawler ____________________________________________________________________________20
4.1.2 The Parser______________________________________________________________________________20
4.1.3 The Data Extracted from the paper_________________________________________________________20
4.1.4 The Parser Implementation _______________________________________________________________21
Implementing Plagiarism Detection Engine for English Academic Papers 3
4.1.5 How it works____________________________________________________________________________21
4.1.6 Steps of Parsing _________________________________________________________________________22
4.1.7 The Paper Class _________________________________________________________________________26
4.1.8 The Paragraph Structure _________________________________________________________________27
4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) _____________________________________27
4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)____________________________________37
4.2 The Natural Language Processing (NLP)______________________________________________ 42
4.2.1 Introduction ____________________________________________________________________________42
4.2.2 The Implementation Overview_____________________________________________________________42
4.2.3 The Text Processing Procedure ____________________________________________________________42
4.2.4 Example of the Text Processing ____________________________________________________________45
4.3 Term Weighting __________________________________________________________________ 47
4.3.1 Lost Connection to Database Problem ______________________________________________________47
4.3.2 Process Paragraph _______________________________________________________________________48
4.3.3 Generating Terms _______________________________________________________________________48
4.3.4 Populating term, paragraphVector Tables ___________________________________________________51
4.3.5 Executing VSM Algorithm ________________________________________________________________52
4.4 Testing Plagiarism ________________________________________________________________ 53
4.4.1 Process Paragraph _______________________________________________________________________53
4.4.2 Calculate Similarity ______________________________________________________________________54
4.4.3 Get Results _____________________________________________________________________________54
4.5 The VSM Algorithm _______________________________________________________________ 55
4.5.1 Calculating similarity ____________________________________________________________________55
4.5.2 K-means and Clustering __________________________________________________________________56
4.6 Server Side_______________________________________________________________________ 59
4.6.1 Handling Routing________________________________________________________________________59
4.6.2 Running Python System __________________________________________________________________60
4.7 Client Side _______________________________________________________________________ 62
4.8 The GUI of the System _____________________________________________________________ 63
Chapter 5 Results and Discussion__________________________________________________ 66
5.1 Dataset of the Parser_______________________________________________________________ 66
5.2 Exploring dataset _________________________________________________________________ 68
5.4.1 Small dataset (15K) ______________________________________________________________________68
5.4.2 Big dataset (50K) ________________________________________________________________________69
5.3 Performance _____________________________________________________________________ 70
5.4 Detecting plagiarism _______________________________________________________________ 72
5.4.1 Percentage score functions:________________________________________________________________72
5.5 Discussing results _________________________________________________________________ 74
Chapter 6 Conclusion ___________________________________________________________ 75
Chapter 7 Appendix _____________________________________________________________ 76
7.1 Entity-Relation Diagram (ERD) _____________________________________________________ 76
7.2 Stored procedures_________________________________________________________________ 77
References _______________________________________________________________________ 84
Implementing Plagiarism Detection Engine for English Academic Papers 4
Table of Figures
Figure 1.1 Plagiarism Detection Approaches _____________________________________________________8
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3) ______ 10
Figure 2.2 Geometric representation of documents ______________________________________________ 12
Figure 3.1 High level block diagram __________________________________________________________ 17
Figure 3.2 Detailed block diagram of the Plagiarism Engine ______________________________________ 17
Figure 4.1 Overview for the Crawler and Parser ________________________________________________ 20
Figure 4.2 UML of the Parser Application _____________________________________________________ 21
Figure 4.3 the Flow Chart of the Parser _______________________________________________________ 21
Figure 4.4 The main function of Parsing ______________________________________________________ 22
Figure 4.5 The First Page of an IEEE Paper (as Blocks) _________________________________________ 22
Figure 4.6 First Page of a Science Direct Paper_________________________________________________ 23
Figure 4.7 First Paper of a Springer Paper_____________________________________________________ 23
Figure 4.8 The function of parseOtherPages ___________________________________________________ 24
Figure 4.9 Block of String before Enhancing___________________________________________________ 25
Figure 4.10 The Paragraphs after enhancing___________________________________________________ 25
Figure 4.11 the Paper Structure _____________________________________________________________ 26
Figure 4.12 the Paragraph Structure _________________________________________________________ 27
Figure 4.13 Different forms for an IEEE Top Header____________________________________________ 27
Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper ____________________________ 28
Figure 4.15 The supported Regex of the IEEE Header formats ____________________________________ 29
Figure 4.16 The Function of extracting the Volume Number ______________________________________ 30
Figure 4.17 The Function of Extracting the Issue Number________________________________________ 30
Figure 4.18 The Function of Extracting the DOI________________________________________________ 30
Figure 4.19 The Function of Extracting the Start and End Pages __________________________________ 31
Figure 4.20 The Function of Extracting the Journal Title_________________________________________ 32
Figure 4.21 Parsing the rest of blocks in the first Page ___________________________________________ 32
Figure 4.22 The Function of Extracting the DOI and PII_________________________________________ 33
Figure 4.23 The Function of Extracting the ISSN _______________________________________________ 33
Figure 4.24 The Function of extracting the paper Dates __________________________________________ 34
Figure 4.25 The Function of Extracting the Keywords ___________________________________________ 35
Implementing Plagiarism Detection Engine for English Academic Papers 5
Figure 4.26 The Function of Extracting the Keywords ___________________________________________ 36
Figure 4.27 The Function of Extracting the Title and the Authors__________________________________ 36
Figure 4.28 The Defining the Style of the Header _______________________________________________ 38
Figure 4.29 the Function of Extracting the Figure Captions ______________________________________ 39
Figure 4.30 the Function of separating the lists _________________________________________________ 40
Figure 4.31 the Function of Extracting the Paragraph ___________________________________________ 40
Figure 4.32 The Function of Extracting the Paragraph __________________________________________ 41
Figure 4.33 Process Text Function ___________________________________________________________ 42
Figure 4.34 Tokenizing words Function _______________________________________________________ 42
Figure 4.35 Tokenization Example ___________________________________________________________ 43
Figure 4.36 POS Function__________________________________________________________________ 43
Figure 4.37 POS Output Example____________________________________________________________ 43
Figure 4.38 WordNet POS Function__________________________________________________________ 43
Figure 4.39 Removing Punctuations Function__________________________________________________ 44
Figure 4.40 Removing Stop Words Function ___________________________________________________ 44
Figure 4.41 Stop Words list _________________________________________________________________ 44
Figure 4.42 Lemmatization Function _________________________________________________________ 45
Figure 4.43 Paragraph before Text Processing _________________________________________________ 45
Figure 4.44 Paragraph after Text Processing___________________________________________________ 46
Figure 4.45 Retrieving Paragraphs ___________________________________________________________ 47
Figure 4.46 Process Paragraph Function______________________________________________________ 48
Figure 4.47 Generate k-gram Terms Function__________________________________________________ 49
Figure 4.48 Paragraph Example _____________________________________________________________ 49
Figure 4.49 1-gram terms___________________________________________________________________ 50
Figure 4.50 2-gram terms___________________________________________________________________ 50
Figure 4.51 3-gram terms___________________________________________________________________ 50
Figure 4.52 4-gram terms___________________________________________________________________ 50
Figure 4.53 5-gram terms___________________________________________________________________ 50
Figure 4.54 Calculate Term Frequency _______________________________________________________ 51
Figure 4.55 insert Terms in Database _________________________________________________________ 51
Figure 4.56 insert Paragraph Vector in Database _______________________________________________ 51
Figure 4.57 Executing the VSM Algorithm_____________________________________________________ 52
Implementing Plagiarism Detection Engine for English Academic Papers 6
Figure 4.58 tokenizing and link paragraphs together_____________________________________________ 53
Figure 4.59 Process input paragraphs_________________________________________________________ 53
Figure 4.60 Populate input paragraph vector ___________________________________________________ 53
Figure 4.61 Calculate Similarity _____________________________________________________________ 54
Figure 4.62 Get Results ____________________________________________________________________ 54
Figure 4.63 Flowchart of the Kmeans text clustering algorithm ____________________________________ 57
Figure 4.64 Home Page Routing _____________________________________________________________ 59
Figure 4.65 Pre-Process Page Routing ________________________________________________________ 59
Figure 4.66 Communicating between the Server and the Core Engine for testing plagiarism ____________ 60
Figure 4.67 Communicating between the Server and the Core Engine for Pre-processing ______________ 61
Figure 4.68 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.69 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.70 Submitting an input document_____________________________________________________ 63
Figure 4.71 The Results of the Process Part 1 __________________________________________________ 64
Figure 4.72 The Results of the Process Part 2 __________________________________________________ 65
Figure 5.1 Number of Papers Published per Year in IEEE ________________________________________ 66
Figure 5.2 Number of Papers Published per Year in Springer _____________________________________ 67
Figure 5.3 Number of Papers Published per Year in Science Direct_________________________________ 67
Figure 5.4 Response time against number of paragraphs tested on small dataset ______________________ 70
Figure 5.5 Screenshot of the System Performance from the System GUI _____________________________ 71
Figure 7.1 ERD of the plagiarism Engine database ______________________________________________ 76
Implementing Plagiarism Detection Engine for English Academic Papers 7
Table of Tables
Table 1 Statistics of the Parser________________________________________________________________66
Table 2 Dataset Statistics ____________________________________________________________________68
Table 3 Unique Terms count in each Paragraph _________________________________________________68
Table 4 Unique Terms count in Dataset ________________________________________________________68
Table 5 Dataset Statistics ____________________________________________________________________69
Table 6 Unique Terms count in each Paragraph _________________________________________________69
Table 7 Unique Terms count in Dataset ________________________________________________________69
Table 8 Processing time of each module in Plagiarism Engine ______________________________________70
Table 9 Parameters_________________________________________________________________________72
Table 10 Testing Paragraphs and Results_______________________________________________________73
Implementing Plagiarism Detection Engine for English Academic Papers 8
Chapter 1 Introduction
1.1 What is Plagiarism?
It’s the act of Academic stealing someone’s work such as: copying words from a book or a
scientific paper and publish it as it’s his work, also stealing the ideas, images, videos and music and
using them without a permission or providing a proper citation is called plagiarism.
1.2 What is Self-Plagiarism?
It’s the act when someone is using a portion of an article or work he published before without
citing that he is doing so, and this portion could be significant, identical or nearly identical, also it may
cause copyrights issues as the copyright of the old work will be transferred to the new one. This type of
articles and work are called duplicate or multiple publication.
1.3 Plagiarism on the Internet
Now the Blogs, Facebook Pages and some website are copying and pasting information violating
many copyrights, so there are many tools that are used to prevent plagiarism such as: disabling the right
click to prevent copying, also placing copyright warning in every page in the website as banners or
pictures, and the use of DCMA copyright law to report for copyright infragment and the violation of
copyrights, this report could be sent to the website owner or the ISP hosting the website and the website
will be removed.
1.4 Plagiarism Detection System
It’s a system used to test a material if it has plagiarism or not, this material could be scientific article or
technical report or essay or others, also the system can emphasize the parts of plagiarism in the material and estate
from where it’s copied from even there is difference between some words with the same meaning.
Figure 1.1 Plagiarism Detection Approaches
1.4.1 Local similarity:
Given a small dataset, the system checks the similarity between each pair of the paragraphs in this dataset,
like checking if two students cheated in an assignment.
Implementing Plagiarism Detection Engine for English Academic Papers 9
1.4.2 Global similarity:
Global similarity systems checks the similarity between a small input paragraphs against a large dataset,
like checking if a submitted paper is plagiarized from an already published paper.
1.4.3 Fingerprinting
In this approach, the data set consists of set of multiple n-grams from documents, these n-grams
are selected randomly as a substring of the document, each set of n-grams representing a fingerprint for
that document and called minutiae, and all of this fingerprints are indexed in the database, and the input
text is processed in the same way and compared with the fingerprints in the database and if it matches
with some of them, then it plagiarizes some documents. [1]
1.4.4 String Matching
This is one of most problems in the plagiarism detection systems as to detect plagiarism you can
to make an exact match, but to compare the document to be tested with the total Database requires huge
amount of resources and storage, so suffix trees and suffix vector are used to overcome this problem. [2]
1.4.5 Bag of words
This approach is an adoption of the vector space retrieval, where the document is represented as
a bag of words and these words are inserted in the database as n-grams with its location in the document
and their frequencies in this or other documents, and for the document to be tested it will be represented
as bag of words too and compared with the n-grams in the database. [3]
1.4.6 Citation-based Analysis
This is the only approach that doesn’t rely on text similarity, It examines the citation and
reference information in texts to identify similar patterns in the citation sequences, It’s not widely used
in the commercial software, but there are prototypes of it.
1.4.7 Stylometry
Analyze only the suspicious document to detect plagiarized passages by detecting the difference
of linguistic characters.
This method isn’t accurate in small documents as it needs to analyze large passages to be able to
extract linguistic properties –up to thousands of words per chunk [4].
Our Project is working with Global Similarity and the Bag of Words Approach, as the system
has Dataset of many scientific papers divided into paragraphs, and the input text is divided into
paragraphs and compared with the large dataset of paragraphs.
Implementing Plagiarism Detection Engine for English Academic Papers 10
Chapter 2 Background Theory
2.1 Linear Algebra Basics
Since we use Vector space model to represent and retrieve text documents, a basic linear algebra
is needed.
2.1.1 Vectors
A vector is Geometric object that have a magnitude and a direction, or a mathematical object
consists of ordered values.
1. Representation in 2D and 3D
1) Graphical (Geometric) representation
A vector is represented graphically as an arrow in the Cartesian 2D plane or the Cartesian 3D
space.
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3).
Source: Wikimedia commons.
2) Cartesian representation
Vectors in an n-dimensional Euclidean space can be represented as coordinate vectors; the
endpoint of a vector can be identified with an ordered list of n real numbers (n-tuple). [5]
2D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦) Euclidean vector
3D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦, 𝑎 𝑧)
2. Operations on vectors
1) Scalar product
𝑟𝑎⃑ = (𝑟𝑎 𝑥, 𝑟𝑎 𝑦, 𝑟𝑎 𝑧)
2) Sum
𝑎⃑ + 𝑏⃑⃑ = (𝑎 𝑥 + 𝑏 𝑥 , 𝑎 𝑦 + 𝑏 𝑦, 𝑎 𝑧 + 𝑏 𝑧 )
Implementing Plagiarism Detection Engine for English Academic Papers 11
3) Subtract
𝑎⃑ − 𝑏⃑⃑ = (𝑎 𝑥 − 𝑏 𝑥 , 𝑎 𝑦 − 𝑏 𝑦, 𝑎 𝑧 − 𝑏 𝑧 )
4) Dot product
Algebraic definition: 𝑎⃑ . 𝑏⃑⃑ = (𝑎 𝑥 𝑏 𝑥 , 𝑎 𝑦 𝑏 𝑦)
Geometric definition: 𝑎⃑ . 𝑏⃑⃑ = ‖𝑎‖‖𝑏‖ cos 𝜃
Where: |a| is the magnitude of vector a, |b| is the magnitude of vector b, θ is the angle between a and b
The projection of a vector 𝑎⃑ in the direction of another vector 𝑏⃑⃑ is given by: 𝑎 𝑏⃑⃑⃑⃑⃑ = 𝑎⃑ . 𝑏̂
Where: 𝑏̂ is the normalized vector –unit vector- of 𝑏⃑⃑.
2.2 Information Retrieval (IR)
Information retrieval could be defined as “the process of finding material of an unstructured
nature –usually text- that satisfies information need or relevant to a query from large collection of data
[6].
As the definition suggests, IR differ than ordinary select query that is the information retrieved
unstructured, and doesn’t always exactly match the query.
Information Retrieval methods are used in search engines, text classification –such as spam
filtering-, and in our case plagiarism engine.
1. Vector Space Model (VSM)
The basic idea of VSM is to represent text documents as vectors in term weights space.
1) Term Frequency weighting (TF)
The simplest VSM weighting is just Term Frequency; all other weighting functions are
modification of it.
In TF weighting we represent each text by a vector of d-dimensions where d is the number of
terms in dataset, the value of the vector nth dimension equals to the frequency of the nth term in the
document.
For example let’s assume a dataset of 2 dimensions/terms (play, ground)
Document 1 ‘play ground’ is represented as 𝑑1
⃑⃑⃑⃑⃑ = (1, 1)
Document 2 “play play” is represented as 𝑑2
⃑⃑⃑⃑⃑ = (2, 0)
Document 3 “ground” is represented as 𝑑3
⃑⃑⃑⃑⃑ = (0, 1)
More generally weight of word w in document d is defined as 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)
Implementing Plagiarism Detection Engine for English Academic Papers 12
The dot product similarity between d1 and d2 = 𝑑1
⃑⃑⃑⃑⃑ . 𝑑2
⃑⃑⃑⃑⃑ = (1, 1) . (2,0) = 1 ∗ 2 + 1 ∗ 0 = 2
2) Term Frequency with Inverse Document Frequency weighting (TF-IDF)
Document Frequency 𝑑𝑓(𝑤) is the number of documents that contains the word.
TF-IDF have additional Inverse Document Frequency term to penalize common terms as they
are have high probability [7] of appearing in a document so they don’t strongly indicate plagiarism
unlike less probable terms which have less probability and more information.
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) ∗
1
𝑑𝑓(𝑤)
So for the above example:
df (play) = 2
df (ground) = 2
and in this case all the weights will be scaled by a half.
2. State-of-the-art VSM functions
1) Pivoted Length Normalization [8]
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) =
ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]]
1 − 𝑏 + 𝑏 |𝑑|
𝑎𝑣𝑑𝑙
log
𝑀 + 1
𝑑𝑓(𝑤)
Where: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) is the weight of word w in document/paragraph d
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) is the count of word w in document d – i.e. term frequency-
play
ground
2
1
1
d2
d1
d3
Figure 2.2 Geometric representation of documents
Implementing Plagiarism Detection Engine for English Academic Papers 13
𝑏 𝑖𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ∈ [0, 1]
|𝑑| is the length of document d
𝑎𝑣𝑑𝑙 is the average length of the documents in dataset
𝑀 is the number of documents in dataset
𝑑𝑓(𝑤) is the number of documents that contains the word w i.e. document frequency
Document length normalizing term 1 − 𝑏 + 𝑏
|𝑑|
𝑎𝑣𝑑𝑙
is used to linearly penalize long documents
if it’s length is larger than the average document length (avdl), or reward short documents if it’s length
is smaller than the average document length.
The parameter b is used to determine the normalization; if its equal to zero then there is no
normalization at all if it’s equal to 1 then the normalization is linear with offset zero and slope 1.
The Inverse Document Frequency (IDF) term log
𝑀+1
𝑑𝑓(𝑤)
is used to penalize common terms as
explained above, the IDF is multiplied by number of documents to normalize it, as the probability of the
term depends not only on the document frequency of that term, but also on the size of the dataset, a term
appeared 10 times in a dataset with 100 document much more common than a term appeared 10 times
but in a dataset with 1000 document, the logarithmic function is used to smooth the IDF weighting, i.e.
reduce the variation of weighting when the Document Frequency varies a lot.
The Term frequency (TF) term ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] contains double natural logarithmic
functions to achieve sub linear transformation-i.e. smoothing TF curve- to avoid over scoring documents
that have large repeated words, as the first occurring of a term should have the highest weight.
Imagine a document with extremely large frequency of a term, without sub linear transformation
this document will always have high similarity with any input query that contains the same term, even
higher similarity than another more similar document.
2) OkapiBM25 [9]
BM stands for Best Match, the weights are defined as follows:
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) =
(𝑘 + 1)𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞)
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) + 𝑘(1 − 𝑏 + 𝑏 |𝑑|
𝑎𝑣𝑑𝑙
)
log
𝑀 + 1
𝑑𝑓(𝑤)
Where all symbols defined as in Pivoted Length Normalization 𝑘 ∈ [0, ∞]
Similar to Pivoted Length Normalization, but instead of Natural logarithms it uses division and k
parameter to achieve sub linear transformation.
Implementing Plagiarism Detection Engine for English Academic Papers 14
It was originally developed based on the probabilistic model, however it’s very similar to Vector
Space Model.
3. Similarity functions
After representing text documents as vectors in the space, we need functions to calculate
similarity –or distance- between any two vectors.
1) Dot product similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)
𝑤∈ 𝑞∩𝑑
Where: 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) is the similarity score between document d and input query q
The score is simply the summation of the product of term weights of each word appear in both
document and query.
It’s very popular because it’s general and can be used with any fancy term weighting.
2) Cosine similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) =
∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)𝑤∈ 𝑞∩𝑑
|𝑞| ∪ |𝑑|
Where: |𝑞| is the magnitude of the query vector, |𝑑| is the magnitude of the document vector.
It’s basically a dot product divided by the product of the lengths of the two vectors, which yields
the cosine of the angle between the vectors.
This function has a built in document – and query – length normalization.
4. Clustering
Clustering is an unsupervised1
machine learning method, and a powerful data mining technique,
it is the process of grouping similar objects together.
This technique can theoretically speed up the information retrieval process by a factor of K
where K is the number of clusters.
This may be achieved by clustering similar paragraphs together and measure the similarity
between each new query and the centroids of the clusters, then measure the similarity between the query
and paragraphs in one cluster, this is much faster than measuring similarity of query against each
paragraph in the data set.
1
Because the data are not labeled.
Implementing Plagiarism Detection Engine for English Academic Papers 15
We use Kmeans algorithm-centroid based clustering-, which is an iterative improvement
algorithm that groups the data set into a pre defined number of clusters K.
Which goes like this [10]:
1 Select K random points from data set to be the initial guess of the centroids –cluster centers.
2 Assign each record in the data set to the closest centroid based on a given similarity function.
3 Move each centroid closer to points assigned to it by calculating the mean value of the points
in the cluster.
4 If reached local optima –i.e. centroids stopped moving- stop, else repeat
Since Kmeans algorithm is sensitive to initial choose of centroids and can stuck in local optima
we repeat it with different initial centroids and keep the best results –which have the least mean square
error-.
Time complexity 𝛰(𝑡𝑘𝑛𝑑) where t is the number of iterations till converges, k is the number of
clusters, and n is the number of records in the data set, d is the number of dimensions.
Since usually 𝑡𝑘 ≪ 𝑛 the algorithm is considered a linear time algorithm.
Note: that is the typical time complexity for applying the algorithm on a dataset with neglect
able limited Number of dimensions, in our case we have very large number of dimensions –all terms in
the dataset- but fortunately for each centroid or paragraph we iterate only terms appear in it –not all
dimensions- so the time complexity will differ as we will discuss in details in the implementation
section.
2.3 Regular Expression
It’s a sequence of character and symbols written in some way to detect patterns, where each
symbol in the sequence has a meaning ex: + means one or more, * means zero or more, - means range
(A-Z: all capital letters from A to Z).
For ex: the Birth date could be written in this way May 15th
, 1993 so the pattern of the Date is
[Month] [day][st
or nd
or rd
or th
] [year], and the Regular Expression for it is [A-Za-z]{3,9} [0-
9]{1,2}(st|nd|rd|th) [0-9]{4}
First, the month is one of 12 fixed words, they could be written explicitly or simply it’s a
sequence of Char range from 3 to 9 as the smallest month (May) and the largest (September), then space,
Then, the Day and it’s a number of 1 or 2 digits, then one of the 4 words (st, nd, rd, th), then space, then
the Year and it’s a number of 4 digits.
Also this isn’t the only format of date so the date expression could be more complicated than
this.
Implementing Plagiarism Detection Engine for English Academic Papers 16
2.4 NLTK Toolkit
It’s a python module that is responsible for Natural Language Processing (NLP) used for text
processing, It has algorithms for sentence and word tokenization, and contains a large number of corpus
(data), also has its own Wordnet corpus, and it’s used for Part of Speech (POS) Tagging, Stemming, and
lemmatization.
2.5 Node.js
It’s a runtime environment built on Chrome’s V8 JavaScript Enginer for developing server-side
web applications, it uses an event driven, non-blocking I/O model.
2.6 Express.js
It’s a Node.js web application server framework, It’s a standard server framework for Node.js, It
has a very thin layer with many features available as plugins.
2.7 Sockets.io
It’s a library for real-time web application, It enables bi-directional communication between the
web client and server, It primarily use web sockets protocol with polling as fallback option.
2.8 Languages Used
1. Java
2. Python
3. SQL
4. JavaScript
5. HTML & CSS
Implementing Plagiarism Detection Engine for English Academic Papers 17
Chapter 3 Design and Architecture
3.1 Extract, Transfer and Load (ETL)
In this part, we are building the database by downloading many scientific papers using the
Crawler software (Extract), then they are passed to the Parser software where all the paper information
and text content are extracted as paragraphs (Transform) and inserted in the database (Load).
3.2 Plagiarism Engine
The plagiarism Engine preprocess a huge dataset of academic English papers and analysis it uses
Natural language processing techniques to extract useful information, then measures the similarity
between an input query and the dataset using Information Retrieval methods to detect both Identical and
paraphrased plagiarism in a fast and intelligent way.
Parsed
papers
NLP and
vectorization
Semantic analysis
VSM representation
Vectorized paragraph
Extracted features
Lexical database
(WordNet/NLTK)Input query
Calculating
similarity Clustering
centroid
Find Cluster
Potential plagiarized
paragraphs
ETL
Parsing
Plagiarism
engine
Communicating results
Academic
papers
Input query
Figure 3.1 High level block diagram
Figure 3.2 Detailed block diagram of the Plagiarism Engine
Implementing Plagiarism Detection Engine for English Academic Papers 18
3.2.1 Natural Language Processing, (Generating k-grams), and vectorization
The Text Processing part work on these data to extract the most important words from the
paragraphs, and ignore the common words, then k-grams terms are generated from these words, and
each bag of words is linked to its corresponding paragraph in the database.
3.2.2 Semantic Analysis (Vector Space Model VSM Representation)
Input: simple Term Frequency vector representation stored in paragraphVector table.
Output: dataset statistics (number of paragraphs, number of terms, average length of paragraph)
stored in dataSetInfo table, document frequency (for each term; number of paragraphs in which that term
appeared) stored in IDF column in term table, pivoted length normalization, and BM25 vector weights.
In this part we calculate a more sophisticated vector representation than just term-frequency of
our text corpus.
We calculate a TF-IDF normalized vector representation of the text using both pivoted length
normalization and BM25 as discussed later.
3.2.3 Calculating Similarity
Input: Vectorized input paragraphs stored in inputPargraphVector Table, and BM25 or pivoted
length normalization in BM25, pivotNorm columns in paragraphVector table.
Output: similarity between input paragraph and relevance paragraphs in dataset stored in
similarity table.
Checking similarity between the input paragraph and paragraphs in the dataset, and detect
possible plagiarism if the similarity measure between the input paragraph and any paragraph form the
dataset exceeded a predetermined threshold.
We implemented both OkapiBM25 and pivoted length normalization similarity functions.
The system first measure similarity on 5-gram vectors, then 4-grams and so on, if it ever found a
high similarity in one k-gram it limits its scope to those paragraphs with high similarity in precedes k-
grams to increase performance.
3.2.4 Clustering
Input: paragraph vectors with BM25 (or pivoted length normalization) weights stored in
paragraphVector table.
Output: the cluster of each paragraph stored in clusterId column in paragraph table, and the
centroids of the clusters stored in centroid table.
Implementing Plagiarism Detection Engine for English Academic Papers 19
We clustered similar paragraphs together so that we can measure similarity between only similar
paragraphs to increase the checking similarity step speed.
An input paragraph have a similarity measure against centroids first to determine its cluster then
the regular similarity measure with all paragraphs in the dataset in the same cluster.
3.2.5 Communicating Results
It’s the interface where the user can check his document to plagiarism by inserting the document
in a text box and the document is parsed in a similar way as the Parser of the system by splitting the
document into paragraphs and they are passed by the text processing part and compared by the dataset in
the database and results appear as plagiarism percentage in the document and showing the plagiarized
parts in the document with other documents.
Implementing Plagiarism Detection Engine for English Academic Papers 20
Chapter 4 Implementation
4.1 Extract, Load and Transform (ETL)
Figure 4.1 Overview for the Crawler and Parser
4.1.1 The Crawler
It’s a software that download all the scientific papers from the web into a folders for each
publisher where the parser will start working on them.
4.1.2 The Parser
It’s a software that take a PDF document (Scientific paper) as an input and extract the paper
information and content of the paper and insert them in the database of the system.
4.1.3 The Data Extracted from the paper
a. Paper Information
1. Paper Title
2. Paper authors
3. Journal and its ISSN
4. Volume, Issue, Paper Date and other dates (Accepted, Received, Revised, Published)
5. DOI (Digital Object Identifier) or PII (Publisher Item Identifier)
6. Starting Page and Ending Page
b. Abstract and Keywords
c. Table of Contents
d. Figure and Table captions
e. Paper text content (as Paragraphs)
Implementing Plagiarism Detection Engine for English Academic Papers 21
4.1.4 The Parser Implementation
Figure 4.2 UML of the Parser Application
4.1.5 How it works
The Parser consists of a parent class (Parser) and other children classes (IEEE, Springer, APEM,
and Science Direct). The parent class has the general functions that parse the PDF document and extract
(Table of contents, Figure and Table captions, and the text content) of the paper, the children classes
Start
Check for
new Papers
Choose the Suitable
Parser
Parse the
Papers
Move to
Processed Directory
Move to
Unprocessed Directory
No
Yes
Success
Fail
Figure 4.3 the Flow Chart of the Parser
Implementing Plagiarism Detection Engine for English Academic Papers 22
have specific functions and Regular Expressions for each publisher structure to extract the paper
information (Title, Authors, DOI ...).
Each Publisher has its own folder where the scientific paper are downloaded by the Crawler, and
the Parser will monitor each folder for the new documents and use the suitable child class to parse the
new document found and extract all the needed information and data.
If the paper information and content are extracted completely, the file will be moved to the
(Processed Directory), otherwise, the file will be moved to the (Unprocessed Directory) logging the
error, so the Developer can check if it’s a new structure to be supported it in the Parser, or something
goes wrong and he has to fix it.
4.1.6 Steps of Parsing
4.1.6.1 Extracting the Text from the PDF file (extractBlocks Function)
The Parser uses the (PDFxStream Java Library) which extract the text from the PDF file as
blocks of String, and in this Function, It loops the file page by page and for each page it extract the
content of the page in an object of ArrayList<String> called page and add this page with the page
Number in an object of type HashMap<Integer,ArrayList<String>> called pages.
Figure 4.5 The First Page of an IEEE Paper (as Blocks)
public void parsePaper(String publisher) throws Exception {
extractBlocks();
try {parseFirstPage();}
catch (Exception e) {throw new Exception("Error Not Processed");}
parserOtherPages();
paper.enhaceParagraphs();
try {paper.insertPaperInDatabase(publisher);}
catch (SQLException e) {throw new Exception("Error Database");}
}
Figure 4.4 The main function of Parsing
Implementing Plagiarism Detection Engine for English Academic Papers 23
4.1.6.2 Extracting the Paper Information (parserFirstPage Function)
Each Publisher accepts his scientific paper in a specific structure which differs from publisher to
publisher, and the difference lies in the first page where the paper information are written, so there has to
be parser for each publisher designed to support its structure, so this function which is an abstract
function in the parent class is implemented in each child class for each publisher.
Figure 4.6 First Page of a Science Direct Paper
Figure 4.7 First Paper of a Springer Paper
These are different structure for Science Direct and Springer to show the difference in the
structures, and the difference lies in the organization and structure of the information ex:
Implementing Plagiarism Detection Engine for English Academic Papers 24
1. This is a header of a Springer Paper
Kong et al. EURASIP Journal on Advances in Signal Processing 2014, 2014:44
http://asp.eurasipjournals.com/content/2014/1/44
2. This is a header of an IEEE Paper
IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93
4.1.6.3 Extracting the Paper text content (parserOtherPages Function)
This function uses the general Parser functions as it loops over all the Pages and the blocks of
string in each page and extract the data from the blocks that could be (Table of contents, Figure and
Table Captions, Lists and Paragraphs).
Each block passes several stages:
1) First Test if the Block is a Figure Caption
2) Then Test if it’s a Table Caption
3) Then Test if it has a Header (Table of Content)
4) Then Test if the block has lists (maybe numeric, dash, Dot)
In the 3rd
stage, if there are headers in the block, they will be extracted and the rest of the block
will be returned to the function and it will continue the other stages.
void parserOtherPages(){
for (Entry<Integer, ArrayList<String>> entrySet : pages.entrySet()) {
Integer pageNumber = entrySet.getKey();
ArrayList<String> page = entrySet.getValue();
Iterator<String> it = page.iterator();
while (it.hasNext()) {
String block = it.next().trim();
boolean isFigureCaption = false, isTableCaption = false;
boolean isList = false, isEmptyParagraph = false;
isFigureCaption = parseFigureCaption(block, pageNumber);
isTableCaption = parseTableCaption(block, pageNumber);
block = parseHeaders(block);
isList = parseLists(block, pageNumber);
isEmptyParagraph = "".equals(block);
if(!isFigureCaption && !isTableCaption
&& !isEmptyParagraph && !isList)
parseParagraph(block, pageNumber);
}
}
}
Figure 4.8 The function of parseOtherPages
Implementing Plagiarism Detection Engine for English Academic Papers 25
5) Finally if the block isn’t one of the previous types (not Figure or Table Caption, has no
list or it has header extracted and the rest of the block is returned), then it’s a paragraph
and it will be extracted.
4.1.6.4 Enhancing the Paragraphs (enhanceParagraph Function)
As shown in Fig IV.9, some paragraphs when they are extracted won’t be in a good shape,
1) Some words may be separated between 2 lines with a hyphen, so I have to rejoin it, also
there are many spaces between words so I have to remove the extra spaces.
2) The paragraph is extracted as lines (has a new line char at the end of the line) not a
continuous String so I have to refine it.
3) Some of the paper Info are in uppercase so I capitalize them.
Figure 4.9 Block of String before Enhancing
Page Number: 1
The Content:
However, as the number of metal layers increases and interconnect dimensions
decrease, the parasitic capacitance increases associated with fill metal have
become more significant, which can lead to timing and signal integrity
problems.
Page Number: 1
The Content:
Previous research has primarily focused on two important aspects of fill metal
realization: 1) the development of fill metal generation methods – which we
discuss further in Section II and 2) the modeling and analysis of capacitance
increases due to fill metal – Several studies have examined the parasitic
capacitance associated with fill metal for small scale interconnect test
structures in order to provide general guidelines on fill metal geometry
selection and placement. For large-scale designs,
Figure 4.10 The Paragraphs after enhancing
Implementing Plagiarism Detection Engine for English Academic Papers 26
4.1.6.5 Finally inserting all these data in the Database
When the Parser starts, an Object of type Paper is created and every information and data
extracted from the scientific paper are assigned to their attribute in this object, and at the end of the
parsing, all these information and data are inserted in the Database by calling this function.
1) Retrieve the Journal ID from the database by its name or ISSN, if it’s already found, the ID will
be returned, Otherwise, It will be considered new Journal and will be inserted and its ID will be
returned.
2) Test if the Paper is already inserted the paper in the Database before if it is already found, the
Parser will throw Exception stating that its already inserted before, but If it’s a new paper then it
will be inserted with its information (title, volume, issue …) and the Paper ID will be returned.
3) With the Paper ID, the rest of the Data will inserted (Authors, Keywords, Table of Contents,
Figure captions, Table captions, and the text content of the Paper which is the paragraphs).
4.1.7 The Paper Class
This Class works as a structure for the paper, It has the attributes that holds the information and
data of the paper and it has also the function of enhancingParagraphs() that is responsible for
improving the text and enhancing the praragraph to be ready for the next step of processing in the
Natural Language Processing part, and also the function of insertPaperInDatabase() which is
responsible for testing if the Paper is already inserted in the database before or not and if it’s a new
public class Paper {
public String title = "";
public int volume = -1,issue = -1;
public int startingPage = -1, endingPage = -1;
public String journal = "", String ISSN = "";
public String DOI = "";
public ArrayList<String> headers = new ArrayList<>();
public ArrayList<String> authors = new ArrayList<>();
public ArrayList<String> keywords = new ArrayList<>();
public ArrayList<Paragraph> figureCaptions = new ArrayList<>();
public ArrayList<Paragraph> tableCaptions = new ArrayList<>();
public ArrayList<Paragraph> paragraphs = new ArrayList<>();
public String date="";
public String dateReceived="NULL", dateRevised="NULL";
public String dateAccepted="NULL", dateOnlinePublishing="NULL";
public void enhaceParagraphs()
public void insertPaperInDatabase(String publisher)
}
Figure 4.11 the Paper Structure
Implementing Plagiarism Detection Engine for English Academic Papers 27
paper, then It will be inserted with all of its data from paragraphs, figure and table captions and the
paper information.
Note that:
When the Parser finds a new PDF document in the folders of the publishers, it create a new
object of type paper and in the meantime of parsing the document, each piece information extracted is
assigned to its attribute in this object, and in the end of the parsing process, This paper object execute
its 2 member function the enhanceParagraph() to refine the paragraph content then execute the
insertPaperInDatabase() method to insert all the data in the database.
4.1.8 The Paragraph Structure
As shown in the Figure IV.12, the paragraph structure is very simple is contains the number of
the content of the paragraph extracted and the page number from where this paragraph is extracted.
4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper)
As shown in Figure IV.13, we can see that the page is divided into blocks of string and each
block has a piece or more of the paper information, this function in the parser is implemented specific to
the publisher, so the function of IEEE parser won’t work to the Springer Parser and so on, and this
function is implemented to parse only the first page and to extract the paper information in this page and
assign them to the attributes of the paper object.
Even in the publisher itself, there are differences in the location of the paper information in the
page, and the structure is changing overtime for example the IEEE Parser support 8 different forms of
Paper header.
Figure 4.13 Different forms for an IEEE Top Header
public class Paragraph {
public int pageNum;
public String content;
}
Figure 4.12 the Paragraph Structure
Implementing Plagiarism Detection Engine for English Academic Papers 28
Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper
Implementing Plagiarism Detection Engine for English Academic Papers 29
4.1.9.1 Parsing the Paper Header
The function of the parseFirstPage() starts with parsing the Header of the paper which is the
first block in the paper, the block is sent to a parsePaperHeader() function which have the different
Regular Expressions for every form of that the parser support as shown in Fig IV.15.
When the function receives the block, the block passes through the different Expressions, and if
it matches with one of supported formats, the function will start extracting the information, otherwise
the function will throw an Exception stating that this header format isn’t supported and the developer
has to support it.
As shown in Figure IV.13, The Header may contain information such as the Starting page
number (may be in the start or at the end of the line), Journal Title, Volume number, Issue number,
and the Date. These information could be presented in the header or not, so according to the format of
the header the suitable functions (parsePaperDate(), parseVolume(), parseIssue(), parseJournal(),
parseStartingPage()) will be called to extracted these information.
// ex: Chang et al. VOL. 1, NO. 4/SEPTEMBER 2009/ J. OPT. COMMUN. NETW. C35
String header1_Exp = "^([A-Z]+ ET AL. " + volume_Exp + ", " + issue_Exp
+ "[ ]*/[ ]*" + paperDate_Exp + "[ ]*/[ ]*" + journalTitle_Exp + " [A-Z0-9]+)$";
// ex: 594 J. OPT. COMMUN. NETW. /VOL. 1, NO. 7/DECEMBER 2009 Lim et al.
String header2_Exp = "^([A-Z0-9]+ " + journalTitle_Exp + "/"
+ volume_Exp + ", " + issue_Exp + "/" + paperDate_Exp + " [A-Z]+ ET AL.)$";
// ex: IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93
// ex: 93 IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007
// ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May-June 2008
// ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May/June 2008
// ex: 93 IEEE TRANSACTIONS ON MAGNETICS Vol. 13, No. 6; December 2006
String header3_Exp = "^(([0-9]+ )*" + journalTitle_Exp +"(,)* " + volume_Exp
+ ", " + issue_Exp + "(,|;) " + paperDate_Exp + "( [0-9]+)*)$";
// ex: 598 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
String header4_Exp = "^([0-9]+ " + journalTitle_Exp + ")$";
// ex: IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 598
String header5_Exp = "^(" + journalTitle_Exp + " [0-9]+)$";
// ex: 1956 lRE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES 75
String header6_Exp = "^([0-9]{4} " + journalTitle_Exp + "[0-9]+)$";
// ex: 112 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS May
String header7_Exp = "^([0-9]+ " + journalTitle_Exp + "[A-Z]{3,9})$";
// ex: SUPPLEMENT TO IEEE TRANSACTIONS ON AEROSPACE / JUNE 1965
String header8_Exp = "^(" + journalTitle_Exp + "[ ]*/[ ]*" + dateExp + ")$";
Figure 4.15 The supported Regex of the IEEE Header formats
Implementing Plagiarism Detection Engine for English Academic Papers 30
4.1.9.2 Extracting the Volume from the Header
The IEEE Parser uses the volume_Exp = VOL(.)* [A-Z-]*[0-9]+ to detect the
Volume part from the Header and passes it to the parseVolume() Function, then uses another
Expression to extract the number from this part, ex: in the first line in Fig IV.16 of the header forms, the
parser will detect the part (VOL. 18), then It will detect the number from this result (18), then change its
type from String to int to be assigned to the volume attribute in the paper Object.
4.1.9.3 Extracting the Issue number from the Header
The IEEE Parser uses the issue_Exp = NO(.|,) [0-9]+ to detect the issue part from the
Header and passes it to the parseIssue() Function, then uses another Expression to extract the
number from this part, for the same example presented in the Volume section, the parser will detect the
part (NO. 3), then It will detect the number from this result (3), then change its type from String to int to
be assigned to the issue attribute in the paper Object.
4.1.9.4 Extracting the PaperDate from the Header
@Override
void parseVolume(String volume) {
Matcher matcher = Pattern.compile(volume_Exp).matcher(volume);
if(matcher.find()){
Matcher numMatcher = Pattern.compile("[0-9]+").matcher(matcher.group());
while(numMatcher.find())
paper.volume = Integer.parseInt(numMatcher.group());
}
}
Figure 4.16 The Function of extracting the Volume Number
@Override
void parseIssue(String issue) {
Matcher matcher = Pattern.compile(issue_Exp).matcher(issue);
if(matcher.find()){
Matcher numMatcher = Pattern.compile("[0-9]+")
.matcher(matcher.group());
if(numMatcher.find())
paper.issue = Integer.parseInt(numMatcher.group());
}
}
Figure 4.17 The Function of Extracting the Issue Number
@Override
void parsePaperDate(String date) {
Matcher matcher = Pattern.compile(paperDate_Exp).matcher(date.trim());
if(matcher.find())
paper.date = matcher.group().replaceAll("^/", "").trim();
}
Figure 4.18 The Function of Extracting the DOI
Implementing Plagiarism Detection Engine for English Academic Papers 31
Like the other parts of the header, the IEEE Parser uses the date_Exp = [A-Z]{0,9}[/-
]*[A-Z]{3,9}(.)*( [0-9]{1,2}(,)*)* [0-9]{4} to extract the date part from the Header,
then assign it to the date attribute in the paper object, also the Date could be written in different formats
(2016, March 2016, May/June 2016, May-June 2016) and the Expression is written to detect all forms
of the date formats.
Note that after extracting each information from the previous ones, this information is removed
from the block (header) String, so the after removing the volume, issue, and date, the information left in
the header will be the Journal Title and the Starting page, and the Starting page could be in the start or
the end of the header.
4.1.9.5 Extracting the Start and End Page numbers from the Header
Now, we know that the header has only the Journal Title and the Start Page number, so The
IEEE Parser uses the startPage_Exp = ^[0-9]+|[0-9]+$ this expression is to extract a number
that lies at the start of the end of the checked String so if the Start Page number lies in the start of the
header or at the end of the header, It will be detected and extracted, then as the other information it will
be assigned to the attribute in the paper object.
And for the End Page, It’s very simple as the IEEE Parser will add the number of pages of the
Paper to the number of Start Page and assign the result to the End page of the paper object.
4.1.9.6 Extracting the Journal Title from the Header
Now finally for the Journal Title, The IEEE Parser uses the journalTitle_Exp = [A-Z
:-—/)(.,]+ to extract the Journal Title part from the Header, then passes the
title to the parseJournalTitle() Function.
The title may have some extra words that aren’t needed such as: ([author name] et al.) or it may
end with some separating characters (comma or forward slash) so they must be removed first, then
assign the rest to the journal attribute in the paper object.
@Override
void parseStartingPage(String startingPage) {
Matcher matcher = Pattern.compile(startingPage_Exp).matcher(startingPage);
if(matcher.find())
paper.startingPage = Integer.parseInt(matcher.group().trim());
parseEndingPage(startingPage);
}
@Override
void parseEndingPage(String endingPage) {
paper.endingPage = paper.startingPage + pages.size();
}
Figure 4.19 The Function of Extracting the Start and End Pages
Implementing Plagiarism Detection Engine for English Academic Papers 32
4.1.9.7 Parsing the Rest of the first page’s blocks
@Override
void parseJournal(String journal) {
journal = journal.replaceAll("( /|, )", "").trim();
Matcher matcher = Pattern.compile(journalTitle_Exp).matcher(journal);
if(matcher.find()){
String journalName = matcher.group().replaceAll("[A-Z ]+ ET AL.", "");
if (journalName.charAt(journalName.length()-1) == '/')
paper.journal = journalName.substring(0, journalName.length()-1);
else
paper.journal = journalName;
}
}
Figure 4.20 The Function of Extracting the Journal Title
Iterator<String> it = pageOne.iterator();
while (it.hasNext()) {
String mainBlock = it.next();
String block = mainBlock.replaceAll("[ ]+", " ");
if(Pattern.compile(IEEE_DOI_Exp + "|" + PII_Exp).matcher(block).find())
{ parseDOI(block); blockList.add(mainBlock); }
if(ISSN_Pattern.matcher(block).find())
{ parseISSN(block); blockList.add(mainBlock); }
if(Pattern.compile("Index Terms").matcher(block).find())
{ parseKeywords(block); blockList.add(mainBlock); }
if(Pattern.compile("(Abstract|ABSTRACT|Summary)").matcher(block).find())
{ parseAbstract(block); blockList.add(mainBlock); }
if(date_Pattern.matcher(block.toUpperCase()).find() && !datesFound){
parseDates(block);
if (!paper.dateAccepted.equals("NULL") ||
!paper.dateOnlinePublishing.equals("NULL") ||
!paper.dateReceived.equals("NULL") ||
!paper.dateRevised.equals("NULL"))
{ blockList.add(mainBlock); datesFound = true; }
}
}
removeUnimportantBlocks();
for (String blockList1 : blockList)
pageOne.remove(blockList1);
Figure 4.21 Parsing the rest of blocks in the first Page
Implementing Plagiarism Detection Engine for English Academic Papers 33
After parsing the Header Block and extracting all the information from it, the IEEE Parser will
continue to parse the other blocks, searching for the rest information, but due to the difference in
structure, the location of these information could differ from structure to structure so the best way to
extract them is by looping through all the first page blocks and using the Regular Expressions of the
these information such as (DOI, ISSN …) the Parser can locate them and also It will try to detect some
other blocks such as the Abstract, Keywords, Nomenclature, and the paper Dates (when it’s
Received, Accepted, Revised, and Published Online).
In every loop, if an information is detected the block will be passed to the suitable function to
extract this information, and since the information is extracted, the block isn’t needed so the Parser will
add this block to a TreeSet<String> (blockList) and after finishing all the iterations on the blocks
of page one, these blocks will be removed from the blocks of the page.
Also, there may be some other blocks that don’t have important information such as: the website
of the publisher or the logo of the publisher with its name under the logo, so they all also have to be
detected and removed using the function removeUnimportantBlocks().
4.1.9.8 Extracting the DOI or the PII
In the loop, if the block is detected to have the DOI (Digital Object Identifier) or PII (Publisher
Object Identifier) using the IEEE_DOI_Exp = [0-9]{2}.[0-9]{4}/[A-Z-]+.[0-
9]+.[0-9]+ and the PII_Exp = [0-9]{4}-[0-9xX]{4}([0-9]{2})[0-9]{5}-
(x|X|[0-9]), The IEEE Parser will be passed the block to the function of parseDOI(), and the DOI
or PII will be extracted, then if it’s the DOI, It will be concatenated with the domain of the DOI of the
papers (http://dx.doi.org/), but if it’s the PII, It will be concatenated with (http://dx.doi.org/10.1109/S)
and the result will be assigned it to the DOI attribute in the paper object.
4.1.9.9 Extracting the ISSN
@Override
void parseDOI(String DOI) {
Matcher matcher = Pattern.compile(IEEE_DOI_Exp).matcher(DOI);
while(matcher.find())
paper.DOI = "http://dx.doi.org/" + matcher.group();
matcher = Pattern.compile(PII_Exp).matcher(DOI);
while(matcher.find())
paper.DOI = "http://dx.doi.org/10.1109/S" + matcher.group();
}
Figure 4.22 The Function of Extracting the DOI and PII
void parseISSN(String ISSN){
Matcher matcher = ISSN_Pattern.matcher(ISSN);
while(matcher.find())
paper.ISSN = matcher.group().replaceAll("(–|-|‐)", "-");
}
Figure 4.23 The Function of Extracting the ISSN
Implementing Plagiarism Detection Engine for English Academic Papers 34
Also if a block is detected to have the ISSN using the ISSN_Exp = [0-9]{4}(–|-|‐|
)[0-9]{3}[0-9xX], The IEEE Parser will pass the block to the function parseISSN(), that will
extract the ISSN, and assign it to the ISSN attribute in the paper object.
4.1.9.10 Extracting the Dates of the Paper:
If a block through the iteration is detected to have dates using the date_Exp = ([0-
9]{1,2}(-| )[A-Z]{3,9}[.]*(-| )[0-9]{4})|[A-Z]{3,9}[.]*( [0-
9]{1,2},)* [0-9]{4}) so The IEEE Parser will pass the block to the function parseDates(), this
block have dates related to the Paper such as when it’s received in the publisher and when it’s revised,
void parseDates(String dates){
dates = dates.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").toUpperCase();
Matcher matcher = receivedDate_Pattern.matcher(dates);
while(matcher.find()){
String stMatch = matcher.group();
Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch);
while(dateMatcher.find())
paper.dateReceived = dateMatcher.group().trim();
}
matcher = revisedDate_Pattern.matcher(dates);
while(matcher.find()){
String stMatch = matcher.group();
Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch);
while(dateMatcher.find())
paper.dateRevised = dateMatcher.group().trim();
}
matcher = acceptedDate_Pattern.matcher(dates);
while(matcher.find()){
String stMatch = matcher.group();
Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch);
while(dateMatcher.find())
paper.dateAccepted = dateMatcher.group().trim();
}
matcher = publishingDate_Pattern.matcher(dates);
while(matcher.find()){
String stMatch = matcher.group();
Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch);
while(dateMatcher.find())
paper.dateOnlinePublishing = dateMatcher.group().trim();
}
}
Figure 4.24 The Function of extracting the paper Dates
Implementing Plagiarism Detection Engine for English Academic Papers 35
accepted and published online and for every date of those there is a Regular Expression to detect it and
Note that: not all papers include these dates in the paper, but most of them include it, so they will be
extracted if they are included in the paper and assigned their attributes in the paper object.
The dates could be written in many formats: (30 OCTOBER 2007), (17 AUG. 2007), (28-JULY-
2009), (OCTOBER 6, 2006), so the Regular Expression of the date itself could be complicated as to
detect all of these formats of dates
Also the word before the date could be written in different forms (Received), (Received:),
(Revised), (Revised:) or (Received in revised form) and maybe lowercase or capitalized, so the Regular
Expressions are constructed to detect all forms of those words and for the character case, we transform
the string to uppercase and compare them.
4.1.9.11 Extracting the Keywords
If the block of the Keywords is detected, The IEEE Parser will pass it to the parseKeywords()
function, the keywords may be found in the block of the Abstract so the first line to crop the part of the
Keywords if it exists with the abstract, then the block could be separated in 2 lines or has a word
separated in 2 lines with a hyphen so they have to be removed and fixed, after that some papers separate
the keywords with comma (,) and others separate them with semi-colon (;), Then the splitted keywords
are added to the list of the keywords in the paper object.
4.1.9.12 Extracting the Abstract
For the block of the abstract, while the iteration in the parseFirstPage() procedure, If one of
the blocks matches the word abstract or summary, then this block will be passed to the
parseAbstract() function, and it will be considered the first paragraph in the page with the header
Abstract .
In some cases the abstract may contain some other information such as the keywords or the
Nomenclature so they have to be copped first and parsed separately.
@Override
void parseKeywords(String keywords) {
keywords = keywords.substring(keywords.indexOf("Index Terms"));
String indexTerms_Removal = "-rn|Index Terms|-";
keywords = keywords.replaceAll(indexTerms_Removal, "");
String[] splitted = keywords.replaceAll(newLine_Removal, " ").split(",|;");
for (int i = 0; i < splitted.length; i++)
paper.keywords.add(splitted[i].trim());
}
Figure 4.25 The Function of Extracting the Keywords
Implementing Plagiarism Detection Engine for English Academic Papers 36
4.1.9.13 Extracting the Title and Authors
Now after extracting all the information of the paper and removing these blocks, The next block
will have the Title of the paper, then the Authors, then the Introduction.
First the title will be passed to the parseTitle() procedure, so If it’s separated on more than
one line, it will remove the newline char and assign it to the title attribute.
Next the Authors will be passed to the parseAuthors() procedure, where they will be
separated may be by comma or semi-colon or some other separation according to the publisher style,
and each author will be added to the authors list in the paper object.
@Override
void parseAbstract(String abstractContent) {
int indexOfIndexTerms = abstractContent.indexOf("Index Terms");
if (indexOfIndexTerms != -1)
abstractContent = abstractContent.substring(0,indexOfIndexTerms);
int indexOfNomenclature = abstractContent.indexOf("NOMENCLATURE");
if (indexOfNomenclature != -1)
abstractContent = abstractContent.substring(0,indexOfNomenclature);
paper.headers.add("Abstract");
abstractContent = abstractContent.replaceAll("(Abstract|Summary)(-)*", "");
String lastHeader = paper.headers.get(paper.headers.size()-1);
Paragraph paragraph = new Paragraph(1, lastHeader, abstractContent);
paper.paragraphs.add(paragraph);
}
Figure 4.26 The Function of Extracting the Keywords
void parseTitle(String title) {
paper.title = title.replaceAll(newLine_Removal, " ").trim();
}
@Override
void parseAuthors(String authors) {
authors = authors.replaceAll(author_Removal, "")
.replaceAll("[ ]+", " ");
authors = authors.replaceAll(separatedWord_Fixing, "")
.replace(newLine_Removal, " ");
String[] split = authors.split(",| and| And| AND");
for (String author : split)
if(!author.trim().isEmpty())
paper.authors.add(author.replaceAll("[0-9]+", "").trim());
}
Figure 4.27 The Function of Extracting the Title and the Authors
Implementing Plagiarism Detection Engine for English Academic Papers 37
4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)
Now all the paper Information is extracted and the blocks exist in the first page is the
introduction and the rest of the page content, and parseFirstPage() procedure is done executing,
and parseOtherPages() procedure will start executing, as we demonstrate before it loops over all
the blocks of strings in the pages and extract all the possible data from them such as Headers (Table of
Contents), figure and table captions, lists and if not any of the previous it will be a paragraph, and all of
these procedures are part of the parent Parser.
4.1.10.1 Extracting the Headers
This procedure is a very general one and works efficiently for most types of headers, first it
detect the style of the level 1 headers and it supports (I. INTRODUCTION), (1 INTRODUCTION), (1.
INTRODUCTION), (1 Introduction), the Headers could be listed with the roman numbers or may be
with number and the header written in an uppercase or may be the number has a dot after it, or the
header written in a capitalized case, so first the function detects the type of the header.
Also for the level 2 headers, there are different styles for these headers and there is another
function to detect them and it supports 3 different types of headers such as: (for example) (A. Level 2
Header), (1.1 Level 2 Header),
(1.1. Level 2 Header), so the Header could be listed alphabetically or number dot number or
number dot number dot then the title of the header.
Once the headers style are specified the Header’s Regular Expressions are created and are tested
on all the passed blocks to detect any headers, and those Regex are not constant but they are changing
for example if I detected
(1. Introduction) then the next header that I will wait will be (2. Another Header) so the
number will be incremented.
Another thing, the header always comes at the start of the block of string and the rest of the
string is a paragraph or it may be extracted from the beginning in one block, so the function will extract
the header only and the rest of the block will be return so it will be parsed as a paragraph.
Also there are other headers that has no numbers such as the Abstract, References,
Acknowledgements, Appendix and more other, and those headers are detected separately with a
separate Regex.
Also this procedure can detect the level 3 and level 4 headers and their style is specified
according to the style of the level 2 headers for example (1.1.1 Header) or (1.1.1. Header) and
all are add to the headers list in the paper object.
Implementing Plagiarism Detection Engine for English Academic Papers 38
4.1.10.2 Extracting the Figure and Table Captions
In This procedure, the parent Parser uses the figure_Exp = ^(Fig.|Figure)[ ]+[0-
9]+(.|:) and table_Exp = ^(TABLE|Table)[ ]+([0-9]+|[IVX]+) to detect the figure
and table captions and they may appear in different styles for example (Figure 1.), (Fig. 1),
(Figure 1:), (TABLE 1) and (Table II) also the listing could be numeric or alphabetic, and after
extracting it, the caption will be added to the list of captions as paragraphs with the page number.
private enum HeaderMode {
NUM_SPACE_UPPERCASE, NUM_SPACE_CAPITALIZED, ROMAN_DOT_UPPERCASE,
NUM_DOT_UPPERCASE, NUM_DOT_CAPITALIZED, ABC_DOT_CAPITALIZED,
NUM_DOT_NUM_DOT_CAPITALIZED, NUM_DOT_NUM_CAPITALIZED
}
Pattern restHeader_Pattern = Pattern.compile("^(REFERENCES|References|"
+ "ACKNOWLEDGMENT[S]*|Acknowledg[e]*ment[s]*|Nomenclature|DEFINITIONS"
+ "|Contents|NOMENCLATURE|ACRONYM|ACRONYMS|NOTATION|APPENDIX|"
+ "Appendix)(rn| )*");
void detect_Header1Mode(String block){
if (Pattern.compile("I. INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.ROMAN_DOT_UPPERCASE;
else if (Pattern.compile("1 INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.NUM_SPACE_UPPERCASE;
else if (Pattern.compile("1 Introduction").matcher(block).find())
header1_Mode = HeaderMode.NUM_SPACE_CAPITALIZED;
else if (Pattern.compile("1. INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.NUM_DOT_UPPERCASE;
else if (Pattern.compile("1. Introduction").matcher(block).find())
header1_Mode = HeaderMode.NUM_DOT_CAPITALIZED;
}
void detect_Header2Mode(int _1st_header, String block){
if (Pattern.compile("^A. [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.ABC_DOT_CAPITALIZED;
else if (Pattern.compile("^" + _1st_header
+ ".1 [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.NUM_DOT_NUM_CAPITALIZED;
else if (Pattern.compile("^" + _1st_header
+ ".1. [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.NUM_DOT_NUM_DOT_CAPITALIZED;
}
Figure 4.28 The Defining the Style of the Header
Implementing Plagiarism Detection Engine for English Academic Papers 39
4.1.10.3 Extracting the Lists
In this procedure, the parent Parser can detect the lists in text content and separate them as a
whole paragraph itself, and it supports the numeric, dot and dashed lists, and the block could have
paragraph at the beginning and paragraph at the last, so they have to be separated, and each of the
paragraph (if found) and the lists are added as paragraphs with the page number in the paragraph list in
the paper object.
private boolean parseFigureCaption(String block, int pageNumber){
block = block.replaceAll("[ ]+", " ").trim();
Matcher matcher = figureCaption_Pattern.matcher(block);
while(matcher.find()){
String figureTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
if(paper.headers.size()>0)
String lastHeader = paper.headers.get(paper.headers.size()-1);
Paragraph figure = new Paragraph(pageNumber, lastHeader, figureTitle);
paper.figureCaptions.add(figure);
return true;
}
return false;
}
private boolean parseTableCaption(String block, int pageNumber){
block = block.replaceAll("[ ]+", " ").trim();
Matcher matcher = tableCaption_Pattern.matcher(block);
while(matcher.find()){
String tableTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
if(paper.headers.size()>0)
String lastHeader = paper.headers.get(paper.headers.size()-1);
Paragraph table = new Paragraph(pageNumber, lastHeader, tableTitle);
paper.tableCaptions.add(table);
return true;
}
return false;
}
Figure 4.29 the Function of Extracting the Figure Captions
Implementing Plagiarism Detection Engine for English Academic Papers 40
4.1.10.4 Extracting the Paragraph
Pattern newList_Pattern = Pattern.compile("rn[ ]*([0-9]|-|.|·|•)");
Pattern numericList1_Pattern = Pattern.compile("^[0-9](.|))[ ]+[A-Z]");
Pattern numericList2_Pattern = Pattern.compile("(.|:)rn[ ]*[0-9](.|))[ ]+[A-
Z]");
Pattern dotList1_Pattern = Pattern.compile("^(.|·|•)[ ]+[A-Za-z]");
Pattern dotList2_Pattern = Pattern.compile("(.|:)rn[ ]*(.|·|•)[ ]+");
Pattern dashList1_Pattern = Pattern.compile("^-[ ]+[A-Za-z]+");
Pattern dashList2_Pattern = Pattern.compile("(.|:)rn[ ]*-[ ]+[A-Z]");
private boolean parseLists(String block,int pageNumber){
Matcher orderList1_Matcher = numericList1_Pattern.matcher(block);
Matcher orderList2_Matcher = numericList2_Pattern.matcher(block);
if(orderList1_Matcher.find() || orderList2_Matcher.find())
return parseList(block, pageNumber, numericList2_Pattern, newList_Pattern);
Matcher dotList1_Matcher = dotList1_Pattern.matcher(block);
Matcher dotList2_Matcher = dotList2_Pattern.matcher(block);
if(dotList1_Matcher.find() || dotList2_Matcher.find())
return parseList(block, pageNumber, dotList2_Pattern, newList_Pattern);
Matcher dashList1_Matcher = dashList1_Pattern.matcher(block);
Matcher dashList2_Matcher = dashList2_Pattern.matcher(block);
if(dashList1_Matcher.find() || dashList2_Matcher.find())
return parseList(block, pageNumber, dashList2_Pattern, newList_Pattern);
return false;
}
Figure 4.30 the Function of separating the lists
void parseParagraph(String block, int pageNumber){
Matcher matcher = newParagraph_Pattern.matcher(block);
String lastHeader="",content;
int startIndex =0, endIndex;
while(matcher.find()){
endIndex = matcher.start();
if(paper.headers.size()>0)
lastHeader = paper.headers.get(paper.headers.size()-1);
content = block.substring(startIndex, endIndex+1);
Figure 4.31 the Function of Extracting the Paragraph
Implementing Plagiarism Detection Engine for English Academic Papers 41
In this procedure, the parent Parser can detect the paragraphs using the paragraph_Exp =
.rn[ ]+[A-Z], and as we mentioned before, the block is passed to this procedure after being
tested to be a Figure or Table Caption or it’s a list, or it could have a header so it has to be extracted first
and return the rest of the block, so if the block passed all these tests, it will be considered a paragraph,
and passed to the parseParagraph() procedure.
Note that: the block may contain one or more paragraph so all of them has to be detected and
separated and each of them is added to the paragraphs list with page number in the paper object.
Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content);
paper.paragraphs.add(paragraph);
startIndex = endIndex + matcher.group().length()-1;
}
if(paper.headers.size()>0)
lastHeader = paper.headers.get(paper.headers.size()-1);
content = block.substring(startIndex);
Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content);
paper.paragraphs.add(paragraph);
}
Figure 4.32 The Function of Extracting the Paragraph
Implementing Plagiarism Detection Engine for English Academic Papers 42
4.2 The Natural Language Processing (NLP)
4.2.1 Introduction
In this section, the text extracted from the scientific papers has to be refined. We have to focus
on the important words in the text such as names and verbs, and ignoring the staffed words such as
prepositions and adverbs, so the plagiarism can be detected efficiently even if the user try to play with
words.
4.2.2 The Implementation Overview
First, each paragraph in the database is selected and passed to the processText() procedure that
perform the text processing and return an array of refined words, in this procedure the paragraph passes
through several steps.
1. Lowercase
2. Tokenization
3. Part of Speech (POS) tagging
4. Remove Punctuations
5. Remove Stop words
6. Lemmatization
4.2.3 The Text Processing Procedure
4.2.3.1 Lowercase
In this step, all the text is changed to the lowercase, so we don’t have redundant data of the same
words written in different cases (Play, play).
4.2.3.2 Tokenization
def processText(document):
document = document.lower()
words = tokenizeWords(document)
tagged_words = pos_tag(words)
filtered_words = removePunctuation(tagged_words)
filtered_words = removeStopWords(filtered_words)
filtered_words = lemmatizeWords(filtered_words)
return filtered_words
Figure 4.33 Process Text Function
def tokenizeWords(sentence):
return word_tokenize(sentence)
Figure 4.34 Tokenizing words Function
Implementing Plagiarism Detection Engine for English Academic Papers 43
Here, we split the text into words using Treebank Tokenization Algorithm. This Algorithm
splitting the words in intelligent way based on corpus (data) retrieved from NLTK, it also split the words
from surrounded punctuation.
For Example:
4.2.3.3 Part of Speech (POS) tagging
The purpose of the POS is to find the position of the word in the sentence, it can detect if
the word is verb, noun, adjective, or adverb, so this information will help return the words to
their origins as for verbs they will be returned to their infinitives.
We use WordNet database to get the words origins.
1. i’m → [ 'i', "'m" ]
2. won’t → ['wo', "n't"]
3. gonna (tested) {helping} (25) → ['gon', 'na', 'tested', 'helping', '25']
Figure 4.35 Tokenization Example
words = ['At', '5 am', "tomorrow", 'morning', 'the',
'weather', "will", 'be', 'very', 'good', '.']
taged_words = nltk.pos_tag(words)
Figure 4.36 POS Function
[('at', 'IN'), ('5', 'CD'), ('am', 'VBP'), ('tomorrow', 'NN'),
('morning', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('will',
'MD'), ('be', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]
Figure 4.37 POS Output Example
def getWordnetPos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
Figure 4.38 WordNet POS Function
Implementing Plagiarism Detection Engine for English Academic Papers 44
4.2.3.4 Remove Punctuations
In this Step the punctuations are removed from the text such as: comma, full stop, single and
double quotes, and the parenthesis either circle or square or the curly braces.
4.2.3.5 Remove Stop words
In this process the staffed words (stop words) are removed.
def removePunctuation(words):
new_words = []
for word in words:
if len(word[0]) > 1:
new_words.append(word)
return new_words
Figure 4.39 Removing Punctuations Function
def removeStopWords(words):
stop_words = set(stopwords.words("english"))
new_words = []
for word in words:
if word[0] not in stop_words:
new_words.append(word)
return new_words
Figure 4.40 Removing Stop Words Function
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her',
'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those',
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Figure 4.41 Stop Words list
Implementing Plagiarism Detection Engine for English Academic Papers 45
4.2.3.6 Lemmatization
In this step, we use the information retrieved from the POS to get the origins of the words. By
passing the word and its wordnet position to the lemmatize function.
Now the paragraph after being processed have only the important words that describe the real meaning
of the paragraph.
4.2.4 Example of the Text Processing
def lemmatizeWords(words):
new_words = []
wordnet_lemmatizer = WordNetLemmatizer()
for word in words:
new_word = wordnet_lemmatizer.lemmatize(word[0],
getWordnetPos(word[1]))
new_words.append(new_word)
Figure 4.42 Lemmatization Function
Plagiarism is the wrongful appropriation and stealing and publication of another
author's language, thoughts, ideas, or expressions and the representation of them
as one's own original work. The idea remains problematic with unclear definitions
and unclear rules. The modern concept of plagiarism as immoral and originality as
an ideal emerged in Europe only in the 18th century, particularly with the
Romantic movement.
Plagiarism is considered academic dishonesty and a breach of journalistic ethics.
It is subject to sanctions like penalties, suspension, and even expulsion. Recently,
cases of 'extreme plagiarism' have been identified in academia.
Plagiarism is not in itself a crime, but can constitute copyright infringement. In
academia and industry, it is a serious ethical offense. Plagiarism and copyright
infringement overlap to a considerable extent, but they are not equivalent
concepts, and many types of plagiarism do not constitute copyright infringement,
which is defined by copyright law and may be adjudicated by courts. Plagiarism is
not defined or punished by law, but rather by institutions (including professional
associations, educational institutions, and commercial entities, such as publishing
companies).
Figure 4.43 Paragraph before Text Processing
Implementing Plagiarism Detection Engine for English Academic Papers 46
['plagiarism', 'wrongful', 'appropriation', 'stealing', 'publication',
'another', 'author', "'s", 'language', 'thought', 'idea', 'expression',
'representation', 'one', "'s", 'original', 'work', 'idea', 'remain',
'problematic', 'unclear', 'definition', 'unclear', 'rule', 'modern',
'concept', 'plagiarism', 'immoral', 'originality', 'ideal', 'emerge',
'europe', '18th', 'century', 'particularly', 'romantic', 'movement',
'plagiarism', 'consider', 'academic', 'dishonesty', 'breach',
'journalistic', 'ethic', 'subject', 'sanction', 'like', 'penalty',
'suspension', 'even', 'expulsion', 'recently', 'case', "'extreme",
'plagiarism', 'identify', 'academia', 'plagiarism', 'crime',
'constitute', 'copyright', 'infringement', 'academia', 'industry',
'serious', 'ethical', 'offense', 'plagiarism', 'copyright',
'infringement', 'overlap', 'considerable', 'extent', 'equivalent',
'concept', 'many', 'type', 'plagiarism', 'constitute', 'copyright',
'infringement', 'define', 'copyright', 'law', 'may', 'adjudicate',
'court', 'plagiarism', 'define', 'punish', 'law', 'rather',
'institution', 'include', 'professional', 'association', 'educational',
'institution', 'commercial', 'entity', 'publish', 'company']
Figure 4.44 Paragraph after Text Processing
Implementing Plagiarism Detection Engine for English Academic Papers 47
4.3 Term Weighting
In this section we will calculate the term weighting our system using the data extracted from the
scientific papers by the parser. The parser extract the data as paragraphs and store them in database, and
here we will retrieve these paragraphs and calculate the term weighting for the system.
4.3.1 Lost Connection to Database Problem
First we open a connection to database and retrieve the unprocessed paragraphs, but we are
processing a large number of paragraphs and the connection must stay open all that time. So we face a
problem of lost connection to database when its internal timeout expires.
1) Increasing timeout solution
This problem could be solved by increasing the timeout but this solution is limited as we might
have a very large number of paragraphs that exceeds the timeout was set before.
2) Better solution
We will retrieve 100 paragraph and process them, then close the connection. Then open a new
connection and retrieve another 100 paragraphs and so on until all the unprocessed paragraphs are
processed.
cursor = connection.run("SELECT COUNT(*) FROM paragraph WHERE processed = false")
(unprocessedParagraphsNum,) = cursor.fetchone()
connection.endConnect()
pCounter = 0
insertTermsBeginTime = time.time()
while pCounter < unprocessedParagraphsNum:
connection1 = Connection(caller)
connection2 = Connection(caller)
remain = unprocessedParagraphsNum - pCounter
if remain > 100: remain = 100
rows = connection1.run("SELECT paragraphId,content FROM paragraph WHERE
processed = false LIMIT %s", (remain,))
for (paragraphId, content) in rows:
pCounter += 1
# Process Paragraph
connection.endConnect()
Figure 4.45 Retrieving Paragraphs
Implementing Plagiarism Detection Engine for English Academic Papers 48
4.3.2 Process Paragraph
For each paragraph we will pass it to processText() procedure to get an array of refined words, if
the array is empty, it means that there we no important words in the paragraph and the paragraph will be
deleted.
We will use the returned words to generate k-gram terms of them and populate the term table and
paragraphVector table.
Finally we will update the length of the paragraph with the number of words returned from
processText(), and mark the paragraph as processed.
4.3.3 Generating Terms
To generate terms we will call the generateTerms() procedure and pass to it: the bag of words,
and the kind of the k-grams we want to generate.
while pCounter < unprocessedParagraphsNum:
connection1 = Connection(caller)
connection2 = Connection(caller)
remain = unprocessedParagraphsNum - pCounter
if remain > 1000: remain = 1000
rows = connection1.run("SELECT paragraphId,content FROM paragraph"
+ " WHERE processed = false LIMIT %s", (remain,))
for (paragraphId, content) in rows:
pCounter += 1
data = processText(content)
length = len(data)
if length < 1:
connection2.run("DELETE FROM paragraph WHERE paragraphId = %s;"
, (paragraphId,))
connection2.commit()
continue
term.populateTerms_ParagraphVector(connection2, data, paragraphId)
connection2.run("UPDATE paragraph SET length = %s, processed = %s "
+ " WHERE paragraphId = %s;", (length, True, paragraphId))
connection2.commit()
connection1.endConnect()
connection2.endConnect()
Figure 4.46 Process Paragraph Function
Implementing Plagiarism Detection Engine for English Academic Papers 49
Example on k-grams
data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId)
def generateTerms(data, kgrams, paragraphId=0):
all_terms = {}
for i in kgrams:
if len(data) < i: continue
terms = createTerms(data, i)
all_terms[i] = terms
data = {
'paragraphId': paragraphId,
'terms': all_terms
}
return data
def createTerms(words, kgram):
length = len(words) - kgram + 1
i = 0
terms = []
while i < length:
term = createTerm(words, i, kgram)
terms.append(term)
i += 1
return terms
def createTerm(words, start, kgram):
i = start
term = []
while i < kgram + start:
term.append(words[i])
i += 1
t = ' '.join(term)
if len(t) > 180:
t = t[0:180]
return t
Figure 4.47 Generate k-gram Terms Function
Physics is one of the oldest academic disciplines, perhaps the oldest through
its inclusion of astronomy. Over the last two millennia, physics was a part of
natural philosophy along with chemistry, biology, and certain branches of
mathematics.
Figure 4.48 Paragraph Example
Implementing Plagiarism Detection Engine for English Academic Papers 50
['physic', 'one', 'old', 'academic', 'discipline', 'perhaps', 'old',
'inclusion', 'astronomy', 'last', 'two', 'millennium', 'physic', 'part',
'natural', 'philosophy', 'along', 'chemistry', 'biology', 'certain', 'branch',
'mathematics']
Figure 4.49 1-gram terms
['physic one', 'one old', 'old academic', 'academic discipline', 'discipline
perhaps', 'perhaps old', 'old inclusion', 'inclusion astronomy', 'astronomy
last', 'last two', 'two millennium', 'millennium physic', 'physic part', 'part
natural', 'natural philosophy', 'philosophy along', 'along chemistry',
'chemistry biology', 'biology certain', 'certain branch', 'branch mathematics']
Figure 4.50 2-gram terms
['physic one old', 'one old academic', 'old academic discipline', 'academic
discipline perhaps', 'discipline perhaps old', 'perhaps old inclusion', 'old
inclusion astronomy', 'inclusion astronomy last', 'astronomy last two', 'last
two millennium', 'two millennium physic', 'millennium physic part', 'physic
part natural', 'part natural philosophy', 'natural philosophy along',
'philosophy along chemistry', 'along chemistry biology', 'chemistry biology
certain', 'biology certain branch', 'certain branch mathematics']
Figure 4.51 3-gram terms
['physic one old academic', 'one old academic discipline', 'old academic
discipline perhaps', 'academic discipline perhaps old', 'discipline perhaps old
inclusion', 'perhaps old inclusion astronomy', 'old inclusion astronomy last',
'inclusion astronomy last two', 'astronomy last two millennium', 'last two
millennium physic', 'two millennium physic part', 'millennium physic part
natural', 'physic part natural philosophy', 'part natural philosophy along',
'natural philosophy along chemistry', 'philosophy along chemistry biology',
'along chemistry biology certain', 'chemistry biology certain branch', 'biology
certain branch mathematics']
Figure 4.52 4-gram terms
['physic one old academic discipline', 'one old academic discipline perhaps',
'old academic discipline perhaps old', 'academic discipline perhaps old
inclusion', 'discipline perhaps old inclusion astronomy', 'perhaps old
inclusion astronomy last', 'old inclusion astronomy last two', 'inclusion
astronomy last two millennium', 'astronomy last two millennium physic', 'last
two millennium physic part', 'two millennium physic part natural', 'millennium
physic part natural philosophy', 'physic part natural philosophy along', 'part
natural philosophy along chemistry', 'natural philosophy along chemistry
biology', 'philosophy along chemistry biology certain', 'along chemistry
biology certain branch', 'chemistry biology certain branch mathematics']
Figure 4.53 5-gram terms
Implementing Plagiarism Detection Engine for English Academic Papers 51
4.3.4 Populating term, paragraphVector Tables
After we generated the terms we will use them to populate term, paragraphVector tables.
4.3.4.1 Calculate Term Frequency
We will use nltk.FreqDist() function to calculate the term frequency of each k-gram term in the
paragraph
4.3.4.2 Inserting Terms
We will insert each term with its corresponding term gram.
4.3.4.3 Inserting ParagraphVector
In this step we will link each term with its paragraph and the term frequency by inserting these
into the paragraphVector table.
tf = {}
for kgram in data['terms']:
tf[kgram] = nltk.FreqDist(data['terms'][kgram])
Figure 4.54 Calculate Term Frequency
query1 ="INSERT INTO term (kgram, term) VALUES (%s, %s) ON DUPLICATE KEY UPDATE
kgram = kgram, term = term;"
insertTerms = [(str(kgram), str(term)) for kgram in tf for term in tf[kgram]]
connection.runMany(query1, insertTerms)
connection.commit()
Figure 4.55 insert Terms in Database
query2 = "INSERT IGNORE INTO paragraphVector (paragraphId, termId, termFreq,
kgram) VALUES (%s, (SELECT termId FROM term WHERE term = %s AND kgram = %s), %s,
%s);"
insertDocVec = [(data['paragraphId'], str(term), str(kgram), tf[kgram][term],
str(kgram)) for kgram in tf for term in tf[kgram]]
connection.runMany(query2, insertDocVec)
connection.commit()
Figure 4.56 insert Paragraph Vector in Database
Implementing Plagiarism Detection Engine for English Academic Papers 52
4.3.5 Executing VSM Algorithm
After all paragraphs are inserted into the database after begin processed, we will run some sorted
SQL procedures to update some columns(inverseDocFreq, BM25, pivotNorm) in term and
paragraphVector tables.
Now the system is finished and all terms are evaluated and ready for testing plagiarism.
connection.callProcedure('update_inverseDocFreq')
connection.callProcedure('update_BM25', (0.75, 1.5))
connection.callProcedure('update_pivotNorm', (0.75,))
Figure 4.57 Executing the VSM Algorithm
Implementing Plagiarism Detection Engine for English Academic Papers 53
4.4 Testing Plagiarism
When a user submits a text or a file to test plagiarism on it, this text must first be splitted into
paragraphs and an inputPaper will be inserted to relate these paragraphs together.
4.4.1 Process Paragraph
Then each paragraph must be processed in a similar way like in the pre-processing, first the
paragraph will be inserted into the database in the inputParagraph table, then the text will be passed to
processText() procedure and return a refined bag of words. And finally these words will be used to
generate terms of them and populate the inputParagraphVector table.
connection.run(" INSERT INTO inputPaper (inputPaperId) VALUES(''); ")
paragraphs = tokenizeParagrapgs(text)
Figure 4.58 tokenizing and link paragraphs together
for paragraph in paragraphs:
data = processText(paragraph)
length = len(data)
if length < 1: continue
cursor = connection.run("INSERT INTO inputParagraph (content,inputPaperId)
VALUES (%s,%s)", (paragraph, paperId))
connection.commit()
paragraphId = cursor.getlastrowid()
term.populateInput_Terms_ParagraphVector(connection, data, paragraphId)
Figure 4.59 Process input paragraphs
def populateInput_Terms_ParagraphVector(connection, words, paragraphId):
data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId)
# Term Frequency representation
tf = {}
for kgram in data['terms']:
tf[kgram] = FreqDist(data['terms'][kgram])
query = "INSERT INTO inputParagraphVector (inputParagraphId, termId, termFreq,
kgram) SELECT %s, termId, %s, %s FROM term WHERE term = %s AND kgram = %s ;"
insertDocVec = [(data['paragraphId'], tf[kgram][term], str(kgram), str(term),
str(kgram)) for kgram in tf for term in tf[kgram]]
connection.runMany(query, insertDocVec)
connection.commit()
Figure 4.60 Populate input paragraph vector
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers
My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

More Related Content

What's hot

Placement Cell project
Placement Cell projectPlacement Cell project
Placement Cell project
Manish Kumar
 
Attendance Management Report 2016
Attendance Management Report 2016Attendance Management Report 2016
Attendance Management Report 2016
Pooja Maan
 

What's hot (20)

Face Recognition Attendance System
Face Recognition Attendance System Face Recognition Attendance System
Face Recognition Attendance System
 
Hostel management system srs
Hostel management system srsHostel management system srs
Hostel management system srs
 
Automatic Attendance system using Facial Recognition
Automatic Attendance system using Facial RecognitionAutomatic Attendance system using Facial Recognition
Automatic Attendance system using Facial Recognition
 
Online Railway Reservation System
Online Railway Reservation SystemOnline Railway Reservation System
Online Railway Reservation System
 
Mobile based attandance system
Mobile based attandance systemMobile based attandance system
Mobile based attandance system
 
Expense tracker
Expense trackerExpense tracker
Expense tracker
 
Food ordering System
Food ordering SystemFood ordering System
Food ordering System
 
FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT
 
Notepad Testing Report
Notepad Testing Report  Notepad Testing Report
Notepad Testing Report
 
Face recognition a survey
Face recognition a surveyFace recognition a survey
Face recognition a survey
 
Voice assistant ppt
Voice assistant pptVoice assistant ppt
Voice assistant ppt
 
college website project report
college website project reportcollege website project report
college website project report
 
Placement Cell project
Placement Cell projectPlacement Cell project
Placement Cell project
 
SRS for Library Management System
SRS for Library Management SystemSRS for Library Management System
SRS for Library Management System
 
Attendance Management Report 2016
Attendance Management Report 2016Attendance Management Report 2016
Attendance Management Report 2016
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Handwritten Character Recognition
Handwritten Character RecognitionHandwritten Character Recognition
Handwritten Character Recognition
 
Online Shopping project report
Online Shopping project report Online Shopping project report
Online Shopping project report
 
Online Examination System Project report
Online Examination System Project report Online Examination System Project report
Online Examination System Project report
 
Online Attendance Management System
Online Attendance Management SystemOnline Attendance Management System
Online Attendance Management System
 

Viewers also liked

Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Maarten van Wesel
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
Traian Rebedea
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
guestf17a2e
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
Vlad Mackevic
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
ankit_saluja
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
butest
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
dahveed123
 

Viewers also liked (17)

plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
 
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio..."Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Plag detection
Plag detectionPlag detection
Plag detection
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
 

Similar to My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

26 Lab #1 Performing Reconnaissance and Probing Using Comm.docx
26  Lab #1 Performing Reconnaissance and Probing Using Comm.docx26  Lab #1 Performing Reconnaissance and Probing Using Comm.docx
26 Lab #1 Performing Reconnaissance and Probing Using Comm.docx
tamicawaysmith
 
Securing ASP.NET MVC 5 Web Applications
Securing ASP.NET MVC 5 Web ApplicationsSecuring ASP.NET MVC 5 Web Applications
Securing ASP.NET MVC 5 Web Applications
Martin Åhlin
 
Saravanan_QA_Automation & Manual Testing_2Years-Exp
Saravanan_QA_Automation & Manual Testing_2Years-ExpSaravanan_QA_Automation & Manual Testing_2Years-Exp
Saravanan_QA_Automation & Manual Testing_2Years-Exp
Saravanan Sangapillai
 
Java documentaion template(mini)
Java documentaion template(mini)Java documentaion template(mini)
Java documentaion template(mini)
cbhareddy
 
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docxCopyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
vanesaburnand
 

Similar to My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers (20)

Event Syndication – Requirements Specification For Linked Event Diaries
Event Syndication – Requirements Specification For Linked Event DiariesEvent Syndication – Requirements Specification For Linked Event Diaries
Event Syndication – Requirements Specification For Linked Event Diaries
 
26 Lab #1 Performing Reconnaissance and Probing Using Comm.docx
26  Lab #1 Performing Reconnaissance and Probing Using Comm.docx26  Lab #1 Performing Reconnaissance and Probing Using Comm.docx
26 Lab #1 Performing Reconnaissance and Probing Using Comm.docx
 
Securing ASP.NET MVC 5 Web Applications
Securing ASP.NET MVC 5 Web ApplicationsSecuring ASP.NET MVC 5 Web Applications
Securing ASP.NET MVC 5 Web Applications
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 
Codec Networks is Present Training in Penetration testing,VAPT in Delhi,India.
 Codec Networks is Present Training in Penetration testing,VAPT in Delhi,India.  Codec Networks is Present Training in Penetration testing,VAPT in Delhi,India.
Codec Networks is Present Training in Penetration testing,VAPT in Delhi,India.
 
SELF CORRECTING MEMORY DESIGN FOR FAULT FREE CODING IN PROGRESSIVE DATA STREA...
SELF CORRECTING MEMORY DESIGN FOR FAULT FREE CODING IN PROGRESSIVE DATA STREA...SELF CORRECTING MEMORY DESIGN FOR FAULT FREE CODING IN PROGRESSIVE DATA STREA...
SELF CORRECTING MEMORY DESIGN FOR FAULT FREE CODING IN PROGRESSIVE DATA STREA...
 
Ch03
Ch03Ch03
Ch03
 
Saravanan_QA_Automation & Manual Testing_2Years-Exp
Saravanan_QA_Automation & Manual Testing_2Years-ExpSaravanan_QA_Automation & Manual Testing_2Years-Exp
Saravanan_QA_Automation & Manual Testing_2Years-Exp
 
D space manual-1_8
D space manual-1_8D space manual-1_8
D space manual-1_8
 
Chemread – a chemical informant
Chemread – a chemical informantChemread – a chemical informant
Chemread – a chemical informant
 
Introduction to Christian Education: Final Exam 2011
Introduction to Christian Education: Final Exam 2011Introduction to Christian Education: Final Exam 2011
Introduction to Christian Education: Final Exam 2011
 
Divya_Resume
Divya_ResumeDivya_Resume
Divya_Resume
 
Java documentaion template(mini)
Java documentaion template(mini)Java documentaion template(mini)
Java documentaion template(mini)
 
Manual for WordFinder 10 Professional, PC
Manual for WordFinder 10 Professional, PCManual for WordFinder 10 Professional, PC
Manual for WordFinder 10 Professional, PC
 
2007
20072007
2007
 
WordFinder Grammatik 3 - Manual
WordFinder Grammatik 3 - ManualWordFinder Grammatik 3 - Manual
WordFinder Grammatik 3 - Manual
 
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docxCopyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
Copyright © 2014 by Jones & Bartlett Learning, LLC, an Ascend .docx
 
Software Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesSoftware Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining Techniques
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
FIB MIT208.pdf
FIB   MIT208.pdfFIB   MIT208.pdf
FIB MIT208.pdf
 

Recently uploaded

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 

My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

  • 1. SUPERVISED BY: Dr Hitham M. Abo Bakr Implementing Plagiarism Detection Engine For English Academic Papers By Muhamed Gameel Abd El Aziz Ahmed Motair El Said Mater Mohamed Hessien Mohamed Shreif Hosni Zidan Esmail Manar Mohamed Said Ahmed Doaa Abd El Hamid Abd El Hamid
  • 2. Implementing Plagiarism Detection Engine for English Academic Papers 1 Abstract Plagiarism became a serious issue now days due to the presence of vast resources easily available on the web, which makes developing plagiarism detection tool a useful and challenging task due to the scalability issues. Our project is implementing a Plagiarism Detection Engine oriented for English academic papers using text Information Retrieval methods, relational database, and Natural Language Processing techniques. The main parts of the projects are: Gathering and cleaning data: crawling the web and collecting academic papers and parsing it to extract information about the paper and make a big dataset of these scientific paper content. Tokenization: Parse, tokenize, and preprocess documents. Plagiarism engine: checking similarity between the input document and the database to detect potential plagiarism.
  • 3. Implementing Plagiarism Detection Engine for English Academic Papers 2 Table of Contents Abstract___________________________________________________________________________ 1 Table of Contents ___________________________________________________________________ 2 Table of Figures____________________________________________________________________ 4 Table of Tables_____________________________________________________________________ 7 Chapter 1 Introduction ___________________________________________________________ 8 1.1 What is Plagiarism? _________________________________________________________________8 1.2 What is Self-Plagiarism? _____________________________________________________________8 1.3 Plagiarism on the Internet ____________________________________________________________8 1.4 Plagiarism Detection System __________________________________________________________8 1.4.1 Local similarity: __________________________________________________________________________8 1.4.2 Global similarity: _________________________________________________________________________9 1.4.3 Fingerprinting ___________________________________________________________________________9 1.4.4 String Matching __________________________________________________________________________9 1.4.5 Bag of words _____________________________________________________________________________9 1.4.6 Citation-based Analysis____________________________________________________________________9 1.4.7 Stylometry_______________________________________________________________________________9 Chapter 2 Background Theory ____________________________________________________ 10 2.1 Linear Algebra Basics______________________________________________________________ 10 2.1.1 Vectors_________________________________________________________________________________10 2.2 Information Retrieval (IR)__________________________________________________________ 11 2.3 Regular Expression________________________________________________________________ 15 2.4 NLTK Toolkit ____________________________________________________________________ 16 2.5 Node.js __________________________________________________________________________ 16 2.6 Express.js________________________________________________________________________ 16 2.7 Sockets.io ________________________________________________________________________ 16 2.8 Languages Used___________________________________________________________________ 16 Chapter 3 Design and Architecture_________________________________________________ 17 3.1 Extract, Transfer and Load (ETL) ___________________________________________________ 17 3.2 Plagiarism Engine_________________________________________________________________ 17 3.2.1 Natural Language Processing, (Generating k-grams), and vectorization__________________________18 3.2.2 Semantic Analysis (Vector Space Model VSM Representation) _________________________________18 3.2.3 Calculating Similarity ____________________________________________________________________18 3.2.4 Clustering ______________________________________________________________________________18 3.2.5 Communicating Results___________________________________________________________________19 Chapter 4 Implementation________________________________________________________ 20 4.1 Extract, Load and Transform (ETL) _________________________________________________ 20 4.1.1 The Crawler ____________________________________________________________________________20 4.1.2 The Parser______________________________________________________________________________20 4.1.3 The Data Extracted from the paper_________________________________________________________20 4.1.4 The Parser Implementation _______________________________________________________________21
  • 4. Implementing Plagiarism Detection Engine for English Academic Papers 3 4.1.5 How it works____________________________________________________________________________21 4.1.6 Steps of Parsing _________________________________________________________________________22 4.1.7 The Paper Class _________________________________________________________________________26 4.1.8 The Paragraph Structure _________________________________________________________________27 4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) _____________________________________27 4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)____________________________________37 4.2 The Natural Language Processing (NLP)______________________________________________ 42 4.2.1 Introduction ____________________________________________________________________________42 4.2.2 The Implementation Overview_____________________________________________________________42 4.2.3 The Text Processing Procedure ____________________________________________________________42 4.2.4 Example of the Text Processing ____________________________________________________________45 4.3 Term Weighting __________________________________________________________________ 47 4.3.1 Lost Connection to Database Problem ______________________________________________________47 4.3.2 Process Paragraph _______________________________________________________________________48 4.3.3 Generating Terms _______________________________________________________________________48 4.3.4 Populating term, paragraphVector Tables ___________________________________________________51 4.3.5 Executing VSM Algorithm ________________________________________________________________52 4.4 Testing Plagiarism ________________________________________________________________ 53 4.4.1 Process Paragraph _______________________________________________________________________53 4.4.2 Calculate Similarity ______________________________________________________________________54 4.4.3 Get Results _____________________________________________________________________________54 4.5 The VSM Algorithm _______________________________________________________________ 55 4.5.1 Calculating similarity ____________________________________________________________________55 4.5.2 K-means and Clustering __________________________________________________________________56 4.6 Server Side_______________________________________________________________________ 59 4.6.1 Handling Routing________________________________________________________________________59 4.6.2 Running Python System __________________________________________________________________60 4.7 Client Side _______________________________________________________________________ 62 4.8 The GUI of the System _____________________________________________________________ 63 Chapter 5 Results and Discussion__________________________________________________ 66 5.1 Dataset of the Parser_______________________________________________________________ 66 5.2 Exploring dataset _________________________________________________________________ 68 5.4.1 Small dataset (15K) ______________________________________________________________________68 5.4.2 Big dataset (50K) ________________________________________________________________________69 5.3 Performance _____________________________________________________________________ 70 5.4 Detecting plagiarism _______________________________________________________________ 72 5.4.1 Percentage score functions:________________________________________________________________72 5.5 Discussing results _________________________________________________________________ 74 Chapter 6 Conclusion ___________________________________________________________ 75 Chapter 7 Appendix _____________________________________________________________ 76 7.1 Entity-Relation Diagram (ERD) _____________________________________________________ 76 7.2 Stored procedures_________________________________________________________________ 77 References _______________________________________________________________________ 84
  • 5. Implementing Plagiarism Detection Engine for English Academic Papers 4 Table of Figures Figure 1.1 Plagiarism Detection Approaches _____________________________________________________8 Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3) ______ 10 Figure 2.2 Geometric representation of documents ______________________________________________ 12 Figure 3.1 High level block diagram __________________________________________________________ 17 Figure 3.2 Detailed block diagram of the Plagiarism Engine ______________________________________ 17 Figure 4.1 Overview for the Crawler and Parser ________________________________________________ 20 Figure 4.2 UML of the Parser Application _____________________________________________________ 21 Figure 4.3 the Flow Chart of the Parser _______________________________________________________ 21 Figure 4.4 The main function of Parsing ______________________________________________________ 22 Figure 4.5 The First Page of an IEEE Paper (as Blocks) _________________________________________ 22 Figure 4.6 First Page of a Science Direct Paper_________________________________________________ 23 Figure 4.7 First Paper of a Springer Paper_____________________________________________________ 23 Figure 4.8 The function of parseOtherPages ___________________________________________________ 24 Figure 4.9 Block of String before Enhancing___________________________________________________ 25 Figure 4.10 The Paragraphs after enhancing___________________________________________________ 25 Figure 4.11 the Paper Structure _____________________________________________________________ 26 Figure 4.12 the Paragraph Structure _________________________________________________________ 27 Figure 4.13 Different forms for an IEEE Top Header____________________________________________ 27 Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper ____________________________ 28 Figure 4.15 The supported Regex of the IEEE Header formats ____________________________________ 29 Figure 4.16 The Function of extracting the Volume Number ______________________________________ 30 Figure 4.17 The Function of Extracting the Issue Number________________________________________ 30 Figure 4.18 The Function of Extracting the DOI________________________________________________ 30 Figure 4.19 The Function of Extracting the Start and End Pages __________________________________ 31 Figure 4.20 The Function of Extracting the Journal Title_________________________________________ 32 Figure 4.21 Parsing the rest of blocks in the first Page ___________________________________________ 32 Figure 4.22 The Function of Extracting the DOI and PII_________________________________________ 33 Figure 4.23 The Function of Extracting the ISSN _______________________________________________ 33 Figure 4.24 The Function of extracting the paper Dates __________________________________________ 34 Figure 4.25 The Function of Extracting the Keywords ___________________________________________ 35
  • 6. Implementing Plagiarism Detection Engine for English Academic Papers 5 Figure 4.26 The Function of Extracting the Keywords ___________________________________________ 36 Figure 4.27 The Function of Extracting the Title and the Authors__________________________________ 36 Figure 4.28 The Defining the Style of the Header _______________________________________________ 38 Figure 4.29 the Function of Extracting the Figure Captions ______________________________________ 39 Figure 4.30 the Function of separating the lists _________________________________________________ 40 Figure 4.31 the Function of Extracting the Paragraph ___________________________________________ 40 Figure 4.32 The Function of Extracting the Paragraph __________________________________________ 41 Figure 4.33 Process Text Function ___________________________________________________________ 42 Figure 4.34 Tokenizing words Function _______________________________________________________ 42 Figure 4.35 Tokenization Example ___________________________________________________________ 43 Figure 4.36 POS Function__________________________________________________________________ 43 Figure 4.37 POS Output Example____________________________________________________________ 43 Figure 4.38 WordNet POS Function__________________________________________________________ 43 Figure 4.39 Removing Punctuations Function__________________________________________________ 44 Figure 4.40 Removing Stop Words Function ___________________________________________________ 44 Figure 4.41 Stop Words list _________________________________________________________________ 44 Figure 4.42 Lemmatization Function _________________________________________________________ 45 Figure 4.43 Paragraph before Text Processing _________________________________________________ 45 Figure 4.44 Paragraph after Text Processing___________________________________________________ 46 Figure 4.45 Retrieving Paragraphs ___________________________________________________________ 47 Figure 4.46 Process Paragraph Function______________________________________________________ 48 Figure 4.47 Generate k-gram Terms Function__________________________________________________ 49 Figure 4.48 Paragraph Example _____________________________________________________________ 49 Figure 4.49 1-gram terms___________________________________________________________________ 50 Figure 4.50 2-gram terms___________________________________________________________________ 50 Figure 4.51 3-gram terms___________________________________________________________________ 50 Figure 4.52 4-gram terms___________________________________________________________________ 50 Figure 4.53 5-gram terms___________________________________________________________________ 50 Figure 4.54 Calculate Term Frequency _______________________________________________________ 51 Figure 4.55 insert Terms in Database _________________________________________________________ 51 Figure 4.56 insert Paragraph Vector in Database _______________________________________________ 51 Figure 4.57 Executing the VSM Algorithm_____________________________________________________ 52
  • 7. Implementing Plagiarism Detection Engine for English Academic Papers 6 Figure 4.58 tokenizing and link paragraphs together_____________________________________________ 53 Figure 4.59 Process input paragraphs_________________________________________________________ 53 Figure 4.60 Populate input paragraph vector ___________________________________________________ 53 Figure 4.61 Calculate Similarity _____________________________________________________________ 54 Figure 4.62 Get Results ____________________________________________________________________ 54 Figure 4.63 Flowchart of the Kmeans text clustering algorithm ____________________________________ 57 Figure 4.64 Home Page Routing _____________________________________________________________ 59 Figure 4.65 Pre-Process Page Routing ________________________________________________________ 59 Figure 4.66 Communicating between the Server and the Core Engine for testing plagiarism ____________ 60 Figure 4.67 Communicating between the Server and the Core Engine for Pre-processing ______________ 61 Figure 4.68 Least Common Subsequence LCS Algorithm _________________________________________ 62 Figure 4.69 Least Common Subsequence LCS Algorithm _________________________________________ 62 Figure 4.70 Submitting an input document_____________________________________________________ 63 Figure 4.71 The Results of the Process Part 1 __________________________________________________ 64 Figure 4.72 The Results of the Process Part 2 __________________________________________________ 65 Figure 5.1 Number of Papers Published per Year in IEEE ________________________________________ 66 Figure 5.2 Number of Papers Published per Year in Springer _____________________________________ 67 Figure 5.3 Number of Papers Published per Year in Science Direct_________________________________ 67 Figure 5.4 Response time against number of paragraphs tested on small dataset ______________________ 70 Figure 5.5 Screenshot of the System Performance from the System GUI _____________________________ 71 Figure 7.1 ERD of the plagiarism Engine database ______________________________________________ 76
  • 8. Implementing Plagiarism Detection Engine for English Academic Papers 7 Table of Tables Table 1 Statistics of the Parser________________________________________________________________66 Table 2 Dataset Statistics ____________________________________________________________________68 Table 3 Unique Terms count in each Paragraph _________________________________________________68 Table 4 Unique Terms count in Dataset ________________________________________________________68 Table 5 Dataset Statistics ____________________________________________________________________69 Table 6 Unique Terms count in each Paragraph _________________________________________________69 Table 7 Unique Terms count in Dataset ________________________________________________________69 Table 8 Processing time of each module in Plagiarism Engine ______________________________________70 Table 9 Parameters_________________________________________________________________________72 Table 10 Testing Paragraphs and Results_______________________________________________________73
  • 9. Implementing Plagiarism Detection Engine for English Academic Papers 8 Chapter 1 Introduction 1.1 What is Plagiarism? It’s the act of Academic stealing someone’s work such as: copying words from a book or a scientific paper and publish it as it’s his work, also stealing the ideas, images, videos and music and using them without a permission or providing a proper citation is called plagiarism. 1.2 What is Self-Plagiarism? It’s the act when someone is using a portion of an article or work he published before without citing that he is doing so, and this portion could be significant, identical or nearly identical, also it may cause copyrights issues as the copyright of the old work will be transferred to the new one. This type of articles and work are called duplicate or multiple publication. 1.3 Plagiarism on the Internet Now the Blogs, Facebook Pages and some website are copying and pasting information violating many copyrights, so there are many tools that are used to prevent plagiarism such as: disabling the right click to prevent copying, also placing copyright warning in every page in the website as banners or pictures, and the use of DCMA copyright law to report for copyright infragment and the violation of copyrights, this report could be sent to the website owner or the ISP hosting the website and the website will be removed. 1.4 Plagiarism Detection System It’s a system used to test a material if it has plagiarism or not, this material could be scientific article or technical report or essay or others, also the system can emphasize the parts of plagiarism in the material and estate from where it’s copied from even there is difference between some words with the same meaning. Figure 1.1 Plagiarism Detection Approaches 1.4.1 Local similarity: Given a small dataset, the system checks the similarity between each pair of the paragraphs in this dataset, like checking if two students cheated in an assignment.
  • 10. Implementing Plagiarism Detection Engine for English Academic Papers 9 1.4.2 Global similarity: Global similarity systems checks the similarity between a small input paragraphs against a large dataset, like checking if a submitted paper is plagiarized from an already published paper. 1.4.3 Fingerprinting In this approach, the data set consists of set of multiple n-grams from documents, these n-grams are selected randomly as a substring of the document, each set of n-grams representing a fingerprint for that document and called minutiae, and all of this fingerprints are indexed in the database, and the input text is processed in the same way and compared with the fingerprints in the database and if it matches with some of them, then it plagiarizes some documents. [1] 1.4.4 String Matching This is one of most problems in the plagiarism detection systems as to detect plagiarism you can to make an exact match, but to compare the document to be tested with the total Database requires huge amount of resources and storage, so suffix trees and suffix vector are used to overcome this problem. [2] 1.4.5 Bag of words This approach is an adoption of the vector space retrieval, where the document is represented as a bag of words and these words are inserted in the database as n-grams with its location in the document and their frequencies in this or other documents, and for the document to be tested it will be represented as bag of words too and compared with the n-grams in the database. [3] 1.4.6 Citation-based Analysis This is the only approach that doesn’t rely on text similarity, It examines the citation and reference information in texts to identify similar patterns in the citation sequences, It’s not widely used in the commercial software, but there are prototypes of it. 1.4.7 Stylometry Analyze only the suspicious document to detect plagiarized passages by detecting the difference of linguistic characters. This method isn’t accurate in small documents as it needs to analyze large passages to be able to extract linguistic properties –up to thousands of words per chunk [4]. Our Project is working with Global Similarity and the Bag of Words Approach, as the system has Dataset of many scientific papers divided into paragraphs, and the input text is divided into paragraphs and compared with the large dataset of paragraphs.
  • 11. Implementing Plagiarism Detection Engine for English Academic Papers 10 Chapter 2 Background Theory 2.1 Linear Algebra Basics Since we use Vector space model to represent and retrieve text documents, a basic linear algebra is needed. 2.1.1 Vectors A vector is Geometric object that have a magnitude and a direction, or a mathematical object consists of ordered values. 1. Representation in 2D and 3D 1) Graphical (Geometric) representation A vector is represented graphically as an arrow in the Cartesian 2D plane or the Cartesian 3D space. Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3). Source: Wikimedia commons. 2) Cartesian representation Vectors in an n-dimensional Euclidean space can be represented as coordinate vectors; the endpoint of a vector can be identified with an ordered list of n real numbers (n-tuple). [5] 2D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦) Euclidean vector 3D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦, 𝑎 𝑧) 2. Operations on vectors 1) Scalar product 𝑟𝑎⃑ = (𝑟𝑎 𝑥, 𝑟𝑎 𝑦, 𝑟𝑎 𝑧) 2) Sum 𝑎⃑ + 𝑏⃑⃑ = (𝑎 𝑥 + 𝑏 𝑥 , 𝑎 𝑦 + 𝑏 𝑦, 𝑎 𝑧 + 𝑏 𝑧 )
  • 12. Implementing Plagiarism Detection Engine for English Academic Papers 11 3) Subtract 𝑎⃑ − 𝑏⃑⃑ = (𝑎 𝑥 − 𝑏 𝑥 , 𝑎 𝑦 − 𝑏 𝑦, 𝑎 𝑧 − 𝑏 𝑧 ) 4) Dot product Algebraic definition: 𝑎⃑ . 𝑏⃑⃑ = (𝑎 𝑥 𝑏 𝑥 , 𝑎 𝑦 𝑏 𝑦) Geometric definition: 𝑎⃑ . 𝑏⃑⃑ = ‖𝑎‖‖𝑏‖ cos 𝜃 Where: |a| is the magnitude of vector a, |b| is the magnitude of vector b, θ is the angle between a and b The projection of a vector 𝑎⃑ in the direction of another vector 𝑏⃑⃑ is given by: 𝑎 𝑏⃑⃑⃑⃑⃑ = 𝑎⃑ . 𝑏̂ Where: 𝑏̂ is the normalized vector –unit vector- of 𝑏⃑⃑. 2.2 Information Retrieval (IR) Information retrieval could be defined as “the process of finding material of an unstructured nature –usually text- that satisfies information need or relevant to a query from large collection of data [6]. As the definition suggests, IR differ than ordinary select query that is the information retrieved unstructured, and doesn’t always exactly match the query. Information Retrieval methods are used in search engines, text classification –such as spam filtering-, and in our case plagiarism engine. 1. Vector Space Model (VSM) The basic idea of VSM is to represent text documents as vectors in term weights space. 1) Term Frequency weighting (TF) The simplest VSM weighting is just Term Frequency; all other weighting functions are modification of it. In TF weighting we represent each text by a vector of d-dimensions where d is the number of terms in dataset, the value of the vector nth dimension equals to the frequency of the nth term in the document. For example let’s assume a dataset of 2 dimensions/terms (play, ground) Document 1 ‘play ground’ is represented as 𝑑1 ⃑⃑⃑⃑⃑ = (1, 1) Document 2 “play play” is represented as 𝑑2 ⃑⃑⃑⃑⃑ = (2, 0) Document 3 “ground” is represented as 𝑑3 ⃑⃑⃑⃑⃑ = (0, 1) More generally weight of word w in document d is defined as 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)
  • 13. Implementing Plagiarism Detection Engine for English Academic Papers 12 The dot product similarity between d1 and d2 = 𝑑1 ⃑⃑⃑⃑⃑ . 𝑑2 ⃑⃑⃑⃑⃑ = (1, 1) . (2,0) = 1 ∗ 2 + 1 ∗ 0 = 2 2) Term Frequency with Inverse Document Frequency weighting (TF-IDF) Document Frequency 𝑑𝑓(𝑤) is the number of documents that contains the word. TF-IDF have additional Inverse Document Frequency term to penalize common terms as they are have high probability [7] of appearing in a document so they don’t strongly indicate plagiarism unlike less probable terms which have less probability and more information. 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) ∗ 1 𝑑𝑓(𝑤) So for the above example: df (play) = 2 df (ground) = 2 and in this case all the weights will be scaled by a half. 2. State-of-the-art VSM functions 1) Pivoted Length Normalization [8] 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] 1 − 𝑏 + 𝑏 |𝑑| 𝑎𝑣𝑑𝑙 log 𝑀 + 1 𝑑𝑓(𝑤) Where: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) is the weight of word w in document/paragraph d 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) is the count of word w in document d – i.e. term frequency- play ground 2 1 1 d2 d1 d3 Figure 2.2 Geometric representation of documents
  • 14. Implementing Plagiarism Detection Engine for English Academic Papers 13 𝑏 𝑖𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ∈ [0, 1] |𝑑| is the length of document d 𝑎𝑣𝑑𝑙 is the average length of the documents in dataset 𝑀 is the number of documents in dataset 𝑑𝑓(𝑤) is the number of documents that contains the word w i.e. document frequency Document length normalizing term 1 − 𝑏 + 𝑏 |𝑑| 𝑎𝑣𝑑𝑙 is used to linearly penalize long documents if it’s length is larger than the average document length (avdl), or reward short documents if it’s length is smaller than the average document length. The parameter b is used to determine the normalization; if its equal to zero then there is no normalization at all if it’s equal to 1 then the normalization is linear with offset zero and slope 1. The Inverse Document Frequency (IDF) term log 𝑀+1 𝑑𝑓(𝑤) is used to penalize common terms as explained above, the IDF is multiplied by number of documents to normalize it, as the probability of the term depends not only on the document frequency of that term, but also on the size of the dataset, a term appeared 10 times in a dataset with 100 document much more common than a term appeared 10 times but in a dataset with 1000 document, the logarithmic function is used to smooth the IDF weighting, i.e. reduce the variation of weighting when the Document Frequency varies a lot. The Term frequency (TF) term ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] contains double natural logarithmic functions to achieve sub linear transformation-i.e. smoothing TF curve- to avoid over scoring documents that have large repeated words, as the first occurring of a term should have the highest weight. Imagine a document with extremely large frequency of a term, without sub linear transformation this document will always have high similarity with any input query that contains the same term, even higher similarity than another more similar document. 2) OkapiBM25 [9] BM stands for Best Match, the weights are defined as follows: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = (𝑘 + 1)𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) + 𝑘(1 − 𝑏 + 𝑏 |𝑑| 𝑎𝑣𝑑𝑙 ) log 𝑀 + 1 𝑑𝑓(𝑤) Where all symbols defined as in Pivoted Length Normalization 𝑘 ∈ [0, ∞] Similar to Pivoted Length Normalization, but instead of Natural logarithms it uses division and k parameter to achieve sub linear transformation.
  • 15. Implementing Plagiarism Detection Engine for English Academic Papers 14 It was originally developed based on the probabilistic model, however it’s very similar to Vector Space Model. 3. Similarity functions After representing text documents as vectors in the space, we need functions to calculate similarity –or distance- between any two vectors. 1) Dot product similarity 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) 𝑤∈ 𝑞∩𝑑 Where: 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) is the similarity score between document d and input query q The score is simply the summation of the product of term weights of each word appear in both document and query. It’s very popular because it’s general and can be used with any fancy term weighting. 2) Cosine similarity 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)𝑤∈ 𝑞∩𝑑 |𝑞| ∪ |𝑑| Where: |𝑞| is the magnitude of the query vector, |𝑑| is the magnitude of the document vector. It’s basically a dot product divided by the product of the lengths of the two vectors, which yields the cosine of the angle between the vectors. This function has a built in document – and query – length normalization. 4. Clustering Clustering is an unsupervised1 machine learning method, and a powerful data mining technique, it is the process of grouping similar objects together. This technique can theoretically speed up the information retrieval process by a factor of K where K is the number of clusters. This may be achieved by clustering similar paragraphs together and measure the similarity between each new query and the centroids of the clusters, then measure the similarity between the query and paragraphs in one cluster, this is much faster than measuring similarity of query against each paragraph in the data set. 1 Because the data are not labeled.
  • 16. Implementing Plagiarism Detection Engine for English Academic Papers 15 We use Kmeans algorithm-centroid based clustering-, which is an iterative improvement algorithm that groups the data set into a pre defined number of clusters K. Which goes like this [10]: 1 Select K random points from data set to be the initial guess of the centroids –cluster centers. 2 Assign each record in the data set to the closest centroid based on a given similarity function. 3 Move each centroid closer to points assigned to it by calculating the mean value of the points in the cluster. 4 If reached local optima –i.e. centroids stopped moving- stop, else repeat Since Kmeans algorithm is sensitive to initial choose of centroids and can stuck in local optima we repeat it with different initial centroids and keep the best results –which have the least mean square error-. Time complexity 𝛰(𝑡𝑘𝑛𝑑) where t is the number of iterations till converges, k is the number of clusters, and n is the number of records in the data set, d is the number of dimensions. Since usually 𝑡𝑘 ≪ 𝑛 the algorithm is considered a linear time algorithm. Note: that is the typical time complexity for applying the algorithm on a dataset with neglect able limited Number of dimensions, in our case we have very large number of dimensions –all terms in the dataset- but fortunately for each centroid or paragraph we iterate only terms appear in it –not all dimensions- so the time complexity will differ as we will discuss in details in the implementation section. 2.3 Regular Expression It’s a sequence of character and symbols written in some way to detect patterns, where each symbol in the sequence has a meaning ex: + means one or more, * means zero or more, - means range (A-Z: all capital letters from A to Z). For ex: the Birth date could be written in this way May 15th , 1993 so the pattern of the Date is [Month] [day][st or nd or rd or th ] [year], and the Regular Expression for it is [A-Za-z]{3,9} [0- 9]{1,2}(st|nd|rd|th) [0-9]{4} First, the month is one of 12 fixed words, they could be written explicitly or simply it’s a sequence of Char range from 3 to 9 as the smallest month (May) and the largest (September), then space, Then, the Day and it’s a number of 1 or 2 digits, then one of the 4 words (st, nd, rd, th), then space, then the Year and it’s a number of 4 digits. Also this isn’t the only format of date so the date expression could be more complicated than this.
  • 17. Implementing Plagiarism Detection Engine for English Academic Papers 16 2.4 NLTK Toolkit It’s a python module that is responsible for Natural Language Processing (NLP) used for text processing, It has algorithms for sentence and word tokenization, and contains a large number of corpus (data), also has its own Wordnet corpus, and it’s used for Part of Speech (POS) Tagging, Stemming, and lemmatization. 2.5 Node.js It’s a runtime environment built on Chrome’s V8 JavaScript Enginer for developing server-side web applications, it uses an event driven, non-blocking I/O model. 2.6 Express.js It’s a Node.js web application server framework, It’s a standard server framework for Node.js, It has a very thin layer with many features available as plugins. 2.7 Sockets.io It’s a library for real-time web application, It enables bi-directional communication between the web client and server, It primarily use web sockets protocol with polling as fallback option. 2.8 Languages Used 1. Java 2. Python 3. SQL 4. JavaScript 5. HTML & CSS
  • 18. Implementing Plagiarism Detection Engine for English Academic Papers 17 Chapter 3 Design and Architecture 3.1 Extract, Transfer and Load (ETL) In this part, we are building the database by downloading many scientific papers using the Crawler software (Extract), then they are passed to the Parser software where all the paper information and text content are extracted as paragraphs (Transform) and inserted in the database (Load). 3.2 Plagiarism Engine The plagiarism Engine preprocess a huge dataset of academic English papers and analysis it uses Natural language processing techniques to extract useful information, then measures the similarity between an input query and the dataset using Information Retrieval methods to detect both Identical and paraphrased plagiarism in a fast and intelligent way. Parsed papers NLP and vectorization Semantic analysis VSM representation Vectorized paragraph Extracted features Lexical database (WordNet/NLTK)Input query Calculating similarity Clustering centroid Find Cluster Potential plagiarized paragraphs ETL Parsing Plagiarism engine Communicating results Academic papers Input query Figure 3.1 High level block diagram Figure 3.2 Detailed block diagram of the Plagiarism Engine
  • 19. Implementing Plagiarism Detection Engine for English Academic Papers 18 3.2.1 Natural Language Processing, (Generating k-grams), and vectorization The Text Processing part work on these data to extract the most important words from the paragraphs, and ignore the common words, then k-grams terms are generated from these words, and each bag of words is linked to its corresponding paragraph in the database. 3.2.2 Semantic Analysis (Vector Space Model VSM Representation) Input: simple Term Frequency vector representation stored in paragraphVector table. Output: dataset statistics (number of paragraphs, number of terms, average length of paragraph) stored in dataSetInfo table, document frequency (for each term; number of paragraphs in which that term appeared) stored in IDF column in term table, pivoted length normalization, and BM25 vector weights. In this part we calculate a more sophisticated vector representation than just term-frequency of our text corpus. We calculate a TF-IDF normalized vector representation of the text using both pivoted length normalization and BM25 as discussed later. 3.2.3 Calculating Similarity Input: Vectorized input paragraphs stored in inputPargraphVector Table, and BM25 or pivoted length normalization in BM25, pivotNorm columns in paragraphVector table. Output: similarity between input paragraph and relevance paragraphs in dataset stored in similarity table. Checking similarity between the input paragraph and paragraphs in the dataset, and detect possible plagiarism if the similarity measure between the input paragraph and any paragraph form the dataset exceeded a predetermined threshold. We implemented both OkapiBM25 and pivoted length normalization similarity functions. The system first measure similarity on 5-gram vectors, then 4-grams and so on, if it ever found a high similarity in one k-gram it limits its scope to those paragraphs with high similarity in precedes k- grams to increase performance. 3.2.4 Clustering Input: paragraph vectors with BM25 (or pivoted length normalization) weights stored in paragraphVector table. Output: the cluster of each paragraph stored in clusterId column in paragraph table, and the centroids of the clusters stored in centroid table.
  • 20. Implementing Plagiarism Detection Engine for English Academic Papers 19 We clustered similar paragraphs together so that we can measure similarity between only similar paragraphs to increase the checking similarity step speed. An input paragraph have a similarity measure against centroids first to determine its cluster then the regular similarity measure with all paragraphs in the dataset in the same cluster. 3.2.5 Communicating Results It’s the interface where the user can check his document to plagiarism by inserting the document in a text box and the document is parsed in a similar way as the Parser of the system by splitting the document into paragraphs and they are passed by the text processing part and compared by the dataset in the database and results appear as plagiarism percentage in the document and showing the plagiarized parts in the document with other documents.
  • 21. Implementing Plagiarism Detection Engine for English Academic Papers 20 Chapter 4 Implementation 4.1 Extract, Load and Transform (ETL) Figure 4.1 Overview for the Crawler and Parser 4.1.1 The Crawler It’s a software that download all the scientific papers from the web into a folders for each publisher where the parser will start working on them. 4.1.2 The Parser It’s a software that take a PDF document (Scientific paper) as an input and extract the paper information and content of the paper and insert them in the database of the system. 4.1.3 The Data Extracted from the paper a. Paper Information 1. Paper Title 2. Paper authors 3. Journal and its ISSN 4. Volume, Issue, Paper Date and other dates (Accepted, Received, Revised, Published) 5. DOI (Digital Object Identifier) or PII (Publisher Item Identifier) 6. Starting Page and Ending Page b. Abstract and Keywords c. Table of Contents d. Figure and Table captions e. Paper text content (as Paragraphs)
  • 22. Implementing Plagiarism Detection Engine for English Academic Papers 21 4.1.4 The Parser Implementation Figure 4.2 UML of the Parser Application 4.1.5 How it works The Parser consists of a parent class (Parser) and other children classes (IEEE, Springer, APEM, and Science Direct). The parent class has the general functions that parse the PDF document and extract (Table of contents, Figure and Table captions, and the text content) of the paper, the children classes Start Check for new Papers Choose the Suitable Parser Parse the Papers Move to Processed Directory Move to Unprocessed Directory No Yes Success Fail Figure 4.3 the Flow Chart of the Parser
  • 23. Implementing Plagiarism Detection Engine for English Academic Papers 22 have specific functions and Regular Expressions for each publisher structure to extract the paper information (Title, Authors, DOI ...). Each Publisher has its own folder where the scientific paper are downloaded by the Crawler, and the Parser will monitor each folder for the new documents and use the suitable child class to parse the new document found and extract all the needed information and data. If the paper information and content are extracted completely, the file will be moved to the (Processed Directory), otherwise, the file will be moved to the (Unprocessed Directory) logging the error, so the Developer can check if it’s a new structure to be supported it in the Parser, or something goes wrong and he has to fix it. 4.1.6 Steps of Parsing 4.1.6.1 Extracting the Text from the PDF file (extractBlocks Function) The Parser uses the (PDFxStream Java Library) which extract the text from the PDF file as blocks of String, and in this Function, It loops the file page by page and for each page it extract the content of the page in an object of ArrayList<String> called page and add this page with the page Number in an object of type HashMap<Integer,ArrayList<String>> called pages. Figure 4.5 The First Page of an IEEE Paper (as Blocks) public void parsePaper(String publisher) throws Exception { extractBlocks(); try {parseFirstPage();} catch (Exception e) {throw new Exception("Error Not Processed");} parserOtherPages(); paper.enhaceParagraphs(); try {paper.insertPaperInDatabase(publisher);} catch (SQLException e) {throw new Exception("Error Database");} } Figure 4.4 The main function of Parsing
  • 24. Implementing Plagiarism Detection Engine for English Academic Papers 23 4.1.6.2 Extracting the Paper Information (parserFirstPage Function) Each Publisher accepts his scientific paper in a specific structure which differs from publisher to publisher, and the difference lies in the first page where the paper information are written, so there has to be parser for each publisher designed to support its structure, so this function which is an abstract function in the parent class is implemented in each child class for each publisher. Figure 4.6 First Page of a Science Direct Paper Figure 4.7 First Paper of a Springer Paper These are different structure for Science Direct and Springer to show the difference in the structures, and the difference lies in the organization and structure of the information ex:
  • 25. Implementing Plagiarism Detection Engine for English Academic Papers 24 1. This is a header of a Springer Paper Kong et al. EURASIP Journal on Advances in Signal Processing 2014, 2014:44 http://asp.eurasipjournals.com/content/2014/1/44 2. This is a header of an IEEE Paper IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93 4.1.6.3 Extracting the Paper text content (parserOtherPages Function) This function uses the general Parser functions as it loops over all the Pages and the blocks of string in each page and extract the data from the blocks that could be (Table of contents, Figure and Table Captions, Lists and Paragraphs). Each block passes several stages: 1) First Test if the Block is a Figure Caption 2) Then Test if it’s a Table Caption 3) Then Test if it has a Header (Table of Content) 4) Then Test if the block has lists (maybe numeric, dash, Dot) In the 3rd stage, if there are headers in the block, they will be extracted and the rest of the block will be returned to the function and it will continue the other stages. void parserOtherPages(){ for (Entry<Integer, ArrayList<String>> entrySet : pages.entrySet()) { Integer pageNumber = entrySet.getKey(); ArrayList<String> page = entrySet.getValue(); Iterator<String> it = page.iterator(); while (it.hasNext()) { String block = it.next().trim(); boolean isFigureCaption = false, isTableCaption = false; boolean isList = false, isEmptyParagraph = false; isFigureCaption = parseFigureCaption(block, pageNumber); isTableCaption = parseTableCaption(block, pageNumber); block = parseHeaders(block); isList = parseLists(block, pageNumber); isEmptyParagraph = "".equals(block); if(!isFigureCaption && !isTableCaption && !isEmptyParagraph && !isList) parseParagraph(block, pageNumber); } } } Figure 4.8 The function of parseOtherPages
  • 26. Implementing Plagiarism Detection Engine for English Academic Papers 25 5) Finally if the block isn’t one of the previous types (not Figure or Table Caption, has no list or it has header extracted and the rest of the block is returned), then it’s a paragraph and it will be extracted. 4.1.6.4 Enhancing the Paragraphs (enhanceParagraph Function) As shown in Fig IV.9, some paragraphs when they are extracted won’t be in a good shape, 1) Some words may be separated between 2 lines with a hyphen, so I have to rejoin it, also there are many spaces between words so I have to remove the extra spaces. 2) The paragraph is extracted as lines (has a new line char at the end of the line) not a continuous String so I have to refine it. 3) Some of the paper Info are in uppercase so I capitalize them. Figure 4.9 Block of String before Enhancing Page Number: 1 The Content: However, as the number of metal layers increases and interconnect dimensions decrease, the parasitic capacitance increases associated with fill metal have become more significant, which can lead to timing and signal integrity problems. Page Number: 1 The Content: Previous research has primarily focused on two important aspects of fill metal realization: 1) the development of fill metal generation methods – which we discuss further in Section II and 2) the modeling and analysis of capacitance increases due to fill metal – Several studies have examined the parasitic capacitance associated with fill metal for small scale interconnect test structures in order to provide general guidelines on fill metal geometry selection and placement. For large-scale designs, Figure 4.10 The Paragraphs after enhancing
  • 27. Implementing Plagiarism Detection Engine for English Academic Papers 26 4.1.6.5 Finally inserting all these data in the Database When the Parser starts, an Object of type Paper is created and every information and data extracted from the scientific paper are assigned to their attribute in this object, and at the end of the parsing, all these information and data are inserted in the Database by calling this function. 1) Retrieve the Journal ID from the database by its name or ISSN, if it’s already found, the ID will be returned, Otherwise, It will be considered new Journal and will be inserted and its ID will be returned. 2) Test if the Paper is already inserted the paper in the Database before if it is already found, the Parser will throw Exception stating that its already inserted before, but If it’s a new paper then it will be inserted with its information (title, volume, issue …) and the Paper ID will be returned. 3) With the Paper ID, the rest of the Data will inserted (Authors, Keywords, Table of Contents, Figure captions, Table captions, and the text content of the Paper which is the paragraphs). 4.1.7 The Paper Class This Class works as a structure for the paper, It has the attributes that holds the information and data of the paper and it has also the function of enhancingParagraphs() that is responsible for improving the text and enhancing the praragraph to be ready for the next step of processing in the Natural Language Processing part, and also the function of insertPaperInDatabase() which is responsible for testing if the Paper is already inserted in the database before or not and if it’s a new public class Paper { public String title = ""; public int volume = -1,issue = -1; public int startingPage = -1, endingPage = -1; public String journal = "", String ISSN = ""; public String DOI = ""; public ArrayList<String> headers = new ArrayList<>(); public ArrayList<String> authors = new ArrayList<>(); public ArrayList<String> keywords = new ArrayList<>(); public ArrayList<Paragraph> figureCaptions = new ArrayList<>(); public ArrayList<Paragraph> tableCaptions = new ArrayList<>(); public ArrayList<Paragraph> paragraphs = new ArrayList<>(); public String date=""; public String dateReceived="NULL", dateRevised="NULL"; public String dateAccepted="NULL", dateOnlinePublishing="NULL"; public void enhaceParagraphs() public void insertPaperInDatabase(String publisher) } Figure 4.11 the Paper Structure
  • 28. Implementing Plagiarism Detection Engine for English Academic Papers 27 paper, then It will be inserted with all of its data from paragraphs, figure and table captions and the paper information. Note that: When the Parser finds a new PDF document in the folders of the publishers, it create a new object of type paper and in the meantime of parsing the document, each piece information extracted is assigned to its attribute in this object, and in the end of the parsing process, This paper object execute its 2 member function the enhanceParagraph() to refine the paragraph content then execute the insertPaperInDatabase() method to insert all the data in the database. 4.1.8 The Paragraph Structure As shown in the Figure IV.12, the paragraph structure is very simple is contains the number of the content of the paragraph extracted and the page number from where this paragraph is extracted. 4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) As shown in Figure IV.13, we can see that the page is divided into blocks of string and each block has a piece or more of the paper information, this function in the parser is implemented specific to the publisher, so the function of IEEE parser won’t work to the Springer Parser and so on, and this function is implemented to parse only the first page and to extract the paper information in this page and assign them to the attributes of the paper object. Even in the publisher itself, there are differences in the location of the paper information in the page, and the structure is changing overtime for example the IEEE Parser support 8 different forms of Paper header. Figure 4.13 Different forms for an IEEE Top Header public class Paragraph { public int pageNum; public String content; } Figure 4.12 the Paragraph Structure
  • 29. Implementing Plagiarism Detection Engine for English Academic Papers 28 Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper
  • 30. Implementing Plagiarism Detection Engine for English Academic Papers 29 4.1.9.1 Parsing the Paper Header The function of the parseFirstPage() starts with parsing the Header of the paper which is the first block in the paper, the block is sent to a parsePaperHeader() function which have the different Regular Expressions for every form of that the parser support as shown in Fig IV.15. When the function receives the block, the block passes through the different Expressions, and if it matches with one of supported formats, the function will start extracting the information, otherwise the function will throw an Exception stating that this header format isn’t supported and the developer has to support it. As shown in Figure IV.13, The Header may contain information such as the Starting page number (may be in the start or at the end of the line), Journal Title, Volume number, Issue number, and the Date. These information could be presented in the header or not, so according to the format of the header the suitable functions (parsePaperDate(), parseVolume(), parseIssue(), parseJournal(), parseStartingPage()) will be called to extracted these information. // ex: Chang et al. VOL. 1, NO. 4/SEPTEMBER 2009/ J. OPT. COMMUN. NETW. C35 String header1_Exp = "^([A-Z]+ ET AL. " + volume_Exp + ", " + issue_Exp + "[ ]*/[ ]*" + paperDate_Exp + "[ ]*/[ ]*" + journalTitle_Exp + " [A-Z0-9]+)$"; // ex: 594 J. OPT. COMMUN. NETW. /VOL. 1, NO. 7/DECEMBER 2009 Lim et al. String header2_Exp = "^([A-Z0-9]+ " + journalTitle_Exp + "/" + volume_Exp + ", " + issue_Exp + "/" + paperDate_Exp + " [A-Z]+ ET AL.)$"; // ex: IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May-June 2008 // ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May/June 2008 // ex: 93 IEEE TRANSACTIONS ON MAGNETICS Vol. 13, No. 6; December 2006 String header3_Exp = "^(([0-9]+ )*" + journalTitle_Exp +"(,)* " + volume_Exp + ", " + issue_Exp + "(,|;) " + paperDate_Exp + "( [0-9]+)*)$"; // ex: 598 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING String header4_Exp = "^([0-9]+ " + journalTitle_Exp + ")$"; // ex: IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 598 String header5_Exp = "^(" + journalTitle_Exp + " [0-9]+)$"; // ex: 1956 lRE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES 75 String header6_Exp = "^([0-9]{4} " + journalTitle_Exp + "[0-9]+)$"; // ex: 112 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS May String header7_Exp = "^([0-9]+ " + journalTitle_Exp + "[A-Z]{3,9})$"; // ex: SUPPLEMENT TO IEEE TRANSACTIONS ON AEROSPACE / JUNE 1965 String header8_Exp = "^(" + journalTitle_Exp + "[ ]*/[ ]*" + dateExp + ")$"; Figure 4.15 The supported Regex of the IEEE Header formats
  • 31. Implementing Plagiarism Detection Engine for English Academic Papers 30 4.1.9.2 Extracting the Volume from the Header The IEEE Parser uses the volume_Exp = VOL(.)* [A-Z-]*[0-9]+ to detect the Volume part from the Header and passes it to the parseVolume() Function, then uses another Expression to extract the number from this part, ex: in the first line in Fig IV.16 of the header forms, the parser will detect the part (VOL. 18), then It will detect the number from this result (18), then change its type from String to int to be assigned to the volume attribute in the paper Object. 4.1.9.3 Extracting the Issue number from the Header The IEEE Parser uses the issue_Exp = NO(.|,) [0-9]+ to detect the issue part from the Header and passes it to the parseIssue() Function, then uses another Expression to extract the number from this part, for the same example presented in the Volume section, the parser will detect the part (NO. 3), then It will detect the number from this result (3), then change its type from String to int to be assigned to the issue attribute in the paper Object. 4.1.9.4 Extracting the PaperDate from the Header @Override void parseVolume(String volume) { Matcher matcher = Pattern.compile(volume_Exp).matcher(volume); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+").matcher(matcher.group()); while(numMatcher.find()) paper.volume = Integer.parseInt(numMatcher.group()); } } Figure 4.16 The Function of extracting the Volume Number @Override void parseIssue(String issue) { Matcher matcher = Pattern.compile(issue_Exp).matcher(issue); if(matcher.find()){ Matcher numMatcher = Pattern.compile("[0-9]+") .matcher(matcher.group()); if(numMatcher.find()) paper.issue = Integer.parseInt(numMatcher.group()); } } Figure 4.17 The Function of Extracting the Issue Number @Override void parsePaperDate(String date) { Matcher matcher = Pattern.compile(paperDate_Exp).matcher(date.trim()); if(matcher.find()) paper.date = matcher.group().replaceAll("^/", "").trim(); } Figure 4.18 The Function of Extracting the DOI
  • 32. Implementing Plagiarism Detection Engine for English Academic Papers 31 Like the other parts of the header, the IEEE Parser uses the date_Exp = [A-Z]{0,9}[/- ]*[A-Z]{3,9}(.)*( [0-9]{1,2}(,)*)* [0-9]{4} to extract the date part from the Header, then assign it to the date attribute in the paper object, also the Date could be written in different formats (2016, March 2016, May/June 2016, May-June 2016) and the Expression is written to detect all forms of the date formats. Note that after extracting each information from the previous ones, this information is removed from the block (header) String, so the after removing the volume, issue, and date, the information left in the header will be the Journal Title and the Starting page, and the Starting page could be in the start or the end of the header. 4.1.9.5 Extracting the Start and End Page numbers from the Header Now, we know that the header has only the Journal Title and the Start Page number, so The IEEE Parser uses the startPage_Exp = ^[0-9]+|[0-9]+$ this expression is to extract a number that lies at the start of the end of the checked String so if the Start Page number lies in the start of the header or at the end of the header, It will be detected and extracted, then as the other information it will be assigned to the attribute in the paper object. And for the End Page, It’s very simple as the IEEE Parser will add the number of pages of the Paper to the number of Start Page and assign the result to the End page of the paper object. 4.1.9.6 Extracting the Journal Title from the Header Now finally for the Journal Title, The IEEE Parser uses the journalTitle_Exp = [A-Z :-—/)(.,]+ to extract the Journal Title part from the Header, then passes the title to the parseJournalTitle() Function. The title may have some extra words that aren’t needed such as: ([author name] et al.) or it may end with some separating characters (comma or forward slash) so they must be removed first, then assign the rest to the journal attribute in the paper object. @Override void parseStartingPage(String startingPage) { Matcher matcher = Pattern.compile(startingPage_Exp).matcher(startingPage); if(matcher.find()) paper.startingPage = Integer.parseInt(matcher.group().trim()); parseEndingPage(startingPage); } @Override void parseEndingPage(String endingPage) { paper.endingPage = paper.startingPage + pages.size(); } Figure 4.19 The Function of Extracting the Start and End Pages
  • 33. Implementing Plagiarism Detection Engine for English Academic Papers 32 4.1.9.7 Parsing the Rest of the first page’s blocks @Override void parseJournal(String journal) { journal = journal.replaceAll("( /|, )", "").trim(); Matcher matcher = Pattern.compile(journalTitle_Exp).matcher(journal); if(matcher.find()){ String journalName = matcher.group().replaceAll("[A-Z ]+ ET AL.", ""); if (journalName.charAt(journalName.length()-1) == '/') paper.journal = journalName.substring(0, journalName.length()-1); else paper.journal = journalName; } } Figure 4.20 The Function of Extracting the Journal Title Iterator<String> it = pageOne.iterator(); while (it.hasNext()) { String mainBlock = it.next(); String block = mainBlock.replaceAll("[ ]+", " "); if(Pattern.compile(IEEE_DOI_Exp + "|" + PII_Exp).matcher(block).find()) { parseDOI(block); blockList.add(mainBlock); } if(ISSN_Pattern.matcher(block).find()) { parseISSN(block); blockList.add(mainBlock); } if(Pattern.compile("Index Terms").matcher(block).find()) { parseKeywords(block); blockList.add(mainBlock); } if(Pattern.compile("(Abstract|ABSTRACT|Summary)").matcher(block).find()) { parseAbstract(block); blockList.add(mainBlock); } if(date_Pattern.matcher(block.toUpperCase()).find() && !datesFound){ parseDates(block); if (!paper.dateAccepted.equals("NULL") || !paper.dateOnlinePublishing.equals("NULL") || !paper.dateReceived.equals("NULL") || !paper.dateRevised.equals("NULL")) { blockList.add(mainBlock); datesFound = true; } } } removeUnimportantBlocks(); for (String blockList1 : blockList) pageOne.remove(blockList1); Figure 4.21 Parsing the rest of blocks in the first Page
  • 34. Implementing Plagiarism Detection Engine for English Academic Papers 33 After parsing the Header Block and extracting all the information from it, the IEEE Parser will continue to parse the other blocks, searching for the rest information, but due to the difference in structure, the location of these information could differ from structure to structure so the best way to extract them is by looping through all the first page blocks and using the Regular Expressions of the these information such as (DOI, ISSN …) the Parser can locate them and also It will try to detect some other blocks such as the Abstract, Keywords, Nomenclature, and the paper Dates (when it’s Received, Accepted, Revised, and Published Online). In every loop, if an information is detected the block will be passed to the suitable function to extract this information, and since the information is extracted, the block isn’t needed so the Parser will add this block to a TreeSet<String> (blockList) and after finishing all the iterations on the blocks of page one, these blocks will be removed from the blocks of the page. Also, there may be some other blocks that don’t have important information such as: the website of the publisher or the logo of the publisher with its name under the logo, so they all also have to be detected and removed using the function removeUnimportantBlocks(). 4.1.9.8 Extracting the DOI or the PII In the loop, if the block is detected to have the DOI (Digital Object Identifier) or PII (Publisher Object Identifier) using the IEEE_DOI_Exp = [0-9]{2}.[0-9]{4}/[A-Z-]+.[0- 9]+.[0-9]+ and the PII_Exp = [0-9]{4}-[0-9xX]{4}([0-9]{2})[0-9]{5}- (x|X|[0-9]), The IEEE Parser will be passed the block to the function of parseDOI(), and the DOI or PII will be extracted, then if it’s the DOI, It will be concatenated with the domain of the DOI of the papers (http://dx.doi.org/), but if it’s the PII, It will be concatenated with (http://dx.doi.org/10.1109/S) and the result will be assigned it to the DOI attribute in the paper object. 4.1.9.9 Extracting the ISSN @Override void parseDOI(String DOI) { Matcher matcher = Pattern.compile(IEEE_DOI_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/" + matcher.group(); matcher = Pattern.compile(PII_Exp).matcher(DOI); while(matcher.find()) paper.DOI = "http://dx.doi.org/10.1109/S" + matcher.group(); } Figure 4.22 The Function of Extracting the DOI and PII void parseISSN(String ISSN){ Matcher matcher = ISSN_Pattern.matcher(ISSN); while(matcher.find()) paper.ISSN = matcher.group().replaceAll("(–|-|‐)", "-"); } Figure 4.23 The Function of Extracting the ISSN
  • 35. Implementing Plagiarism Detection Engine for English Academic Papers 34 Also if a block is detected to have the ISSN using the ISSN_Exp = [0-9]{4}(–|-|‐| )[0-9]{3}[0-9xX], The IEEE Parser will pass the block to the function parseISSN(), that will extract the ISSN, and assign it to the ISSN attribute in the paper object. 4.1.9.10 Extracting the Dates of the Paper: If a block through the iteration is detected to have dates using the date_Exp = ([0- 9]{1,2}(-| )[A-Z]{3,9}[.]*(-| )[0-9]{4})|[A-Z]{3,9}[.]*( [0- 9]{1,2},)* [0-9]{4}) so The IEEE Parser will pass the block to the function parseDates(), this block have dates related to the Paper such as when it’s received in the publisher and when it’s revised, void parseDates(String dates){ dates = dates.replaceAll(separatedWord_Fixing, "") .replaceAll(newLine_Removal, " ").toUpperCase(); Matcher matcher = receivedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateReceived = dateMatcher.group().trim(); } matcher = revisedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateRevised = dateMatcher.group().trim(); } matcher = acceptedDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateAccepted = dateMatcher.group().trim(); } matcher = publishingDate_Pattern.matcher(dates); while(matcher.find()){ String stMatch = matcher.group(); Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch); while(dateMatcher.find()) paper.dateOnlinePublishing = dateMatcher.group().trim(); } } Figure 4.24 The Function of extracting the paper Dates
  • 36. Implementing Plagiarism Detection Engine for English Academic Papers 35 accepted and published online and for every date of those there is a Regular Expression to detect it and Note that: not all papers include these dates in the paper, but most of them include it, so they will be extracted if they are included in the paper and assigned their attributes in the paper object. The dates could be written in many formats: (30 OCTOBER 2007), (17 AUG. 2007), (28-JULY- 2009), (OCTOBER 6, 2006), so the Regular Expression of the date itself could be complicated as to detect all of these formats of dates Also the word before the date could be written in different forms (Received), (Received:), (Revised), (Revised:) or (Received in revised form) and maybe lowercase or capitalized, so the Regular Expressions are constructed to detect all forms of those words and for the character case, we transform the string to uppercase and compare them. 4.1.9.11 Extracting the Keywords If the block of the Keywords is detected, The IEEE Parser will pass it to the parseKeywords() function, the keywords may be found in the block of the Abstract so the first line to crop the part of the Keywords if it exists with the abstract, then the block could be separated in 2 lines or has a word separated in 2 lines with a hyphen so they have to be removed and fixed, after that some papers separate the keywords with comma (,) and others separate them with semi-colon (;), Then the splitted keywords are added to the list of the keywords in the paper object. 4.1.9.12 Extracting the Abstract For the block of the abstract, while the iteration in the parseFirstPage() procedure, If one of the blocks matches the word abstract or summary, then this block will be passed to the parseAbstract() function, and it will be considered the first paragraph in the page with the header Abstract . In some cases the abstract may contain some other information such as the keywords or the Nomenclature so they have to be copped first and parsed separately. @Override void parseKeywords(String keywords) { keywords = keywords.substring(keywords.indexOf("Index Terms")); String indexTerms_Removal = "-rn|Index Terms|-"; keywords = keywords.replaceAll(indexTerms_Removal, ""); String[] splitted = keywords.replaceAll(newLine_Removal, " ").split(",|;"); for (int i = 0; i < splitted.length; i++) paper.keywords.add(splitted[i].trim()); } Figure 4.25 The Function of Extracting the Keywords
  • 37. Implementing Plagiarism Detection Engine for English Academic Papers 36 4.1.9.13 Extracting the Title and Authors Now after extracting all the information of the paper and removing these blocks, The next block will have the Title of the paper, then the Authors, then the Introduction. First the title will be passed to the parseTitle() procedure, so If it’s separated on more than one line, it will remove the newline char and assign it to the title attribute. Next the Authors will be passed to the parseAuthors() procedure, where they will be separated may be by comma or semi-colon or some other separation according to the publisher style, and each author will be added to the authors list in the paper object. @Override void parseAbstract(String abstractContent) { int indexOfIndexTerms = abstractContent.indexOf("Index Terms"); if (indexOfIndexTerms != -1) abstractContent = abstractContent.substring(0,indexOfIndexTerms); int indexOfNomenclature = abstractContent.indexOf("NOMENCLATURE"); if (indexOfNomenclature != -1) abstractContent = abstractContent.substring(0,indexOfNomenclature); paper.headers.add("Abstract"); abstractContent = abstractContent.replaceAll("(Abstract|Summary)(-)*", ""); String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph paragraph = new Paragraph(1, lastHeader, abstractContent); paper.paragraphs.add(paragraph); } Figure 4.26 The Function of Extracting the Keywords void parseTitle(String title) { paper.title = title.replaceAll(newLine_Removal, " ").trim(); } @Override void parseAuthors(String authors) { authors = authors.replaceAll(author_Removal, "") .replaceAll("[ ]+", " "); authors = authors.replaceAll(separatedWord_Fixing, "") .replace(newLine_Removal, " "); String[] split = authors.split(",| and| And| AND"); for (String author : split) if(!author.trim().isEmpty()) paper.authors.add(author.replaceAll("[0-9]+", "").trim()); } Figure 4.27 The Function of Extracting the Title and the Authors
  • 38. Implementing Plagiarism Detection Engine for English Academic Papers 37 4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper) Now all the paper Information is extracted and the blocks exist in the first page is the introduction and the rest of the page content, and parseFirstPage() procedure is done executing, and parseOtherPages() procedure will start executing, as we demonstrate before it loops over all the blocks of strings in the pages and extract all the possible data from them such as Headers (Table of Contents), figure and table captions, lists and if not any of the previous it will be a paragraph, and all of these procedures are part of the parent Parser. 4.1.10.1 Extracting the Headers This procedure is a very general one and works efficiently for most types of headers, first it detect the style of the level 1 headers and it supports (I. INTRODUCTION), (1 INTRODUCTION), (1. INTRODUCTION), (1 Introduction), the Headers could be listed with the roman numbers or may be with number and the header written in an uppercase or may be the number has a dot after it, or the header written in a capitalized case, so first the function detects the type of the header. Also for the level 2 headers, there are different styles for these headers and there is another function to detect them and it supports 3 different types of headers such as: (for example) (A. Level 2 Header), (1.1 Level 2 Header), (1.1. Level 2 Header), so the Header could be listed alphabetically or number dot number or number dot number dot then the title of the header. Once the headers style are specified the Header’s Regular Expressions are created and are tested on all the passed blocks to detect any headers, and those Regex are not constant but they are changing for example if I detected (1. Introduction) then the next header that I will wait will be (2. Another Header) so the number will be incremented. Another thing, the header always comes at the start of the block of string and the rest of the string is a paragraph or it may be extracted from the beginning in one block, so the function will extract the header only and the rest of the block will be return so it will be parsed as a paragraph. Also there are other headers that has no numbers such as the Abstract, References, Acknowledgements, Appendix and more other, and those headers are detected separately with a separate Regex. Also this procedure can detect the level 3 and level 4 headers and their style is specified according to the style of the level 2 headers for example (1.1.1 Header) or (1.1.1. Header) and all are add to the headers list in the paper object.
  • 39. Implementing Plagiarism Detection Engine for English Academic Papers 38 4.1.10.2 Extracting the Figure and Table Captions In This procedure, the parent Parser uses the figure_Exp = ^(Fig.|Figure)[ ]+[0- 9]+(.|:) and table_Exp = ^(TABLE|Table)[ ]+([0-9]+|[IVX]+) to detect the figure and table captions and they may appear in different styles for example (Figure 1.), (Fig. 1), (Figure 1:), (TABLE 1) and (Table II) also the listing could be numeric or alphabetic, and after extracting it, the caption will be added to the list of captions as paragraphs with the page number. private enum HeaderMode { NUM_SPACE_UPPERCASE, NUM_SPACE_CAPITALIZED, ROMAN_DOT_UPPERCASE, NUM_DOT_UPPERCASE, NUM_DOT_CAPITALIZED, ABC_DOT_CAPITALIZED, NUM_DOT_NUM_DOT_CAPITALIZED, NUM_DOT_NUM_CAPITALIZED } Pattern restHeader_Pattern = Pattern.compile("^(REFERENCES|References|" + "ACKNOWLEDGMENT[S]*|Acknowledg[e]*ment[s]*|Nomenclature|DEFINITIONS" + "|Contents|NOMENCLATURE|ACRONYM|ACRONYMS|NOTATION|APPENDIX|" + "Appendix)(rn| )*"); void detect_Header1Mode(String block){ if (Pattern.compile("I. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.ROMAN_DOT_UPPERCASE; else if (Pattern.compile("1 INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_UPPERCASE; else if (Pattern.compile("1 Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_SPACE_CAPITALIZED; else if (Pattern.compile("1. INTRODUCTION").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_UPPERCASE; else if (Pattern.compile("1. Introduction").matcher(block).find()) header1_Mode = HeaderMode.NUM_DOT_CAPITALIZED; } void detect_Header2Mode(int _1st_header, String block){ if (Pattern.compile("^A. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.ABC_DOT_CAPITALIZED; else if (Pattern.compile("^" + _1st_header + ".1 [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_CAPITALIZED; else if (Pattern.compile("^" + _1st_header + ".1. [A-Z][a-z]+").matcher(block).find()) header2_Mode = HeaderMode.NUM_DOT_NUM_DOT_CAPITALIZED; } Figure 4.28 The Defining the Style of the Header
  • 40. Implementing Plagiarism Detection Engine for English Academic Papers 39 4.1.10.3 Extracting the Lists In this procedure, the parent Parser can detect the lists in text content and separate them as a whole paragraph itself, and it supports the numeric, dot and dashed lists, and the block could have paragraph at the beginning and paragraph at the last, so they have to be separated, and each of the paragraph (if found) and the lists are added as paragraphs with the page number in the paragraph list in the paper object. private boolean parseFigureCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = figureCaption_Pattern.matcher(block); while(matcher.find()){ String figureTitle = block.replaceAll(separatedWord_Fixing, "") .replaceAll(newLine_Removal, " ").trim(); if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph figure = new Paragraph(pageNumber, lastHeader, figureTitle); paper.figureCaptions.add(figure); return true; } return false; } private boolean parseTableCaption(String block, int pageNumber){ block = block.replaceAll("[ ]+", " ").trim(); Matcher matcher = tableCaption_Pattern.matcher(block); while(matcher.find()){ String tableTitle = block.replaceAll(separatedWord_Fixing, "") .replaceAll(newLine_Removal, " ").trim(); if(paper.headers.size()>0) String lastHeader = paper.headers.get(paper.headers.size()-1); Paragraph table = new Paragraph(pageNumber, lastHeader, tableTitle); paper.tableCaptions.add(table); return true; } return false; } Figure 4.29 the Function of Extracting the Figure Captions
  • 41. Implementing Plagiarism Detection Engine for English Academic Papers 40 4.1.10.4 Extracting the Paragraph Pattern newList_Pattern = Pattern.compile("rn[ ]*([0-9]|-|.|·|•)"); Pattern numericList1_Pattern = Pattern.compile("^[0-9](.|))[ ]+[A-Z]"); Pattern numericList2_Pattern = Pattern.compile("(.|:)rn[ ]*[0-9](.|))[ ]+[A- Z]"); Pattern dotList1_Pattern = Pattern.compile("^(.|·|•)[ ]+[A-Za-z]"); Pattern dotList2_Pattern = Pattern.compile("(.|:)rn[ ]*(.|·|•)[ ]+"); Pattern dashList1_Pattern = Pattern.compile("^-[ ]+[A-Za-z]+"); Pattern dashList2_Pattern = Pattern.compile("(.|:)rn[ ]*-[ ]+[A-Z]"); private boolean parseLists(String block,int pageNumber){ Matcher orderList1_Matcher = numericList1_Pattern.matcher(block); Matcher orderList2_Matcher = numericList2_Pattern.matcher(block); if(orderList1_Matcher.find() || orderList2_Matcher.find()) return parseList(block, pageNumber, numericList2_Pattern, newList_Pattern); Matcher dotList1_Matcher = dotList1_Pattern.matcher(block); Matcher dotList2_Matcher = dotList2_Pattern.matcher(block); if(dotList1_Matcher.find() || dotList2_Matcher.find()) return parseList(block, pageNumber, dotList2_Pattern, newList_Pattern); Matcher dashList1_Matcher = dashList1_Pattern.matcher(block); Matcher dashList2_Matcher = dashList2_Pattern.matcher(block); if(dashList1_Matcher.find() || dashList2_Matcher.find()) return parseList(block, pageNumber, dashList2_Pattern, newList_Pattern); return false; } Figure 4.30 the Function of separating the lists void parseParagraph(String block, int pageNumber){ Matcher matcher = newParagraph_Pattern.matcher(block); String lastHeader="",content; int startIndex =0, endIndex; while(matcher.find()){ endIndex = matcher.start(); if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex, endIndex+1); Figure 4.31 the Function of Extracting the Paragraph
  • 42. Implementing Plagiarism Detection Engine for English Academic Papers 41 In this procedure, the parent Parser can detect the paragraphs using the paragraph_Exp = .rn[ ]+[A-Z], and as we mentioned before, the block is passed to this procedure after being tested to be a Figure or Table Caption or it’s a list, or it could have a header so it has to be extracted first and return the rest of the block, so if the block passed all these tests, it will be considered a paragraph, and passed to the parseParagraph() procedure. Note that: the block may contain one or more paragraph so all of them has to be detected and separated and each of them is added to the paragraphs list with page number in the paper object. Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); startIndex = endIndex + matcher.group().length()-1; } if(paper.headers.size()>0) lastHeader = paper.headers.get(paper.headers.size()-1); content = block.substring(startIndex); Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content); paper.paragraphs.add(paragraph); } Figure 4.32 The Function of Extracting the Paragraph
  • 43. Implementing Plagiarism Detection Engine for English Academic Papers 42 4.2 The Natural Language Processing (NLP) 4.2.1 Introduction In this section, the text extracted from the scientific papers has to be refined. We have to focus on the important words in the text such as names and verbs, and ignoring the staffed words such as prepositions and adverbs, so the plagiarism can be detected efficiently even if the user try to play with words. 4.2.2 The Implementation Overview First, each paragraph in the database is selected and passed to the processText() procedure that perform the text processing and return an array of refined words, in this procedure the paragraph passes through several steps. 1. Lowercase 2. Tokenization 3. Part of Speech (POS) tagging 4. Remove Punctuations 5. Remove Stop words 6. Lemmatization 4.2.3 The Text Processing Procedure 4.2.3.1 Lowercase In this step, all the text is changed to the lowercase, so we don’t have redundant data of the same words written in different cases (Play, play). 4.2.3.2 Tokenization def processText(document): document = document.lower() words = tokenizeWords(document) tagged_words = pos_tag(words) filtered_words = removePunctuation(tagged_words) filtered_words = removeStopWords(filtered_words) filtered_words = lemmatizeWords(filtered_words) return filtered_words Figure 4.33 Process Text Function def tokenizeWords(sentence): return word_tokenize(sentence) Figure 4.34 Tokenizing words Function
  • 44. Implementing Plagiarism Detection Engine for English Academic Papers 43 Here, we split the text into words using Treebank Tokenization Algorithm. This Algorithm splitting the words in intelligent way based on corpus (data) retrieved from NLTK, it also split the words from surrounded punctuation. For Example: 4.2.3.3 Part of Speech (POS) tagging The purpose of the POS is to find the position of the word in the sentence, it can detect if the word is verb, noun, adjective, or adverb, so this information will help return the words to their origins as for verbs they will be returned to their infinitives. We use WordNet database to get the words origins. 1. i’m → [ 'i', "'m" ] 2. won’t → ['wo', "n't"] 3. gonna (tested) {helping} (25) → ['gon', 'na', 'tested', 'helping', '25'] Figure 4.35 Tokenization Example words = ['At', '5 am', "tomorrow", 'morning', 'the', 'weather', "will", 'be', 'very', 'good', '.'] taged_words = nltk.pos_tag(words) Figure 4.36 POS Function [('at', 'IN'), ('5', 'CD'), ('am', 'VBP'), ('tomorrow', 'NN'), ('morning', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('will', 'MD'), ('be', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')] Figure 4.37 POS Output Example def getWordnetPos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN Figure 4.38 WordNet POS Function
  • 45. Implementing Plagiarism Detection Engine for English Academic Papers 44 4.2.3.4 Remove Punctuations In this Step the punctuations are removed from the text such as: comma, full stop, single and double quotes, and the parenthesis either circle or square or the curly braces. 4.2.3.5 Remove Stop words In this process the staffed words (stop words) are removed. def removePunctuation(words): new_words = [] for word in words: if len(word[0]) > 1: new_words.append(word) return new_words Figure 4.39 Removing Punctuations Function def removeStopWords(words): stop_words = set(stopwords.words("english")) new_words = [] for word in words: if word[0] not in stop_words: new_words.append(word) return new_words Figure 4.40 Removing Stop Words Function ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] Figure 4.41 Stop Words list
  • 46. Implementing Plagiarism Detection Engine for English Academic Papers 45 4.2.3.6 Lemmatization In this step, we use the information retrieved from the POS to get the origins of the words. By passing the word and its wordnet position to the lemmatize function. Now the paragraph after being processed have only the important words that describe the real meaning of the paragraph. 4.2.4 Example of the Text Processing def lemmatizeWords(words): new_words = [] wordnet_lemmatizer = WordNetLemmatizer() for word in words: new_word = wordnet_lemmatizer.lemmatize(word[0], getWordnetPos(word[1])) new_words.append(new_word) Figure 4.42 Lemmatization Function Plagiarism is the wrongful appropriation and stealing and publication of another author's language, thoughts, ideas, or expressions and the representation of them as one's own original work. The idea remains problematic with unclear definitions and unclear rules. The modern concept of plagiarism as immoral and originality as an ideal emerged in Europe only in the 18th century, particularly with the Romantic movement. Plagiarism is considered academic dishonesty and a breach of journalistic ethics. It is subject to sanctions like penalties, suspension, and even expulsion. Recently, cases of 'extreme plagiarism' have been identified in academia. Plagiarism is not in itself a crime, but can constitute copyright infringement. In academia and industry, it is a serious ethical offense. Plagiarism and copyright infringement overlap to a considerable extent, but they are not equivalent concepts, and many types of plagiarism do not constitute copyright infringement, which is defined by copyright law and may be adjudicated by courts. Plagiarism is not defined or punished by law, but rather by institutions (including professional associations, educational institutions, and commercial entities, such as publishing companies). Figure 4.43 Paragraph before Text Processing
  • 47. Implementing Plagiarism Detection Engine for English Academic Papers 46 ['plagiarism', 'wrongful', 'appropriation', 'stealing', 'publication', 'another', 'author', "'s", 'language', 'thought', 'idea', 'expression', 'representation', 'one', "'s", 'original', 'work', 'idea', 'remain', 'problematic', 'unclear', 'definition', 'unclear', 'rule', 'modern', 'concept', 'plagiarism', 'immoral', 'originality', 'ideal', 'emerge', 'europe', '18th', 'century', 'particularly', 'romantic', 'movement', 'plagiarism', 'consider', 'academic', 'dishonesty', 'breach', 'journalistic', 'ethic', 'subject', 'sanction', 'like', 'penalty', 'suspension', 'even', 'expulsion', 'recently', 'case', "'extreme", 'plagiarism', 'identify', 'academia', 'plagiarism', 'crime', 'constitute', 'copyright', 'infringement', 'academia', 'industry', 'serious', 'ethical', 'offense', 'plagiarism', 'copyright', 'infringement', 'overlap', 'considerable', 'extent', 'equivalent', 'concept', 'many', 'type', 'plagiarism', 'constitute', 'copyright', 'infringement', 'define', 'copyright', 'law', 'may', 'adjudicate', 'court', 'plagiarism', 'define', 'punish', 'law', 'rather', 'institution', 'include', 'professional', 'association', 'educational', 'institution', 'commercial', 'entity', 'publish', 'company'] Figure 4.44 Paragraph after Text Processing
  • 48. Implementing Plagiarism Detection Engine for English Academic Papers 47 4.3 Term Weighting In this section we will calculate the term weighting our system using the data extracted from the scientific papers by the parser. The parser extract the data as paragraphs and store them in database, and here we will retrieve these paragraphs and calculate the term weighting for the system. 4.3.1 Lost Connection to Database Problem First we open a connection to database and retrieve the unprocessed paragraphs, but we are processing a large number of paragraphs and the connection must stay open all that time. So we face a problem of lost connection to database when its internal timeout expires. 1) Increasing timeout solution This problem could be solved by increasing the timeout but this solution is limited as we might have a very large number of paragraphs that exceeds the timeout was set before. 2) Better solution We will retrieve 100 paragraph and process them, then close the connection. Then open a new connection and retrieve another 100 paragraphs and so on until all the unprocessed paragraphs are processed. cursor = connection.run("SELECT COUNT(*) FROM paragraph WHERE processed = false") (unprocessedParagraphsNum,) = cursor.fetchone() connection.endConnect() pCounter = 0 insertTermsBeginTime = time.time() while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 100: remain = 100 rows = connection1.run("SELECT paragraphId,content FROM paragraph WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 # Process Paragraph connection.endConnect() Figure 4.45 Retrieving Paragraphs
  • 49. Implementing Plagiarism Detection Engine for English Academic Papers 48 4.3.2 Process Paragraph For each paragraph we will pass it to processText() procedure to get an array of refined words, if the array is empty, it means that there we no important words in the paragraph and the paragraph will be deleted. We will use the returned words to generate k-gram terms of them and populate the term table and paragraphVector table. Finally we will update the length of the paragraph with the number of words returned from processText(), and mark the paragraph as processed. 4.3.3 Generating Terms To generate terms we will call the generateTerms() procedure and pass to it: the bag of words, and the kind of the k-grams we want to generate. while pCounter < unprocessedParagraphsNum: connection1 = Connection(caller) connection2 = Connection(caller) remain = unprocessedParagraphsNum - pCounter if remain > 1000: remain = 1000 rows = connection1.run("SELECT paragraphId,content FROM paragraph" + " WHERE processed = false LIMIT %s", (remain,)) for (paragraphId, content) in rows: pCounter += 1 data = processText(content) length = len(data) if length < 1: connection2.run("DELETE FROM paragraph WHERE paragraphId = %s;" , (paragraphId,)) connection2.commit() continue term.populateTerms_ParagraphVector(connection2, data, paragraphId) connection2.run("UPDATE paragraph SET length = %s, processed = %s " + " WHERE paragraphId = %s;", (length, True, paragraphId)) connection2.commit() connection1.endConnect() connection2.endConnect() Figure 4.46 Process Paragraph Function
  • 50. Implementing Plagiarism Detection Engine for English Academic Papers 49 Example on k-grams data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) def generateTerms(data, kgrams, paragraphId=0): all_terms = {} for i in kgrams: if len(data) < i: continue terms = createTerms(data, i) all_terms[i] = terms data = { 'paragraphId': paragraphId, 'terms': all_terms } return data def createTerms(words, kgram): length = len(words) - kgram + 1 i = 0 terms = [] while i < length: term = createTerm(words, i, kgram) terms.append(term) i += 1 return terms def createTerm(words, start, kgram): i = start term = [] while i < kgram + start: term.append(words[i]) i += 1 t = ' '.join(term) if len(t) > 180: t = t[0:180] return t Figure 4.47 Generate k-gram Terms Function Physics is one of the oldest academic disciplines, perhaps the oldest through its inclusion of astronomy. Over the last two millennia, physics was a part of natural philosophy along with chemistry, biology, and certain branches of mathematics. Figure 4.48 Paragraph Example
  • 51. Implementing Plagiarism Detection Engine for English Academic Papers 50 ['physic', 'one', 'old', 'academic', 'discipline', 'perhaps', 'old', 'inclusion', 'astronomy', 'last', 'two', 'millennium', 'physic', 'part', 'natural', 'philosophy', 'along', 'chemistry', 'biology', 'certain', 'branch', 'mathematics'] Figure 4.49 1-gram terms ['physic one', 'one old', 'old academic', 'academic discipline', 'discipline perhaps', 'perhaps old', 'old inclusion', 'inclusion astronomy', 'astronomy last', 'last two', 'two millennium', 'millennium physic', 'physic part', 'part natural', 'natural philosophy', 'philosophy along', 'along chemistry', 'chemistry biology', 'biology certain', 'certain branch', 'branch mathematics'] Figure 4.50 2-gram terms ['physic one old', 'one old academic', 'old academic discipline', 'academic discipline perhaps', 'discipline perhaps old', 'perhaps old inclusion', 'old inclusion astronomy', 'inclusion astronomy last', 'astronomy last two', 'last two millennium', 'two millennium physic', 'millennium physic part', 'physic part natural', 'part natural philosophy', 'natural philosophy along', 'philosophy along chemistry', 'along chemistry biology', 'chemistry biology certain', 'biology certain branch', 'certain branch mathematics'] Figure 4.51 3-gram terms ['physic one old academic', 'one old academic discipline', 'old academic discipline perhaps', 'academic discipline perhaps old', 'discipline perhaps old inclusion', 'perhaps old inclusion astronomy', 'old inclusion astronomy last', 'inclusion astronomy last two', 'astronomy last two millennium', 'last two millennium physic', 'two millennium physic part', 'millennium physic part natural', 'physic part natural philosophy', 'part natural philosophy along', 'natural philosophy along chemistry', 'philosophy along chemistry biology', 'along chemistry biology certain', 'chemistry biology certain branch', 'biology certain branch mathematics'] Figure 4.52 4-gram terms ['physic one old academic discipline', 'one old academic discipline perhaps', 'old academic discipline perhaps old', 'academic discipline perhaps old inclusion', 'discipline perhaps old inclusion astronomy', 'perhaps old inclusion astronomy last', 'old inclusion astronomy last two', 'inclusion astronomy last two millennium', 'astronomy last two millennium physic', 'last two millennium physic part', 'two millennium physic part natural', 'millennium physic part natural philosophy', 'physic part natural philosophy along', 'part natural philosophy along chemistry', 'natural philosophy along chemistry biology', 'philosophy along chemistry biology certain', 'along chemistry biology certain branch', 'chemistry biology certain branch mathematics'] Figure 4.53 5-gram terms
  • 52. Implementing Plagiarism Detection Engine for English Academic Papers 51 4.3.4 Populating term, paragraphVector Tables After we generated the terms we will use them to populate term, paragraphVector tables. 4.3.4.1 Calculate Term Frequency We will use nltk.FreqDist() function to calculate the term frequency of each k-gram term in the paragraph 4.3.4.2 Inserting Terms We will insert each term with its corresponding term gram. 4.3.4.3 Inserting ParagraphVector In this step we will link each term with its paragraph and the term frequency by inserting these into the paragraphVector table. tf = {} for kgram in data['terms']: tf[kgram] = nltk.FreqDist(data['terms'][kgram]) Figure 4.54 Calculate Term Frequency query1 ="INSERT INTO term (kgram, term) VALUES (%s, %s) ON DUPLICATE KEY UPDATE kgram = kgram, term = term;" insertTerms = [(str(kgram), str(term)) for kgram in tf for term in tf[kgram]] connection.runMany(query1, insertTerms) connection.commit() Figure 4.55 insert Terms in Database query2 = "INSERT IGNORE INTO paragraphVector (paragraphId, termId, termFreq, kgram) VALUES (%s, (SELECT termId FROM term WHERE term = %s AND kgram = %s), %s, %s);" insertDocVec = [(data['paragraphId'], str(term), str(kgram), tf[kgram][term], str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query2, insertDocVec) connection.commit() Figure 4.56 insert Paragraph Vector in Database
  • 53. Implementing Plagiarism Detection Engine for English Academic Papers 52 4.3.5 Executing VSM Algorithm After all paragraphs are inserted into the database after begin processed, we will run some sorted SQL procedures to update some columns(inverseDocFreq, BM25, pivotNorm) in term and paragraphVector tables. Now the system is finished and all terms are evaluated and ready for testing plagiarism. connection.callProcedure('update_inverseDocFreq') connection.callProcedure('update_BM25', (0.75, 1.5)) connection.callProcedure('update_pivotNorm', (0.75,)) Figure 4.57 Executing the VSM Algorithm
  • 54. Implementing Plagiarism Detection Engine for English Academic Papers 53 4.4 Testing Plagiarism When a user submits a text or a file to test plagiarism on it, this text must first be splitted into paragraphs and an inputPaper will be inserted to relate these paragraphs together. 4.4.1 Process Paragraph Then each paragraph must be processed in a similar way like in the pre-processing, first the paragraph will be inserted into the database in the inputParagraph table, then the text will be passed to processText() procedure and return a refined bag of words. And finally these words will be used to generate terms of them and populate the inputParagraphVector table. connection.run(" INSERT INTO inputPaper (inputPaperId) VALUES(''); ") paragraphs = tokenizeParagrapgs(text) Figure 4.58 tokenizing and link paragraphs together for paragraph in paragraphs: data = processText(paragraph) length = len(data) if length < 1: continue cursor = connection.run("INSERT INTO inputParagraph (content,inputPaperId) VALUES (%s,%s)", (paragraph, paperId)) connection.commit() paragraphId = cursor.getlastrowid() term.populateInput_Terms_ParagraphVector(connection, data, paragraphId) Figure 4.59 Process input paragraphs def populateInput_Terms_ParagraphVector(connection, words, paragraphId): data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId) # Term Frequency representation tf = {} for kgram in data['terms']: tf[kgram] = FreqDist(data['terms'][kgram]) query = "INSERT INTO inputParagraphVector (inputParagraphId, termId, termFreq, kgram) SELECT %s, termId, %s, %s FROM term WHERE term = %s AND kgram = %s ;" insertDocVec = [(data['paragraphId'], tf[kgram][term], str(kgram), str(term), str(kgram)) for kgram in tf for term in tf[kgram]] connection.runMany(query, insertDocVec) connection.commit() Figure 4.60 Populate input paragraph vector