My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

SUPERVISED BY: Dr Hitham M. Abo Bakr
Implementing Plagiarism
Detection Engine
For English Academic Papers
By
Muhamed Gameel Abd El Aziz
Ahmed Motair El Said Mater
Mohamed Hessien Mohamed
Shreif Hosni Zidan Esmail
Manar Mohamed Said Ahmed
Doaa Abd El Hamid Abd El Hamid

Implementing Plagiarism Detection Engine for English Academic Papers 1
Abstract
Plagiarism became a serious issue now days due to the presence of vast resources easily
available on the web, which makes developing plagiarism detection tool a useful and challenging task
due to the scalability issues.
Our project is implementing a Plagiarism Detection Engine oriented for English academic papers
using text Information Retrieval methods, relational database, and Natural Language Processing
techniques.
The main parts of the projects are:
Gathering and cleaning data: crawling the web and collecting academic papers and parsing it to
extract information about the paper and make a big dataset of these scientific paper content.
Tokenization: Parse, tokenize, and preprocess documents.
Plagiarism engine: checking similarity between the input document and the database to detect
potential plagiarism.

Table of Contents
Abstract___________________________________________________________________________ 1
Table of Contents ___________________________________________________________________ 2
Table of Figures____________________________________________________________________ 4
Table of Tables_____________________________________________________________________ 7
Chapter 1 Introduction ___________________________________________________________ 8
1.1 What is Plagiarism? _________________________________________________________________8
1.2 What is Self-Plagiarism? _____________________________________________________________8
1.3 Plagiarism on the Internet ____________________________________________________________8
1.4 Plagiarism Detection System __________________________________________________________8
1.4.1 Local similarity: __________________________________________________________________________8
1.4.2 Global similarity: _________________________________________________________________________9
1.4.3 Fingerprinting ___________________________________________________________________________9
1.4.4 String Matching __________________________________________________________________________9
1.4.5 Bag of words _____________________________________________________________________________9
1.4.6 Citation-based Analysis____________________________________________________________________9
1.4.7 Stylometry_______________________________________________________________________________9
Chapter 2 Background Theory ____________________________________________________ 10
2.1 Linear Algebra Basics______________________________________________________________ 10
2.1.1 Vectors_________________________________________________________________________________10
2.2 Information Retrieval (IR)__________________________________________________________ 11
2.3 Regular Expression________________________________________________________________ 15
2.4 NLTK Toolkit ____________________________________________________________________ 16
2.5 Node.js __________________________________________________________________________ 16
2.6 Express.js________________________________________________________________________ 16
2.7 Sockets.io ________________________________________________________________________ 16
2.8 Languages Used___________________________________________________________________ 16
Chapter 3 Design and Architecture_________________________________________________ 17
3.1 Extract, Transfer and Load (ETL) ___________________________________________________ 17
3.2 Plagiarism Engine_________________________________________________________________ 17
3.2.1 Natural Language Processing, (Generating k-grams), and vectorization__________________________18
3.2.2 Semantic Analysis (Vector Space Model VSM Representation) _________________________________18
3.2.3 Calculating Similarity ____________________________________________________________________18
3.2.4 Clustering ______________________________________________________________________________18
3.2.5 Communicating Results___________________________________________________________________19
Chapter 4 Implementation________________________________________________________ 20
4.1 Extract, Load and Transform (ETL) _________________________________________________ 20
4.1.1 The Crawler ____________________________________________________________________________20
4.1.2 The Parser______________________________________________________________________________20
4.1.3 The Data Extracted from the paper_________________________________________________________20
4.1.4 The Parser Implementation _______________________________________________________________21

4.1.5 How it works____________________________________________________________________________21
4.1.6 Steps of Parsing _________________________________________________________________________22
4.1.7 The Paper Class _________________________________________________________________________26
4.1.8 The Paragraph Structure _________________________________________________________________27
4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper) _____________________________________27
4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)____________________________________37
4.2 The Natural Language Processing (NLP)______________________________________________ 42
4.2.1 Introduction ____________________________________________________________________________42
4.2.2 The Implementation Overview_____________________________________________________________42
4.2.3 The Text Processing Procedure ____________________________________________________________42
4.2.4 Example of the Text Processing ____________________________________________________________45
4.3 Term Weighting __________________________________________________________________ 47
4.3.1 Lost Connection to Database Problem ______________________________________________________47
4.3.2 Process Paragraph _______________________________________________________________________48
4.3.3 Generating Terms _______________________________________________________________________48
4.3.4 Populating term, paragraphVector Tables ___________________________________________________51
4.3.5 Executing VSM Algorithm ________________________________________________________________52
4.4 Testing Plagiarism ________________________________________________________________ 53
4.4.1 Process Paragraph _______________________________________________________________________53
4.4.2 Calculate Similarity ______________________________________________________________________54
4.4.3 Get Results _____________________________________________________________________________54
4.5 The VSM Algorithm _______________________________________________________________ 55
4.5.1 Calculating similarity ____________________________________________________________________55
4.5.2 K-means and Clustering __________________________________________________________________56
4.6 Server Side_______________________________________________________________________ 59
4.6.1 Handling Routing________________________________________________________________________59
4.6.2 Running Python System __________________________________________________________________60
4.7 Client Side _______________________________________________________________________ 62
4.8 The GUI of the System _____________________________________________________________ 63
Chapter 5 Results and Discussion__________________________________________________ 66
5.1 Dataset of the Parser_______________________________________________________________ 66
5.2 Exploring dataset _________________________________________________________________ 68
5.4.1 Small dataset (15K) ______________________________________________________________________68
5.4.2 Big dataset (50K) ________________________________________________________________________69
5.3 Performance _____________________________________________________________________ 70
5.4 Detecting plagiarism _______________________________________________________________ 72
5.4.1 Percentage score functions:________________________________________________________________72
5.5 Discussing results _________________________________________________________________ 74
Chapter 6 Conclusion ___________________________________________________________ 75
Chapter 7 Appendix _____________________________________________________________ 76
7.1 Entity-Relation Diagram (ERD) _____________________________________________________ 76
7.2 Stored procedures_________________________________________________________________ 77
References _______________________________________________________________________ 84

Table of Figures
Figure 1.1 Plagiarism Detection Approaches _____________________________________________________8
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3) ______ 10
Figure 2.2 Geometric representation of documents ______________________________________________ 12
Figure 3.1 High level block diagram __________________________________________________________ 17
Figure 3.2 Detailed block diagram of the Plagiarism Engine ______________________________________ 17
Figure 4.1 Overview for the Crawler and Parser ________________________________________________ 20
Figure 4.2 UML of the Parser Application _____________________________________________________ 21
Figure 4.3 the Flow Chart of the Parser _______________________________________________________ 21
Figure 4.4 The main function of Parsing ______________________________________________________ 22
Figure 4.5 The First Page of an IEEE Paper (as Blocks) _________________________________________ 22
Figure 4.6 First Page of a Science Direct Paper_________________________________________________ 23
Figure 4.7 First Paper of a Springer Paper_____________________________________________________ 23
Figure 4.8 The function of parseOtherPages ___________________________________________________ 24
Figure 4.9 Block of String before Enhancing___________________________________________________ 25
Figure 4.10 The Paragraphs after enhancing___________________________________________________ 25
Figure 4.11 the Paper Structure _____________________________________________________________ 26
Figure 4.12 the Paragraph Structure _________________________________________________________ 27
Figure 4.13 Different forms for an IEEE Top Header____________________________________________ 27
Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper ____________________________ 28
Figure 4.15 The supported Regex of the IEEE Header formats ____________________________________ 29
Figure 4.16 The Function of extracting the Volume Number ______________________________________ 30
Figure 4.17 The Function of Extracting the Issue Number________________________________________ 30
Figure 4.18 The Function of Extracting the DOI________________________________________________ 30
Figure 4.19 The Function of Extracting the Start and End Pages __________________________________ 31
Figure 4.20 The Function of Extracting the Journal Title_________________________________________ 32
Figure 4.21 Parsing the rest of blocks in the first Page ___________________________________________ 32
Figure 4.22 The Function of Extracting the DOI and PII_________________________________________ 33
Figure 4.23 The Function of Extracting the ISSN _______________________________________________ 33
Figure 4.24 The Function of extracting the paper Dates __________________________________________ 34
Figure 4.25 The Function of Extracting the Keywords ___________________________________________ 35

Figure 4.26 The Function of Extracting the Keywords ___________________________________________ 36
Figure 4.27 The Function of Extracting the Title and the Authors__________________________________ 36
Figure 4.28 The Defining the Style of the Header _______________________________________________ 38
Figure 4.29 the Function of Extracting the Figure Captions ______________________________________ 39
Figure 4.30 the Function of separating the lists _________________________________________________ 40
Figure 4.31 the Function of Extracting the Paragraph ___________________________________________ 40
Figure 4.32 The Function of Extracting the Paragraph __________________________________________ 41
Figure 4.33 Process Text Function ___________________________________________________________ 42
Figure 4.34 Tokenizing words Function _______________________________________________________ 42
Figure 4.35 Tokenization Example ___________________________________________________________ 43
Figure 4.36 POS Function__________________________________________________________________ 43
Figure 4.37 POS Output Example____________________________________________________________ 43
Figure 4.38 WordNet POS Function__________________________________________________________ 43
Figure 4.39 Removing Punctuations Function__________________________________________________ 44
Figure 4.40 Removing Stop Words Function ___________________________________________________ 44
Figure 4.41 Stop Words list _________________________________________________________________ 44
Figure 4.42 Lemmatization Function _________________________________________________________ 45
Figure 4.43 Paragraph before Text Processing _________________________________________________ 45
Figure 4.44 Paragraph after Text Processing___________________________________________________ 46
Figure 4.45 Retrieving Paragraphs ___________________________________________________________ 47
Figure 4.46 Process Paragraph Function______________________________________________________ 48
Figure 4.47 Generate k-gram Terms Function__________________________________________________ 49
Figure 4.48 Paragraph Example _____________________________________________________________ 49
Figure 4.49 1-gram terms___________________________________________________________________ 50
Figure 4.54 Calculate Term Frequency _______________________________________________________ 51
Figure 4.55 insert Terms in Database _________________________________________________________ 51
Figure 4.56 insert Paragraph Vector in Database _______________________________________________ 51
Figure 4.57 Executing the VSM Algorithm_____________________________________________________ 52

Figure 4.58 tokenizing and link paragraphs together_____________________________________________ 53
Figure 4.59 Process input paragraphs_________________________________________________________ 53
Figure 4.60 Populate input paragraph vector ___________________________________________________ 53
Figure 4.61 Calculate Similarity _____________________________________________________________ 54
Figure 4.62 Get Results ____________________________________________________________________ 54
Figure 4.63 Flowchart of the Kmeans text clustering algorithm ____________________________________ 57
Figure 4.64 Home Page Routing _____________________________________________________________ 59
Figure 4.65 Pre-Process Page Routing ________________________________________________________ 59
Figure 4.66 Communicating between the Server and the Core Engine for testing plagiarism ____________ 60
Figure 4.67 Communicating between the Server and the Core Engine for Pre-processing ______________ 61
Figure 4.68 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.69 Least Common Subsequence LCS Algorithm _________________________________________ 62
Figure 4.70 Submitting an input document_____________________________________________________ 63
Figure 4.71 The Results of the Process Part 1 __________________________________________________ 64
Figure 4.72 The Results of the Process Part 2 __________________________________________________ 65
Figure 5.1 Number of Papers Published per Year in IEEE ________________________________________ 66
Figure 5.2 Number of Papers Published per Year in Springer _____________________________________ 67
Figure 5.3 Number of Papers Published per Year in Science Direct_________________________________ 67
Figure 5.4 Response time against number of paragraphs tested on small dataset ______________________ 70
Figure 5.5 Screenshot of the System Performance from the System GUI _____________________________ 71
Figure 7.1 ERD of the plagiarism Engine database ______________________________________________ 76

Table of Tables
Table 1 Statistics of the Parser________________________________________________________________66
Table 2 Dataset Statistics ____________________________________________________________________68
Table 3 Unique Terms count in each Paragraph _________________________________________________68
Table 4 Unique Terms count in Dataset ________________________________________________________68
Table 5 Dataset Statistics ____________________________________________________________________69
Table 6 Unique Terms count in each Paragraph _________________________________________________69
Table 7 Unique Terms count in Dataset ________________________________________________________69
Table 8 Processing time of each module in Plagiarism Engine ______________________________________70
Table 9 Parameters_________________________________________________________________________72
Table 10 Testing Paragraphs and Results_______________________________________________________73

Chapter 1 Introduction
1.1 What is Plagiarism?
It’s the act of Academic stealing someone’s work such as: copying words from a book or a
scientific paper and publish it as it’s his work, also stealing the ideas, images, videos and music and
using them without a permission or providing a proper citation is called plagiarism.
1.2 What is Self-Plagiarism?
It’s the act when someone is using a portion of an article or work he published before without
citing that he is doing so, and this portion could be significant, identical or nearly identical, also it may
cause copyrights issues as the copyright of the old work will be transferred to the new one. This type of
articles and work are called duplicate or multiple publication.
1.3 Plagiarism on the Internet
Now the Blogs, Facebook Pages and some website are copying and pasting information violating
many copyrights, so there are many tools that are used to prevent plagiarism such as: disabling the right
click to prevent copying, also placing copyright warning in every page in the website as banners or
pictures, and the use of DCMA copyright law to report for copyright infragment and the violation of
copyrights, this report could be sent to the website owner or the ISP hosting the website and the website
will be removed.
1.4 Plagiarism Detection System
It’s a system used to test a material if it has plagiarism or not, this material could be scientific article or
technical report or essay or others, also the system can emphasize the parts of plagiarism in the material and estate
from where it’s copied from even there is difference between some words with the same meaning.
Figure 1.1 Plagiarism Detection Approaches
1.4.1 Local similarity:
Given a small dataset, the system checks the similarity between each pair of the paragraphs in this dataset,
like checking if two students cheated in an assignment.

1.4.2 Global similarity:
Global similarity systems checks the similarity between a small input paragraphs against a large dataset,
like checking if a submitted paper is plagiarized from an already published paper.
1.4.3 Fingerprinting
In this approach, the data set consists of set of multiple n-grams from documents, these n-grams
are selected randomly as a substring of the document, each set of n-grams representing a fingerprint for
that document and called minutiae, and all of this fingerprints are indexed in the database, and the input
text is processed in the same way and compared with the fingerprints in the database and if it matches
with some of them, then it plagiarizes some documents. [1]
1.4.4 String Matching
This is one of most problems in the plagiarism detection systems as to detect plagiarism you can
to make an exact match, but to compare the document to be tested with the total Database requires huge
amount of resources and storage, so suffix trees and suffix vector are used to overcome this problem. [2]
1.4.5 Bag of words
This approach is an adoption of the vector space retrieval, where the document is represented as
a bag of words and these words are inserted in the database as n-grams with its location in the document
and their frequencies in this or other documents, and for the document to be tested it will be represented
as bag of words too and compared with the n-grams in the database. [3]
1.4.6 Citation-based Analysis
This is the only approach that doesn’t rely on text similarity, It examines the citation and
reference information in texts to identify similar patterns in the citation sequences, It’s not widely used
in the commercial software, but there are prototypes of it.
1.4.7 Stylometry
Analyze only the suspicious document to detect plagiarized passages by detecting the difference
of linguistic characters.
This method isn’t accurate in small documents as it needs to analyze large passages to be able to
extract linguistic properties –up to thousands of words per chunk [4].
Our Project is working with Global Similarity and the Bag of Words Approach, as the system
has Dataset of many scientific papers divided into paragraphs, and the input text is divided into
paragraphs and compared with the large dataset of paragraphs.

Chapter 2 Background Theory
2.1 Linear Algebra Basics
Since we use Vector space model to represent and retrieve text documents, a basic linear algebra
is needed.
2.1.1 Vectors
A vector is Geometric object that have a magnitude and a direction, or a mathematical object
consists of ordered values.
1. Representation in 2D and 3D
1) Graphical (Geometric) representation
A vector is represented graphically as an arrow in the Cartesian 2D plane or the Cartesian 3D
space.
Figure 2.1 A vector in the Cartesian plane, showing the position of a point A with coordinates (2, 3).
Source: Wikimedia commons.
2) Cartesian representation
Vectors in an n-dimensional Euclidean space can be represented as coordinate vectors; the
endpoint of a vector can be identified with an ordered list of n real numbers (n-tuple). [5]
2D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦) Euclidean vector
3D vector 𝑎⃑ = (𝑎 𝑥, 𝑎 𝑦, 𝑎 𝑧)
2. Operations on vectors
1) Scalar product
𝑟𝑎⃑ = (𝑟𝑎 𝑥, 𝑟𝑎 𝑦, 𝑟𝑎 𝑧)
2) Sum
𝑎⃑ + 𝑏⃑⃑ = (𝑎 𝑥 + 𝑏 𝑥 , 𝑎 𝑦 + 𝑏 𝑦, 𝑎 𝑧 + 𝑏 𝑧 )

3) Subtract
𝑎⃑ − 𝑏⃑⃑ = (𝑎 𝑥 − 𝑏 𝑥 , 𝑎 𝑦 − 𝑏 𝑦, 𝑎 𝑧 − 𝑏 𝑧 )
4) Dot product
Algebraic definition: 𝑎⃑ . 𝑏⃑⃑ = (𝑎 𝑥 𝑏 𝑥 , 𝑎 𝑦 𝑏 𝑦)
Geometric definition: 𝑎⃑ . 𝑏⃑⃑ = ‖𝑎‖‖𝑏‖ cos 𝜃
Where: |a| is the magnitude of vector a, |b| is the magnitude of vector b, θ is the angle between a and b
The projection of a vector 𝑎⃑ in the direction of another vector 𝑏⃑⃑ is given by: 𝑎 𝑏⃑⃑⃑⃑⃑ = 𝑎⃑ . 𝑏̂
Where: 𝑏̂ is the normalized vector –unit vector- of 𝑏⃑⃑.
2.2 Information Retrieval (IR)
Information retrieval could be defined as “the process of finding material of an unstructured
nature –usually text- that satisfies information need or relevant to a query from large collection of data
[6].
As the definition suggests, IR differ than ordinary select query that is the information retrieved
unstructured, and doesn’t always exactly match the query.
Information Retrieval methods are used in search engines, text classification –such as spam
filtering-, and in our case plagiarism engine.
1. Vector Space Model (VSM)
The basic idea of VSM is to represent text documents as vectors in term weights space.
1) Term Frequency weighting (TF)
The simplest VSM weighting is just Term Frequency; all other weighting functions are
modification of it.
In TF weighting we represent each text by a vector of d-dimensions where d is the number of
terms in dataset, the value of the vector nth dimension equals to the frequency of the nth term in the
document.
For example let’s assume a dataset of 2 dimensions/terms (play, ground)
Document 1 ‘play ground’ is represented as 𝑑1
⃑⃑⃑⃑⃑ = (1, 1)
Document 2 “play play” is represented as 𝑑2
⃑⃑⃑⃑⃑ = (2, 0)
Document 3 “ground” is represented as 𝑑3
⃑⃑⃑⃑⃑ = (0, 1)
More generally weight of word w in document d is defined as 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)

The dot product similarity between d1 and d2 = 𝑑1
⃑⃑⃑⃑⃑ . 𝑑2
⃑⃑⃑⃑⃑ = (1, 1) . (2,0) = 1 ∗ 2 + 1 ∗ 0 = 2
2) Term Frequency with Inverse Document Frequency weighting (TF-IDF)
Document Frequency 𝑑𝑓(𝑤) is the number of documents that contains the word.
TF-IDF have additional Inverse Document Frequency term to penalize common terms as they
are have high probability [7] of appearing in a document so they don’t strongly indicate plagiarism
unlike less probable terms which have less probability and more information.
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) = 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) ∗
1
𝑑𝑓(𝑤)
So for the above example:
df (play) = 2
df (ground) = 2
and in this case all the weights will be scaled by a half.
2. State-of-the-art VSM functions
1) Pivoted Length Normalization [8]
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) =
ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]]
1 − 𝑏 + 𝑏 |𝑑|
𝑎𝑣𝑑𝑙
log
𝑀 + 1
𝑑𝑓(𝑤)
Where: 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) is the weight of word w in document/paragraph d
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑) is the count of word w in document d – i.e. term frequency-
play
ground
2
1
1
d2
d1
d3
Figure 2.2 Geometric representation of documents

𝑏 𝑖𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ∈ [0, 1]
|𝑑| is the length of document d
𝑎𝑣𝑑𝑙 is the average length of the documents in dataset
𝑀 is the number of documents in dataset
𝑑𝑓(𝑤) is the number of documents that contains the word w i.e. document frequency
Document length normalizing term 1 − 𝑏 + 𝑏
|𝑑|
𝑎𝑣𝑑𝑙
is used to linearly penalize long documents
if it’s length is larger than the average document length (avdl), or reward short documents if it’s length
is smaller than the average document length.
The parameter b is used to determine the normalization; if its equal to zero then there is no
normalization at all if it’s equal to 1 then the normalization is linear with offset zero and slope 1.
The Inverse Document Frequency (IDF) term log
𝑀+1
𝑑𝑓(𝑤)
is used to penalize common terms as
explained above, the IDF is multiplied by number of documents to normalize it, as the probability of the
term depends not only on the document frequency of that term, but also on the size of the dataset, a term
appeared 10 times in a dataset with 100 document much more common than a term appeared 10 times
but in a dataset with 1000 document, the logarithmic function is used to smooth the IDF weighting, i.e.
reduce the variation of weighting when the Document Frequency varies a lot.
The Term frequency (TF) term ln[1 + ln[1 + 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑)]] contains double natural logarithmic
functions to achieve sub linear transformation-i.e. smoothing TF curve- to avoid over scoring documents
that have large repeated words, as the first occurring of a term should have the highest weight.
Imagine a document with extremely large frequency of a term, without sub linear transformation
this document will always have high similarity with any input query that contains the same term, even
higher similarity than another more similar document.
2) OkapiBM25 [9]
BM stands for Best Match, the weights are defined as follows:
𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑) =
(𝑘 + 1)𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞)
𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) + 𝑘(1 − 𝑏 + 𝑏 |𝑑|
𝑎𝑣𝑑𝑙
)
log
𝑀 + 1
𝑑𝑓(𝑤)
Where all symbols defined as in Pivoted Length Normalization 𝑘 ∈ [0, ∞]
Similar to Pivoted Length Normalization, but instead of Natural logarithms it uses division and k
parameter to achieve sub linear transformation.

It was originally developed based on the probabilistic model, however it’s very similar to Vector
Space Model.
3. Similarity functions
After representing text documents as vectors in the space, we need functions to calculate
similarity –or distance- between any two vectors.
1) Dot product similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) = ∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)
𝑤∈ 𝑞∩𝑑
Where: 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) is the similarity score between document d and input query q
The score is simply the summation of the product of term weights of each word appear in both
document and query.
It’s very popular because it’s general and can be used with any fancy term weighting.
2) Cosine similarity
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑞, 𝑑) =
∑ 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑞) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤, 𝑑)𝑤∈ 𝑞∩𝑑
|𝑞| ∪ |𝑑|
Where: |𝑞| is the magnitude of the query vector, |𝑑| is the magnitude of the document vector.
It’s basically a dot product divided by the product of the lengths of the two vectors, which yields
the cosine of the angle between the vectors.
This function has a built in document – and query – length normalization.
4. Clustering
Clustering is an unsupervised1
machine learning method, and a powerful data mining technique,
it is the process of grouping similar objects together.
This technique can theoretically speed up the information retrieval process by a factor of K
where K is the number of clusters.
This may be achieved by clustering similar paragraphs together and measure the similarity
between each new query and the centroids of the clusters, then measure the similarity between the query
and paragraphs in one cluster, this is much faster than measuring similarity of query against each
paragraph in the data set.
1
Because the data are not labeled.

We use Kmeans algorithm-centroid based clustering-, which is an iterative improvement
algorithm that groups the data set into a pre defined number of clusters K.
Which goes like this [10]:
1 Select K random points from data set to be the initial guess of the centroids –cluster centers.
2 Assign each record in the data set to the closest centroid based on a given similarity function.
3 Move each centroid closer to points assigned to it by calculating the mean value of the points
in the cluster.
4 If reached local optima –i.e. centroids stopped moving- stop, else repeat
Since Kmeans algorithm is sensitive to initial choose of centroids and can stuck in local optima
we repeat it with different initial centroids and keep the best results –which have the least mean square
error-.
Time complexity 𝛰(𝑡𝑘𝑛𝑑) where t is the number of iterations till converges, k is the number of
clusters, and n is the number of records in the data set, d is the number of dimensions.
Since usually 𝑡𝑘 ≪ 𝑛 the algorithm is considered a linear time algorithm.
Note: that is the typical time complexity for applying the algorithm on a dataset with neglect
able limited Number of dimensions, in our case we have very large number of dimensions –all terms in
the dataset- but fortunately for each centroid or paragraph we iterate only terms appear in it –not all
dimensions- so the time complexity will differ as we will discuss in details in the implementation
section.
2.3 Regular Expression
It’s a sequence of character and symbols written in some way to detect patterns, where each
symbol in the sequence has a meaning ex: + means one or more, * means zero or more, - means range
(A-Z: all capital letters from A to Z).
For ex: the Birth date could be written in this way May 15th
, 1993 so the pattern of the Date is
[Month] [day][st
or nd
or rd
or th
] [year], and the Regular Expression for it is [A-Za-z]{3,9} [0-
9]{1,2}(st|nd|rd|th) [0-9]{4}
First, the month is one of 12 fixed words, they could be written explicitly or simply it’s a
sequence of Char range from 3 to 9 as the smallest month (May) and the largest (September), then space,
Then, the Day and it’s a number of 1 or 2 digits, then one of the 4 words (st, nd, rd, th), then space, then
the Year and it’s a number of 4 digits.
Also this isn’t the only format of date so the date expression could be more complicated than
this.

2.4 NLTK Toolkit
It’s a python module that is responsible for Natural Language Processing (NLP) used for text
processing, It has algorithms for sentence and word tokenization, and contains a large number of corpus
(data), also has its own Wordnet corpus, and it’s used for Part of Speech (POS) Tagging, Stemming, and
lemmatization.
2.5 Node.js
It’s a runtime environment built on Chrome’s V8 JavaScript Enginer for developing server-side
web applications, it uses an event driven, non-blocking I/O model.
2.6 Express.js
It’s a Node.js web application server framework, It’s a standard server framework for Node.js, It
has a very thin layer with many features available as plugins.
2.7 Sockets.io
It’s a library for real-time web application, It enables bi-directional communication between the
web client and server, It primarily use web sockets protocol with polling as fallback option.
2.8 Languages Used
1. Java
2. Python
3. SQL
4. JavaScript
5. HTML & CSS

Chapter 3 Design and Architecture
3.1 Extract, Transfer and Load (ETL)
In this part, we are building the database by downloading many scientific papers using the
Crawler software (Extract), then they are passed to the Parser software where all the paper information
and text content are extracted as paragraphs (Transform) and inserted in the database (Load).
3.2 Plagiarism Engine
The plagiarism Engine preprocess a huge dataset of academic English papers and analysis it uses
Natural language processing techniques to extract useful information, then measures the similarity
between an input query and the dataset using Information Retrieval methods to detect both Identical and
paraphrased plagiarism in a fast and intelligent way.
Parsed
papers
NLP and
vectorization
Semantic analysis
VSM representation
Vectorized paragraph
Extracted features
Lexical database
(WordNet/NLTK)Input query
Calculating
similarity Clustering
centroid
Find Cluster
Potential plagiarized
paragraphs
ETL
Parsing
Plagiarism
engine
Communicating results
Academic
papers
Input query
Figure 3.1 High level block diagram
Figure 3.2 Detailed block diagram of the Plagiarism Engine

3.2.1 Natural Language Processing, (Generating k-grams), and vectorization
The Text Processing part work on these data to extract the most important words from the
paragraphs, and ignore the common words, then k-grams terms are generated from these words, and
each bag of words is linked to its corresponding paragraph in the database.
3.2.2 Semantic Analysis (Vector Space Model VSM Representation)
Input: simple Term Frequency vector representation stored in paragraphVector table.
Output: dataset statistics (number of paragraphs, number of terms, average length of paragraph)
stored in dataSetInfo table, document frequency (for each term; number of paragraphs in which that term
appeared) stored in IDF column in term table, pivoted length normalization, and BM25 vector weights.
In this part we calculate a more sophisticated vector representation than just term-frequency of
our text corpus.
We calculate a TF-IDF normalized vector representation of the text using both pivoted length
normalization and BM25 as discussed later.
3.2.3 Calculating Similarity
Input: Vectorized input paragraphs stored in inputPargraphVector Table, and BM25 or pivoted
length normalization in BM25, pivotNorm columns in paragraphVector table.
Output: similarity between input paragraph and relevance paragraphs in dataset stored in
similarity table.
Checking similarity between the input paragraph and paragraphs in the dataset, and detect
possible plagiarism if the similarity measure between the input paragraph and any paragraph form the
dataset exceeded a predetermined threshold.
We implemented both OkapiBM25 and pivoted length normalization similarity functions.
The system first measure similarity on 5-gram vectors, then 4-grams and so on, if it ever found a
high similarity in one k-gram it limits its scope to those paragraphs with high similarity in precedes k-
grams to increase performance.
3.2.4 Clustering
Input: paragraph vectors with BM25 (or pivoted length normalization) weights stored in
paragraphVector table.
Output: the cluster of each paragraph stored in clusterId column in paragraph table, and the
centroids of the clusters stored in centroid table.

We clustered similar paragraphs together so that we can measure similarity between only similar
paragraphs to increase the checking similarity step speed.
An input paragraph have a similarity measure against centroids first to determine its cluster then
the regular similarity measure with all paragraphs in the dataset in the same cluster.
3.2.5 Communicating Results
It’s the interface where the user can check his document to plagiarism by inserting the document
in a text box and the document is parsed in a similar way as the Parser of the system by splitting the
document into paragraphs and they are passed by the text processing part and compared by the dataset in
the database and results appear as plagiarism percentage in the document and showing the plagiarized
parts in the document with other documents.

Chapter 4 Implementation
4.1 Extract, Load and Transform (ETL)
Figure 4.1 Overview for the Crawler and Parser
4.1.1 The Crawler
It’s a software that download all the scientific papers from the web into a folders for each
publisher where the parser will start working on them.
4.1.2 The Parser
It’s a software that take a PDF document (Scientific paper) as an input and extract the paper
information and content of the paper and insert them in the database of the system.
4.1.3 The Data Extracted from the paper
a. Paper Information
1. Paper Title
2. Paper authors
3. Journal and its ISSN
4. Volume, Issue, Paper Date and other dates (Accepted, Received, Revised, Published)
5. DOI (Digital Object Identifier) or PII (Publisher Item Identifier)
6. Starting Page and Ending Page
b. Abstract and Keywords
c. Table of Contents
d. Figure and Table captions
e. Paper text content (as Paragraphs)

4.1.4 The Parser Implementation
Figure 4.2 UML of the Parser Application
4.1.5 How it works
The Parser consists of a parent class (Parser) and other children classes (IEEE, Springer, APEM,
and Science Direct). The parent class has the general functions that parse the PDF document and extract
(Table of contents, Figure and Table captions, and the text content) of the paper, the children classes
Start
Check for
new Papers
Choose the Suitable
Parser
Parse the
Papers
Move to
Processed Directory
Move to
Unprocessed Directory
No
Yes
Success
Fail
Figure 4.3 the Flow Chart of the Parser

have specific functions and Regular Expressions for each publisher structure to extract the paper
information (Title, Authors, DOI ...).
Each Publisher has its own folder where the scientific paper are downloaded by the Crawler, and
the Parser will monitor each folder for the new documents and use the suitable child class to parse the
new document found and extract all the needed information and data.
If the paper information and content are extracted completely, the file will be moved to the
(Processed Directory), otherwise, the file will be moved to the (Unprocessed Directory) logging the
error, so the Developer can check if it’s a new structure to be supported it in the Parser, or something
goes wrong and he has to fix it.
4.1.6 Steps of Parsing
4.1.6.1 Extracting the Text from the PDF file (extractBlocks Function)
The Parser uses the (PDFxStream Java Library) which extract the text from the PDF file as
blocks of String, and in this Function, It loops the file page by page and for each page it extract the
content of the page in an object of ArrayList<String> called page and add this page with the page
Number in an object of type HashMap<Integer,ArrayList<String>> called pages.
Figure 4.5 The First Page of an IEEE Paper (as Blocks)
public void parsePaper(String publisher) throws Exception {
extractBlocks();
try {parseFirstPage();}
catch (Exception e) {throw new Exception("Error Not Processed");}
parserOtherPages();
paper.enhaceParagraphs();
try {paper.insertPaperInDatabase(publisher);}
catch (SQLException e) {throw new Exception("Error Database");}
}
Figure 4.4 The main function of Parsing

4.1.6.2 Extracting the Paper Information (parserFirstPage Function)
Each Publisher accepts his scientific paper in a specific structure which differs from publisher to
publisher, and the difference lies in the first page where the paper information are written, so there has to
be parser for each publisher designed to support its structure, so this function which is an abstract
function in the parent class is implemented in each child class for each publisher.
Figure 4.6 First Page of a Science Direct Paper
Figure 4.7 First Paper of a Springer Paper
These are different structure for Science Direct and Springer to show the difference in the
structures, and the difference lies in the organization and structure of the information ex:

1. This is a header of a Springer Paper
Kong et al. EURASIP Journal on Advances in Signal Processing 2014, 2014:44
http://asp.eurasipjournals.com/content/2014/1/44
2. This is a header of an IEEE Paper
IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93
4.1.6.3 Extracting the Paper text content (parserOtherPages Function)
This function uses the general Parser functions as it loops over all the Pages and the blocks of
string in each page and extract the data from the blocks that could be (Table of contents, Figure and
Table Captions, Lists and Paragraphs).
Each block passes several stages:
1) First Test if the Block is a Figure Caption
2) Then Test if it’s a Table Caption
3) Then Test if it has a Header (Table of Content)
4) Then Test if the block has lists (maybe numeric, dash, Dot)
In the 3rd
stage, if there are headers in the block, they will be extracted and the rest of the block
will be returned to the function and it will continue the other stages.
void parserOtherPages(){
for (Entry<Integer, ArrayList<String>> entrySet : pages.entrySet()) {
Integer pageNumber = entrySet.getKey();
ArrayList<String> page = entrySet.getValue();
Iterator<String> it = page.iterator();
while (it.hasNext()) {
String block = it.next().trim();
boolean isFigureCaption = false, isTableCaption = false;
boolean isList = false, isEmptyParagraph = false;
isFigureCaption = parseFigureCaption(block, pageNumber);
isTableCaption = parseTableCaption(block, pageNumber);
block = parseHeaders(block);
isList = parseLists(block, pageNumber);
isEmptyParagraph = "".equals(block);
if(!isFigureCaption && !isTableCaption
&& !isEmptyParagraph && !isList)
parseParagraph(block, pageNumber);
}
}
}
Figure 4.8 The function of parseOtherPages

5) Finally if the block isn’t one of the previous types (not Figure or Table Caption, has no
list or it has header extracted and the rest of the block is returned), then it’s a paragraph
and it will be extracted.
4.1.6.4 Enhancing the Paragraphs (enhanceParagraph Function)
As shown in Fig IV.9, some paragraphs when they are extracted won’t be in a good shape,
1) Some words may be separated between 2 lines with a hyphen, so I have to rejoin it, also
there are many spaces between words so I have to remove the extra spaces.
2) The paragraph is extracted as lines (has a new line char at the end of the line) not a
continuous String so I have to refine it.
3) Some of the paper Info are in uppercase so I capitalize them.
Figure 4.9 Block of String before Enhancing
Page Number: 1
The Content:
However, as the number of metal layers increases and interconnect dimensions
decrease, the parasitic capacitance increases associated with fill metal have
become more significant, which can lead to timing and signal integrity
problems.
Page Number: 1
The Content:
Previous research has primarily focused on two important aspects of fill metal
realization: 1) the development of fill metal generation methods – which we
discuss further in Section II and 2) the modeling and analysis of capacitance
increases due to fill metal – Several studies have examined the parasitic
capacitance associated with fill metal for small scale interconnect test
structures in order to provide general guidelines on fill metal geometry
selection and placement. For large-scale designs,
Figure 4.10 The Paragraphs after enhancing

4.1.6.5 Finally inserting all these data in the Database
When the Parser starts, an Object of type Paper is created and every information and data
extracted from the scientific paper are assigned to their attribute in this object, and at the end of the
parsing, all these information and data are inserted in the Database by calling this function.
1) Retrieve the Journal ID from the database by its name or ISSN, if it’s already found, the ID will
be returned, Otherwise, It will be considered new Journal and will be inserted and its ID will be
returned.
2) Test if the Paper is already inserted the paper in the Database before if it is already found, the
Parser will throw Exception stating that its already inserted before, but If it’s a new paper then it
will be inserted with its information (title, volume, issue …) and the Paper ID will be returned.
3) With the Paper ID, the rest of the Data will inserted (Authors, Keywords, Table of Contents,
Figure captions, Table captions, and the text content of the Paper which is the paragraphs).
4.1.7 The Paper Class
This Class works as a structure for the paper, It has the attributes that holds the information and
data of the paper and it has also the function of enhancingParagraphs() that is responsible for
improving the text and enhancing the praragraph to be ready for the next step of processing in the
Natural Language Processing part, and also the function of insertPaperInDatabase() which is
responsible for testing if the Paper is already inserted in the database before or not and if it’s a new
public class Paper {
public String title = "";
public int volume = -1,issue = -1;
public int startingPage = -1, endingPage = -1;
public String journal = "", String ISSN = "";
public String DOI = "";
public ArrayList<String> headers = new ArrayList<>();
public ArrayList<String> authors = new ArrayList<>();
public ArrayList<String> keywords = new ArrayList<>();
public ArrayList<Paragraph> figureCaptions = new ArrayList<>();
public ArrayList<Paragraph> tableCaptions = new ArrayList<>();
public ArrayList<Paragraph> paragraphs = new ArrayList<>();
public String date="";
public String dateReceived="NULL", dateRevised="NULL";
public String dateAccepted="NULL", dateOnlinePublishing="NULL";
public void enhaceParagraphs()
public void insertPaperInDatabase(String publisher)
}
Figure 4.11 the Paper Structure

paper, then It will be inserted with all of its data from paragraphs, figure and table captions and the
paper information.
Note that:
When the Parser finds a new PDF document in the folders of the publishers, it create a new
object of type paper and in the meantime of parsing the document, each piece information extracted is
assigned to its attribute in this object, and in the end of the parsing process, This paper object execute
its 2 member function the enhanceParagraph() to refine the paragraph content then execute the
insertPaperInDatabase() method to insert all the data in the database.
4.1.8 The Paragraph Structure
As shown in the Figure IV.12, the paragraph structure is very simple is contains the number of
the content of the paragraph extracted and the page number from where this paragraph is extracted.
4.1.9 The Parsing the First Page in Details (ex: an IEEE Paper)
As shown in Figure IV.13, we can see that the page is divided into blocks of string and each
block has a piece or more of the paper information, this function in the parser is implemented specific to
the publisher, so the function of IEEE parser won’t work to the Springer Parser and so on, and this
function is implemented to parse only the first page and to extract the paper information in this page and
assign them to the attributes of the paper object.
Even in the publisher itself, there are differences in the location of the paper information in the
page, and the structure is changing overtime for example the IEEE Parser support 8 different forms of
Paper header.
Figure 4.13 Different forms for an IEEE Top Header
public class Paragraph {
public int pageNum;
public String content;
}
Figure 4.12 the Paragraph Structure

Figure 4.14 Blocks to be extracted from the first page of an IEEE Paper

4.1.9.1 Parsing the Paper Header
The function of the parseFirstPage() starts with parsing the Header of the paper which is the
first block in the paper, the block is sent to a parsePaperHeader() function which have the different
Regular Expressions for every form of that the parser support as shown in Fig IV.15.
When the function receives the block, the block passes through the different Expressions, and if
it matches with one of supported formats, the function will start extracting the information, otherwise
the function will throw an Exception stating that this header format isn’t supported and the developer
has to support it.
As shown in Figure IV.13, The Header may contain information such as the Starting page
number (may be in the start or at the end of the line), Journal Title, Volume number, Issue number,
and the Date. These information could be presented in the header or not, so according to the format of
the header the suitable functions (parsePaperDate(), parseVolume(), parseIssue(), parseJournal(),
parseStartingPage()) will be called to extracted these information.
// ex: Chang et al. VOL. 1, NO. 4/SEPTEMBER 2009/ J. OPT. COMMUN. NETW. C35
String header1_Exp = "^([A-Z]+ ET AL. " + volume_Exp + ", " + issue_Exp
+ "[ ]*/[ ]*" + paperDate_Exp + "[ ]*/[ ]*" + journalTitle_Exp + " [A-Z0-9]+)$";
// ex: 594 J. OPT. COMMUN. NETW. /VOL. 1, NO. 7/DECEMBER 2009 Lim et al.
String header2_Exp = "^([A-Z0-9]+ " + journalTitle_Exp + "/"
+ volume_Exp + ", " + issue_Exp + "/" + paperDate_Exp + " [A-Z]+ ET AL.)$";
// ex: IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007 93
// ex: 93 IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 1, JANUARY 2007
// ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May-June 2008
// ex: 22 IEEE TRANSACTIONS ON MAGNETICS, VOL. 5, NO. 1, May/June 2008
// ex: 93 IEEE TRANSACTIONS ON MAGNETICS Vol. 13, No. 6; December 2006
String header3_Exp = "^(([0-9]+ )*" + journalTitle_Exp +"(,)* " + volume_Exp
+ ", " + issue_Exp + "(,|;) " + paperDate_Exp + "( [0-9]+)*)$";
// ex: 598 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
String header4_Exp = "^([0-9]+ " + journalTitle_Exp + ")$";
// ex: IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 598
String header5_Exp = "^(" + journalTitle_Exp + " [0-9]+)$";
// ex: 1956 lRE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES 75
String header6_Exp = "^([0-9]{4} " + journalTitle_Exp + "[0-9]+)$";
// ex: 112 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS May
String header7_Exp = "^([0-9]+ " + journalTitle_Exp + "[A-Z]{3,9})$";
// ex: SUPPLEMENT TO IEEE TRANSACTIONS ON AEROSPACE / JUNE 1965
String header8_Exp = "^(" + journalTitle_Exp + "[ ]*/[ ]*" + dateExp + ")$";
Figure 4.15 The supported Regex of the IEEE Header formats

4.1.9.2 Extracting the Volume from the Header
The IEEE Parser uses the volume_Exp = VOL(.)* [A-Z-]*[0-9]+ to detect the
Volume part from the Header and passes it to the parseVolume() Function, then uses another
Expression to extract the number from this part, ex: in the first line in Fig IV.16 of the header forms, the
parser will detect the part (VOL. 18), then It will detect the number from this result (18), then change its
type from String to int to be assigned to the volume attribute in the paper Object.
4.1.9.3 Extracting the Issue number from the Header
The IEEE Parser uses the issue_Exp = NO(.|,) [0-9]+ to detect the issue part from the
Header and passes it to the parseIssue() Function, then uses another Expression to extract the
number from this part, for the same example presented in the Volume section, the parser will detect the
part (NO. 3), then It will detect the number from this result (3), then change its type from String to int to
be assigned to the issue attribute in the paper Object.
4.1.9.4 Extracting the PaperDate from the Header
@Override
void parseVolume(String volume) {
Matcher matcher = Pattern.compile(volume_Exp).matcher(volume);
if(matcher.find()){
Matcher numMatcher = Pattern.compile("[0-9]+").matcher(matcher.group());
while(numMatcher.find())
paper.volume = Integer.parseInt(numMatcher.group());
}
}
Figure 4.16 The Function of extracting the Volume Number
@Override
void parseIssue(String issue) {
Matcher matcher = Pattern.compile(issue_Exp).matcher(issue);
if(matcher.find()){
Matcher numMatcher = Pattern.compile("[0-9]+")
.matcher(matcher.group());
if(numMatcher.find())
paper.issue = Integer.parseInt(numMatcher.group());
}
}
Figure 4.17 The Function of Extracting the Issue Number
@Override
void parsePaperDate(String date) {
Matcher matcher = Pattern.compile(paperDate_Exp).matcher(date.trim());
if(matcher.find())
paper.date = matcher.group().replaceAll("^/", "").trim();
}
Figure 4.18 The Function of Extracting the DOI

Like the other parts of the header, the IEEE Parser uses the date_Exp = [A-Z]{0,9}[/-
]*[A-Z]{3,9}(.)*( [0-9]{1,2}(,)*)* [0-9]{4} to extract the date part from the Header,
then assign it to the date attribute in the paper object, also the Date could be written in different formats
(2016, March 2016, May/June 2016, May-June 2016) and the Expression is written to detect all forms
of the date formats.
Note that after extracting each information from the previous ones, this information is removed
from the block (header) String, so the after removing the volume, issue, and date, the information left in
the header will be the Journal Title and the Starting page, and the Starting page could be in the start or
the end of the header.
4.1.9.5 Extracting the Start and End Page numbers from the Header
Now, we know that the header has only the Journal Title and the Start Page number, so The
IEEE Parser uses the startPage_Exp = ^[0-9]+|[0-9]+$ this expression is to extract a number
that lies at the start of the end of the checked String so if the Start Page number lies in the start of the
header or at the end of the header, It will be detected and extracted, then as the other information it will
be assigned to the attribute in the paper object.
And for the End Page, It’s very simple as the IEEE Parser will add the number of pages of the
Paper to the number of Start Page and assign the result to the End page of the paper object.
4.1.9.6 Extracting the Journal Title from the Header
Now finally for the Journal Title, The IEEE Parser uses the journalTitle_Exp = [A-Z
:-—/)(.,]+ to extract the Journal Title part from the Header, then passes the
title to the parseJournalTitle() Function.
The title may have some extra words that aren’t needed such as: ([author name] et al.) or it may
end with some separating characters (comma or forward slash) so they must be removed first, then
assign the rest to the journal attribute in the paper object.
@Override
void parseStartingPage(String startingPage) {
Matcher matcher = Pattern.compile(startingPage_Exp).matcher(startingPage);
if(matcher.find())
paper.startingPage = Integer.parseInt(matcher.group().trim());
parseEndingPage(startingPage);
}
@Override
void parseEndingPage(String endingPage) {
paper.endingPage = paper.startingPage + pages.size();
}
Figure 4.19 The Function of Extracting the Start and End Pages

4.1.9.7 Parsing the Rest of the first page’s blocks
@Override
void parseJournal(String journal) {
journal = journal.replaceAll("( /|, )", "").trim();
Matcher matcher = Pattern.compile(journalTitle_Exp).matcher(journal);
if(matcher.find()){
String journalName = matcher.group().replaceAll("[A-Z ]+ ET AL.", "");
if (journalName.charAt(journalName.length()-1) == '/')
paper.journal = journalName.substring(0, journalName.length()-1);
else
paper.journal = journalName;
}
}
Figure 4.20 The Function of Extracting the Journal Title
Iterator<String> it = pageOne.iterator();
while (it.hasNext()) {
String mainBlock = it.next();
String block = mainBlock.replaceAll("[ ]+", " ");
if(Pattern.compile(IEEE_DOI_Exp + "|" + PII_Exp).matcher(block).find())
{ parseDOI(block); blockList.add(mainBlock); }
if(ISSN_Pattern.matcher(block).find())
{ parseISSN(block); blockList.add(mainBlock); }
if(Pattern.compile("Index Terms").matcher(block).find())
{ parseKeywords(block); blockList.add(mainBlock); }
if(Pattern.compile("(Abstract|ABSTRACT|Summary)").matcher(block).find())
{ parseAbstract(block); blockList.add(mainBlock); }
if(date_Pattern.matcher(block.toUpperCase()).find() && !datesFound){
parseDates(block);
if (!paper.dateAccepted.equals("NULL") ||
!paper.dateOnlinePublishing.equals("NULL") ||
!paper.dateReceived.equals("NULL") ||
!paper.dateRevised.equals("NULL"))
{ blockList.add(mainBlock); datesFound = true; }
}
}
removeUnimportantBlocks();
for (String blockList1 : blockList)
pageOne.remove(blockList1);
Figure 4.21 Parsing the rest of blocks in the first Page

After parsing the Header Block and extracting all the information from it, the IEEE Parser will
continue to parse the other blocks, searching for the rest information, but due to the difference in
structure, the location of these information could differ from structure to structure so the best way to
extract them is by looping through all the first page blocks and using the Regular Expressions of the
these information such as (DOI, ISSN …) the Parser can locate them and also It will try to detect some
other blocks such as the Abstract, Keywords, Nomenclature, and the paper Dates (when it’s
Received, Accepted, Revised, and Published Online).
In every loop, if an information is detected the block will be passed to the suitable function to
extract this information, and since the information is extracted, the block isn’t needed so the Parser will
add this block to a TreeSet<String> (blockList) and after finishing all the iterations on the blocks
of page one, these blocks will be removed from the blocks of the page.
Also, there may be some other blocks that don’t have important information such as: the website
of the publisher or the logo of the publisher with its name under the logo, so they all also have to be
detected and removed using the function removeUnimportantBlocks().
4.1.9.8 Extracting the DOI or the PII
In the loop, if the block is detected to have the DOI (Digital Object Identifier) or PII (Publisher
Object Identifier) using the IEEE_DOI_Exp = [0-9]{2}.[0-9]{4}/[A-Z-]+.[0-
9]+.[0-9]+ and the PII_Exp = [0-9]{4}-[0-9xX]{4}([0-9]{2})[0-9]{5}-
(x|X|[0-9]), The IEEE Parser will be passed the block to the function of parseDOI(), and the DOI
or PII will be extracted, then if it’s the DOI, It will be concatenated with the domain of the DOI of the
papers (http://dx.doi.org/), but if it’s the PII, It will be concatenated with (http://dx.doi.org/10.1109/S)
and the result will be assigned it to the DOI attribute in the paper object.
4.1.9.9 Extracting the ISSN
@Override
void parseDOI(String DOI) {
Matcher matcher = Pattern.compile(IEEE_DOI_Exp).matcher(DOI);
while(matcher.find())
paper.DOI = "http://dx.doi.org/" + matcher.group();
matcher = Pattern.compile(PII_Exp).matcher(DOI);
paper.DOI = "http://dx.doi.org/10.1109/S" + matcher.group();
}
Figure 4.22 The Function of Extracting the DOI and PII
void parseISSN(String ISSN){
Matcher matcher = ISSN_Pattern.matcher(ISSN);
paper.ISSN = matcher.group().replaceAll("(–|-|‐)", "-");
}
Figure 4.23 The Function of Extracting the ISSN

Also if a block is detected to have the ISSN using the ISSN_Exp = [0-9]{4}(–|-|‐|
)[0-9]{3}[0-9xX], The IEEE Parser will pass the block to the function parseISSN(), that will
extract the ISSN, and assign it to the ISSN attribute in the paper object.
4.1.9.10 Extracting the Dates of the Paper:
If a block through the iteration is detected to have dates using the date_Exp = ([0-
9]{1,2}(-| )[A-Z]{3,9}[.]*(-| )[0-9]{4})|[A-Z]{3,9}[.]*( [0-
9]{1,2},)* [0-9]{4}) so The IEEE Parser will pass the block to the function parseDates(), this
block have dates related to the Paper such as when it’s received in the publisher and when it’s revised,
void parseDates(String dates){
dates = dates.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").toUpperCase();
Matcher matcher = receivedDate_Pattern.matcher(dates);
while(matcher.find()){
String stMatch = matcher.group();
Matcher dateMatcher = Pattern.compile(dateExp).matcher(stMatch);
while(dateMatcher.find())
paper.dateReceived = dateMatcher.group().trim();
}
matcher = revisedDate_Pattern.matcher(dates);
paper.dateRevised = dateMatcher.group().trim();
}
matcher = acceptedDate_Pattern.matcher(dates);
paper.dateAccepted = dateMatcher.group().trim();
}
matcher = publishingDate_Pattern.matcher(dates);
paper.dateOnlinePublishing = dateMatcher.group().trim();
}
}
Figure 4.24 The Function of extracting the paper Dates

accepted and published online and for every date of those there is a Regular Expression to detect it and
Note that: not all papers include these dates in the paper, but most of them include it, so they will be
extracted if they are included in the paper and assigned their attributes in the paper object.
The dates could be written in many formats: (30 OCTOBER 2007), (17 AUG. 2007), (28-JULY-
2009), (OCTOBER 6, 2006), so the Regular Expression of the date itself could be complicated as to
detect all of these formats of dates
Also the word before the date could be written in different forms (Received), (Received:),
(Revised), (Revised:) or (Received in revised form) and maybe lowercase or capitalized, so the Regular
Expressions are constructed to detect all forms of those words and for the character case, we transform
the string to uppercase and compare them.
4.1.9.11 Extracting the Keywords
If the block of the Keywords is detected, The IEEE Parser will pass it to the parseKeywords()
function, the keywords may be found in the block of the Abstract so the first line to crop the part of the
Keywords if it exists with the abstract, then the block could be separated in 2 lines or has a word
separated in 2 lines with a hyphen so they have to be removed and fixed, after that some papers separate
the keywords with comma (,) and others separate them with semi-colon (;), Then the splitted keywords
are added to the list of the keywords in the paper object.
4.1.9.12 Extracting the Abstract
For the block of the abstract, while the iteration in the parseFirstPage() procedure, If one of
the blocks matches the word abstract or summary, then this block will be passed to the
parseAbstract() function, and it will be considered the first paragraph in the page with the header
Abstract .
In some cases the abstract may contain some other information such as the keywords or the
Nomenclature so they have to be copped first and parsed separately.
@Override
void parseKeywords(String keywords) {
keywords = keywords.substring(keywords.indexOf("Index Terms"));
String indexTerms_Removal = "-rn|Index Terms|-";
keywords = keywords.replaceAll(indexTerms_Removal, "");
String[] splitted = keywords.replaceAll(newLine_Removal, " ").split(",|;");
for (int i = 0; i < splitted.length; i++)
paper.keywords.add(splitted[i].trim());
}
Figure 4.25 The Function of Extracting the Keywords

4.1.9.13 Extracting the Title and Authors
Now after extracting all the information of the paper and removing these blocks, The next block
will have the Title of the paper, then the Authors, then the Introduction.
First the title will be passed to the parseTitle() procedure, so If it’s separated on more than
one line, it will remove the newline char and assign it to the title attribute.
Next the Authors will be passed to the parseAuthors() procedure, where they will be
separated may be by comma or semi-colon or some other separation according to the publisher style,
and each author will be added to the authors list in the paper object.
@Override
void parseAbstract(String abstractContent) {
int indexOfIndexTerms = abstractContent.indexOf("Index Terms");
if (indexOfIndexTerms != -1)
abstractContent = abstractContent.substring(0,indexOfIndexTerms);
int indexOfNomenclature = abstractContent.indexOf("NOMENCLATURE");
if (indexOfNomenclature != -1)
abstractContent = abstractContent.substring(0,indexOfNomenclature);
paper.headers.add("Abstract");
abstractContent = abstractContent.replaceAll("(Abstract|Summary)(-)*", "");
String lastHeader = paper.headers.get(paper.headers.size()-1);
Paragraph paragraph = new Paragraph(1, lastHeader, abstractContent);
paper.paragraphs.add(paragraph);
}
Figure 4.26 The Function of Extracting the Keywords
void parseTitle(String title) {
paper.title = title.replaceAll(newLine_Removal, " ").trim();
}
@Override
void parseAuthors(String authors) {
authors = authors.replaceAll(author_Removal, "")
.replaceAll("[ ]+", " ");
authors = authors.replaceAll(separatedWord_Fixing, "")
.replace(newLine_Removal, " ");
String[] split = authors.split(",| and| And| AND");
for (String author : split)
if(!author.trim().isEmpty())
paper.authors.add(author.replaceAll("[0-9]+", "").trim());
}
Figure 4.27 The Function of Extracting the Title and the Authors

4.1.10 The Parsing the Other Pages in Details (ex: an IEEE Paper)
Now all the paper Information is extracted and the blocks exist in the first page is the
introduction and the rest of the page content, and parseFirstPage() procedure is done executing,
and parseOtherPages() procedure will start executing, as we demonstrate before it loops over all
the blocks of strings in the pages and extract all the possible data from them such as Headers (Table of
Contents), figure and table captions, lists and if not any of the previous it will be a paragraph, and all of
these procedures are part of the parent Parser.
4.1.10.1 Extracting the Headers
This procedure is a very general one and works efficiently for most types of headers, first it
detect the style of the level 1 headers and it supports (I. INTRODUCTION), (1 INTRODUCTION), (1.
INTRODUCTION), (1 Introduction), the Headers could be listed with the roman numbers or may be
with number and the header written in an uppercase or may be the number has a dot after it, or the
header written in a capitalized case, so first the function detects the type of the header.
Also for the level 2 headers, there are different styles for these headers and there is another
function to detect them and it supports 3 different types of headers such as: (for example) (A. Level 2
Header), (1.1 Level 2 Header),
(1.1. Level 2 Header), so the Header could be listed alphabetically or number dot number or
number dot number dot then the title of the header.
Once the headers style are specified the Header’s Regular Expressions are created and are tested
on all the passed blocks to detect any headers, and those Regex are not constant but they are changing
for example if I detected
(1. Introduction) then the next header that I will wait will be (2. Another Header) so the
number will be incremented.
Another thing, the header always comes at the start of the block of string and the rest of the
string is a paragraph or it may be extracted from the beginning in one block, so the function will extract
the header only and the rest of the block will be return so it will be parsed as a paragraph.
Also there are other headers that has no numbers such as the Abstract, References,
Acknowledgements, Appendix and more other, and those headers are detected separately with a
separate Regex.
Also this procedure can detect the level 3 and level 4 headers and their style is specified
according to the style of the level 2 headers for example (1.1.1 Header) or (1.1.1. Header) and
all are add to the headers list in the paper object.

4.1.10.2 Extracting the Figure and Table Captions
In This procedure, the parent Parser uses the figure_Exp = ^(Fig.|Figure)[ ]+[0-
9]+(.|:) and table_Exp = ^(TABLE|Table)[ ]+([0-9]+|[IVX]+) to detect the figure
and table captions and they may appear in different styles for example (Figure 1.), (Fig. 1),
(Figure 1:), (TABLE 1) and (Table II) also the listing could be numeric or alphabetic, and after
extracting it, the caption will be added to the list of captions as paragraphs with the page number.
private enum HeaderMode {
NUM_SPACE_UPPERCASE, NUM_SPACE_CAPITALIZED, ROMAN_DOT_UPPERCASE,
NUM_DOT_UPPERCASE, NUM_DOT_CAPITALIZED, ABC_DOT_CAPITALIZED,
NUM_DOT_NUM_DOT_CAPITALIZED, NUM_DOT_NUM_CAPITALIZED
}
Pattern restHeader_Pattern = Pattern.compile("^(REFERENCES|References|"
+ "ACKNOWLEDGMENT[S]*|Acknowledg[e]*ment[s]*|Nomenclature|DEFINITIONS"
+ "|Contents|NOMENCLATURE|ACRONYM|ACRONYMS|NOTATION|APPENDIX|"
+ "Appendix)(rn| )*");
void detect_Header1Mode(String block){
if (Pattern.compile("I. INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.ROMAN_DOT_UPPERCASE;
else if (Pattern.compile("1 INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.NUM_SPACE_UPPERCASE;
else if (Pattern.compile("1 Introduction").matcher(block).find())
header1_Mode = HeaderMode.NUM_SPACE_CAPITALIZED;
else if (Pattern.compile("1. INTRODUCTION").matcher(block).find())
header1_Mode = HeaderMode.NUM_DOT_UPPERCASE;
else if (Pattern.compile("1. Introduction").matcher(block).find())
header1_Mode = HeaderMode.NUM_DOT_CAPITALIZED;
}
void detect_Header2Mode(int _1st_header, String block){
if (Pattern.compile("^A. [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.ABC_DOT_CAPITALIZED;
else if (Pattern.compile("^" + _1st_header
+ ".1 [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.NUM_DOT_NUM_CAPITALIZED;
else if (Pattern.compile("^" + _1st_header
+ ".1. [A-Z][a-z]+").matcher(block).find())
header2_Mode = HeaderMode.NUM_DOT_NUM_DOT_CAPITALIZED;
}
Figure 4.28 The Defining the Style of the Header

4.1.10.3 Extracting the Lists
In this procedure, the parent Parser can detect the lists in text content and separate them as a
whole paragraph itself, and it supports the numeric, dot and dashed lists, and the block could have
paragraph at the beginning and paragraph at the last, so they have to be separated, and each of the
paragraph (if found) and the lists are added as paragraphs with the page number in the paragraph list in
the paper object.
private boolean parseFigureCaption(String block, int pageNumber){
block = block.replaceAll("[ ]+", " ").trim();
Matcher matcher = figureCaption_Pattern.matcher(block);
String figureTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
if(paper.headers.size()>0)
Paragraph figure = new Paragraph(pageNumber, lastHeader, figureTitle);
paper.figureCaptions.add(figure);
return true;
}
return false;
}
private boolean parseTableCaption(String block, int pageNumber){
block = block.replaceAll("[ ]+", " ").trim();
Matcher matcher = tableCaption_Pattern.matcher(block);
String tableTitle = block.replaceAll(separatedWord_Fixing, "")
.replaceAll(newLine_Removal, " ").trim();
Paragraph table = new Paragraph(pageNumber, lastHeader, tableTitle);
paper.tableCaptions.add(table);
return true;
}
return false;
}
Figure 4.29 the Function of Extracting the Figure Captions

4.1.10.4 Extracting the Paragraph
Pattern newList_Pattern = Pattern.compile("rn[ ]*([0-9]|-|.|·|•)");
Pattern numericList1_Pattern = Pattern.compile("^[0-9](.|))[ ]+[A-Z]");
Pattern numericList2_Pattern = Pattern.compile("(.|:)rn[ ]*[0-9](.|))[ ]+[A-
Z]");
Pattern dotList1_Pattern = Pattern.compile("^(.|·|•)[ ]+[A-Za-z]");
Pattern dotList2_Pattern = Pattern.compile("(.|:)rn[ ]*(.|·|•)[ ]+");
Pattern dashList1_Pattern = Pattern.compile("^-[ ]+[A-Za-z]+");
Pattern dashList2_Pattern = Pattern.compile("(.|:)rn[ ]*-[ ]+[A-Z]");
private boolean parseLists(String block,int pageNumber){
Matcher orderList1_Matcher = numericList1_Pattern.matcher(block);
Matcher orderList2_Matcher = numericList2_Pattern.matcher(block);
if(orderList1_Matcher.find() || orderList2_Matcher.find())
return parseList(block, pageNumber, numericList2_Pattern, newList_Pattern);
Matcher dotList1_Matcher = dotList1_Pattern.matcher(block);
Matcher dotList2_Matcher = dotList2_Pattern.matcher(block);
if(dotList1_Matcher.find() || dotList2_Matcher.find())
return parseList(block, pageNumber, dotList2_Pattern, newList_Pattern);
Matcher dashList1_Matcher = dashList1_Pattern.matcher(block);
Matcher dashList2_Matcher = dashList2_Pattern.matcher(block);
if(dashList1_Matcher.find() || dashList2_Matcher.find())
return parseList(block, pageNumber, dashList2_Pattern, newList_Pattern);
return false;
}
Figure 4.30 the Function of separating the lists
void parseParagraph(String block, int pageNumber){
Matcher matcher = newParagraph_Pattern.matcher(block);
String lastHeader="",content;
int startIndex =0, endIndex;
endIndex = matcher.start();
lastHeader = paper.headers.get(paper.headers.size()-1);
content = block.substring(startIndex, endIndex+1);
Figure 4.31 the Function of Extracting the Paragraph

In this procedure, the parent Parser can detect the paragraphs using the paragraph_Exp =
.rn[ ]+[A-Z], and as we mentioned before, the block is passed to this procedure after being
tested to be a Figure or Table Caption or it’s a list, or it could have a header so it has to be extracted first
and return the rest of the block, so if the block passed all these tests, it will be considered a paragraph,
and passed to the parseParagraph() procedure.
Note that: the block may contain one or more paragraph so all of them has to be detected and
separated and each of them is added to the paragraphs list with page number in the paper object.
Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content);
startIndex = endIndex + matcher.group().length()-1;
}
lastHeader = paper.headers.get(paper.headers.size()-1);
content = block.substring(startIndex);
Paragraph paragraph = new Paragraph(pageNumber, lastHeader, content);
}
Figure 4.32 The Function of Extracting the Paragraph

4.2 The Natural Language Processing (NLP)
4.2.1 Introduction
In this section, the text extracted from the scientific papers has to be refined. We have to focus
on the important words in the text such as names and verbs, and ignoring the staffed words such as
prepositions and adverbs, so the plagiarism can be detected efficiently even if the user try to play with
words.
4.2.2 The Implementation Overview
First, each paragraph in the database is selected and passed to the processText() procedure that
perform the text processing and return an array of refined words, in this procedure the paragraph passes
through several steps.
1. Lowercase
2. Tokenization
3. Part of Speech (POS) tagging
4. Remove Punctuations
5. Remove Stop words
6. Lemmatization
4.2.3 The Text Processing Procedure
4.2.3.1 Lowercase
In this step, all the text is changed to the lowercase, so we don’t have redundant data of the same
words written in different cases (Play, play).
4.2.3.2 Tokenization
def processText(document):
document = document.lower()
words = tokenizeWords(document)
tagged_words = pos_tag(words)
filtered_words = removePunctuation(tagged_words)
filtered_words = removeStopWords(filtered_words)
filtered_words = lemmatizeWords(filtered_words)
return filtered_words
Figure 4.33 Process Text Function
def tokenizeWords(sentence):
return word_tokenize(sentence)
Figure 4.34 Tokenizing words Function

Here, we split the text into words using Treebank Tokenization Algorithm. This Algorithm
splitting the words in intelligent way based on corpus (data) retrieved from NLTK, it also split the words
from surrounded punctuation.
For Example:
4.2.3.3 Part of Speech (POS) tagging
The purpose of the POS is to find the position of the word in the sentence, it can detect if
the word is verb, noun, adjective, or adverb, so this information will help return the words to
their origins as for verbs they will be returned to their infinitives.
We use WordNet database to get the words origins.
1. i’m → [ 'i', "'m" ]
2. won’t → ['wo', "n't"]
3. gonna (tested) {helping} (25) → ['gon', 'na', 'tested', 'helping', '25']
Figure 4.35 Tokenization Example
words = ['At', '5 am', "tomorrow", 'morning', 'the',
'weather', "will", 'be', 'very', 'good', '.']
taged_words = nltk.pos_tag(words)
Figure 4.36 POS Function
[('at', 'IN'), ('5', 'CD'), ('am', 'VBP'), ('tomorrow', 'NN'),
('morning', 'NN'), ('the', 'DT'), ('weather', 'NN'), ('will',
'MD'), ('be', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]
Figure 4.37 POS Output Example
def getWordnetPos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
Figure 4.38 WordNet POS Function

4.2.3.4 Remove Punctuations
In this Step the punctuations are removed from the text such as: comma, full stop, single and
double quotes, and the parenthesis either circle or square or the curly braces.
4.2.3.5 Remove Stop words
In this process the staffed words (stop words) are removed.
def removePunctuation(words):
new_words = []
for word in words:
if len(word[0]) > 1:
new_words.append(word)
return new_words
Figure 4.39 Removing Punctuations Function
def removeStopWords(words):
stop_words = set(stopwords.words("english"))
new_words = []
for word in words:
if word[0] not in stop_words:
new_words.append(word)
return new_words
Figure 4.40 Removing Stop Words Function
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her',
'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those',
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
Figure 4.41 Stop Words list

4.2.3.6 Lemmatization
In this step, we use the information retrieved from the POS to get the origins of the words. By
passing the word and its wordnet position to the lemmatize function.
Now the paragraph after being processed have only the important words that describe the real meaning
of the paragraph.
4.2.4 Example of the Text Processing
def lemmatizeWords(words):
new_words = []
wordnet_lemmatizer = WordNetLemmatizer()
for word in words:
new_word = wordnet_lemmatizer.lemmatize(word[0],
getWordnetPos(word[1]))
new_words.append(new_word)
Figure 4.42 Lemmatization Function
Plagiarism is the wrongful appropriation and stealing and publication of another
author's language, thoughts, ideas, or expressions and the representation of them
as one's own original work. The idea remains problematic with unclear definitions
and unclear rules. The modern concept of plagiarism as immoral and originality as
an ideal emerged in Europe only in the 18th century, particularly with the
Romantic movement.
Plagiarism is considered academic dishonesty and a breach of journalistic ethics.
It is subject to sanctions like penalties, suspension, and even expulsion. Recently,
cases of 'extreme plagiarism' have been identified in academia.
Plagiarism is not in itself a crime, but can constitute copyright infringement. In
academia and industry, it is a serious ethical offense. Plagiarism and copyright
infringement overlap to a considerable extent, but they are not equivalent
concepts, and many types of plagiarism do not constitute copyright infringement,
which is defined by copyright law and may be adjudicated by courts. Plagiarism is
not defined or punished by law, but rather by institutions (including professional
associations, educational institutions, and commercial entities, such as publishing
companies).
Figure 4.43 Paragraph before Text Processing

['plagiarism', 'wrongful', 'appropriation', 'stealing', 'publication',
'another', 'author', "'s", 'language', 'thought', 'idea', 'expression',
'representation', 'one', "'s", 'original', 'work', 'idea', 'remain',
'problematic', 'unclear', 'definition', 'unclear', 'rule', 'modern',
'concept', 'plagiarism', 'immoral', 'originality', 'ideal', 'emerge',
'europe', '18th', 'century', 'particularly', 'romantic', 'movement',
'plagiarism', 'consider', 'academic', 'dishonesty', 'breach',
'journalistic', 'ethic', 'subject', 'sanction', 'like', 'penalty',
'suspension', 'even', 'expulsion', 'recently', 'case', "'extreme",
'plagiarism', 'identify', 'academia', 'plagiarism', 'crime',
'constitute', 'copyright', 'infringement', 'academia', 'industry',
'serious', 'ethical', 'offense', 'plagiarism', 'copyright',
'infringement', 'overlap', 'considerable', 'extent', 'equivalent',
'concept', 'many', 'type', 'plagiarism', 'constitute', 'copyright',
'infringement', 'define', 'copyright', 'law', 'may', 'adjudicate',
'court', 'plagiarism', 'define', 'punish', 'law', 'rather',
'institution', 'include', 'professional', 'association', 'educational',
'institution', 'commercial', 'entity', 'publish', 'company']
Figure 4.44 Paragraph after Text Processing

4.3 Term Weighting
In this section we will calculate the term weighting our system using the data extracted from the
scientific papers by the parser. The parser extract the data as paragraphs and store them in database, and
here we will retrieve these paragraphs and calculate the term weighting for the system.
4.3.1 Lost Connection to Database Problem
First we open a connection to database and retrieve the unprocessed paragraphs, but we are
processing a large number of paragraphs and the connection must stay open all that time. So we face a
problem of lost connection to database when its internal timeout expires.
1) Increasing timeout solution
This problem could be solved by increasing the timeout but this solution is limited as we might
have a very large number of paragraphs that exceeds the timeout was set before.
2) Better solution
We will retrieve 100 paragraph and process them, then close the connection. Then open a new
connection and retrieve another 100 paragraphs and so on until all the unprocessed paragraphs are
processed.
cursor = connection.run("SELECT COUNT(*) FROM paragraph WHERE processed = false")
(unprocessedParagraphsNum,) = cursor.fetchone()
connection.endConnect()
pCounter = 0
insertTermsBeginTime = time.time()
while pCounter < unprocessedParagraphsNum:
connection1 = Connection(caller)
remain = unprocessedParagraphsNum - pCounter
if remain > 100: remain = 100
rows = connection1.run("SELECT paragraphId,content FROM paragraph WHERE
processed = false LIMIT %s", (remain,))
for (paragraphId, content) in rows:
pCounter += 1
# Process Paragraph
connection.endConnect()
Figure 4.45 Retrieving Paragraphs

4.3.2 Process Paragraph
For each paragraph we will pass it to processText() procedure to get an array of refined words, if
the array is empty, it means that there we no important words in the paragraph and the paragraph will be
deleted.
We will use the returned words to generate k-gram terms of them and populate the term table and
paragraphVector table.
Finally we will update the length of the paragraph with the number of words returned from
processText(), and mark the paragraph as processed.
4.3.3 Generating Terms
To generate terms we will call the generateTerms() procedure and pass to it: the bag of words,
and the kind of the k-grams we want to generate.
while pCounter < unprocessedParagraphsNum:
remain = unprocessedParagraphsNum - pCounter
if remain > 1000: remain = 1000
rows = connection1.run("SELECT paragraphId,content FROM paragraph"
+ " WHERE processed = false LIMIT %s", (remain,))
for (paragraphId, content) in rows:
pCounter += 1
data = processText(content)
length = len(data)
if length < 1:
connection2.run("DELETE FROM paragraph WHERE paragraphId = %s;"
, (paragraphId,))
connection2.commit()
continue
term.populateTerms_ParagraphVector(connection2, data, paragraphId)
connection2.run("UPDATE paragraph SET length = %s, processed = %s "
+ " WHERE paragraphId = %s;", (length, True, paragraphId))
connection2.commit()
connection1.endConnect()
connection2.endConnect()
Figure 4.46 Process Paragraph Function

Example on k-grams
data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId)
def generateTerms(data, kgrams, paragraphId=0):
all_terms = {}
for i in kgrams:
if len(data) < i: continue
terms = createTerms(data, i)
all_terms[i] = terms
data = {
'paragraphId': paragraphId,
'terms': all_terms
}
return data
def createTerms(words, kgram):
length = len(words) - kgram + 1
i = 0
terms = []
while i < length:
term = createTerm(words, i, kgram)
terms.append(term)
i += 1
return terms
def createTerm(words, start, kgram):
i = start
term = []
while i < kgram + start:
term.append(words[i])
i += 1
t = ' '.join(term)
if len(t) > 180:
t = t[0:180]
return t
Figure 4.47 Generate k-gram Terms Function
Physics is one of the oldest academic disciplines, perhaps the oldest through
its inclusion of astronomy. Over the last two millennia, physics was a part of
natural philosophy along with chemistry, biology, and certain branches of
mathematics.
Figure 4.48 Paragraph Example

['physic', 'one', 'old', 'academic', 'discipline', 'perhaps', 'old',
'inclusion', 'astronomy', 'last', 'two', 'millennium', 'physic', 'part',
'natural', 'philosophy', 'along', 'chemistry', 'biology', 'certain', 'branch',
'mathematics']
Figure 4.49 1-gram terms
['physic one', 'one old', 'old academic', 'academic discipline', 'discipline
perhaps', 'perhaps old', 'old inclusion', 'inclusion astronomy', 'astronomy
last', 'last two', 'two millennium', 'millennium physic', 'physic part', 'part
natural', 'natural philosophy', 'philosophy along', 'along chemistry',
'chemistry biology', 'biology certain', 'certain branch', 'branch mathematics']
['physic one old', 'one old academic', 'old academic discipline', 'academic
discipline perhaps', 'discipline perhaps old', 'perhaps old inclusion', 'old
inclusion astronomy', 'inclusion astronomy last', 'astronomy last two', 'last
two millennium', 'two millennium physic', 'millennium physic part', 'physic
part natural', 'part natural philosophy', 'natural philosophy along',
'philosophy along chemistry', 'along chemistry biology', 'chemistry biology
certain', 'biology certain branch', 'certain branch mathematics']
['physic one old academic', 'one old academic discipline', 'old academic
discipline perhaps', 'academic discipline perhaps old', 'discipline perhaps old
inclusion', 'perhaps old inclusion astronomy', 'old inclusion astronomy last',
'inclusion astronomy last two', 'astronomy last two millennium', 'last two
millennium physic', 'two millennium physic part', 'millennium physic part
natural', 'physic part natural philosophy', 'part natural philosophy along',
'natural philosophy along chemistry', 'philosophy along chemistry biology',
'along chemistry biology certain', 'chemistry biology certain branch', 'biology
certain branch mathematics']
['physic one old academic discipline', 'one old academic discipline perhaps',
'old academic discipline perhaps old', 'academic discipline perhaps old
inclusion', 'discipline perhaps old inclusion astronomy', 'perhaps old
inclusion astronomy last', 'old inclusion astronomy last two', 'inclusion
astronomy last two millennium', 'astronomy last two millennium physic', 'last
two millennium physic part', 'two millennium physic part natural', 'millennium
physic part natural philosophy', 'physic part natural philosophy along', 'part
natural philosophy along chemistry', 'natural philosophy along chemistry
biology', 'philosophy along chemistry biology certain', 'along chemistry
biology certain branch', 'chemistry biology certain branch mathematics']

4.3.4 Populating term, paragraphVector Tables
After we generated the terms we will use them to populate term, paragraphVector tables.
4.3.4.1 Calculate Term Frequency
We will use nltk.FreqDist() function to calculate the term frequency of each k-gram term in the
paragraph
4.3.4.2 Inserting Terms
We will insert each term with its corresponding term gram.
4.3.4.3 Inserting ParagraphVector
In this step we will link each term with its paragraph and the term frequency by inserting these
into the paragraphVector table.
tf = {}
for kgram in data['terms']:
tf[kgram] = nltk.FreqDist(data['terms'][kgram])
Figure 4.54 Calculate Term Frequency
query1 ="INSERT INTO term (kgram, term) VALUES (%s, %s) ON DUPLICATE KEY UPDATE
kgram = kgram, term = term;"
insertTerms = [(str(kgram), str(term)) for kgram in tf for term in tf[kgram]]
connection.runMany(query1, insertTerms)
connection.commit()
Figure 4.55 insert Terms in Database
query2 = "INSERT IGNORE INTO paragraphVector (paragraphId, termId, termFreq,
kgram) VALUES (%s, (SELECT termId FROM term WHERE term = %s AND kgram = %s), %s,
%s);"
insertDocVec = [(data['paragraphId'], str(term), str(kgram), tf[kgram][term],
str(kgram)) for kgram in tf for term in tf[kgram]]
connection.runMany(query2, insertDocVec)
connection.commit()
Figure 4.56 insert Paragraph Vector in Database

4.3.5 Executing VSM Algorithm
After all paragraphs are inserted into the database after begin processed, we will run some sorted
SQL procedures to update some columns(inverseDocFreq, BM25, pivotNorm) in term and
paragraphVector tables.
Now the system is finished and all terms are evaluated and ready for testing plagiarism.
connection.callProcedure('update_inverseDocFreq')
connection.callProcedure('update_BM25', (0.75, 1.5))
connection.callProcedure('update_pivotNorm', (0.75,))
Figure 4.57 Executing the VSM Algorithm

4.4 Testing Plagiarism
When a user submits a text or a file to test plagiarism on it, this text must first be splitted into
paragraphs and an inputPaper will be inserted to relate these paragraphs together.
4.4.1 Process Paragraph
Then each paragraph must be processed in a similar way like in the pre-processing, first the
paragraph will be inserted into the database in the inputParagraph table, then the text will be passed to
processText() procedure and return a refined bag of words. And finally these words will be used to
generate terms of them and populate the inputParagraphVector table.
connection.run(" INSERT INTO inputPaper (inputPaperId) VALUES(''); ")
paragraphs = tokenizeParagrapgs(text)
Figure 4.58 tokenizing and link paragraphs together
for paragraph in paragraphs:
data = processText(paragraph)
length = len(data)
if length < 1: continue
cursor = connection.run("INSERT INTO inputParagraph (content,inputPaperId)
VALUES (%s,%s)", (paragraph, paperId))
connection.commit()
paragraphId = cursor.getlastrowid()
term.populateInput_Terms_ParagraphVector(connection, data, paragraphId)
Figure 4.59 Process input paragraphs
def populateInput_Terms_ParagraphVector(connection, words, paragraphId):
data = generateTerms(words, [1, 2, 3, 4, 5], paragraphId)
# Term Frequency representation
tf = {}
for kgram in data['terms']:
tf[kgram] = FreqDist(data['terms'][kgram])
query = "INSERT INTO inputParagraphVector (inputParagraphId, termId, termFreq,
kgram) SELECT %s, termId, %s, %s FROM term WHERE term = %s AND kgram = %s ;"
insertDocVec = [(data['paragraphId'], tf[kgram][term], str(kgram), str(term),
str(kgram)) for kgram in tf for term in tf[kgram]]
connection.runMany(query, insertDocVec)
connection.commit()
Figure 4.60 Populate input paragraph vector

My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers

Similar to My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers (20)

Recently uploaded

Recently uploaded (20)

My Graduation Project Documentation: Plagiarism Detection System for English Scientific Papers