novel and efficient approch for detection of duplicate pages in web crawling

A Novel And Efficient Approach
For Near Duplicate Page
Detection In Web Crawling

VIPIN KP Guided by: Mr . Aneesh M Haneef
08103066 Asst . Professor
S7 CSE A Department of
CSE,MESCE

Presentation Outline
 Introduction
 What are near duplicates
 Drawbacks of near duplicate pages
 What is a Web crawler
 Simplified Crawl Architecture
 Near duplicate detection
 Advantages
 Conclusion
 Reference

1/2/2012 2

Introduction
 The main gateways for access of a information in
the web are search engines .
 A search engine operates in the following order:
Web crawling
Indexing
Searching
 Web crawling ,a process that create a indexed
repository utilized by the search engines.
 The large amount of web documents in the web
have huge challenges to the search engine making
their results less relevant to the user.

1/2/2012 3

Introduction cont‟d…
 Web search engines face additional problems
due to near duplicate web pages.
 It is an important requirements for search
engines to provide users with relevant results
without duplication.
 Near duplicate page detection is a challenging
problem.

1/2/2012 4

What are near duplicates ?
 The near duplicates are not considered as “exact
duplicates ” , but are files with minute
differences .
 They differ slightly in advertisement, counters ,
timestamps , etc…
 Most of the web sites have boiler plate codes.

1/2/2012 5


http://shop.asus.co.uk/shop/gb/en-gb/home.aspx

1/2/2012 6


http://shop.asus.es/shop/gb/en-gb/home.aspx
1/2/2012 7

Drawbacks of Near Duplicate web
pages

 Waste network bandwidth
 Increase storage cost
 Affect the quality of search indexes
 Increase the load on the remote host that is
serving such web pages
 Affect customer satisfaction

1/2/2012 8

Web Crawler
 A Web crawler is a computer program that browses
the World Wide Web in an orderly fashion.
 Other terms for Web crawlers are ants, automatic
indexers, bots , Web spiders, Web robots.
 Search engines uses web crawlers to create a
copy of all the visited pages for later processing by
a search engine that will index the downloaded
pages to provide fast searches.
 This indexed database will use for searching
process.
 A crawler may examine the URL if it ends with
certain characters such as .html, .htm, .asp, .aspx,
.php, .jsp, .jspx or a slash.
 Some crawlers may also avoid requesting any
resources that have a "?"1/2/2012
in them. 9

Simplified Crawl Architecture
one document HTML traverse
Documen
t links

Web
Index Web

entire index Near-
duplicate
? newly-crawled
document(s)

insert
trash

1/2/2012 10

Near Duplicate Detection
The Steps Involved In This Approach Are,

Web document parsing
Stemming algorithm
Keyword representation
Similarity score calculation

1/2/2012 11

cont‟d…
Web Document Parsing:

• It may either be simple as URL extraction or complex
as removing the HTML tags and java scripts from a web
page.

•Stop Word Removal
Remove commonly used words such as „an', „and‟
, ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the
size of the indexing file.

1/2/2012 12

cont‟d…
Stemming Algorithm:

•Stemming is the process for reducing derived words to
their stem, base or root form—generally a written word
form.
•The relation between a query and a document is
determined by the number and frequency of terms
which they have common.
•Affix removal algorithms remove suffixes and/or
prefixes from terms leaving a stem.
eg : “connect”, “connected”,” connecting” are all
condensed to connect.

1/2/2012 13

cont‟d…
Stemming Algorithm cont’d..
•The prefix removal algorithm removes:
anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro

•The suffix removal algorithm removes:
ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o

• The derivation are converted to their stems which are rela
to original in both form and semantics.

1/2/2012 14

cont‟d…
Key Word Representation:

• Keywords and their counts in each crawled page
is the result of stemming

• Keywords are sorted in descending order based
on the counts

• Keywords with highest counts are called prime
keywords stored in table and the remaining indexed
and stored in another table.

1/2/2012 15

cont‟d…
Similarity score calculation:
• If prime keywords of the new web page do not match
with the prime keywords of the pages in the table then new
page is added to the repository.

• If all the keywords of the both pages are same then new
page is a duplicate.

• If prime keywords of the both pages are same then
similarity score (SSM) is calculated as follows.

1/2/2012 16

cont‟d…
K1 K2 ……….. Kn
C1 C2 ……….. Cn
Table of web page in the repository containing keywords and count

K1 K2 ………… Kn
C1 C2 …………. Cn
Table of new web page containing keywords and count

If a key word present in both tables then
a=Δ[ki]T1
b=Δ[ki]T2

Using the formula
SDc=log(count(a)/count(b))*Abs(1+(a-b))

1/2/2012 17

cont‟d…
• If keywords present in T1 but not in T2 and amount of keywords prese
is NT1 then
SDT1 =log(count(a))*Abs(1+|T2|)

• If keywords present in T2 but not in T1 and amount of keywords prese
is NT2 then
SDT2 =log(count(b))*Abs(1+|T1|)

• The similarity score of page against another page is calculated by

|NC| |NT1| |NT@|

ΣSDC + ΣSDT1 + ΣSDT2
i=1 i=1 i=1
SSM =
N
Where N=(|T1|+|T2|)/2

1/2/2012 18

cont‟d…
• The web documents with similarity score greater than
a predefined threshold are considered as near
duplicates

• These near duplicated pages are not added to the
repository of search engine

1/2/2012 19

Advantages
• Save the network bandwidth

• Reduce storage cost of search engines

• Improve the quality of search index

1/2/2012 20

Conclusion
• The proposed method solve the difficulties of
information retrieval from the web.

• The approach has detected the near duplicate web
pages efficiently based on the keywords extracted from
the web pages.

• It reduces the memory space for web repositories.

• The near duplicate detection increases the search
engines quality.

1/2/2012 21

Reference
• Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection
mechanisms for digital documents", In Proceedings of the Special
Interest Group on Management of Data (SIGMOD 1995), ACM Press.

• Pandey, S.; Olston, C., (2005) "User-centric Web crawling",
Proceedings
of the 14th international conference on World Wide Web, pp: 401 - 41

• Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
for Near Duplicate Detection", Proceeding of the 17th international
443 - 452. conference on World Wide Web, pp:131--140.

• Lovins, J.B. (1968) "Development of a stemming algorithm".
Mechanical Translation and Computational Linguistics.

1/2/2012 22

Questions

1/2/2012 23

novel and efficient approch for detection of duplicate pages in web crawling

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to novel and efficient approch for detection of duplicate pages in web crawling

Similar to novel and efficient approch for detection of duplicate pages in web crawling (20)

Recently uploaded

Recently uploaded (20)

novel and efficient approch for detection of duplicate pages in web crawling