SlideShare a Scribd company logo
1 of 24
A Novel And Efficient Approach
For Near Duplicate Page
Detection In Web Crawling

VIPIN KP       Guided by: Mr . Aneesh M Haneef
08103066                Asst . Professor
S7 CSE A                 Department of
CSE,MESCE
Presentation Outline
   Introduction
   What are near duplicates
   Drawbacks of near duplicate pages
   What is a Web crawler
   Simplified Crawl Architecture
   Near duplicate detection
   Advantages
   Conclusion
   Reference

                         1/2/2012       2
Introduction
   The main gateways for access of a information in
    the web are search engines .
   A search engine operates in the following order:
     Web crawling
     Indexing
     Searching
   Web crawling ,a process that create a indexed
    repository utilized by the search engines.
   The large amount of web documents in the web
    have huge challenges to the search engine making
    their results less relevant to the user.

                            1/2/2012                   3
Introduction cont‟d…
  Web search engines face additional problems
   due to near duplicate web pages.
 It is an important requirements for search
   engines to provide users with relevant results
   without duplication.
  Near duplicate page detection is a challenging
   problem.




                          1/2/2012                  4
What are near duplicates ?
 The near duplicates are not considered as “exact
  duplicates ” , but are files with minute
  differences .
 They differ slightly in advertisement, counters ,
  timestamps , etc…
 Most of the web sites have boiler plate codes.




                         1/2/2012                 5
What are near duplicates ?




   http://shop.asus.co.uk/shop/gb/en-gb/home.aspx

                           1/2/2012                 6
What are near duplicates ?




   http://shop.asus.es/shop/gb/en-gb/home.aspx
                           1/2/2012              7
Drawbacks of Near Duplicate web
pages

   Waste network bandwidth
   Increase storage cost
   Affect the quality of search indexes
   Increase the load on the remote host that is
    serving such web pages
   Affect customer satisfaction




                            1/2/2012               8
Web Crawler
 A Web crawler is a computer program that browses
  the World Wide Web in an orderly fashion.
 Other terms for Web crawlers are ants, automatic
  indexers, bots , Web spiders, Web robots.
 Search engines uses web crawlers to create a
  copy of all the visited pages for later processing by
  a search engine that will index the downloaded
  pages to provide fast searches.
 This indexed database will use for searching
  process.
 A crawler may examine the URL if it ends with
  certain characters such as .html, .htm, .asp, .aspx,
  .php, .jsp, .jspx or a slash.
 Some crawlers may also avoid requesting any
  resources that have a "?"1/2/2012
                               in them.                   9
Simplified Crawl Architecture
         one document    HTML              traverse
                        Documen
                           t                links



  Web
 Index                                                 Web

         entire index    Near-
                        duplicate
                           ?           newly-crawled
                                       document(s)


            insert
                                    trash

                                1/2/2012                     10
Near Duplicate Detection
 The Steps Involved In This Approach Are,

 Web document parsing
 Stemming algorithm
 Keyword representation
 Similarity score calculation




                          1/2/2012          11
Near Duplicate Detection
  cont‟d…
Web Document Parsing:

• It may either be simple as URL extraction or complex
as removing the HTML tags and java scripts from a web
page.

•Stop Word Removal
       Remove commonly used words such as „an', „and‟
, ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the
size of the indexing file.




                                 1/2/2012                          12
Near Duplicate Detection
 cont‟d…
Stemming Algorithm:

•Stemming is the process for reducing derived words to
their stem, base or root form—generally a written word
form.
•The relation between a query and a document is
determined by the number and frequency of terms
which they have common.
•Affix removal algorithms remove suffixes and/or
prefixes from terms leaving a stem.
        eg : “connect”, “connected”,” connecting” are all
condensed to          connect.

                             1/2/2012                       13
Near Duplicate Detection
 cont‟d…
Stemming Algorithm cont’d..
•The prefix removal algorithm removes:
   anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro

•The suffix removal algorithm removes:
   ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o

• The derivation are converted to their stems which are rela
  to original in both form and semantics.




                              1/2/2012                      14
Near Duplicate Detection
cont‟d…
Key Word Representation:

• Keywords and their counts in each crawled page
is the result of stemming

• Keywords are sorted in descending order based
on the counts

• Keywords with highest counts are called prime
keywords stored in table and the remaining indexed
and stored in another table.


                          1/2/2012                   15
Near Duplicate Detection
  cont‟d…
Similarity score calculation:
• If prime keywords of the new web page do not match
with the prime keywords of the pages in the table then new
page is added to the repository.

• If all the keywords of the both pages are same then new
page is a duplicate.

• If prime keywords of the both pages are same then
similarity score (SSM) is calculated as follows.




                            1/2/2012                    16
Near Duplicate Detection
cont‟d…
                    K1                K2            ………..                   Kn
                     C1               C2            ………..                   Cn
        Table of web page in the repository containing keywords and count



                    K1                K2            …………                    Kn
                     C1               C2            ………….                   Cn
              Table of new web page containing keywords and count



 If a key word present in both tables then
          a=Δ[ki]T1
          b=Δ[ki]T2

  Using the formula
         SDc=log(count(a)/count(b))*Abs(1+(a-b))

                                         1/2/2012                                17
Near Duplicate Detection
   cont‟d…
• If keywords present in T1 but not in T2 and amount of keywords prese
   is NT1 then
        SDT1 =log(count(a))*Abs(1+|T2|)

• If keywords present in T2 but not in T1 and amount of keywords prese
   is NT2 then
        SDT2 =log(count(b))*Abs(1+|T1|)

• The similarity score of page against another page is calculated by

             |NC|      |NT1|       |NT@|

             ΣSDC + ΣSDT1 + ΣSDT2
              i=1      i=1         i=1
     SSM =
                               N
                                     Where N=(|T1|+|T2|)/2




                                     1/2/2012                      18
Near Duplicate Detection
cont‟d…
• The web documents with similarity score greater than
  a predefined threshold are considered as near
  duplicates

• These near duplicated pages are not added to the
  repository of search engine




                         1/2/2012                    19
Advantages
• Save the network bandwidth

• Reduce storage cost of search engines

• Improve the quality of search index




                           1/2/2012       20
Conclusion
• The proposed method solve the difficulties of
  information retrieval from the web.

• The approach has detected the near duplicate web
  pages efficiently based on the keywords extracted from
  the web pages.

• It reduces the memory space for web repositories.

• The near duplicate detection increases the search
  engines quality.


                            1/2/2012                  21
Reference
•   Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection
    mechanisms for digital documents", In Proceedings of the Special
    Interest Group on Management of Data (SIGMOD 1995), ACM Press.

•   Pandey, S.; Olston, C., (2005) "User-centric Web crawling",
    Proceedings
    of the 14th international conference on World Wide Web, pp: 401 - 41

•   Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
    for Near Duplicate Detection", Proceeding of the 17th international
     443 - 452. conference on World Wide Web, pp:131--140.

•    Lovins, J.B. (1968) "Development of a stemming algorithm".
    Mechanical Translation and Computational Linguistics.


                                      1/2/2012                           22
Questions




1/2/2012         23
Thank you

    1/2/2012   24

More Related Content

Viewers also liked

An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
Likan Patra
 

Viewers also liked (13)

Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
Progressive Texture
Progressive TextureProgressive Texture
Progressive Texture
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
Avito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAvito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggle
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Deduplication
DeduplicationDeduplication
Deduplication
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 

Similar to novel and efficient approch for detection of duplicate pages in web crawling

Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In Perl
Konstantin Ivinsky
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
MongoDB
 
Web and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig InstituteWeb and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig Institute
egore
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
ijceronline
 

Similar to novel and efficient approch for detection of duplicate pages in web crawling (20)

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In Perl
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31 JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31
 
Isset Presentation @ EECI2009
Isset Presentation @ EECI2009Isset Presentation @ EECI2009
Isset Presentation @ EECI2009
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
H017554148
H017554148H017554148
H017554148
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
 
Lecture 6 Data Driven Design
Lecture 6  Data Driven DesignLecture 6  Data Driven Design
Lecture 6 Data Driven Design
 
Web and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig InstituteWeb and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig Institute
 
Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
Three Steps to Link Analysis Insight
Three Steps to Link Analysis InsightThree Steps to Link Analysis Insight
Three Steps to Link Analysis Insight
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 

Recently uploaded

Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
rajveerescorts2022
 
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
Sheetaleventcompany
 
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
Nitya salvi
 
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
Abortion pills in Kuwait Cytotec pills in Kuwait
 
I am Independent Call girl in noida at chepest price Call Me 8826255397
I am Independent Call girl in noida at chepest price Call Me 8826255397I am Independent Call girl in noida at chepest price Call Me 8826255397
I am Independent Call girl in noida at chepest price Call Me 8826255397
Riya Singh
 

Recently uploaded (20)

❤️Amritsar Call Girls☎️9815674956☎️ Call Girl service in Amritsar☎️ Amritsar ...
❤️Amritsar Call Girls☎️9815674956☎️ Call Girl service in Amritsar☎️ Amritsar ...❤️Amritsar Call Girls☎️9815674956☎️ Call Girl service in Amritsar☎️ Amritsar ...
❤️Amritsar Call Girls☎️9815674956☎️ Call Girl service in Amritsar☎️ Amritsar ...
 
Call Girls In Mohali ☎ 9915851334☎ Just Genuine Call Call Girls Mohali 🧿Elite...
Call Girls In Mohali ☎ 9915851334☎ Just Genuine Call Call Girls Mohali 🧿Elite...Call Girls In Mohali ☎ 9915851334☎ Just Genuine Call Call Girls Mohali 🧿Elite...
Call Girls In Mohali ☎ 9915851334☎ Just Genuine Call Call Girls Mohali 🧿Elite...
 
Escorts Service Model Hathras 👉 Just CALL ME: 8617697112 💋 Call Out Call Both...
Escorts Service Model Hathras 👉 Just CALL ME: 8617697112 💋 Call Out Call Both...Escorts Service Model Hathras 👉 Just CALL ME: 8617697112 💋 Call Out Call Both...
Escorts Service Model Hathras 👉 Just CALL ME: 8617697112 💋 Call Out Call Both...
 
Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
Zirakpur Call Girls ✅ Just Call ☎ 9878799926☎ Call Girls Service In Mohali Av...
 
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
Call Girls Service In Zirakpur ❤️🍑 7837612180 👄🫦Independent Escort Service Zi...
 
9867746289 - Payal Mehta Book Call Girls in Versova and escort services 24x7
9867746289 - Payal Mehta Book Call Girls in Versova and escort services 24x79867746289 - Payal Mehta Book Call Girls in Versova and escort services 24x7
9867746289 - Payal Mehta Book Call Girls in Versova and escort services 24x7
 
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
Call Girls In Raigad Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service Enjoy...
 
Introduction to Fashion Designing for all
Introduction to Fashion Designing for allIntroduction to Fashion Designing for all
Introduction to Fashion Designing for all
 
Mahim Call Girls in Bandra 7738631006, Sakinaka Call Girls agency, Kurla Call...
Mahim Call Girls in Bandra 7738631006, Sakinaka Call Girls agency, Kurla Call...Mahim Call Girls in Bandra 7738631006, Sakinaka Call Girls agency, Kurla Call...
Mahim Call Girls in Bandra 7738631006, Sakinaka Call Girls agency, Kurla Call...
 
Call girls in Vashi Service 7738596112 Free Delivery 24x7 at Your Doorstep
Call girls in Vashi Service 7738596112 Free Delivery 24x7 at Your DoorstepCall girls in Vashi Service 7738596112 Free Delivery 24x7 at Your Doorstep
Call girls in Vashi Service 7738596112 Free Delivery 24x7 at Your Doorstep
 
Payal Mehta 9867746289, Escorts Service Near The Taj Mahal Palace Colaba
Payal Mehta 9867746289, Escorts Service Near The Taj Mahal Palace ColabaPayal Mehta 9867746289, Escorts Service Near The Taj Mahal Palace Colaba
Payal Mehta 9867746289, Escorts Service Near The Taj Mahal Palace Colaba
 
Just Call Vip call girls Etawah Escorts ☎️8617370543 Two shot with one girl (...
Just Call Vip call girls Etawah Escorts ☎️8617370543 Two shot with one girl (...Just Call Vip call girls Etawah Escorts ☎️8617370543 Two shot with one girl (...
Just Call Vip call girls Etawah Escorts ☎️8617370543 Two shot with one girl (...
 
{ Pooja 9892124323 } girls birds call girls netflix funny names to call girls...
{ Pooja 9892124323 } girls birds call girls netflix funny names to call girls...{ Pooja 9892124323 } girls birds call girls netflix funny names to call girls...
{ Pooja 9892124323 } girls birds call girls netflix funny names to call girls...
 
Style Victorious Cute Outfits for Winners
Style Victorious Cute Outfits for WinnersStyle Victorious Cute Outfits for Winners
Style Victorious Cute Outfits for Winners
 
Tirunelveli Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tirunelveli
Tirunelveli Escorts Service Girl ^ 9332606886, WhatsApp Anytime TirunelveliTirunelveli Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tirunelveli
Tirunelveli Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tirunelveli
 
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
@Abortion clinic tablets Kuwait (+918133066128) Abortion Pills IN Kuwait
 
I am Independent Call girl in noida at chepest price Call Me 8826255397
I am Independent Call girl in noida at chepest price Call Me 8826255397I am Independent Call girl in noida at chepest price Call Me 8826255397
I am Independent Call girl in noida at chepest price Call Me 8826255397
 
Top 10 Moisturising Cream Brands In India - Stelon Biotech
Top 10 Moisturising Cream Brands In India - Stelon BiotechTop 10 Moisturising Cream Brands In India - Stelon Biotech
Top 10 Moisturising Cream Brands In India - Stelon Biotech
 
UNIVERSAL HUMAN VALUES -Harmony in the Human Being
UNIVERSAL HUMAN VALUES -Harmony in the Human BeingUNIVERSAL HUMAN VALUES -Harmony in the Human Being
UNIVERSAL HUMAN VALUES -Harmony in the Human Being
 
Ladies kitty party invitation messages and greetings.pdf
Ladies kitty party invitation messages and greetings.pdfLadies kitty party invitation messages and greetings.pdf
Ladies kitty party invitation messages and greetings.pdf
 

novel and efficient approch for detection of duplicate pages in web crawling

  • 1. A Novel And Efficient Approach For Near Duplicate Page Detection In Web Crawling VIPIN KP Guided by: Mr . Aneesh M Haneef 08103066 Asst . Professor S7 CSE A Department of CSE,MESCE
  • 2. Presentation Outline  Introduction  What are near duplicates  Drawbacks of near duplicate pages  What is a Web crawler  Simplified Crawl Architecture  Near duplicate detection  Advantages  Conclusion  Reference 1/2/2012 2
  • 3. Introduction  The main gateways for access of a information in the web are search engines .  A search engine operates in the following order: Web crawling Indexing Searching  Web crawling ,a process that create a indexed repository utilized by the search engines.  The large amount of web documents in the web have huge challenges to the search engine making their results less relevant to the user. 1/2/2012 3
  • 4. Introduction cont‟d…  Web search engines face additional problems due to near duplicate web pages.  It is an important requirements for search engines to provide users with relevant results without duplication.  Near duplicate page detection is a challenging problem. 1/2/2012 4
  • 5. What are near duplicates ?  The near duplicates are not considered as “exact duplicates ” , but are files with minute differences .  They differ slightly in advertisement, counters , timestamps , etc…  Most of the web sites have boiler plate codes. 1/2/2012 5
  • 6. What are near duplicates ? http://shop.asus.co.uk/shop/gb/en-gb/home.aspx 1/2/2012 6
  • 7. What are near duplicates ? http://shop.asus.es/shop/gb/en-gb/home.aspx 1/2/2012 7
  • 8. Drawbacks of Near Duplicate web pages  Waste network bandwidth  Increase storage cost  Affect the quality of search indexes  Increase the load on the remote host that is serving such web pages  Affect customer satisfaction 1/2/2012 8
  • 9. Web Crawler  A Web crawler is a computer program that browses the World Wide Web in an orderly fashion.  Other terms for Web crawlers are ants, automatic indexers, bots , Web spiders, Web robots.  Search engines uses web crawlers to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.  This indexed database will use for searching process.  A crawler may examine the URL if it ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash.  Some crawlers may also avoid requesting any resources that have a "?"1/2/2012 in them. 9
  • 10. Simplified Crawl Architecture one document HTML traverse Documen t links Web Index Web entire index Near- duplicate ? newly-crawled document(s) insert trash 1/2/2012 10
  • 11. Near Duplicate Detection The Steps Involved In This Approach Are, Web document parsing Stemming algorithm Keyword representation Similarity score calculation 1/2/2012 11
  • 12. Near Duplicate Detection cont‟d… Web Document Parsing: • It may either be simple as URL extraction or complex as removing the HTML tags and java scripts from a web page. •Stop Word Removal Remove commonly used words such as „an', „and‟ , ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the size of the indexing file. 1/2/2012 12
  • 13. Near Duplicate Detection cont‟d… Stemming Algorithm: •Stemming is the process for reducing derived words to their stem, base or root form—generally a written word form. •The relation between a query and a document is determined by the number and frequency of terms which they have common. •Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. eg : “connect”, “connected”,” connecting” are all condensed to connect. 1/2/2012 13
  • 14. Near Duplicate Detection cont‟d… Stemming Algorithm cont’d.. •The prefix removal algorithm removes: anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro •The suffix removal algorithm removes: ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o • The derivation are converted to their stems which are rela to original in both form and semantics. 1/2/2012 14
  • 15. Near Duplicate Detection cont‟d… Key Word Representation: • Keywords and their counts in each crawled page is the result of stemming • Keywords are sorted in descending order based on the counts • Keywords with highest counts are called prime keywords stored in table and the remaining indexed and stored in another table. 1/2/2012 15
  • 16. Near Duplicate Detection cont‟d… Similarity score calculation: • If prime keywords of the new web page do not match with the prime keywords of the pages in the table then new page is added to the repository. • If all the keywords of the both pages are same then new page is a duplicate. • If prime keywords of the both pages are same then similarity score (SSM) is calculated as follows. 1/2/2012 16
  • 17. Near Duplicate Detection cont‟d… K1 K2 ……….. Kn C1 C2 ……….. Cn Table of web page in the repository containing keywords and count K1 K2 ………… Kn C1 C2 …………. Cn Table of new web page containing keywords and count If a key word present in both tables then a=Δ[ki]T1 b=Δ[ki]T2 Using the formula SDc=log(count(a)/count(b))*Abs(1+(a-b)) 1/2/2012 17
  • 18. Near Duplicate Detection cont‟d… • If keywords present in T1 but not in T2 and amount of keywords prese is NT1 then SDT1 =log(count(a))*Abs(1+|T2|) • If keywords present in T2 but not in T1 and amount of keywords prese is NT2 then SDT2 =log(count(b))*Abs(1+|T1|) • The similarity score of page against another page is calculated by |NC| |NT1| |NT@| ΣSDC + ΣSDT1 + ΣSDT2 i=1 i=1 i=1 SSM = N Where N=(|T1|+|T2|)/2 1/2/2012 18
  • 19. Near Duplicate Detection cont‟d… • The web documents with similarity score greater than a predefined threshold are considered as near duplicates • These near duplicated pages are not added to the repository of search engine 1/2/2012 19
  • 20. Advantages • Save the network bandwidth • Reduce storage cost of search engines • Improve the quality of search index 1/2/2012 20
  • 21. Conclusion • The proposed method solve the difficulties of information retrieval from the web. • The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages. • It reduces the memory space for web repositories. • The near duplicate detection increases the search engines quality. 1/2/2012 21
  • 22. Reference • Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press. • Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings of the 14th international conference on World Wide Web, pp: 401 - 41 • Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins for Near Duplicate Detection", Proceeding of the 17th international 443 - 452. conference on World Wide Web, pp:131--140. • Lovins, J.B. (1968) "Development of a stemming algorithm". Mechanical Translation and Computational Linguistics. 1/2/2012 22
  • 24. Thank you 1/2/2012 24