SlideShare a Scribd company logo
1 of 8
Heritrix DecideRules
Roger G. Coram
Web Crawl Engineer
2
DecideRuleSequence
A list of DecideRules:
• Processed in order
• *Every rule is processed
• Result will be:
• ACCEPT: URI is rule in scope
• REJECT: URI is ruled out of scope
• PASS: DecideRule has no effect
*DecideRules can have a onlyDecision method to skip processing if they can’t change the outcome.
3
UK Domain Crawl 2014
RejectDecideRule REJECT everything by default
SurtPrefixedDecideRule ACCEPT .uk, london, etc.
MatchesRegexDecideRule ACCEPT; try to capture media files
HopsPathMatchesRegexDecideRule ACCEPT anything embedded on a seed; ^E*$
*ExternalGeoLocationDecideRule ACCEPT IP addresses in GB
*OnDomainsDecideRule ACCEPT specific domains; disabled by default
*HopsPathMatchesRegexDecideRule ACCEPT redirects from seeds; ^R+$
CompressibilityDecideRule REJECT highly (in)compressible URIs; experimental
*TooManyHopsDecideRule REJECT URIs more than 20 hops from a seed
*MatchesListRegexDecideRule REJECT specific patterns
PathologicalPathDecideRule REJECT URIs with more than 3 recurrences of a pattern
TooManyPathSegmentsDecideRule REJECT URIs with more than 15 path segments
SurtPrefixedDecideRule ACCEPT URIs matching a list of URL-shortening services
SurtPrefixedDecideRule REJECT a list of SURTs from a file—exclude.txt
PrerequisiteAcceptDecideRule ACCEPT prerequisites
4
UK Domain Crawl 2014
Basic flow:
• REJECT everything.
• Look for reasons to ACCEPT content.
• REJECT anything you absolutely do not want.
Those marked with a ‘*’ are specified using Spring’s <ref bean…/> syntax. This
facilitates changing their values dynamically or with Sheets.
Those struck out are disabled by default (and typically enabled using Sheets for
specific sites).
Experimental DecideRules:
• ExternalGeoLocationDecideRule: found 2,544,426 new hosts.
• CompressibilityDecideRule: REJECTed 1,650,861 URIs.
5
Beyond Scoping
All Processors have a shouldProcessRule property—you can use DecideRules
instead of simple true/false values.
We use this to filter viral content:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule"/>
<bean class="uk.bl.wap.modules.deciderules.AnnotationMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>^.*stream:.+FOUND.*$</value>
</list>
</property>
</bean>
</list>
6
logging.properties
In the logging.properties file:
org.archive.modules.deciderules.DecideRuleSequence.level=FINEST
This will output the decision of every DecideRule for every URI.
This will generate a lot of log entries.
Only practical for small crawls or testing.
scope.log
Alternatively, a recent addition to the DecideRuleSequence:
<property name="logToFile" value="true" />
This will create a file, scope.log, containing the final decision for every
URI along with the specific rule which made that decision:
2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull
7
List of DecideRules
AcceptDecideRule
ContentLengthDecideRule
ContentTypeMatchesRegexDecideRule
ContentTypeNotMatchesRegexDecideRule
ExternalGeoLocationDecideRule
FetchStatusDecideRule
FetchStatusMatchesRegexDecideRule
FetchStatusNotMatchesRegexDecideRule
HasViaDecideRule
HopCrossesAssignmentLevelDomainDecideRule
HopsPathMatchesRegexDecideRule
IpAddressSetDecideRule
MatchesFilePatternDecideRule
MatchesListRegexDecideRule
MatchesRegexDecideRule
MatchesStatusCodeDecideRule
NotMatchesFilePatternDecideRule
NotMatchesListRegexDecideRule
NotMatchesRegexDecideRule
NotMatchesStatusCodeDecideRule
PathologicalPathDecideRule
PredicatedDecideRule
PrerequisiteAcceptDecideRule
RejectDecideRule
ResourceLongerThanDecideRule
ResourceNoLongerThanDecideRule
ResponseContentLengthDecideRule
SchemeNotInSetDecideRule
ScriptedDecideRule
SeedAcceptDecideRule
TooManyHopsDecideRule
TooManyPathSegmentsDecideRule
TransclusionDecideRule
ViaSurtPrefixedDecideRule
IdenticalDigestDecideRule
NotOnDomainsDecideRule
NotOnHostsDecideRule
NotSurtPrefixedDecideRule
OnDomainsDecideRule
OnHostsDecideRule
SurtPrefixedDecideRule
8

More Related Content

Similar to Heritrix DecideRules

Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011Rainforest QA
 
Testing at scale with Redis queues
Testing at scale with Redis queuesTesting at scale with Redis queues
Testing at scale with Redis queuesAAron EvaNS
 
Testing at scale with redis queues
Testing at scale with redis queuesTesting at scale with redis queues
Testing at scale with redis queuesAAron EvaNS
 
DNN Database Tips & Tricks
DNN Database Tips & TricksDNN Database Tips & Tricks
DNN Database Tips & TricksWill Strohl
 
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSErik Krogen
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersJonathan Levin
 
REST in Piece - Administration of an Oracle Cluster/Database using REST
REST in Piece - Administration of an Oracle Cluster/Database using RESTREST in Piece - Administration of an Oracle Cluster/Database using REST
REST in Piece - Administration of an Oracle Cluster/Database using RESTChristian Gohmann
 
Pure Speed Drupal 4 Gov talk
Pure Speed Drupal 4 Gov talkPure Speed Drupal 4 Gov talk
Pure Speed Drupal 4 Gov talkBryan Ollendyke
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 201910 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019Dave Nielsen
 
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014FalafelSoftware
 
Breaking Dependencies to Allow Unit Testing
Breaking Dependencies to Allow Unit TestingBreaking Dependencies to Allow Unit Testing
Breaking Dependencies to Allow Unit TestingSteven Smith
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database JonesJohn David Duncan
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Professional SharePoint Solution Deployment with PowerShell
Professional SharePoint Solution Deployment with PowerShellProfessional SharePoint Solution Deployment with PowerShell
Professional SharePoint Solution Deployment with PowerShellMatthias Einig
 
Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherencearagozin
 
2008 Collaborate IOUG Presentation
2008 Collaborate IOUG Presentation2008 Collaborate IOUG Presentation
2008 Collaborate IOUG PresentationBiju Thomas
 
Java colombo-deep-dive-into-jax-rs
Java colombo-deep-dive-into-jax-rsJava colombo-deep-dive-into-jax-rs
Java colombo-deep-dive-into-jax-rsSagara Gunathunga
 
Fine-grained authorization with XACML
Fine-grained authorization with XACMLFine-grained authorization with XACML
Fine-grained authorization with XACMLPrabath Siriwardena
 

Similar to Heritrix DecideRules (20)

Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011
 
Testing at scale with Redis queues
Testing at scale with Redis queuesTesting at scale with Redis queues
Testing at scale with Redis queues
 
Testing at scale with redis queues
Testing at scale with redis queuesTesting at scale with redis queues
Testing at scale with redis queues
 
DNN Database Tips & Tricks
DNN Database Tips & TricksDNN Database Tips & Tricks
DNN Database Tips & Tricks
 
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
REST in Piece - Administration of an Oracle Cluster/Database using REST
REST in Piece - Administration of an Oracle Cluster/Database using RESTREST in Piece - Administration of an Oracle Cluster/Database using REST
REST in Piece - Administration of an Oracle Cluster/Database using REST
 
Pure Speed Drupal 4 Gov talk
Pure Speed Drupal 4 Gov talkPure Speed Drupal 4 Gov talk
Pure Speed Drupal 4 Gov talk
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 201910 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
 
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014
Breaking Dependencies To Allow Unit Testing - Steve Smith | FalafelCON 2014
 
Breaking Dependencies to Allow Unit Testing
Breaking Dependencies to Allow Unit TestingBreaking Dependencies to Allow Unit Testing
Breaking Dependencies to Allow Unit Testing
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Professional SharePoint Solution Deployment with PowerShell
Professional SharePoint Solution Deployment with PowerShellProfessional SharePoint Solution Deployment with PowerShell
Professional SharePoint Solution Deployment with PowerShell
 
Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherence
 
2008 Collaborate IOUG Presentation
2008 Collaborate IOUG Presentation2008 Collaborate IOUG Presentation
2008 Collaborate IOUG Presentation
 
Java colombo-deep-dive-into-jax-rs
Java colombo-deep-dive-into-jax-rsJava colombo-deep-dive-into-jax-rs
Java colombo-deep-dive-into-jax-rs
 
Configuration management with Chef
Configuration management with ChefConfiguration management with Chef
Configuration management with Chef
 
Fine-grained authorization with XACML
Fine-grained authorization with XACMLFine-grained authorization with XACML
Fine-grained authorization with XACML
 
Hadoop 설치
Hadoop 설치Hadoop 설치
Hadoop 설치
 

Recently uploaded

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 

Recently uploaded (20)

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 

Heritrix DecideRules