4. Configuring Nutch
• Edit the file conf/regex-urlfilter.txt and replace
# accept anything else
+.
• Use a regular expression matching the domain you wish to crawl.
• For example, to crawl only nutch.apache.org domain
+^http://([a-z0-9]*.)*nutch.apache.org/
8. Scaling Nutch
• HDFS – scaling storage
• MapReduce – scale crawling
• Gora – scale back end
9. Gora
• Data Persistence : Persisting objects to Column stores such as
HBase, Cassandra, Hypertable, Voldermort, Redis, etc; SQL
databases, such as MySQL, HSQLDB, flat files in local file system
of Hadoop HDFS
• Data Access : Java-friendly API for accessing the data regardless of
its location
• Indexing : Solr
• Analysis Apache Pig, Apache Hive and Cascading
• MapReduce support
14. Selenium (with demo)
WebDriver driver = new FirefoxDriver();
// Go to the login page
driver.get("https://mysite.com");
// put in the username
WebElement query = driver.findElement(By.name("username-element"));
query.sendKeys("your-user-name");
// put in the password
query = driver.findElement(By.name("password-element"));
query.sendKeys("real-password");
((JavascriptExecutor) driver).executeScript("javascript:whatever-login-script();");