Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

•Download as PPTX, PDF•

0 likes•2,614 views

Mark Kerzner

Presented at Houston Hadoop Meetup in March '14

Technology

Houston Hadoop Meetup
2/12/14
Nutch + Hadoop with
Selenium and Burp
By Mark Kerzner, Elephant Scale

Nutch story
• Created by Doug Cutting to crawl the web

• Not scalable
• Enter HDFS
• Nutch on HDFS
• Nutch on Hadoop
• Nutch 1.x, Nutch 2.x

Nutch 1.x
• Local or HDFS
• Command-line
• Crawl-db

Configuring Nutch
• Edit the file conf/regex-urlfilter.txt and replace

# accept anything else

+.
• Use a regular expression matching the domain you wish to crawl.
• For example, to crawl only nutch.apache.org domain

+^http://([a-z0-9]*.)*nutch.apache.org/

Scaling Nutch
• HDFS – scaling storage
• MapReduce – scale crawling
• Gora – scale back end

Gora
• Data Persistence : Persisting objects to Column stores such as
HBase, Cassandra, Hypertable, Voldermort, Redis, etc; SQL
databases, such as MySQL, HSQLDB, flat files in local file system

of Hadoop HDFS
• Data Access : Java-friendly API for accessing the data regardless of
its location
• Indexing : Solr
• Analysis Apache Pig, Apache Hive and Cascading
• MapReduce support

Passwords? – Oops!

1. Burp + HttpClient
2. Selenium + Java

HttpClient
CloseableHttpClient httpclient = HttpClients.createDefault();
try {

HttpPost httpPost = new HttpPost(getUrl());
// put in all custom headers
Map<String, String> headers = getHeaders();
for (Map.Entry<String, String> header : headers.entrySet()) {
httpPost.addHeader(header.getKey(), header.getValue());
}
HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8"));
httpPost.setEntity(entity);
response = httpclient.execute(httpPost);

Browser interaction? – Oops!

Selenium
Selenium + Java

Selenium (with demo)
WebDriver driver = new FirefoxDriver();
// Go to the login page
driver.get("https://mysite.com");

// put in the username
WebElement query = driver.findElement(By.name("username-element"));
query.sendKeys("your-user-name");
// put in the password
query = driver.findElement(By.name("password-element"));
query.sendKeys("real-password");
((JavascriptExecutor) driver).executeScript("javascript:whatever-login-script();");

What's hot

presentation_Hadoop_File_SystemBrett Keim

Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari

Web scraping with nutch solr part 2Mike Frampton

Hadoop PrimerSteve Staso

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network

Pptx presentNitish Bhardwaj

HDFS: Hadoop Distributed FilesystemSteve Loughran

Intro to Apache HadoopSufi Nawaz

Hadoop - OverviewJay

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter

HadoopCassell Hsu

Efficient in situ processing of various storage types on apache tajoHyunsik Choi

July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox

Asbury Hadoop OverviewBrian Enochson

Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise

Hadoop sqoop Wei-Yu Chen

Big data and hadoop anupamaAnupama Prabhudesai

Meet Solr For The Tirst AgainVarun Thacker

New features in Pig 0.11Hortonworks

HADOOP TECHNOLOGY pptsravya raju

What's hot (20)

presentation_Hadoop_File_System

Accessing external hadoop data sources using pivotal e xtension framework (px...

Web scraping with nutch solr part 2

Hadoop Primer

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...

Pptx present

HDFS: Hadoop Distributed Filesystem

Intro to Apache Hadoop

Hadoop - Overview

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)

Hadoop

Efficient in situ processing of various storage types on apache tajo

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Asbury Hadoop Overview

Sept 17 2013 - THUG - HBase a Technical Introduction

Hadoop sqoop

Big data and hadoop anupama

Meet Solr For The Tirst Again

New features in Pig 0.11

HADOOP TECHNOLOGY ppt

Viewers also liked

Nutch as a Web data mining platformabial

Lab_syllabus_sp16_updatedXiangzhen Sun

Introduction to pigRavi Mutyala

Porting your hadoop app to horton works hdpMark Kerzner

Night owl by Boyd Meyer of PROS Mark Kerzner

Zeta architecture -2015MapR Technologies

Cloudera searchMark Kerzner

Oil and gas big data editionMark Kerzner

Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupMark Kerzner

Launching your career in Big DataSujee Maniyam

Hadoop to spark_v2elephantscale

Intro to Apache Spark by Marco VasquezMapR Technologies

SHMcloud visionMark Kerzner

Current challenges in web crawlingDenis Shestakov

Joe Witt presentation on Apache NiFiMark Kerzner

Viewers also liked (15)

Nutch as a Web data mining platform

Lab_syllabus_sp16_updated

Introduction to pig

Porting your hadoop app to horton works hdp

Night owl by Boyd Meyer of PROS

Zeta architecture -2015

Cloudera search

Oil and gas big data edition

Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup

Launching your career in Big Data

Hadoop to spark_v2

Intro to Apache Spark by Marco Vasquez

SHMcloud vision

Current challenges in web crawling

Joe Witt presentation on Apache NiFi

Similar to Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal

Introduction to Apache Hadoop EcosystemMahabubur Rahaman

Hadoop - HDFSKavyaGo

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1

Introduction to HDFS and MapReduceDerek Chen

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

Scaling Storage and Computation with Hadoopyaevents

Presentationch samaram

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

Apache Hadoop 1.1Sperasoft

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

HDFS Deep DiveZoltan C. Toth

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Hadoop And Their Ecosystem pptsunera pathan

Hadoop And Their Ecosystemsunera pathan

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23

Similar to Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium) (20)

Hadoop Cluster Configuration and Data Loading - Module 2

Introduction to Apache Hadoop Ecosystem

Hadoop - HDFS

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY

Introduction to HDFS and MapReduce

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Scaling Storage and Computation with Hadoop

Presentation

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Apache Hadoop 1.1

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop in the cloud – The what, why and how from the experts

HDFS Deep Dive

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Hadoop And Their Ecosystem ppt

Hadoop And Their Ecosystem

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

Topic 9a-Hadoop Storage- HDFS.pptx

Recently uploaded

Artificial Intelligence: Facts and MythsJoaquim Jorge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

GenCyber Cyber Security Day PresentationMichael W. Hawkins

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Histor y of HAM Radio presentation slidevu2urc

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

A Year of the Servo Reboot: Where Are We Now?Igalia

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Slack Application Development 101 Slidespraypatel2

Recently uploaded (20)

Artificial Intelligence: Facts and Myths

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

GenCyber Cyber Security Day Presentation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Breaking the Kubernetes Kill Chain: Host Path Mount

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

08448380779 Call Girls In Civil Lines Women Seeking Men

Automating Google Workspace (GWS) & more with Apps Script

Histor y of HAM Radio presentation slide

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

IAC 2024 - IA Fast Track to Search Focused AI Solutions

A Year of the Servo Reboot: Where Are We Now?

Boost PC performance: How more available memory can improve productivity

Powerful Google developer tools for immediate impact! (2023-24 C)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Driving Behavioral Change for Information Management through Data-Driven Gree...

Handwritten Text Recognition for manuscripts and early printed texts

Slack Application Development 101 Slides

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

1. Houston Hadoop Meetup 2/12/14 Nutch + Hadoop with Selenium and Burp By Mark Kerzner, Elephant Scale

2. Nutch story • Created by Doug Cutting to crawl the web • Not scalable • Enter HDFS • Nutch on HDFS • Nutch on Hadoop • Nutch 1.x, Nutch 2.x

3. Nutch 1.x • Local or HDFS • Command-line • Crawl-db

4. Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace # accept anything else +. • Use a regular expression matching the domain you wish to crawl. • For example, to crawl only nutch.apache.org domain +^http://([a-z0-9]*.)*nutch.apache.org/

5. Nutch architecture

6. Solr integration

7. Solr Application (FreeEed, demo)

8. Scaling Nutch • HDFS – scaling storage • MapReduce – scale crawling • Gora – scale back end

9. Gora • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable, Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS • Data Access : Java-friendly API for accessing the data regardless of its location • Indexing : Solr • Analysis Apache Pig, Apache Hive and Cascading • MapReduce support

10. Passwords? – Oops! 1. Burp + HttpClient 2. Selenium + Java

11. Burp (with demo)

12. HttpClient CloseableHttpClient httpclient = HttpClients.createDefault(); try { HttpPost httpPost = new HttpPost(getUrl()); // put in all custom headers Map<String, String> headers = getHeaders(); for (Map.Entry<String, String> header : headers.entrySet()) { httpPost.addHeader(header.getKey(), header.getValue()); } HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8")); httpPost.setEntity(entity); response = httpclient.execute(httpPost);

13. Browser interaction? – Oops! Selenium Selenium + Java

14. Selenium (with demo) WebDriver driver = new FirefoxDriver(); // Go to the login page driver.get("https://mysite.com"); // put in the username WebElement query = driver.findElement(By.name("username-element")); query.sendKeys("your-user-name"); // put in the password query = driver.findElement(By.name("password-element")); query.sendKeys("real-password"); ((JavascriptExecutor) driver).executeScript("javascript:whatever-login-script();");

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Similar to Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium) (20)

More from Mark Kerzner

More from Mark Kerzner (20)

Recently uploaded

Recently uploaded (20)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)