SlideShare a Scribd company logo
1 of 20
Web Scraping 101
with F.O.S. Goutte
By Joshua Copeland
About Me
● CTO of Engaged Nation
● PHP Developer for 6+ years
● Java, .NET, and C/++ exp.
● Serial Entrepreneur
● Prior Real Estate Agent
● ♥’s Family, Tech, & Skating
● Self Proclaimed Computer
Josh Copeland
@OGProgrammer
What is Web Scraping?
Web scraping is the process of automatically collecting information from the web.
Requires breakthroughs in text processing, semantic understanding, artificial
intelligence and human-computer interactions.
Current web scraping solutions range from the ad-hoc, requiring human effort, to
fully automated systems that are able to convert entire sites into structured
information, with limitations.
Traditional Methods
In 2009 there was no “all-in-one” library with
both an HTTP Client & a HTML Parser. These
where your choices in PHP back then:
➔ Tidy Extension
Wasn’t designed for extraction, only
HTML error fixing.
➔ DOM
➔ SimpleXML
➔ XMLReader
➔ CSS Selectors
Works fine for HTML parsing but isn’t a
crawler.
Introducing Goutte!
A simple PHP Web Scraper.
Did you know?
Goutte was built by
Fabien Potencier who
also built the Symfony
Framework.
FriendsOfSymfony is the
group that maintains this
package and others in
the Symfony world.
Examples from this presentation available at
https://github.com/php-vegas/web-scraper-
examples
What does Goutte use?
● Symfony Components
a. BrowserKit
b. CssSelector
c. DomCrawler
● Guzzle HTTP Component.
Did you know?
Fabien Potencier also
built these Symfony
components.
You should check out
his github profile where
his username is “fabpot”.
He’s kind of a big deal.
What does Goutte do
● Uses Guzzle (cURL, streams, sockets, or event loops)
○ GET/POST Requests
● Fine tune cURL settings
● Follow links - Crawl the site
● Extract data
○ XPath, CssSelector
● Submit forms
○ Login!
What Goutte doesn’t do
● Does not interpret the response in any way.
○ Will not execute JavaScript
■ Which means no AJAX
● Could simulate the AJAX request
■ Try Google cached versions of the site
■ Use PhantomJS, Spiderling, CasperJS, Selenium
● Can’t render or screenshot the page
○ Could save the HTML & assets
Let’s get started!
What you’ll need
➔ Recommend using Composer
Easiest way to install PHP libraries
➔ Alternatively could use PHAR
Available releases on their GitHub
➔ Version 3
◆ PHP 5.5+
◆ Guzzle 6+
➔ Version 2
◆ PHP 5.4
◆ Guzzle 4-5
➔ Version 1
◆ PHP 5.3
◆ Guzzle 3
Require Goutte in your project
composer require fabpot/goutte
Basic Example
use GoutteClient;
$client = new Client();
// Go to the symfony.com website
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."n";
});
Guzzle Settings Example
use GoutteClient;
use GuzzleHttpClient as GuzzleClient;
// Create the guzzle client with your default options
$guzzle = new GuzzleClient(
array(
// base_uri isn't supported due to BrowserKit, anyone want to make a PR on github for this?
// 'base_uri' => 'https://www.symfony.com',
'timeout' => 0,
'allow_redirects' => false,
'cookies' => true,
// Proxy from proxylist.hidemyass.com
'proxy' => 'tcp://63.150.152.151:3128'
)
);
$client = new Client();
$client->setClient($guzzle);
Check out all the Guzzle options at http://docs.guzzlephp.org/en/latest/request-options.html
Basic HTTP Auth Example
$client = new Client();
// Params are username, password, and auth type (basic & digest)
$client->setAuth('test', 'test', 'basic');
$crawler = $client->request('GET', 'http://browserspy.dk/password-ok.php');
print $client->getResponse()->getStatus();
// 401 = no good, 200 = happy
Form Login Example
$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."n";
});
// outputs “You can't perform that action at this time.”
Getting info from the Response
// Get the URI
print 'Request URI : ' . $crawler->getUri() . PHP_EOL;
// Get the SymfonyComponentBrowserKitResponse object
$response = $client->getResponse();
// Get important stuff out of the Response object
$status = $response->getStatus();
$content = $response->getContent();
$headers = $response->getHeaders();
Watch out for...
● DDOSing a site, put a sleep(x) between calls
○ Good way to get your IP banned, use a proxy
● Pulling bad/malformed data
○ Write tests to make sure this doesn’t happen
● Fetching elements by unique IDs, hashes, etc
○ Get creative, find an RSS feed, API, or structured data
● Protections against scrapping like javascript or AJAX
○ Some buttons have JS events attached
More examples in GitHub
This presentation + dell product info scraper
https://github.com/php-vegas/web-scraper-
examples
CSRF Scanner
https://github.com/marlon-be/marlon-csrfscanner
There are extensions for WP, Laravel, mink, & others.
Just search pacakgist.org for “goutte”.
In theory you could use PHPv8 (v8js engine)
to execute javascript and create a handler aka
middleware in Guzzle.
This would be awesome but there are existing
projects out there that already do this. Just
that PHP doesn’t have a way right now.
What were those projects?
● PhantomJS
● Spiderling
● CasperJS
● Selenium
Questions?
@OGProgrammer
Rate this talk
http://spkr8.com/t/68291

More Related Content

What's hot

Web Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.KeyWeb Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.Keyjtzemp
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social PluginsStoyan Stefanov
 
Website Performance Basics
Website Performance BasicsWebsite Performance Basics
Website Performance Basicsgeku
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Stoyan Stefanov
 
The 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themThe 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themOtto Kekäläinen
 
REST, the internet as a database?
REST, the internet as a database?REST, the internet as a database?
REST, the internet as a database?Andrej Koelewijn
 
JavaScript Performance Patterns
JavaScript Performance PatternsJavaScript Performance Patterns
JavaScript Performance PatternsStoyan Stefanov
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patternsStoyan Stefanov
 
moma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layermoma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layerGadi Oren
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3adamsilverstein
 
Scaling my sql_in_3d
Scaling my sql_in_3dScaling my sql_in_3d
Scaling my sql_in_3dsarahnovotny
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3Angela Zoss
 
all data everywhere
all data everywhereall data everywhere
all data everywheresarahnovotny
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
HTML5 Overview
HTML5 OverviewHTML5 Overview
HTML5 Overviewreybango
 
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...Doris Chen
 

What's hot (20)

Web Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.KeyWeb Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.Key
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social Plugins
 
Website Performance Basics
Website Performance BasicsWebsite Performance Basics
Website Performance Basics
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2
 
The 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themThe 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix them
 
REST, the internet as a database?
REST, the internet as a database?REST, the internet as a database?
REST, the internet as a database?
 
JavaScript Performance Patterns
JavaScript Performance PatternsJavaScript Performance Patterns
JavaScript Performance Patterns
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
 
moma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layermoma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layer
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3
 
Scaling my sql_in_3d
Scaling my sql_in_3dScaling my sql_in_3d
Scaling my sql_in_3d
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3Data Visualization on the Web - Intro to D3
Data Visualization on the Web - Intro to D3
 
all data everywhere
all data everywhereall data everywhere
all data everywhere
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
HTML5 Overview
HTML5 OverviewHTML5 Overview
HTML5 Overview
 
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
Develop High Performance Windows 8 Application with HTML5 and JavaScriptHigh ...
 

Similar to Web scraping 101 with goutte

PHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source ProjectPHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source Projectxsist10
 
Php on the Web and Desktop
Php on the Web and DesktopPhp on the Web and Desktop
Php on the Web and DesktopElizabeth Smith
 
PHP BASIC PRESENTATION
PHP BASIC PRESENTATIONPHP BASIC PRESENTATION
PHP BASIC PRESENTATIONkrutitrivedi
 
Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...Patrick Meenan
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersTodd Anglin
 
Twas the night before Malware...
Twas the night before Malware...Twas the night before Malware...
Twas the night before Malware...DoktorMandrake
 
Mojolicious - A new hope
Mojolicious - A new hopeMojolicious - A new hope
Mojolicious - A new hopeMarcus Ramberg
 
Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Anatoly Sharifulin
 
PHP Web Development
PHP Web DevelopmentPHP Web Development
PHP Web Developmentgaplabs
 
A Holistic View of Website Performance
A Holistic View of Website PerformanceA Holistic View of Website Performance
A Holistic View of Website PerformanceRene Churchill
 
Web 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web SitesWeb 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web SitesSteve Souders
 
Use Xdebug to profile PHP
Use Xdebug to profile PHPUse Xdebug to profile PHP
Use Xdebug to profile PHPSeravo
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translationwcto2017
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php SecurityDave Ross
 
Using Geeklog as a Web Application Framework
Using Geeklog as a Web Application FrameworkUsing Geeklog as a Web Application Framework
Using Geeklog as a Web Application FrameworkDirk Haun
 
Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!Jeff Jones
 
jQuery Features to Avoid
jQuery Features to AvoidjQuery Features to Avoid
jQuery Features to Avoiddmethvin
 

Similar to Web scraping 101 with goutte (20)

PHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source ProjectPHP SA 2014 - Releasing Your Open Source Project
PHP SA 2014 - Releasing Your Open Source Project
 
Php on the Web and Desktop
Php on the Web and DesktopPhp on the Web and Desktop
Php on the Web and Desktop
 
PHP BASIC PRESENTATION
PHP BASIC PRESENTATIONPHP BASIC PRESENTATION
PHP BASIC PRESENTATION
 
Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...Google I/O 2012 - Protecting your user experience while integrating 3rd party...
Google I/O 2012 - Protecting your user experience while integrating 3rd party...
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
 
Twas the night before Malware...
Twas the night before Malware...Twas the night before Malware...
Twas the night before Malware...
 
Mojolicious - A new hope
Mojolicious - A new hopeMojolicious - A new hope
Mojolicious - A new hope
 
Introduction to python scrapping
Introduction to python scrappingIntroduction to python scrapping
Introduction to python scrapping
 
Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!Mojolicious. Веб в коробке!
Mojolicious. Веб в коробке!
 
PHP Web Development
PHP Web DevelopmentPHP Web Development
PHP Web Development
 
A Holistic View of Website Performance
A Holistic View of Website PerformanceA Holistic View of Website Performance
A Holistic View of Website Performance
 
Web 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web SitesWeb 2.0 Expo: Even Faster Web Sites
Web 2.0 Expo: Even Faster Web Sites
 
Use Xdebug to profile PHP
Use Xdebug to profile PHPUse Xdebug to profile PHP
Use Xdebug to profile PHP
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
 
Using Geeklog as a Web Application Framework
Using Geeklog as a Web Application FrameworkUsing Geeklog as a Web Application Framework
Using Geeklog as a Web Application Framework
 
Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!Apache and PHP: Why httpd.conf is your new BFF!
Apache and PHP: Why httpd.conf is your new BFF!
 
jQuery Features to Avoid
jQuery Features to AvoidjQuery Features to Avoid
jQuery Features to Avoid
 
Api Design
Api DesignApi Design
Api Design
 

More from Joshua Copeland (7)

WooCommerce
WooCommerceWooCommerce
WooCommerce
 
Universal Windows Platform Overview
Universal Windows Platform OverviewUniversal Windows Platform Overview
Universal Windows Platform Overview
 
LVPHP.org
LVPHP.orgLVPHP.org
LVPHP.org
 
PHP Rocketeer
PHP RocketeerPHP Rocketeer
PHP Rocketeer
 
PHP 7
PHP 7PHP 7
PHP 7
 
Blackfire
BlackfireBlackfire
Blackfire
 
Lumen
LumenLumen
Lumen
 

Recently uploaded

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Recently uploaded (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Web scraping 101 with goutte

  • 1. Web Scraping 101 with F.O.S. Goutte By Joshua Copeland
  • 2. About Me ● CTO of Engaged Nation ● PHP Developer for 6+ years ● Java, .NET, and C/++ exp. ● Serial Entrepreneur ● Prior Real Estate Agent ● ♥’s Family, Tech, & Skating ● Self Proclaimed Computer Josh Copeland @OGProgrammer
  • 3. What is Web Scraping? Web scraping is the process of automatically collecting information from the web. Requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire sites into structured information, with limitations.
  • 4. Traditional Methods In 2009 there was no “all-in-one” library with both an HTTP Client & a HTML Parser. These where your choices in PHP back then: ➔ Tidy Extension Wasn’t designed for extraction, only HTML error fixing. ➔ DOM ➔ SimpleXML ➔ XMLReader ➔ CSS Selectors Works fine for HTML parsing but isn’t a crawler.
  • 5. Introducing Goutte! A simple PHP Web Scraper. Did you know? Goutte was built by Fabien Potencier who also built the Symfony Framework. FriendsOfSymfony is the group that maintains this package and others in the Symfony world. Examples from this presentation available at https://github.com/php-vegas/web-scraper- examples
  • 6. What does Goutte use? ● Symfony Components a. BrowserKit b. CssSelector c. DomCrawler ● Guzzle HTTP Component. Did you know? Fabien Potencier also built these Symfony components. You should check out his github profile where his username is “fabpot”. He’s kind of a big deal.
  • 7. What does Goutte do ● Uses Guzzle (cURL, streams, sockets, or event loops) ○ GET/POST Requests ● Fine tune cURL settings ● Follow links - Crawl the site ● Extract data ○ XPath, CssSelector ● Submit forms ○ Login!
  • 8. What Goutte doesn’t do ● Does not interpret the response in any way. ○ Will not execute JavaScript ■ Which means no AJAX ● Could simulate the AJAX request ■ Try Google cached versions of the site ■ Use PhantomJS, Spiderling, CasperJS, Selenium ● Can’t render or screenshot the page ○ Could save the HTML & assets
  • 10. What you’ll need ➔ Recommend using Composer Easiest way to install PHP libraries ➔ Alternatively could use PHAR Available releases on their GitHub ➔ Version 3 ◆ PHP 5.5+ ◆ Guzzle 6+ ➔ Version 2 ◆ PHP 5.4 ◆ Guzzle 4-5 ➔ Version 1 ◆ PHP 5.3 ◆ Guzzle 3
  • 11. Require Goutte in your project composer require fabpot/goutte
  • 12. Basic Example use GoutteClient; $client = new Client(); // Go to the symfony.com website $crawler = $client->request('GET', 'http://www.symfony.com/blog/'); // Click on the "Security Advisories" link $link = $crawler->selectLink('Security Advisories')->link(); $crawler = $client->click($link); // Get the latest post in this category and display the titles $crawler->filter('h2 > a')->each(function ($node) { print $node->text()."n"; });
  • 13. Guzzle Settings Example use GoutteClient; use GuzzleHttpClient as GuzzleClient; // Create the guzzle client with your default options $guzzle = new GuzzleClient( array( // base_uri isn't supported due to BrowserKit, anyone want to make a PR on github for this? // 'base_uri' => 'https://www.symfony.com', 'timeout' => 0, 'allow_redirects' => false, 'cookies' => true, // Proxy from proxylist.hidemyass.com 'proxy' => 'tcp://63.150.152.151:3128' ) ); $client = new Client(); $client->setClient($guzzle); Check out all the Guzzle options at http://docs.guzzlephp.org/en/latest/request-options.html
  • 14. Basic HTTP Auth Example $client = new Client(); // Params are username, password, and auth type (basic & digest) $client->setAuth('test', 'test', 'basic'); $crawler = $client->request('GET', 'http://browserspy.dk/password-ok.php'); print $client->getResponse()->getStatus(); // 401 = no good, 200 = happy
  • 15. Form Login Example $crawler = $client->request('GET', 'http://github.com/'); $crawler = $client->click($crawler->selectLink('Sign in')->link()); $form = $crawler->selectButton('Sign in')->form(); $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx')); $crawler->filter('.flash-error')->each(function ($node) { print $node->text()."n"; }); // outputs “You can't perform that action at this time.”
  • 16. Getting info from the Response // Get the URI print 'Request URI : ' . $crawler->getUri() . PHP_EOL; // Get the SymfonyComponentBrowserKitResponse object $response = $client->getResponse(); // Get important stuff out of the Response object $status = $response->getStatus(); $content = $response->getContent(); $headers = $response->getHeaders();
  • 17. Watch out for... ● DDOSing a site, put a sleep(x) between calls ○ Good way to get your IP banned, use a proxy ● Pulling bad/malformed data ○ Write tests to make sure this doesn’t happen ● Fetching elements by unique IDs, hashes, etc ○ Get creative, find an RSS feed, API, or structured data ● Protections against scrapping like javascript or AJAX ○ Some buttons have JS events attached
  • 18. More examples in GitHub This presentation + dell product info scraper https://github.com/php-vegas/web-scraper- examples CSRF Scanner https://github.com/marlon-be/marlon-csrfscanner There are extensions for WP, Laravel, mink, & others. Just search pacakgist.org for “goutte”.
  • 19. In theory you could use PHPv8 (v8js engine) to execute javascript and create a handler aka middleware in Guzzle. This would be awesome but there are existing projects out there that already do this. Just that PHP doesn’t have a way right now. What were those projects? ● PhantomJS ● Spiderling ● CasperJS ● Selenium

Editor's Notes

  1. Http clients were pretty clunky too, thank you Amazon for sponsoring Guzzle!
  2. Allows you to make requests to sites and crawl the site (just as Google does) to “scrape” HTML for data.
  3. There is always manual scraping! Pay someone overseas?