The document provides an overview of Scrapy, an open-source and Python-based web scraping framework. It discusses Scrapy's key features such as being portable, simple, productive, extensible, and well-documented. The architecture is explained, including the typical project layout containing items, pipelines, settings, and spiders. Basic scraping operations are outlined involving defining items, spiders to extract data, and using pipelines for post-processing. XPath and regular expressions can be used for parsing pages within a spider's parse function. CrawlSpider is also introduced as a common type of spider that provides rules-based crawling.
2. Related Resource
Web Connection
Urllib2
Httplib2
Request
Screen Scraping
lxml
XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of the
Python standard library)
Beautiful Soup
Provides a few simple methods for navigating, searching and modifying a parse tree
Automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
Sits on top of popular Python parsers like lxml and html5lib
Deals with bad markup reasonably as well
One drawback: it’s slow.
Mechanize
Programmatic web browsing
3. Why Scrapy ?
Portable, open-source, 100% Python
Scrapy is completely written in Python and runs on
Linux, Windows, Mac and BSD
Only works for Python 2.6, 2.7 currently
Simple
Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
Productive
Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
Extensible
Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the
framework core
4. Batteries included
Scrapy comes with lots of functionality built in.
Well-documented & well-tested
Scrapy is extensively documented and has an comprehensive test suite
with very good code coverage
Good community and commercial support
https://groups.google.com/forum/?fromgroups#!forum/scrapy-users
#scrapy @ freenode
Company like
Parsely
Direct Employers Foundation
Scrapinghub
6. Project layout
scrapy.cfg: the project configuration file
tutorial/: the project’s python module, you’ll later import your code from here.
tutorial/items.py: the project’s items file.
tutorial/pipelines.py: the project’s pipelines file.
tutorial/settings.py: the project’s settings file.
tutorial/spiders/: a directory where you’ll later put your spiders.
7. Basic SOP
1. Define item: decide what to extract
2. Define Spider: decide crawling strategy
3. Define parse function: find patterns to extract
data
4. Pipeline: Define post process
9. Spider
Spiders are classes which define
how to perform the crawl
how to extract structured data from their pages
Build-in Spider
BaseSpider
Simplest spider, and the one from which every other spider
must inherit from
start_requests() generates Request for the URLs specified in
the start_urls
And the parse() method as default callback function for the
Requests
Parse()
response and return
either Item objects, Request objects, or an iterable of both.
10. Spider
CrawlSpider
This is the most commonly used spider for crawling regular
websites
It provides a convenient mechanism for following links by
defining a set of rules
Rules Which is a list of one (or more) Rule objects.
Each Rule defines a certain behavior for crawling the site.
BaseSpider: Customize BFS
CrawlSpider: DFS
Other build-in spiders
XMLFeedSpider, CSVFeedSpider, SitemapSpider
11. Parse()
Selector
Scrapy comes with its own mechanism for extracting data.
XPath is a language for selecting nodes in XML documents,
which can also be used with HTML.
Scrapy Selectors are built over the libxml2 library
Same with lxml , which means they’re very similar in speed
and parsing accuracy
It also support RE
Re vs XPath
12. XPath
Expression Meaning
name matches all nodes on the current level with the specified
name
name[n] matches the nth element on the current level with the
specified name
/ if used as the first character, denotes the top-level
document, otherwise denotes moving down a level
// the current level and all sublevels to any depth
* matches all nodes on the current level
. Or .. the current level / go up one level
@name the attribute with the specified name
[@key='value'] all elements with an attribute that matches the specified
key/value pair
name[@key='val
ue']
all elements with the specified name and an attribute that
matches the specified key/value pair
[text()='value'] all elements with the specified text
name[text()='valu all elements with the specified name and text