Scrapy.for.dummies

SCRAPY FOR
DUMMIES
Chandler Huang
previa [at] gmail.com

Related Resource
 Web Connection
 Urllib2
 Httplib2
 Request
 Screen Scraping
 lxml
 XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of the
Python standard library)
 Beautiful Soup
 Provides a few simple methods for navigating, searching and modifying a parse tree
 Automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
 Sits on top of popular Python parsers like lxml and html5lib
 Deals with bad markup reasonably as well
 One drawback: it’s slow.
 Mechanize
 Programmatic web browsing

Why Scrapy ?
 Portable, open-source, 100% Python
 Scrapy is completely written in Python and runs on
Linux, Windows, Mac and BSD
 Only works for Python 2.6, 2.7 currently
 Simple
 Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
 Productive
 Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
 Extensible
 Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the
framework core

 Batteries included
 Scrapy comes with lots of functionality built in.
 Well-documented & well-tested
 Scrapy is extensively documented and has an comprehensive test suite
with very good code coverage
 Good community and commercial support
 https://groups.google.com/forum/?fromgroups#!forum/scrapy-users
 #scrapy @ freenode
 Company like
 Parsely
 Direct Employers Foundation
 Scrapinghub

Project layout
scrapy.cfg: the project configuration file
tutorial/: the project’s python module, you’ll later import your code from here.
tutorial/items.py: the project’s items file.
tutorial/pipelines.py: the project’s pipelines file.
tutorial/settings.py: the project’s settings file.
tutorial/spiders/: a directory where you’ll later put your spiders.

Basic SOP
1. Define item: decide what to extract
2. Define Spider: decide crawling strategy
3. Define parse function: find patterns to extract
data
4. Pipeline: Define post process

Spider
 Spiders are classes which define
 how to perform the crawl
 how to extract structured data from their pages
 Build-in Spider
 BaseSpider
 Simplest spider, and the one from which every other spider
must inherit from
 start_requests() generates Request for the URLs specified in
the start_urls
 And the parse() method as default callback function for the
Requests
 Parse()
 response and return
either Item objects, Request objects, or an iterable of both.

Spider
 CrawlSpider
 This is the most commonly used spider for crawling regular
websites
 It provides a convenient mechanism for following links by
defining a set of rules
 Rules Which is a list of one (or more) Rule objects.
 Each Rule defines a certain behavior for crawling the site.
 BaseSpider: Customize BFS
 CrawlSpider: DFS
 Other build-in spiders
 XMLFeedSpider, CSVFeedSpider, SitemapSpider

Parse()
 Selector
 Scrapy comes with its own mechanism for extracting data.
 XPath is a language for selecting nodes in XML documents,
which can also be used with HTML.
 Scrapy Selectors are built over the libxml2 library
 Same with lxml , which means they’re very similar in speed
and parsing accuracy
 It also support RE
 Re vs XPath

XPath
Expression Meaning
name matches all nodes on the current level with the specified
name
name[n] matches the nth element on the current level with the
specified name
/ if used as the first character, denotes the top-level
document, otherwise denotes moving down a level
// the current level and all sublevels to any depth
* matches all nodes on the current level
. Or .. the current level / go up one level
@name the attribute with the specified name
[@key='value'] all elements with an attribute that matches the specified
key/value pair
name[@key='val
ue']
all elements with the specified name and an attribute that
matches the specified key/value pair
[text()='value'] all elements with the specified text
name[text()='valu all elements with the specified name and text

Pipeline
 For post process
 All item will pass to process_item by default
 Add pepeline to ITEM_PIPELINES in
settings.py

Control Method
 Telnet
 Web services
 Scrapyd
 Using json-rpc to control

Scrapy.for.dummies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scrapy.for.dummies

Similar to Scrapy.for.dummies (20)

Recently uploaded

Recently uploaded (20)

Scrapy.for.dummies