Related Resource Web Connection Urllib2 Httplib2 Request Screen Scraping lxml XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of thePython standard library) Beautiful Soup Provides a few simple methods for navigating, searching and modifying a parse tree Automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib Deals with bad markup reasonably as well One drawback: it’s slow. Mechanize Programmatic web browsing
Why Scrapy ? Portable, open-source, 100% Python Scrapy is completely written in Python and runs onLinux, Windows, Mac and BSD Only works for Python 2.6, 2.7 currently Simple Scrapy was designed with simplicity in mind, by providing the featuresyou need without getting in your way Productive Just write the rules to extract the data from web pages and let Scrapycrawl the entire web site for you Extensible Scrapy was designed with extensibility in mind and so it providesseveral mechanisms to plug new code without having to touch theframework core
Batteries included Scrapy comes with lots of functionality built in. Well-documented & well-tested Scrapy is extensively documented and has an comprehensive test suitewith very good code coverage Good community and commercial support https://groups.google.com/forum/?fromgroups#!forum/scrapy-users #scrapy @ freenode Company like Parsely Direct Employers Foundation Scrapinghub
Project layoutscrapy.cfg: the project configuration filetutorial/: the project’s python module, you’ll later import your code from here.tutorial/items.py: the project’s items file.tutorial/pipelines.py: the project’s pipelines file.tutorial/settings.py: the project’s settings file.tutorial/spiders/: a directory where you’ll later put your spiders.
Basic SOP1. Define item: decide what to extract2. Define Spider: decide crawling strategy3. Define parse function: find patterns to extractdata4. Pipeline: Define post process
Spider Spiders are classes which define how to perform the crawl how to extract structured data from their pages Build-in Spider BaseSpider Simplest spider, and the one from which every other spidermust inherit from start_requests() generates Request for the URLs specified inthe start_urls And the parse() method as default callback function for theRequests Parse() response and returneither Item objects, Request objects, or an iterable of both.
Spider CrawlSpider This is the most commonly used spider for crawling regularwebsites It provides a convenient mechanism for following links bydefining a set of rules Rules Which is a list of one (or more) Rule objects. Each Rule defines a certain behavior for crawling the site. BaseSpider: Customize BFS CrawlSpider: DFS Other build-in spiders XMLFeedSpider, CSVFeedSpider, SitemapSpider
Parse() Selector Scrapy comes with its own mechanism for extracting data. XPath is a language for selecting nodes in XML documents,which can also be used with HTML. Scrapy Selectors are built over the libxml2 library Same with lxml , which means they’re very similar in speedand parsing accuracy It also support RE Re vs XPath
XPathExpression Meaningname matches all nodes on the current level with the specifiednamename[n] matches the nth element on the current level with thespecified name/ if used as the first character, denotes the top-leveldocument, otherwise denotes moving down a level// the current level and all sublevels to any depth* matches all nodes on the current level. Or .. the current level / go up one level@name the attribute with the specified name[@key=value] all elements with an attribute that matches the specifiedkey/value pairname[@key=value]all elements with the specified name and an attribute thatmatches the specified key/value pair[text()=value] all elements with the specified textname[text()=valu all elements with the specified name and text