SlideShare a Scribd company logo
1 of 30
Download to read offline
Downloading the
internet with
Python + scrapy 💻🐍
Erin Shellman @erinshellman
Puget Sound 
Programming Python meet-up
January 14, 2015
hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.
Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.
Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭
• “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!
scrapy startproject prices
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
scrapy project
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
...
class Product(Item):!
!
product_title = Field()!
description = Field()!
price = Field() !
Define what to
scrape in items.py
protip: get to know
the DOM.
protip: get to know
the DOM.
Sometimes
there are
hidden gems.
SKU-level inventory
availability?
Score!
Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions
Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.
class BackcountrySpider(CrawlSpider):!
name = 'backcountry'!
def __init__(self, *args, **kwargs):!
super(BackcountrySpider, self).__init__(*args, **kwargs)!
self.base_url = 'http://www.backcountry.com'!
self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']!
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
Part I: Crawl Setup
e.g. brand_url = http://www.backcountry.com/burton
!
def parse_start_url(self, response):!
brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()!
!
for brand in brands:!
brand_url = str(self.base_url + brand)!
self.log("Queued up: %s" % brand_url)!
!
yield scrapy.Request(url = brand_url, !
callback = self.parse_brand_landing_pages)!
!
def parse_brand_landing_pages(self, response):!
shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"!
shop_all_link = response.xpath(shop_all_pattern).extract()!
!
if shop_all_link:!
all_product_url = str(self.base_url + shop_all_link[0]) !
!
yield scrapy.Request(url = all_product_url,!
callback = self.parse_product_pages)!
else: !
yield scrapy.Request(url = response.url,!
callback = self.parse_product_pages)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_product_pages(self, response):!
product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"!
pagination_pattern = "//li[@class='page-link page-number']/a/@href"!
!
product_pages = response.xpath(product_page_pattern).extract()!
more_pages = response.xpath(pagination_pattern).extract()!
!
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
# Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!
scrapy crawl backcountry -o bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished)
2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json
2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 33068379,
'downloader/request_count': 41848,
'downloader/request_method_count/GET': 41848,
'downloader/response_bytes': 1715232531,
'downloader/response_count': 41848,
'downloader/response_status_count/200': 41835,
'downloader/response_status_count/301': 9,
'downloader/response_status_count/404': 4,
'dupefilter/filtered': 12481,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929),
'item_scraped_count': 38881,
'log_count/DEBUG': 81784,
'log_count/ERROR': 23,
'log_count/INFO': 26,
'request_depth_max': 7,
'response_received_count': 41839,
'scheduler/dequeued': 41848,
'scheduler/dequeued/memory': 41848,
'scheduler/enqueued': 41848,
'scheduler/enqueued/memory': 41848,
'spider_exceptions/IndexError': 23,
'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)}
2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
{
"review_count": 18,
"product_id": "BAF0028",
"brand": "Baffin",
"product_url": "http://www.backcountry.com/baffin-cush-slipper-mens",
"source": "backcountry",
"inventory": {
"BAF0028-ESP-S3XL": 27,
"BAF0028-BKA-XL": 40,
"BAF0028-NVA-XL": 5,
"BAF0028-NVA-L": 7,
"BAF0028-BKA-L": 17,
"BAF0028-ESP-XXL": 12,
"BAF0028-NVA-XXL": 6,
"BAF0028-BKA-XXL": 44,
"BAF0028-NVA-S3XL": 10,
"BAF0028-ESP-L": 50,
"BAF0028-ESP-XL": 52,
"BAF0028-BKA-S3XL": 19
},
"price_high": 24.95,
"price": 23.95,
"description_short": "Cush Slipper - Men's",
"price_low": 23.95,
"review_score": 4
}
prices/
scrapy.cfg
prices/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
backcountry.py
evo.py
rei.py
...
–Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”
Resources
• Code here!:
• https://github.com/erinshellman/backcountry-scraper
• Lynn Root’s excellent end-to-end tutorial.
• http://newcoder.io/Intro-Scrape/
• Web scraping - It’s your civic duty
• http://pbpython.com/web-scraping-mn-budget.html
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350

More Related Content

What's hot

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQLChristian Heilmann
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Christian Heilmann
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Christian Heilmann
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Tsuyoshi Yamamoto
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3makoto tsuyuki
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functionspodsframework
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programmingCasear Chu
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Django Overview
Django OverviewDjango Overview
Django OverviewBrian Tol
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web AppsMark
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON APIpodsframework
 

What's hot (20)

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Selenium&scrapy
Selenium&scrapySelenium&scrapy
Selenium&scrapy
 
Scrapy
ScrapyScrapy
Scrapy
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 
Django
DjangoDjango
Django
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
 

Viewers also liked

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapyorangain
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter APIErin Shellman
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Erin Shellman
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯Sammy Fung
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screeningErin Shellman
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jNeo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Bruno Rocha
 

Viewers also liked (19)

Bot or Not
Bot or NotBot or Not
Bot or Not
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter API
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
real time real talk
real time real talkreal time real talk
real time real talk
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
摘星
摘星摘星
摘星
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
 
Optimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4jOptimizing Cypher Queries in Neo4j
Optimizing Cypher Queries in Neo4j
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
 

Similar to Downloading the internet with Python + Scrapy

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Betteraaronheckmann
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesLincoln III
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosIgor Sobreira
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Chris Alfano
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentMike Brittain
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rob
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on labNAVER D2
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on labNAVER D2
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015Matt Raible
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketersRobin Lord
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisAndreas Sahle
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_staticLincoln III
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchartErhwen Kuo
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the backDiUS
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web ComponentsSteve Souders
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsDylan Jay
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasLoiane Groner
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015Christian Heilmann
 

Similar to Downloading the internet with Python + Scrapy (20)

Code is Cool - Products are Better
Code is Cool - Products are BetterCode is Cool - Products are Better
Code is Cool - Products are Better
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011Jarv.us Showcase — SenchaCon 2011
Jarv.us Showcase — SenchaCon 2011
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008Rugalytics | Ruby Manor Nov 2008
Rugalytics | Ruby Manor Nov 2008
 
E2 appspresso hands on lab
E2 appspresso hands on labE2 appspresso hands on lab
E2 appspresso hands on lab
 
E3 appspresso hands on lab
E3 appspresso hands on labE3 appspresso hands on lab
E3 appspresso hands on lab
 
The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015The Art of AngularJS in 2015 - Angular Summit 2015
The Art of AngularJS in 2015 - Angular Summit 2015
 
Coding for marketers
Coding for marketersCoding for marketers
Coding for marketers
 
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysisSherlock Markup and Sammy Semantic - drupal theming forensic analysis
Sherlock Markup and Sammy Semantic - drupal theming forensic analysis
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 
Test upload
Test uploadTest upload
Test upload
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchart
 
Metrics on the front, data in the back
Metrics on the front, data in the backMetrics on the front, data in the back
Metrics on the front, data in the back
 
High Performance Web Components
High Performance Web ComponentsHigh Performance Web Components
High Performance Web Components
 
Pyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web appsPyramid Lighter/Faster/Better web apps
Pyramid Lighter/Faster/Better web apps
 
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridasFrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
FrontInBahia 2014: 10 dicas de desempenho para apps mobile híbridas
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Downloading the internet with Python + Scrapy

  • 1. Downloading the internet with Python + scrapy 💻🐍 Erin Shellman @erinshellman Puget Sound Programming Python meet-up January 14, 2015
  • 2. hi! I’m a data scientist in the Nordstrom Data Lab. I’ve built scrapers to monitor the product catalogs of various sports retailers.
  • 3. Getting data can be hard Despite the open-data movement and popularity of APIs, volumes of data are locked up in DOMs all over the internet.
  • 4. Monitoring competitor prices • As a retailer, I want to strategically set prices in relation to my competitors. • But they aren’t interested in sharing their prices and mark-down strategies with me. 😭
  • 5. • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.” • scrapin’ on rails!
  • 9. class Product(Item):! ! product_title = Field()! description = Field()! price = Field() ! Define what to scrape in items.py
  • 10. protip: get to know the DOM.
  • 11. protip: get to know the DOM.
  • 12. Sometimes there are hidden gems. SKU-level inventory availability? Score!
  • 13. Spider design Spiders have two primary components: 1. Crawling (navigation) instructions 2. Parsing instructions
  • 14. Define the crawl behavior in spiders/backcountry.py After spending some time on backcountry.com, I decided the all brands landing page was the best starting URL.
  • 15. class BackcountrySpider(CrawlSpider):! name = 'backcountry'! def __init__(self, *args, **kwargs):! super(BackcountrySpider, self).__init__(*args, **kwargs)! self.base_url = 'http://www.backcountry.com'! self.start_urls = ['http://www.backcountry.com/Store/catalog/shopAllBrands.jsp']! ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)! Part I: Crawl Setup
  • 16. e.g. brand_url = http://www.backcountry.com/burton ! def parse_start_url(self, response):! brands = response.xpath("//a[@class='qa-brand-link']/@href").extract()! ! for brand in brands:! brand_url = str(self.base_url + brand)! self.log("Queued up: %s" % brand_url)! ! yield scrapy.Request(url = brand_url, ! callback = self.parse_brand_landing_pages)!
  • 17. ! def parse_brand_landing_pages(self, response):! shop_all_pattern = "//a[@class='subcategory-link brand-plp-link qa-brand-plp-link']/@href"! shop_all_link = response.xpath(shop_all_pattern).extract()! ! if shop_all_link:! all_product_url = str(self.base_url + shop_all_link[0]) ! ! yield scrapy.Request(url = all_product_url,! callback = self.parse_product_pages)! else: ! yield scrapy.Request(url = response.url,! callback = self.parse_product_pages)
  • 18. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 19. def parse_product_pages(self, response):! product_page_pattern = "//a[contains(@class, 'qa-product-link')]/@href"! pagination_pattern = "//li[@class='page-link page-number']/a/@href"! ! product_pages = response.xpath(product_page_pattern).extract()! more_pages = response.xpath(pagination_pattern).extract()! ! # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 20. # Paginate!! for page in more_pages:! next_page = str(self.base_url + page)! yield scrapy.Request(url = next_page,! callback = self.parse_product_pages)! ! for product in product_pages:! product_url = str(self.base_url + product)! ! yield scrapy.Request(url = product_url,! callback = self.parse_item)
  • 21. def parse_item(self, response):! ! item = Product()! dirty_data = {}! ! dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()! dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()! dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()! ! for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip()! ! yield item! Part II: Parsing
  • 22. for variable in dirty_data.keys():! if dirty_data[variable]: ! if variable == 'price':! item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))! else: ! item[variable] = ''.join(dirty_data[variable]).strip() Part II: Clean it now!
  • 24. 2015-01-02 12:32:52-0800 [backcountry] INFO: Closing spider (finished) 2015-01-02 12:32:52-0800 [backcountry] INFO: Stored json feed (38881 items) in: bc.json 2015-01-02 12:32:52-0800 [backcountry] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 33068379, 'downloader/request_count': 41848, 'downloader/request_method_count/GET': 41848, 'downloader/response_bytes': 1715232531, 'downloader/response_count': 41848, 'downloader/response_status_count/200': 41835, 'downloader/response_status_count/301': 9, 'downloader/response_status_count/404': 4, 'dupefilter/filtered': 12481, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 1, 2, 20, 32, 52, 524929), 'item_scraped_count': 38881, 'log_count/DEBUG': 81784, 'log_count/ERROR': 23, 'log_count/INFO': 26, 'request_depth_max': 7, 'response_received_count': 41839, 'scheduler/dequeued': 41848, 'scheduler/dequeued/memory': 41848, 'scheduler/enqueued': 41848, 'scheduler/enqueued/memory': 41848, 'spider_exceptions/IndexError': 23, 'start_time': datetime.datetime(2015, 1, 2, 20, 14, 16, 892071)} 2015-01-02 12:32:52-0800 [backcountry] INFO: Spider closed (finished)
  • 25. { "review_count": 18, "product_id": "BAF0028", "brand": "Baffin", "product_url": "http://www.backcountry.com/baffin-cush-slipper-mens", "source": "backcountry", "inventory": { "BAF0028-ESP-S3XL": 27, "BAF0028-BKA-XL": 40, "BAF0028-NVA-XL": 5, "BAF0028-NVA-L": 7, "BAF0028-BKA-L": 17, "BAF0028-ESP-XXL": 12, "BAF0028-NVA-XXL": 6, "BAF0028-BKA-XXL": 44, "BAF0028-NVA-S3XL": 10, "BAF0028-ESP-L": 50, "BAF0028-ESP-XL": 52, "BAF0028-BKA-S3XL": 19 }, "price_high": 24.95, "price": 23.95, "description_short": "Cush Slipper - Men's", "price_low": 23.95, "review_score": 4 }
  • 27. –Monica Rogati, VP of Data at Jawbone “Data wrangling is a huge — and surprisingly so — part of the job. It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
  • 28. Resources • Code here!: • https://github.com/erinshellman/backcountry-scraper • Lynn Root’s excellent end-to-end tutorial. • http://newcoder.io/Intro-Scrape/ • Web scraping - It’s your civic duty • http://pbpython.com/web-scraping-mn-budget.html
  • 29. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!!
  • 30. Bring your projects to hacknight! http://www.meetup.com/Seattle-PyLadies Ladies!! Thursday, January 29th 6PM ! Intro to iPython and Matplotlib Ada Developers Academy 1301 5th Avenue #1350