Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

•

7 likes•6,611 views

Sammy Fung

Technology

Today Agenda
● Some Cases
● Python and Scrapy

Web Scraping
● a computer software technique of extracting
information from websites. (Wikipedia)
● for business, hobbies, research.......
● NOT talk about business cases today.

CableTV & NOWTV Programme
(Past)
● 2004.
● slow, slow, slow, or worst - can't connect.
● use Flash.

HK Observatory and Joint Typhoon
Warning Center
● no easy data exchange format, eg.
RSS/Atom.
● We won't check websites everyday.

Transportation - KMB, PTES
● no map view on KMB website for a bus route
in the past.
● Exteremly Poor, Ugly (or much worse) map
UI on PTES.

My experiences on web scraping
● 2004: php
● year after: python
● recent year: python with scrapy

Document Types
● HTML, XML,......
● Text
● Others, eg. pictures, videos,......

Web Scraping
● Look for right URLs to scrap.
● Look for right content from webpages.
● Saving data into data store.
● When to run the web scraping program ?

What is Scrapy ?
● An open source web scraping framework for
Python.
● Scrapy is a fast high-level screen scraping
and web crawling framework, used to crawl
websites and extract structured data from
their pages. It can be used for a wide range
of purposes, from data mining to monitoring
and automated testing.

Features of Scrapy
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from
HTML and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others

Installation of Scrapy
● pip
● APT repo
● RPM
● tarball (binary/source)

Create new scrapy project
$ scrapy startproject mybot
mybot/
mybot/scrapy.cfg
mybot/mybot/items.py
mybot/mybot/pipeline.py
mybot/mybot/settings.py
mybot/mybot/spiders/myspider.py
etc.......

items.py
from scrapy.item import Item, Field

class HKOCurrentItem(Item):
time = Field()
station = Field()
temperature = Field()
humidity = Field()
#......

spiders/hko_spider.py (1/5)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from weatherhk.items import HKOCurrentItem

import datetime, re

spiders/hko_spider.py (2/5)
class HKOCurrentSpider(BaseSpider):
name = "HKOCurrentSpider"
#allowed_domains = ["www.weather.gov.hk"]
start_urls = [
"http://www.weather.gov.
hk/textonly/forecast/chinesewx.htm"
]

spiders/hko_spider.py (3/5)
def parse(self, response):
hxs = HtmlXPathSelector(response)

stations = []
# Getting weather data from each stations.
tx = hxs.select("//pre[1]/text()").re('[^n]*n')

spiders/hko_spider.py (4/5)
for i in tx:
if re.search(u'd 度',i):
data = HKOCurrentItem()
data['time'] = int(dt)
data['station'] = self.station.code(i)
data['temperature'] = int(re.findall(u'd+',i)
[0])
stations.append(data)

spiders/hko_spider.py (5/5)
return stations

pipelines.py (1/2)
class HKOCurrentPipeline(object):
def process_item(self, item, spider):
station = self.db[item['station']]
storeditem = dict(item.__dict__)['_values']

pipelines.py (2/2)
try:
if 'temperature' in storeditem:
lasttime = station.find({'temperature': {'$gt':
0}}).sort('time', -1).limit(1)
if lasttime[0]['time'] != storeditem['time']:
id = self.insert(storeditem)

return item

What's hot

Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia

Thinking in documentsCésar Rodas

Dive into .gitnishio

世界のどこかで楽しくRubyでお仕事するためにKuniaki Igarashi

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan

Using MongoDB With GroovyJames Williams

Django Mongodb EngineFlavio Percoco Premoli

Java Development with MongoDB (James Williams)MongoSF

Fun with PythonNarong Intiruk

7 Common Mistakes in Go (2015)Steven Francia

Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Дмитрий Бумов

regular expressions and the world wide webSergio Burdisso

7 Common mistakes in Go and when to avoid themSteven Francia

Python learning for Natural Language Processing (2nd)EunGi Hong

Golang slidesaudreyAudrey Lim

성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon

Painless Data Storage with MongoDB & Go Steven Francia

Writing dumb testsLuke Lee

Kiosk / PHP Basuke Suzuki

Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Steven Pousty

What's hot (20)

Go for Object Oriented Programmers or Object Oriented Programming without Obj...

Thinking in documents

Dive into .git

世界のどこかで楽しくRubyでお仕事するために

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Using MongoDB With Groovy

Django Mongodb Engine

Java Development with MongoDB (James Williams)

Fun with Python

7 Common Mistakes in Go (2015)

Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)

regular expressions and the world wide web

7 Common mistakes in Go and when to avoid them

Python learning for Natural Language Processing (2nd)

Golang slidesaudrey

성장을 좋아하는 사람이, 성장하고 싶은 사람에게

Painless Data Storage with MongoDB & Go

Writing dumb tests

Kiosk / PHP

Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012

Viewers also liked

Collecting web information with open source toolsSammy Fung

Crawling the web for fun and profitFederico Feroldi

Scrapy-101Snehil Verma

When big data meet python @ COSCUP 2012Jimmy Lai

Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha

ScrapyFrancisco Sousa

Qredo gravytrain digital pitch '15Dan Whitehouse

CPA-ScriptDan Whitehouse

摘星zenyuhao

Relevance Assessment ToolDirk Lewandowski

Scrapy workshopKarthik Ananth

RESTful API Design FundamentalsHüseyin BABAL

Web::ScraperTatsuhiko Miyagawa

Pydata-Python tools for webscrapingJose Manuel Ortega Candel

Website vs web appImmortal Technologies

Mobile Website vs Mobile AppChromeInfo Technologies

Web Engineering - Web Applications versus Conventional SoftwareNosheen Qamar

Scraping the web with pythonJose Manuel Ortega Candel

XPath for web scrapingScrapinghub

Webscraping with asyncioJose Manuel Ortega Candel

Viewers also liked (20)

Collecting web information with open source tools

Crawling the web for fun and profit

Scrapy-101

When big data meet python @ COSCUP 2012

Web Crawling Modeling with Scrapy Models #TDC2014

Scrapy

Qredo gravytrain digital pitch '15

CPA-Script

摘星

Relevance Assessment Tool

Scrapy workshop

RESTful API Design Fundamentals

Web::Scraper

Pydata-Python tools for webscraping

Website vs web app

Mobile Website vs Mobile App

Web Engineering - Web Applications versus Conventional Software

Scraping the web with python

XPath for web scraping

Webscraping with asyncio

Similar to Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Open Source Weather Information Project with OpenStack Object StorageSammy Fung

Monitoraggio del Traffico di Rete Usando Python ed ntopPyCon Italia

Creating Open Data with Open Source (beta2)Sammy Fung

How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung

carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson

Deploy your own P2P networkDobrica Pavlinušić

Writing Fast Code - PyCon HK 2015Younggun Kim

Python. Why to learn?Oleh Korkh

The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan

Big data analysis in python @ PyCon.tw 2013Jimmy Lai

How to scraping content from web for location-based mobile app.Diep Nguyen

Writing Fast Code (JP) - PyCon JP 2015Younggun Kim

Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde

Python in IndustryDharmit Shah

[KubeCon EU 2020] containerd Deep DiveAkihiro Suda

PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...Megumi Takeshita

Tornado Web Server InternalsPraveen Gollakota

Crawlerhackstuff

Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise

Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward

Similar to Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) (20)

Open Source Weather Information Project with OpenStack Object Storage

Monitoraggio del Traffico di Rete Usando Python ed ntop

Creating Open Data with Open Source (beta2)

How do we develop open source software to help open data ? (MOSC 2013)

carrow - Go bindings to Apache Arrow via C++-API

Deploy your own P2P network

Writing Fast Code - PyCon HK 2015

Python. Why to learn?

The magic behind your Lyft ride prices: A case study on machine learning and ...

Big data analysis in python @ PyCon.tw 2013

How to scraping content from web for location-based mobile app.

Writing Fast Code (JP) - PyCon JP 2015

Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...

Python in Industry

[KubeCon EU 2020] containerd Deep Dive

PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...

Tornado Web Server Internals

Crawler

Streaming your Lyft Ride Prices - Flink Forward SF 2019

Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

A Journey Into the Emotions of Software DevelopersNicole Novielli

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Top 10 Hubspot Development Companies in 2024TopCSSGallery

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Potential of AI (Generative AI) in Business: Learnings and Insights

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Decarbonising Buildings: Making a net-zero built environment a reality

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Generative AI - Gitex v1Generative AI - Gitex v1.pptx

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

A Journey Into the Emotions of Software Developers

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

The State of Passkeys with FIDO Alliance.pptx

Top 10 Hubspot Development Companies in 2024

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

React Native vs Ionic - The Best Mobile App Framework

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration

UiPath Community: Communication Mining from Zero to Hero

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

1. Web Scraping 1-2-3 with Python + Scrapy Sammy Fung sammy.hk, gownjob.com

2. Today Agenda ● Some Cases ● Python and Scrapy

3. Web Scraping ● a computer software technique of extracting information from websites. (Wikipedia) ● for business, hobbies, research....... ● NOT talk about business cases today.

4. CableTV & NOWTV Programme (Past) ● 2004. ● slow, slow, slow, or worst - can't connect. ● use Flash.

5. HK Observatory and Joint Typhoon Warning Center ● no easy data exchange format, eg. RSS/Atom. ● We won't check websites everyday.

6. Transportation - KMB, PTES ● no map view on KMB website for a bus route in the past. ● Exteremly Poor, Ugly (or much worse) map UI on PTES.

7. My experiences on web scraping ● 2004: php ● year after: python ● recent year: python with scrapy

8. Document Types ● HTML, XML,...... ● Text ● Others, eg. pictures, videos,......

9. Web Scraping ● Look for right URLs to scrap. ● Look for right content from webpages. ● Saving data into data store. ● When to run the web scraping program ?

10. What is Scrapy ? ● An open source web scraping framework for Python. ● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

11. Features of Scrapy ● define data you want to scrapy ● write spider to extract data ● Built-in: selecting and extracting data from HTML and XML ● Built-in: JSON, CSV, XML output ● Interactive shell console ● Built-in: web service, telnet console, logging ● Others

12. Installation of Scrapy ● pip ● APT repo ● RPM ● tarball (binary/source)

13. Create new scrapy project $ scrapy startproject mybot mybot/ mybot/scrapy.cfg mybot/mybot/items.py mybot/mybot/pipeline.py mybot/mybot/settings.py mybot/mybot/spiders/myspider.py etc.......

14. items.py from scrapy.item import Item, Field class HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......

15. spiders/hko_spider.py (1/5) from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from weatherhk.items import HKOCurrentItem import datetime, re

16. spiders/hko_spider.py (2/5) class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov. hk/textonly/forecast/chinesewx.htm" ]

17. spiders/hko_spider.py (3/5) def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re('[^n]*n')

18. spiders/hko_spider.py (4/5) for i in tx: if re.search(u'd 度',i): data = HKOCurrentItem() data['time'] = int(dt) data['station'] = self.station.code(i) data['temperature'] = int(re.findall(u'd+',i) [0]) stations.append(data)

19. spiders/hko_spider.py (5/5) return stations

20. pipelines.py (1/2) class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item['station']] storeditem = dict(item.__dict__)['_values']

21. pipelines.py (2/2) try: if 'temperature' in storeditem: lasttime = station.find({'temperature': {'$gt': 0}}).sort('time', -1).limit(1) if lasttime[0]['time'] != storeditem['time']: id = self.insert(storeditem) return item

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Similar to Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) (20)

More from Sammy Fung

More from Sammy Fung (20)

Recently uploaded

Recently uploaded (20)

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)