3. Web Scraping
● a computer software technique of extracting
information from websites. (Wikipedia)
● for business, hobbies, research.......
● NOT talk about business cases today.
4. CableTV & NOWTV Programme
(Past)
● 2004.
● slow, slow, slow, or worst - can't connect.
● use Flash.
5. HK Observatory and Joint Typhoon
Warning Center
● no easy data exchange format, eg.
RSS/Atom.
● We won't check websites everyday.
6. Transportation - KMB, PTES
● no map view on KMB website for a bus route
in the past.
● Exteremly Poor, Ugly (or much worse) map
UI on PTES.
7. My experiences on web scraping
● 2004: php
● year after: python
● recent year: python with scrapy
9. Web Scraping
● Look for right URLs to scrap.
● Look for right content from webpages.
● Saving data into data store.
● When to run the web scraping program ?
10. What is Scrapy ?
● An open source web scraping framework for
Python.
● Scrapy is a fast high-level screen scraping
and web crawling framework, used to crawl
websites and extract structured data from
their pages. It can be used for a wide range
of purposes, from data mining to monitoring
and automated testing.
11. Features of Scrapy
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from
HTML and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others
14. items.py
from scrapy.item import Item, Field
class HKOCurrentItem(Item):
time = Field()
station = Field()
temperature = Field()
humidity = Field()
#......
17. spiders/hko_spider.py (3/5)
def parse(self, response):
hxs = HtmlXPathSelector(response)
stations = []
# Getting weather data from each stations.
tx = hxs.select("//pre[1]/text()").re('[^n]*n')
18. spiders/hko_spider.py (4/5)
for i in tx:
if re.search(u'd 度',i):
data = HKOCurrentItem()
data['time'] = int(dt)
data['station'] = self.station.code(i)
data['temperature'] = int(re.findall(u'd+',i)
[0])
stations.append(data)