My first-crawler-in-python

⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python

⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python
• 了解 web 架構
• 計算股票價值

Steps
• 抓網⾴頁
• 解析內容
• 資料計算

資料來源
表格別股票id

開發⼈人員⼯工具
• 練習 google python style guide

中年Py的奇幻漂流
http://static.ettoday.net/images/206/206484.jpg

Python Modules
• Parse DOM
• urllib + SGMLParser
• requests + BeautifulSoup4
• Excel
• xlutils

urllib
url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'
webcode = urllib.urlopen(url)
if webcode.code == 200:
self.webpage = webcode.read()
webcode.close()

SGMLParser
class AccountTable(SGMLParser):
def feed(self, data):
def start_tr(self, attr):
def end_tr(self):
def handle_data(self):

Oops
def start_table(self, attrs):
if len(attrs) > 0:
for at in attrs:
if at[0] == 'id' and at[1] == 'oMainTable':
self.isTargetTbl = True

中⽂文轉碼
line.encode(‘big5’).decode(‘utf8’)

v2.0
• Coding style reﬁnement
• google python style guide
• pyhon 慣⽤用語

requests
import requests
def parse_url(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
parse_html(r.text)

BeautifulSoup
from bs4 import BeautifulSoup
def parse_html(html_text):
soup = BeautifulSoup(html_text)
rows = soup.find(‘table', class=‘t01’)
rows = rows.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [e.text.encode('utf-8').strip() for e in cols]
data.append(cols)
<td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>

Future Plan
• concurrent / gevent
• fake browser header
• free proxy

My first-crawler-in-python

Recommended

Recommended

More Related Content

Similar to My first-crawler-in-python

Similar to My first-crawler-in-python (20)

More from Viller Hsiao

More from Viller Hsiao (10)

My first-crawler-in-python