Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Viller Hsiao
⽤用 Python 抓取財報資訊
• 練習 python
• 練習 好好寫 python
⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python
• 了解 web 架構
• 計算股票價值
Steps
• 抓網⾴頁
• 解析內容
• 資料計算
資料來源
表格別 股票id
檢查元素
開發⼈人員⼯工具
• 練習 google python style guide
中年Py的奇幻漂流
http://static.ettoday.net/images/206/206484.jpg
Python Modules
• Parse DOM
• urllib + SGMLParser
• requests + BeautifulSoup4
• Excel
• xlutils
urllib
url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'
webcode = urllib.urlopen(url)
if webcode.code == 200:
se...
SGMLParser
class AccountTable(SGMLParser):
def feed(self, data):
def start_tr(self, attr):
def end_tr(self):
def handle_da...
Oops
def start_table(self, attrs):
if len(attrs) > 0:
for at in attrs:
if at[0] == 'id' and at[1] == 'oMainTable':
self.is...
中⽂文轉碼
line.encode(‘big5’).decode(‘utf8’)
v2.0
• Coding style refinement
• google python style guide
• pyhon 慣⽤用語
g0v 專案
requests
import requests
def parse_url(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
parse_html(r.tex...
BeautifulSoup
from bs4 import BeautifulSoup
def parse_html(html_text):
soup = BeautifulSoup(html_text)
rows = soup.find(‘ta...
Future Plan
• concurrent / gevent
• fake browser header
• free proxy
Upcoming SlideShare
Loading in …5
×

My first-crawler-in-python

775 views

Published on

  • Login to see the comments

My first-crawler-in-python

  1. 1. Viller Hsiao
  2. 2. ⽤用 Python 抓取財報資訊 • 練習 python • 練習 好好寫 python
  3. 3. ⽤用 Python 抓取財報資訊 • 練習 python • 練習好好寫 python • 了解 web 架構 • 計算股票價值
  4. 4. Steps • 抓網⾴頁 • 解析內容 • 資料計算
  5. 5. 資料來源 表格別 股票id
  6. 6. 檢查元素
  7. 7. 開發⼈人員⼯工具 • 練習 google python style guide
  8. 8. 中年Py的奇幻漂流 http://static.ettoday.net/images/206/206484.jpg
  9. 9. Python Modules • Parse DOM • urllib + SGMLParser • requests + BeautifulSoup4 • Excel • xlutils
  10. 10. urllib url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm' webcode = urllib.urlopen(url) if webcode.code == 200: self.webpage = webcode.read() webcode.close()
  11. 11. SGMLParser class AccountTable(SGMLParser): def feed(self, data): def start_tr(self, attr): def end_tr(self): def handle_data(self):
  12. 12. Oops def start_table(self, attrs): if len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True
  13. 13. 中⽂文轉碼 line.encode(‘big5’).decode(‘utf8’)
  14. 14. v2.0 • Coding style refinement • google python style guide • pyhon 慣⽤用語
  15. 15. g0v 專案
  16. 16. requests import requests def parse_url(url): r = requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)
  17. 17. BeautifulSoup from bs4 import BeautifulSoup def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’) rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols) <td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>
  18. 18. Future Plan • concurrent / gevent • fake browser header • free proxy

×