More Related Content Similar to BBS crawler for Taiwan Similar to BBS crawler for Taiwan (20) BBS crawler for Taiwan1. BBS Crawler
for Taiwan
bsdconv + pyte + telnetlib
by Buganini @ PyHUG
Sep. 2012
2. Obstacles
● Big5/UAO
● Segmented Big5
● Control Sequence
● Ambiguous Width
3. ● Big5/UAO Gov.tw: BIG5-2003
● Segmented Big5 Windows: CP950
● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS,
BIG5-HKSCS:2004, BIG5-HKSCS:2001,
● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental)
Mozilla: UAO 2.41
BBS: UAO 2.50(?)
etc.. ref: http://moztw.org/docs/big5/
UAO
== Unicode At Once
== Unicode 補完計畫
!= Unicode
UAO
is extended Big5 (by using PUA),
including Chinese (trad/sim/hk), Japanese, Cyrillic
Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
4. Big5/UAO
xAExE1
●
● Segmented Big5
● Control Sequence xAE
● Ambiguous Width x1B[1;33m
xE1
PCMAN
Standard Tool
5. ● Big5/UAO
● Segmented Big5
● Control Sequence
● Ambiguous Width
08 08 20 20 ← ← SP SP
08 08 0a ←←↓
e2 97 8f ●
6. ● Big5/UAO
● Segmented Big5
● Control Sequence
● Ambiguous Width
7. Obstacles
Not anymore…
● Big5/UAO
● Segmented Big5 Solved in bug5, using bsdconv
● Ambiguous Width
● Control Sequence Solved, using pyte
https://github.com/buganini/bug5
https://github.com/buganini/bsdconv
https://github.com/selectel/pyte
8. bsdconv (1/4)
import bsdconv
bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")
xAExE1xAEx1B[1;33mxE1
---------------------------------------------------------
AE E1 AE 1B 5B 31 3B 33 33 6D E1
★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
Bsdconv Internal Prefix:
03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte
1B: ANSI Control Sequence
★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
03AE 03E1 03AE 03E1 1B5B313B33336D
★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
AE E1 AE E1 1B 5B 31 3B 33 33 6D
★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
016851 016851 1B5B313B33336D #U+6851 == 桑
9. bsdconv (2/4)
import bsdconv
bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")
>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
1B5B313B33336D ( FREE )
03E1
>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
03E1
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
03: Byte
1B: ANSI Control Sequence
10. bsdconv (3/4)
import bsdconv
bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")
>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
pass:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
AE
E1
AE
E1
1B5B313B33336D ( FREE SKIP )
>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
skip,big5:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
016851
016851
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
01: Unicode
1B: ANSI Control Sequence
#U+6851 == 桑
11. bsdconv (4/4)
import bsdconv
bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")
>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
skip,big5:utf-8,bsdconv_raw")
>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")
>>> s
'xe6xa1x91xe6xa1x91x1b[1;33m'
>>> s.decode("utf-8")
u'u6851u6851x1b[1;33m'
#U+6851 == 桑
12. _
| |
pyte (1/2)
_ __ _ _ | |_ ___
| '_ | | | || __|/ _
| |_) || |_| || |_| __/
import pyte
| .__/ __, | __|___|
stream = pyte.Stream() | | __/ |
|_| |___/
screen = pyte.Screen(80, 24)
Python Terminal Emulator
screen.mode.discard(pyte.modes.LNM)
stream.attach(screen)
seq=SEQUENCE_FROM_SERVER
useq=c.conv(seq)
stream.feed(useq.decode("utf-8"))
RESULT_SCREEN="n".join(screen.display).encode("utf-8")
With pyte.modes.LNM:
r → CR+LF (CarriageReturn / LineFeed)
Without pyte.modes.LNM:
r → CR
13. pyte (2/2)
#Ambiguous Width
screens.py
width_counter=bsdconv.Bsdconv("utf-8:width:null")
14. telnetlib (1/3)
What's wrong with read_until/expect?
What telnetlib does:
Server → telnetlib connection→ telnetlib.read_until
What I need:
Server → telnetlib connection → bsdconv → telnetlib.read_until
Regular Expression
Solutions:
a) Implement bsdconv → telnetlib.read_until (current)
b) Hack telnetlib (maybe cleaner)
c) Other telnetlib implementation?
15. telnetlib (2/3)
#Deal with lagging/noop
def term_comm(feed=None, wait=None):
if feed!=None:
conn.write(feed)
if wait:
s=conn.read_some()
s=conv.conv_chunk(s)
stream.feed(s.decode("utf-8"))
if wait!=False:
time.sleep(0.1)
s=conn.read_very_eager()
s=conv.conv_chunk(s)
stream.feed(s.decode("utf-8"))
ret="n".join(screen.display).encode("utf-8")
return ret
Reading Feed No Feed
Wait=None Non-blocking Non-blocking
Wait=True Blocking Non-blocking (unused)
Wait=False No No
16. telnetlib (3/3)
#Deal with lagging/noop
Action with or without screen refresh
term_comm('Action A', False)
term_comm('Action B', True)
#Action A+B cause screen refresh
Action with screen refresh (important content)
term_comm('Action', True)
Action with screen refresh
term_comm('Action')
Wait+Retry
Reading Feed No Feed
Wait=None Non-blocking Non-blocking
Wait=True Blocking Non-blocking (unused)
Wait=False No No