SlideShare a Scribd company logo
1 of 18
Download to read offline
BBS Crawler
     for Taiwan

bsdconv + pyte + telnetlib


 by Buganini @ PyHUG
      Sep. 2012
Obstacles
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
●   Big5/UAO           Gov.tw: BIG5-2003

●   Segmented Big5     Windows: CP950
●   Control Sequence   Libiconv: BIG5(?), CP950, BIG5-HKSCS,
                          BIG5-HKSCS:2004, BIG5-HKSCS:2001,
●   Ambiguous Width       BIG5-HKSCS:1999, BIG5-2003 (experimental)

                       Mozilla: UAO 2.41

                       BBS: UAO 2.50(?)

                                etc..   ref: http://moztw.org/docs/big5/


                       UAO
                          == Unicode At Once
                          == Unicode 補完計畫
                          != Unicode

                       UAO
                          is extended Big5 (by using PUA),
                          including Chinese (trad/sim/hk), Japanese, Cyrillic

                          Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
Big5/UAO
                       xAExE1
●



●   Segmented Big5
●   Control Sequence   xAE
●   Ambiguous Width    x1B[1;33m
                       xE1

                             PCMAN

                       Standard Tool
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width




                       08 08 20 20   ← ← SP SP
                       08 08 0a      ←←↓
                       e2 97 8f      ●
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
Obstacles
                                             Not anymore…

●   Big5/UAO
●   Segmented Big5                    Solved in bug5, using bsdconv

●   Ambiguous Width
●   Control Sequence                  Solved, using pyte




https://github.com/buganini/bug5

https://github.com/buganini/bsdconv

https://github.com/selectel/pyte
bsdconv                           (1/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


                                 xAExE1xAEx1B[1;33mxE1
                         ---------------------------------------------------------
                             AE E1 AE 1B 5B 31 3B 33 33 6D E1

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
                                                                                     Bsdconv Internal Prefix:
                          03AE 03E1 03AE 1B5B313B33336D 03E1                         03: Byte
                                                                                     1B: ANSI Control Sequence
     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                          03AE 03E1 03AE 03E1 1B5B313B33336D


   ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                             AE E1 AE E1 1B 5B 31 3B 33 33 6D

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                              016851 016851 1B5B313B33336D                           #U+6851 == 桑
bsdconv                      (2/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
1B5B313B33336D ( FREE )
03E1

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
03E1
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
03: Byte
1B: ANSI Control Sequence
bsdconv                      (3/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   pass:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
AE
E1
AE
E1
1B5B313B33336D ( FREE SKIP )

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
016851
016851
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
01: Unicode
1B: ANSI Control Sequence

#U+6851 == 桑
bsdconv                      (4/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:utf-8,bsdconv_raw")
>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")

>>> s
'xe6xa1x91xe6xa1x91x1b[1;33m'

>>> s.decode("utf-8")
u'u6851u6851x1b[1;33m'




#U+6851 == 桑
_
                                                                           | |

                                    pyte       (1/2)
                                                           _ __    _   _ | |_    ___
                                                          | '_  | | | || __|/ _ 
                                                          | |_) || |_| || |_|     __/
import pyte
                                                          | .__/   __, | __|___|
stream = pyte.Stream()                                    | |      __/ |
                                                          |_|      |___/
screen = pyte.Screen(80, 24)
                                                          Python Terminal Emulator
screen.mode.discard(pyte.modes.LNM)

stream.attach(screen)

seq=SEQUENCE_FROM_SERVER

useq=c.conv(seq)

stream.feed(useq.decode("utf-8"))

RESULT_SCREEN="n".join(screen.display).encode("utf-8")




 With pyte.modes.LNM:
 r → CR+LF (CarriageReturn / LineFeed)
 Without pyte.modes.LNM:
 r → CR
pyte           (2/2)
                                   #Ambiguous Width
screens.py

width_counter=bsdconv.Bsdconv("utf-8:width:null")
telnetlib           (1/3)




What's wrong with read_until/expect?
  What telnetlib does:
    Server → telnetlib connection→ telnetlib.read_until

  What I need:
    Server → telnetlib connection → bsdconv → telnetlib.read_until
    Regular Expression

Solutions:
  a) Implement bsdconv → telnetlib.read_until (current)
  b) Hack telnetlib (maybe cleaner)
  c) Other telnetlib implementation?
telnetlib             (2/3)
                    #Deal with lagging/noop
def term_comm(feed=None, wait=None):
   if feed!=None:
        conn.write(feed)
        if wait:
            s=conn.read_some()
            s=conv.conv_chunk(s)
            stream.feed(s.decode("utf-8"))
   if wait!=False:
        time.sleep(0.1)
        s=conn.read_very_eager()
        s=conv.conv_chunk(s)
        stream.feed(s.decode("utf-8"))
   ret="n".join(screen.display).encode("utf-8")
   return ret

       Reading                   Feed                     No Feed
     Wait=None               Non-blocking               Non-blocking
      Wait=True                Blocking             Non-blocking (unused)
     Wait=False                   No                         No
telnetlib            (3/3)
                  #Deal with lagging/noop
Action with or without screen refresh
   term_comm('Action A', False)
   term_comm('Action B', True)
   #Action A+B cause screen refresh

Action with screen refresh (important content)
   term_comm('Action', True)

Action with screen refresh
   term_comm('Action')

Wait+Retry



      Reading                 Feed                     No Feed
    Wait=None             Non-blocking               Non-blocking
     Wait=True               Blocking            Non-blocking (unused)
     Wait=False                No                         No
- Demo -
- End -

More Related Content

What's hot

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Hsien-Hsin Sean Lee, Ph.D.
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumlercorehard_by
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUEducation
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gatesRakesh kumar jha
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)Selomon birhane
 

What's hot (20)

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Ch9c
Ch9cCh9c
Ch9c
 
Ch9a
Ch9aCh9a
Ch9a
 
Machine Trace Metrics
Machine Trace MetricsMachine Trace Metrics
Machine Trace Metrics
 
Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
Stack
StackStack
Stack
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gates
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Dataflow Analysis
Dataflow AnalysisDataflow Analysis
Dataflow Analysis
 
Ch9b
Ch9bCh9b
Ch9b
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 

Similar to BBS crawler for Taiwan

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentAnne Nicolas
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL BasicRon Liu
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Community
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)yang firo
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)yang firo
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기Ji Hun Kim
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014Béo Tú
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.pptHongV34104
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Dev_Events
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDBJim Chang
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterakaptur
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Béo Tú
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3Sangho Park
 

Similar to BBS crawler for Taiwan (20)

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL Basic
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014
 
Operating System
Operating SystemOperating System
Operating System
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDB
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreter
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3
 
Ansible 2.0 spblug
Ansible 2.0 spblugAnsible 2.0 spblug
Ansible 2.0 spblug
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

BBS crawler for Taiwan

  • 1. BBS Crawler for Taiwan bsdconv + pyte + telnetlib by Buganini @ PyHUG Sep. 2012
  • 2. Obstacles ● Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 3. Big5/UAO Gov.tw: BIG5-2003 ● Segmented Big5 Windows: CP950 ● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, ● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental) Mozilla: UAO 2.41 BBS: UAO 2.50(?) etc.. ref: http://moztw.org/docs/big5/ UAO == Unicode At Once == Unicode 補完計畫 != Unicode UAO is extended Big5 (by using PUA), including Chinese (trad/sim/hk), Japanese, Cyrillic Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
  • 4. Big5/UAO xAExE1 ● ● Segmented Big5 ● Control Sequence xAE ● Ambiguous Width x1B[1;33m xE1 PCMAN Standard Tool
  • 5. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width 08 08 20 20 ← ← SP SP 08 08 0a ←←↓ e2 97 8f ●
  • 6. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 7. Obstacles Not anymore… ● Big5/UAO ● Segmented Big5 Solved in bug5, using bsdconv ● Ambiguous Width ● Control Sequence Solved, using pyte https://github.com/buganini/bug5 https://github.com/buganini/bsdconv https://github.com/selectel/pyte
  • 8. bsdconv (1/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") xAExE1xAEx1B[1;33mxE1 --------------------------------------------------------- AE E1 AE 1B 5B 31 3B 33 33 6D E1 ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ Bsdconv Internal Prefix: 03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte 1B: ANSI Control Sequence ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 03AE 03E1 03AE 03E1 1B5B313B33336D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ AE E1 AE E1 1B 5B 31 3B 33 33 6D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 016851 016851 1B5B313B33336D #U+6851 == 桑
  • 9. bsdconv (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 1B5B313B33336D ( FREE ) 03E1 >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 03E1 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 03: Byte 1B: ANSI Control Sequence
  • 10. bsdconv (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| pass:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") AE E1 AE E1 1B5B313B33336D ( FREE SKIP ) >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 016851 016851 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 01: Unicode 1B: ANSI Control Sequence #U+6851 == 桑
  • 11. bsdconv (4/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:utf-8,bsdconv_raw") >>> s=c.conv("xAExE1xAEx1B[1;33mxE1") >>> s 'xe6xa1x91xe6xa1x91x1b[1;33m' >>> s.decode("utf-8") u'u6851u6851x1b[1;33m' #U+6851 == 桑
  • 12. _ | | pyte (1/2) _ __ _ _ | |_ ___ | '_ | | | || __|/ _ | |_) || |_| || |_| __/ import pyte | .__/ __, | __|___| stream = pyte.Stream() | | __/ | |_| |___/ screen = pyte.Screen(80, 24) Python Terminal Emulator screen.mode.discard(pyte.modes.LNM) stream.attach(screen) seq=SEQUENCE_FROM_SERVER useq=c.conv(seq) stream.feed(useq.decode("utf-8")) RESULT_SCREEN="n".join(screen.display).encode("utf-8") With pyte.modes.LNM: r → CR+LF (CarriageReturn / LineFeed) Without pyte.modes.LNM: r → CR
  • 13. pyte (2/2) #Ambiguous Width screens.py width_counter=bsdconv.Bsdconv("utf-8:width:null")
  • 14. telnetlib (1/3) What's wrong with read_until/expect? What telnetlib does: Server → telnetlib connection→ telnetlib.read_until What I need: Server → telnetlib connection → bsdconv → telnetlib.read_until Regular Expression Solutions: a) Implement bsdconv → telnetlib.read_until (current) b) Hack telnetlib (maybe cleaner) c) Other telnetlib implementation?
  • 15. telnetlib (2/3) #Deal with lagging/noop def term_comm(feed=None, wait=None): if feed!=None: conn.write(feed) if wait: s=conn.read_some() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) if wait!=False: time.sleep(0.1) s=conn.read_very_eager() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) ret="n".join(screen.display).encode("utf-8") return ret Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  • 16. telnetlib (3/3) #Deal with lagging/noop Action with or without screen refresh term_comm('Action A', False) term_comm('Action B', True) #Action A+B cause screen refresh Action with screen refresh (important content) term_comm('Action', True) Action with screen refresh term_comm('Action') Wait+Retry Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No