SlideShare a Scribd company logo
1 of 38
Page Layout Analysis of
19th
Century Siamese Newspapers
using Python and OpenCV
Mark Hollow
PyCon APAC, 2017
graduated in classical music · self-taught in computing
programming python since 2002 · 20 years working in IT
IT infrastructure · UNIX sysadmin · project management
software engineering · data systems · product management
about me...
2
once upon a time...
Dr Dan Beach Bradley - หมอ บรัดเลย
Born 18th
July 1804, New York; died 23th
June 1873, Bangkok
Graduated as Doctor of Medicine from New York University
American Protestant missionary in Siam
Arrives in Bangkok on 18th
July 1835 from Boston via Singapore
Brings with him the first printing press to Siam
Many notable achievements & firsts in Siam
first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial
advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the
Old Testament, first monolingual Siamese dictionary
3
the first siamese newspaper
The Bangkok Recorder - หนังสือจดหมายเหตุ
1844–1845 magazine-like, fact-based, introduces western
ideas, knowledge, science and Christianity
1865–1867 more social commentary and introduction of
western liberalism -- rather controversial
a lot of historical information...
thai society (seen from a western perspective)
regional and global news/information
prices of goods, services, imports and exports
4
there is no online
searchable database
of this historical
information.
5
DigitalBangkokRecorder
markhollow.com
digital bangkok recorder project
objectives
scan all the surviving editions
transcribe all text
make all text available online
learn how to do all of this
in this presentation
cleaning scanned images
detecting the page layout
extract all text lines
prepare for transcription
6
page layout
2 column layout
front page:
title & date lines
last page:
tabular data
some illustrations
some full-width
tables
7
a closer look...
large header on cover
dual-language headings
column separator line
topic separators
unique typeface
the first ever thai typeface
now-obsolete characters
not supported by modern ocr
8
basic workflow
1. SCAN
2. CLEAN
9
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
5. TRANSCRIPTION
getting started
with opencv
10
what is opencv?
“OpenCV (Open Source Computer Vision Library) is an
open source computer vision and machine learning
software library.”
- opencv.org
Written in C++; bindings for Python and others
v3.2 used for here, v3.1 probably works
v2.x won’t work - different API structure
many v2.x blogs/articles still online - beware!
11
opencv basics: installation
$ pip install opencv-python
or
$ pip install opencv-contrib-python
No FFmpeg, GTK or carbon support - limits some features.
Works well in jupyter/ipython.
Non-free
patented
stuff!!
12
opencv basics: loading/saving images
loading images…
>>> import cv2
>>> img = cv2.imread(’image001.jpg’)
>>> type(img)
<type 'numpy.ndarray'>
saving images...
>>> cv2.imwrite(’newfile.png’, img)
OpenCV images
are numpy arrays!
All common
formats supported.
Extra args
supported for
image formats.
13
document cleaning.
14
removing background noise (1)
- binarization: set pixel value based on threshold
- types: basic, adaptive
- both need experimentation with threshold value
bin_image, th = cv2.threshold(image, 192, 255,
cv2.THRESH_BINARY)
bin_image = cv2.adaptiveThreshold(image, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 101, 2)
15
removing background noise (2)
bin_image, th = cv2.threshold(img, 0, 255,
cv2.THRESH_BINARY + cv2.THRESH_OTSU)
otsu binarization tries to find best threshold value
example:
manual threshold guessed at v=192
otsu selects v=177
improvement in number of artifacts
16
removing background noise (3)
* Contrast emphasized for display purposes. 17
structural analysis:
page margins
18
morphological transforms (1)
- erosion: erodes away the
boundaries of foreground object
- dilation: dilates/thickens
boundaries
NOTE: black = background
white = foreground
kernel = numpy.ones(
(5, 5), np.uint8)
new_image = ~cv2.erode(
~original_image,
kernel,
iterations=1)
19
morphological transforms (2)
- opening: erosion+dilation
used for removing noise
- closing: dilation+erosion
closes small holes in objects
kernel = numpy.ones(
(5, 5), np.uint8)
img2= ~cv2.morphologyEx(
~img1,
cv2.MORPH_OPEN,
kernel,
iterations=1)
20
contours
“a curve joining all the
continuous points having same
color or intensity”
_, contours, hierarchy = cv2.findContours(
~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
cv2.drawContours(col_img, contours, -1, (255,0,0), 5)
findContours return values:
contours: list of contours
hierarchy: contour structure
21
finding page margins
“open” removes
artifacts; “dilate”
emphasizes text
opened & dilated
22
get margin from
contour edges
findContours() to
group blocks; filter
out small contours.
structural analysis:
identify page
sections
23
morphological transforms (revisited)
structuring element (kernel) array
is made of 1’s & 0’s
it’s compared to each pixel
erode: takes minimum value
dilate: takes maximum value
a linear structuring element will
operate on linear patterns
Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html
kernel
input image output image
Dilation Example
>>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1))
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8)
24
page segmentation
1
2
3
once for horizontal,
then for vertical
lines
erode & dilate with a long
linear structuring element:
extracts lines to mask
findContours() on
the mask gets
contour coordinates
draw contour to
remove line
centre line of contours
used as page section
boundaries
25
page segmentation (full page)
section boundaries
page margin
26
blank areas from
average values of
multiple adjacent
lines
structural analysis:
topic separators
27
template matching: finding objects in an image
result = cv2.matchTemplate(image,
template, cv2.TM_CCOEFF_NORMED)
_, maxval, _, maxloc =
cv2.minMaxLoc(result)
a template is a small image segment:
cv2.matchTemplate() returns match scores
28
structural analysis complete
- margins identified
- horizontal and vertical lines detected
- original lines removed
- blank areas identified
- removed decorative markers with templates
- use template matching to identify titles
- and therefore page style (eg. first or other page)
29
structural analysis: first edition
30
extract
text lines
31
extract text lines
THRESHOLD = 248
thresholds = cv2.reduce(
image,
1, # 1 => column; 0 => row
cv2.REDUCE_AVG
) >= THRESHOLD
32
workflow: page layout analysis all done!
1. SCAN
2. CLEAN
5. TRANSCRIPTION
33
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
✓
✓
✓
what’s next?
34
transcription
- transcribe enough text for developing an OCR model
- regular ocr is very inaccurate due to
the unique font
- hire typists or amazon mechanical turk
- there’s a few problems to solve:
- transcription cost, guidelines needed due to archaic text & unique typeface
- how to develop an OCR system?
- retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network?
- use tensorflow or similar?
- perhaps that’s my next PyCon presentation!
35
appendix
Not enough time to cover these
topics… :-(
- Removing page frames *
- Skew correction *
- Detecting tables †
- Detecting pictures
* See https://markhollow.com/
† Coming soon
36
Other resources:
- ocropus / ocropy: python document
analysis tools
- scantailor: GUI for cleaning
scanned documents
- CE316 / CE866: Computer Vision,
University of Essex, UK
http://orb.essex.ac.uk/ce/ce316/
in summary...
opencv basics · thresholds · morphological
transformations · contours · masks · template
matching and a little bit of numpy
...plus a practical application
to document layout analysis
37
thank you for listening.
questions?
38
Mark Hollow
markhollow.com
DigitalBangkokRecorder

More Related Content

Similar to PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959Mohamed Essam
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrAnn Loraine
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforRomain Boman
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendLuis Goldster
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendFraboni Ec
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendYoung Alista
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendJames Wong
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHarry Potter
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendTony Nguyen
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHoang Nguyen
 
A brief introduction to lisp language
A brief introduction to lisp languageA brief introduction to lisp language
A brief introduction to lisp languageDavid Gu
 
Reasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'HondtReasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'HondtFAST
 
cis97003
cis97003cis97003
cis97003perfj
 
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleStatic PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleBrent Cook
 
EBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) EverywhereEBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) EverywhereC4Media
 

Similar to PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV (20)

Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959
 
Q
QQ
Q
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Lecture12
Lecture12Lecture12
Lecture12
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
A brief introduction to lisp language
A brief introduction to lisp languageA brief introduction to lisp language
A brief introduction to lisp language
 
Reasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'HondtReasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'Hondt
 
cis97003
cis97003cis97003
cis97003
 
Assem -lect-6
Assem -lect-6Assem -lect-6
Assem -lect-6
 
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleStatic PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
 
EBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) EverywhereEBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) Everywhere
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

  • 1. Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV Mark Hollow PyCon APAC, 2017
  • 2. graduated in classical music · self-taught in computing programming python since 2002 · 20 years working in IT IT infrastructure · UNIX sysadmin · project management software engineering · data systems · product management about me... 2
  • 3. once upon a time... Dr Dan Beach Bradley - หมอ บรัดเลย Born 18th July 1804, New York; died 23th June 1873, Bangkok Graduated as Doctor of Medicine from New York University American Protestant missionary in Siam Arrives in Bangkok on 18th July 1835 from Boston via Singapore Brings with him the first printing press to Siam Many notable achievements & firsts in Siam first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the Old Testament, first monolingual Siamese dictionary 3
  • 4. the first siamese newspaper The Bangkok Recorder - หนังสือจดหมายเหตุ 1844–1845 magazine-like, fact-based, introduces western ideas, knowledge, science and Christianity 1865–1867 more social commentary and introduction of western liberalism -- rather controversial a lot of historical information... thai society (seen from a western perspective) regional and global news/information prices of goods, services, imports and exports 4
  • 5. there is no online searchable database of this historical information. 5
  • 6. DigitalBangkokRecorder markhollow.com digital bangkok recorder project objectives scan all the surviving editions transcribe all text make all text available online learn how to do all of this in this presentation cleaning scanned images detecting the page layout extract all text lines prepare for transcription 6
  • 7. page layout 2 column layout front page: title & date lines last page: tabular data some illustrations some full-width tables 7
  • 8. a closer look... large header on cover dual-language headings column separator line topic separators unique typeface the first ever thai typeface now-obsolete characters not supported by modern ocr 8
  • 9. basic workflow 1. SCAN 2. CLEAN 9 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT 5. TRANSCRIPTION
  • 11. what is opencv? “OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library.” - opencv.org Written in C++; bindings for Python and others v3.2 used for here, v3.1 probably works v2.x won’t work - different API structure many v2.x blogs/articles still online - beware! 11
  • 12. opencv basics: installation $ pip install opencv-python or $ pip install opencv-contrib-python No FFmpeg, GTK or carbon support - limits some features. Works well in jupyter/ipython. Non-free patented stuff!! 12
  • 13. opencv basics: loading/saving images loading images… >>> import cv2 >>> img = cv2.imread(’image001.jpg’) >>> type(img) <type 'numpy.ndarray'> saving images... >>> cv2.imwrite(’newfile.png’, img) OpenCV images are numpy arrays! All common formats supported. Extra args supported for image formats. 13
  • 15. removing background noise (1) - binarization: set pixel value based on threshold - types: basic, adaptive - both need experimentation with threshold value bin_image, th = cv2.threshold(image, 192, 255, cv2.THRESH_BINARY) bin_image = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 101, 2) 15
  • 16. removing background noise (2) bin_image, th = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) otsu binarization tries to find best threshold value example: manual threshold guessed at v=192 otsu selects v=177 improvement in number of artifacts 16
  • 17. removing background noise (3) * Contrast emphasized for display purposes. 17
  • 19. morphological transforms (1) - erosion: erodes away the boundaries of foreground object - dilation: dilates/thickens boundaries NOTE: black = background white = foreground kernel = numpy.ones( (5, 5), np.uint8) new_image = ~cv2.erode( ~original_image, kernel, iterations=1) 19
  • 20. morphological transforms (2) - opening: erosion+dilation used for removing noise - closing: dilation+erosion closes small holes in objects kernel = numpy.ones( (5, 5), np.uint8) img2= ~cv2.morphologyEx( ~img1, cv2.MORPH_OPEN, kernel, iterations=1) 20
  • 21. contours “a curve joining all the continuous points having same color or intensity” _, contours, hierarchy = cv2.findContours( ~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB) cv2.drawContours(col_img, contours, -1, (255,0,0), 5) findContours return values: contours: list of contours hierarchy: contour structure 21
  • 22. finding page margins “open” removes artifacts; “dilate” emphasizes text opened & dilated 22 get margin from contour edges findContours() to group blocks; filter out small contours.
  • 24. morphological transforms (revisited) structuring element (kernel) array is made of 1’s & 0’s it’s compared to each pixel erode: takes minimum value dilate: takes maximum value a linear structuring element will operate on linear patterns Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html kernel input image output image Dilation Example >>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1)) array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8) 24
  • 25. page segmentation 1 2 3 once for horizontal, then for vertical lines erode & dilate with a long linear structuring element: extracts lines to mask findContours() on the mask gets contour coordinates draw contour to remove line centre line of contours used as page section boundaries 25
  • 26. page segmentation (full page) section boundaries page margin 26 blank areas from average values of multiple adjacent lines
  • 28. template matching: finding objects in an image result = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) _, maxval, _, maxloc = cv2.minMaxLoc(result) a template is a small image segment: cv2.matchTemplate() returns match scores 28
  • 29. structural analysis complete - margins identified - horizontal and vertical lines detected - original lines removed - blank areas identified - removed decorative markers with templates - use template matching to identify titles - and therefore page style (eg. first or other page) 29
  • 32. extract text lines THRESHOLD = 248 thresholds = cv2.reduce( image, 1, # 1 => column; 0 => row cv2.REDUCE_AVG ) >= THRESHOLD 32
  • 33. workflow: page layout analysis all done! 1. SCAN 2. CLEAN 5. TRANSCRIPTION 33 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT ✓ ✓ ✓
  • 35. transcription - transcribe enough text for developing an OCR model - regular ocr is very inaccurate due to the unique font - hire typists or amazon mechanical turk - there’s a few problems to solve: - transcription cost, guidelines needed due to archaic text & unique typeface - how to develop an OCR system? - retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network? - use tensorflow or similar? - perhaps that’s my next PyCon presentation! 35
  • 36. appendix Not enough time to cover these topics… :-( - Removing page frames * - Skew correction * - Detecting tables † - Detecting pictures * See https://markhollow.com/ † Coming soon 36 Other resources: - ocropus / ocropy: python document analysis tools - scantailor: GUI for cleaning scanned documents - CE316 / CE866: Computer Vision, University of Essex, UK http://orb.essex.ac.uk/ce/ce316/
  • 37. in summary... opencv basics · thresholds · morphological transformations · contours · masks · template matching and a little bit of numpy ...plus a practical application to document layout analysis 37
  • 38. thank you for listening. questions? 38 Mark Hollow markhollow.com DigitalBangkokRecorder