PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV
1. Page Layout Analysis of
19th
Century Siamese Newspapers
using Python and OpenCV
Mark Hollow
PyCon APAC, 2017
2. graduated in classical music · self-taught in computing
programming python since 2002 · 20 years working in IT
IT infrastructure · UNIX sysadmin · project management
software engineering · data systems · product management
about me...
2
3. once upon a time...
Dr Dan Beach Bradley - หมอ บรัดเลย
Born 18th
July 1804, New York; died 23th
June 1873, Bangkok
Graduated as Doctor of Medicine from New York University
American Protestant missionary in Siam
Arrives in Bangkok on 18th
July 1835 from Boston via Singapore
Brings with him the first printing press to Siam
Many notable achievements & firsts in Siam
first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial
advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the
Old Testament, first monolingual Siamese dictionary
3
4. the first siamese newspaper
The Bangkok Recorder - หนังสือจดหมายเหตุ
1844–1845 magazine-like, fact-based, introduces western
ideas, knowledge, science and Christianity
1865–1867 more social commentary and introduction of
western liberalism -- rather controversial
a lot of historical information...
thai society (seen from a western perspective)
regional and global news/information
prices of goods, services, imports and exports
4
5. there is no online
searchable database
of this historical
information.
5
6. DigitalBangkokRecorder
markhollow.com
digital bangkok recorder project
objectives
scan all the surviving editions
transcribe all text
make all text available online
learn how to do all of this
in this presentation
cleaning scanned images
detecting the page layout
extract all text lines
prepare for transcription
6
7. page layout
2 column layout
front page:
title & date lines
last page:
tabular data
some illustrations
some full-width
tables
7
8. a closer look...
large header on cover
dual-language headings
column separator line
topic separators
unique typeface
the first ever thai typeface
now-obsolete characters
not supported by modern ocr
8
11. what is opencv?
“OpenCV (Open Source Computer Vision Library) is an
open source computer vision and machine learning
software library.”
- opencv.org
Written in C++; bindings for Python and others
v3.2 used for here, v3.1 probably works
v2.x won’t work - different API structure
many v2.x blogs/articles still online - beware!
11
12. opencv basics: installation
$ pip install opencv-python
or
$ pip install opencv-contrib-python
No FFmpeg, GTK or carbon support - limits some features.
Works well in jupyter/ipython.
Non-free
patented
stuff!!
12
13. opencv basics: loading/saving images
loading images…
>>> import cv2
>>> img = cv2.imread(’image001.jpg’)
>>> type(img)
<type 'numpy.ndarray'>
saving images...
>>> cv2.imwrite(’newfile.png’, img)
OpenCV images
are numpy arrays!
All common
formats supported.
Extra args
supported for
image formats.
13
15. removing background noise (1)
- binarization: set pixel value based on threshold
- types: basic, adaptive
- both need experimentation with threshold value
bin_image, th = cv2.threshold(image, 192, 255,
cv2.THRESH_BINARY)
bin_image = cv2.adaptiveThreshold(image, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 101, 2)
15
16. removing background noise (2)
bin_image, th = cv2.threshold(img, 0, 255,
cv2.THRESH_BINARY + cv2.THRESH_OTSU)
otsu binarization tries to find best threshold value
example:
manual threshold guessed at v=192
otsu selects v=177
improvement in number of artifacts
16
19. morphological transforms (1)
- erosion: erodes away the
boundaries of foreground object
- dilation: dilates/thickens
boundaries
NOTE: black = background
white = foreground
kernel = numpy.ones(
(5, 5), np.uint8)
new_image = ~cv2.erode(
~original_image,
kernel,
iterations=1)
19
20. morphological transforms (2)
- opening: erosion+dilation
used for removing noise
- closing: dilation+erosion
closes small holes in objects
kernel = numpy.ones(
(5, 5), np.uint8)
img2= ~cv2.morphologyEx(
~img1,
cv2.MORPH_OPEN,
kernel,
iterations=1)
20
21. contours
“a curve joining all the
continuous points having same
color or intensity”
_, contours, hierarchy = cv2.findContours(
~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
cv2.drawContours(col_img, contours, -1, (255,0,0), 5)
findContours return values:
contours: list of contours
hierarchy: contour structure
21
22. finding page margins
“open” removes
artifacts; “dilate”
emphasizes text
opened & dilated
22
get margin from
contour edges
findContours() to
group blocks; filter
out small contours.
24. morphological transforms (revisited)
structuring element (kernel) array
is made of 1’s & 0’s
it’s compared to each pixel
erode: takes minimum value
dilate: takes maximum value
a linear structuring element will
operate on linear patterns
Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html
kernel
input image output image
Dilation Example
>>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1))
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8)
24
25. page segmentation
1
2
3
once for horizontal,
then for vertical
lines
erode & dilate with a long
linear structuring element:
extracts lines to mask
findContours() on
the mask gets
contour coordinates
draw contour to
remove line
centre line of contours
used as page section
boundaries
25
26. page segmentation (full page)
section boundaries
page margin
26
blank areas from
average values of
multiple adjacent
lines
28. template matching: finding objects in an image
result = cv2.matchTemplate(image,
template, cv2.TM_CCOEFF_NORMED)
_, maxval, _, maxloc =
cv2.minMaxLoc(result)
a template is a small image segment:
cv2.matchTemplate() returns match scores
28
29. structural analysis complete
- margins identified
- horizontal and vertical lines detected
- original lines removed
- blank areas identified
- removed decorative markers with templates
- use template matching to identify titles
- and therefore page style (eg. first or other page)
29
35. transcription
- transcribe enough text for developing an OCR model
- regular ocr is very inaccurate due to
the unique font
- hire typists or amazon mechanical turk
- there’s a few problems to solve:
- transcription cost, guidelines needed due to archaic text & unique typeface
- how to develop an OCR system?
- retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network?
- use tensorflow or similar?
- perhaps that’s my next PyCon presentation!
35
36. appendix
Not enough time to cover these
topics… :-(
- Removing page frames *
- Skew correction *
- Detecting tables †
- Detecting pictures
* See https://markhollow.com/
† Coming soon
36
Other resources:
- ocropus / ocropy: python document
analysis tools
- scantailor: GUI for cleaning
scanned documents
- CE316 / CE866: Computer Vision,
University of Essex, UK
http://orb.essex.ac.uk/ce/ce316/
37. in summary...
opencv basics · thresholds · morphological
transformations · contours · masks · template
matching and a little bit of numpy
...plus a practical application
to document layout analysis
37
38. thank you for listening.
questions?
38
Mark Hollow
markhollow.com
DigitalBangkokRecorder