We propose a pipeline for text extraction from infographics
that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognize the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline. - http://ceur-ws.org/Vol-1458/
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics
1. Text Extraction from Infographics
Ansgar Scherp,
Kiel University and ZBW – Leibniz Information Centre for Economics, Germany
Falk Böschen,
Kiel University, Germany
LWA 2015 (KDML), Trier, Germany
2. Infographics Challenges
• Text with different font sizes
• Text with varying emphasis
• Text in different colors
• Text on different background colors
• Text rotated at different angles
• Text occluded by graphic elements
Slide [ 01 / 18 ]Falk Böschen and Ansgar Scherp
Initial presentation at [DocEng’15] → Now: Improve comparability and extensibility
3. Abstract Pipeline Idea
Input: Information Graphic
1. RE: Extract regions from graphic
2. RC: Cluster regions into text and non-text elements
3. LC: Computation of text lines for orientation estimation
4. PRE: Preprocessing of text elements for OCR
5. OCR: Optical Character Recognition
6. POST: Post-correction of OCR result
Output: Text
Slide [ 02 / 18 ]Falk Böschen and Ansgar Scherp
Region
Extraction
Region
Clustering
TextLine
Computa-tion
Preprocessing
OCR
Postprocessing
RE RC LC PRE POSTOCR
4. Excerpt of Related Work
Authors Title RE RC LC Pre OCR Post
Chiang & Knoblock Recognizing text in raster maps ✍ ✍ ✔ ✔ ✔ ✔
Jayant et al. Automated tactile graphics translation: in the field ✍ ✍ ✍ ✔ ✔
Sas & Zolnierek Three-Stage Method of Text Region Extraction from Diagram
Raster Images
✔ ✔ ✔ ✔
Huang et al. Associating Text and Graphics for Scientific Chart
Understanding
✔ ✍ ✔ ✔ ✔ ✍
Lu et al. Automated analysis of images in documents for intelligent
document search
✔ ✔ ✔
Xu & Krauthammer A New Pivoting and Iterative Text Detection Algorithm for
Biomedical Images
✔ ✔
Chen et al. DiagramFlyer: A Search Engine for Data-Driven Diagrams ? ? ? ? ? ?
Böschen & Scherp Multi-oriented Text Extraction from Information Graphics ✔ ✔ ✔ ✔ ✔
Gllavata et al. Adaptive Fuzzy Text Segmentation in Images with Complex
Backgrounds Using Color and Texture
✔ ✔
Fraz et al. Exploiting colour information for better scene text detection
and recognition
✔ ✔ ✔ ✔ ✔
Liu & Samarabandu Multiscale Edge-Based Text Extraction from Complex Images ✔ ✔
Olszewska Active contour based optical character recognition for
automated scene understanding
✔ ✔ ✔
Lu et al. Scene text extraction based on edges and support vector
regression
✔ ✔ ✍
Slide [ 03 / 18 ]Falk Böschen and Ansgar Scherp
5. Example: Adaptive Binarization and Labeling
• Binarization based on
Otsu‘s method
• Extended by hierarchical
computation using edge
images for split-decision
• Connected Component
Labeling with 8-neighbors
• Noise removal by region
size thresholding
Slide [ 04 / 18 ]Falk Böschen and Ansgar Scherp
6. Example: Grouping Regions
Slide [ 05 / 18 ]Falk Böschen and Ansgar Scherp
• Number of clusters
unknown
• Text is “dense”
→ DBSCAN
• DBSCAN does not
necessarily produce text
lines which are required
for reliable orientation
estimation
𝑓 =
𝑥
𝑦
𝑤
ℎ
𝑟
7. Example: Computing Text Lines
Slide [ 06 / 18 ]Falk Böschen and Ansgar Scherp
• Compute a Minimum
Spanning Tree for each
DBSCAN Cluster using
a reduced feature vector
• Split each MST
(if necessary) by using
the edge orientations
𝑓′ =
𝑥
𝑦
8. Example: Estimating the Orientation of Text Lines
Slide [ 07 / 18 ]Falk Böschen and Ansgar Scherp
• Transform the center of mass coordinates of each element of every
cluster into a discretized Hough space (one for each cluster)
→ a line/curve for each center of mass in Hough space
• Hough space discretized to 180 degree in 1 degree steps
• Find maximal value to obtain orientation of cluster
Maximum
9. Example: Rotating Text Lines and Applying OCR
Slide [ 08 / 18 ]Falk Böschen and Ansgar Scherp
• Cut each text element
out of the original image
• Rotate it accordingly to
the estimated angle
• Send it to an OCR
engine for recognition
• Reasonable OCR
engine: Tesseract (also
used in Google Books)
11. Evaluation Setup
Slide [ 10 / 18 ]Falk Böschen and Ansgar Scherp
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams
12. Preliminary Evaluation Setup: Baselines
Baseline #1:
• OCR engine Tesseract with layout analysis
• Single execution on the whole infographic
Baseline #2:
• OCR engine Tesseract with layout analysis
• Multiple executions on the whole infographic at various angles
• Merging of the different executions results
+ + + +
Slide [ 11 / 18 ]Falk Böschen and Ansgar Scherp
13. Bilder oder Grafik
Slide [ 12 / 18 ]Falk Böschen and Ansgar Scherp
Evaluation Set: 121 Infographics (Domain Economics)
14. Dataset/Result set Characteristics
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
Baseline #1 AVG : 106.30
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
Baseline #2 AVG : 135.08
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
Slide [ 13 / 18 ]Falk Böschen and Ansgar Scherp
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
Baseline #1 AVG : 106.30
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
Baseline #2 AVG : 135.08
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
• Our pipeline extracts more characters and words than present in the data
→Increased chance to recognize all the textual information
• The baselines extract less characters and words than present in the data
→Obviously miss some text components
• There is a high standard deviation in general
→Infographics are very heterogeneous
19. Conclusion and Future Work
Conclusion
• Automated pipeline for text extraction from infographics
• Independent of infographic type (no special knowledge required)
Future Work
• Improvements necessary for individual/broken characters,
occlusion, dotted lines, shading, super-/subscripts, …
• Make different approaches comparable (implementations)
• Improved evaluation framework for different configurations
• Test of alternative OCR engines
• Expanding the ground truth set for extensive evaluation
Falk Böschen and Ansgar Scherp Slide [ 18 / 18 ]
20. Questions?
Ansgar Scherp
ZBW – Leibniz Information
Centre for Economics
and Kiel University
Germany
asc@informatik.uni-kiel.de
Falk Böschen
Kiel University
Germany
fboe@informatik.uni-kiel.de
http://www.kd.informatik.uni-kiel.de/en
22. Phase 1: Text Line Localization
Structure of our Text Extraction Pipeline
Adaptive
Binarization
and Labeling
Grouping
Regions into
Text Elements
Computing of
Text Lines
Estimating the
Orientation of
Text Lines
Rotation of
Text Lines and
Applying OCR
Evaluation
Phase 2: Text Extraction and Evaluation
Falk Böschen and Ansgar Scherp
23. Otsu‘s Method
Input Image Output Image
Source: https://en.wikipedia.org/wiki/Otsu's_method
• Assumes two classes of pixels following bi-modal histogram (foreground
pixels and background pixels)
• Calculates the optimum threshold separating the two classes so that their
combined spread (intra-class variance) is minimal / that their inter-class
variance is maximal
• Extension of the original method to multi-level thresholding exist
Falk Böschen and Ansgar Scherp
Editor's Notes
No uniform use of terms:
Biomedical Image Topographic/Geographic/Raster Map Scientific Chart Chart Image Chart Diagram Diagram [Raster] Image
Information Graphic Infographic Mathematical/Scholarly Figure Flow/Pie/Bar/Column Chart Column/Bar/Line Graph 2D Plot Scatterplot
No (automated) complete pipeline from infographic to text described
Technical description in many cases insufficient for reproduction
Comparison is difficult due to missing formalization
In computer vision and image processing, Otsu's method, named after Nobuyuki Otsu (大津展之 Ōtsu Nobuyuki?), is used to automatically perform clustering-based image thresholding,[1] or, the reduction of a graylevel image to a binary image. The algorithm assumes that the image contains two classes of pixels following bi-modal histogram (foreground pixels and background pixels), it then calculates the optimum threshold separating the two classes so that their combined spread (intra-class variance) is minimal, or equivalently (because the sum of pairwise squared distances is constant), so that their inter-class variance is maximal.[2] Consequently, Otsu's method is roughly a one-dimensional, discrete analog of Fisher's Discriminant Analysis.
The extension of the original method to multi-level thresholding is referred to as the Multi Otsu method.
https://en.wikipedia.org/wiki/Otsu's_method