Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

554 views

Published on

We propose a pipeline for text extraction from infographics
that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognize the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline. - http://ceur-ws.org/Vol-1458/

Published in: Internet
  • Check the source ⇒ www.HelpWriting.net ⇐ This site is really helped me out gave me relief from headaches. Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

  1. 1. Text Extraction from Infographics Ansgar Scherp, Kiel University and ZBW – Leibniz Information Centre for Economics, Germany Falk Böschen, Kiel University, Germany LWA 2015 (KDML), Trier, Germany
  2. 2. Infographics Challenges • Text with different font sizes • Text with varying emphasis • Text in different colors • Text on different background colors • Text rotated at different angles • Text occluded by graphic elements Slide [ 01 / 18 ]Falk Böschen and Ansgar Scherp Initial presentation at [DocEng’15] → Now: Improve comparability and extensibility
  3. 3. Abstract Pipeline Idea  Input: Information Graphic 1. RE: Extract regions from graphic 2. RC: Cluster regions into text and non-text elements 3. LC: Computation of text lines for orientation estimation 4. PRE: Preprocessing of text elements for OCR 5. OCR: Optical Character Recognition 6. POST: Post-correction of OCR result  Output: Text Slide [ 02 / 18 ]Falk Böschen and Ansgar Scherp Region Extraction Region Clustering TextLine Computa-tion Preprocessing OCR Postprocessing RE RC LC PRE POSTOCR
  4. 4. Excerpt of Related Work Authors Title RE RC LC Pre OCR Post Chiang & Knoblock Recognizing text in raster maps ✍ ✍ ✔ ✔ ✔ ✔ Jayant et al. Automated tactile graphics translation: in the field ✍ ✍ ✍ ✔ ✔ Sas & Zolnierek Three-Stage Method of Text Region Extraction from Diagram Raster Images ✔ ✔ ✔ ✔ Huang et al. Associating Text and Graphics for Scientific Chart Understanding ✔ ✍ ✔ ✔ ✔ ✍ Lu et al. Automated analysis of images in documents for intelligent document search ✔ ✔ ✔ Xu & Krauthammer A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images ✔ ✔ Chen et al. DiagramFlyer: A Search Engine for Data-Driven Diagrams ? ? ? ? ? ? Böschen & Scherp Multi-oriented Text Extraction from Information Graphics ✔ ✔ ✔ ✔ ✔ Gllavata et al. Adaptive Fuzzy Text Segmentation in Images with Complex Backgrounds Using Color and Texture ✔ ✔ Fraz et al. Exploiting colour information for better scene text detection and recognition ✔ ✔ ✔ ✔ ✔ Liu & Samarabandu Multiscale Edge-Based Text Extraction from Complex Images ✔ ✔ Olszewska Active contour based optical character recognition for automated scene understanding ✔ ✔ ✔ Lu et al. Scene text extraction based on edges and support vector regression ✔ ✔ ✍ Slide [ 03 / 18 ]Falk Böschen and Ansgar Scherp
  5. 5. Example: Adaptive Binarization and Labeling • Binarization based on Otsu‘s method • Extended by hierarchical computation using edge images for split-decision • Connected Component Labeling with 8-neighbors • Noise removal by region size thresholding Slide [ 04 / 18 ]Falk Böschen and Ansgar Scherp
  6. 6. Example: Grouping Regions Slide [ 05 / 18 ]Falk Böschen and Ansgar Scherp • Number of clusters unknown • Text is “dense” → DBSCAN • DBSCAN does not necessarily produce text lines which are required for reliable orientation estimation 𝑓 = 𝑥 𝑦 𝑤 ℎ 𝑟
  7. 7. Example: Computing Text Lines Slide [ 06 / 18 ]Falk Böschen and Ansgar Scherp • Compute a Minimum Spanning Tree for each DBSCAN Cluster using a reduced feature vector • Split each MST (if necessary) by using the edge orientations 𝑓′ = 𝑥 𝑦
  8. 8. Example: Estimating the Orientation of Text Lines Slide [ 07 / 18 ]Falk Böschen and Ansgar Scherp • Transform the center of mass coordinates of each element of every cluster into a discretized Hough space (one for each cluster) → a line/curve for each center of mass in Hough space • Hough space discretized to 180 degree in 1 degree steps • Find maximal value to obtain orientation of cluster Maximum
  9. 9. Example: Rotating Text Lines and Applying OCR Slide [ 08 / 18 ]Falk Böschen and Ansgar Scherp • Cut each text element out of the original image • Rotate it accordingly to the estimated angle • Send it to an OCR engine for recognition • Reasonable OCR engine: Tesseract (also used in Google Books)
  10. 10. Ground Truth Generation Falk Böschen and Ansgar Scherp Slide [ 09 / 18 ]
  11. 11. Evaluation Setup Slide [ 10 / 18 ]Falk Böschen and Ansgar Scherp item 1 Item 1 {e, i, m, t, 1} {em, it, te} {ite, tem} {e, m, t, I, 1} {em, te, It} {tem, Ite} Unigrams Bigrams Trigrams
  12. 12. Preliminary Evaluation Setup: Baselines Baseline #1: • OCR engine Tesseract with layout analysis • Single execution on the whole infographic Baseline #2: • OCR engine Tesseract with layout analysis • Multiple executions on the whole infographic at various angles • Merging of the different executions results + + + + Slide [ 11 / 18 ]Falk Böschen and Ansgar Scherp
  13. 13. Bilder oder Grafik Slide [ 12 / 18 ]Falk Böschen and Ansgar Scherp Evaluation Set: 121 Infographics (Domain Economics)
  14. 14. Dataset/Result set Characteristics # 1-grams # 2-grams # 3-grams # Words Word Length TX Pipeline AVG : 177.20 SD : 128.20 AVG : 127.34 SD : 100.51 AVG : 89.34 SD : 79.35 AVG : 50.07 SD : 31.95 AVG : 3.63 SD : 2.69 Baseline #1 AVG : 106.30 SD : 87.71 AVG : 80.17 SD : 69.12 AVG : 60.79 SD : 54.54 AVG : 25.21 SD : 22.12 AVG : 4.15 SD : 2.25 Baseline #2 AVG : 135.08 SD : 125.56 AVG : 100.20 SD : 98.20 AVG : 75.08 SD : 78.10 AVG : 35.25 SD : 33.94 AVG : 4.08 SD : 1.95 Ground Truth AVG : 150.65 SD : 122.28 AVG : 115.93 SD : 103.09 AVG : 84.95 SD : 85.61 AVG : 35.46 SD : 22.24 AVG : 4.22 SD : 1.48 Slide [ 13 / 18 ]Falk Böschen and Ansgar Scherp # 1-grams # 2-grams # 3-grams # Words Word Length TX Pipeline AVG : 177.20 SD : 128.20 AVG : 127.34 SD : 100.51 AVG : 89.34 SD : 79.35 AVG : 50.07 SD : 31.95 AVG : 3.63 SD : 2.69 Baseline #1 AVG : 106.30 SD : 87.71 AVG : 80.17 SD : 69.12 AVG : 60.79 SD : 54.54 AVG : 25.21 SD : 22.12 AVG : 4.15 SD : 2.25 Baseline #2 AVG : 135.08 SD : 125.56 AVG : 100.20 SD : 98.20 AVG : 75.08 SD : 78.10 AVG : 35.25 SD : 33.94 AVG : 4.08 SD : 1.95 Ground Truth AVG : 150.65 SD : 122.28 AVG : 115.93 SD : 103.09 AVG : 84.95 SD : 85.61 AVG : 35.46 SD : 22.24 AVG : 4.22 SD : 1.48 • Our pipeline extracts more characters and words than present in the data →Increased chance to recognize all the textual information • The baselines extract less characters and words than present in the data →Obviously miss some text components • There is a high standard deviation in general →Infographics are very heterogeneous
  15. 15. Preliminary Evaluation Results n-gram Precision Recall F1-measure TX Pipeline 1 2 3 AVG: 0.50 SD: 0.41 AVG: 0.58 SD: 0.39 AVG: 0.52 SD: 0.39 AVG: 0.68 SD: 0.36 AVG: 0.54 SD: 0.38 AVG: 0.48 SD: 0.37 AVG: 0.47 SD: 0.39 AVG: 0.54 SD: 0.34 AVG: 0.49 SD: 0.37 Baseline #1 1 2 3 AVG: 0.37 SD: 0.36 AVG: 0.42 SD: 0.33 AVG: 0.42 SD: 0.31 AVG: 0.48 SD: 0.36 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.31 AVG: 0.36 SD: 0.35 AVG: 0.42 SD: 0.33 AVG: 0.36 SD: 0.33 Relative Improvement 1 2 3 35.14 % 38.10 % 23.81 % 41.67 % 28.57 % 14.29 % 30.06 % 28.57 % 36.11 % Slide [ 14 / 18 ]Falk Böschen and Ansgar Scherp n-gram Precision Recall F1-measure TX Pipeline 1 2 3 AVG: 0.50 SD: 0.41 AVG: 0.58 SD: 0.39 AVG: 0.52 SD: 0.39 AVG: 0.68 SD: 0.36 AVG: 0.54 SD: 0.38 AVG: 0.48 SD: 0.37 AVG: 0.47 SD: 0.39 AVG: 0.54 SD: 0.34 AVG: 0.49 SD: 0.37 Baseline #2 1 2 3 AVG: 0.37 SD: 0.37 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.32 AVG: 0.51 SD: 0.38 AVG: 0.42 SD: 0.35 AVG: 0.42 SD: 0.32 AVG: 0.36 SD: 0.36 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.32 Relative Improvement 1 2 3 35.14 % 38.10 % 23.81 % 33.33 % 28.57 % 14.29 % 30.06 % 28.57 % 16.67 %
  16. 16. Preliminary Evaluation: Orientation Distributions Here horizontal equals ±15° based on Tesseracts rotation tolerances Falk Böschen and Ansgar Scherp Slide [ 15 / 18 ]
  17. 17. Preliminary Evaluation: Levenshtein Distance Slide [ 16 / 18 ]Falk Böschen and Ansgar Scherp
  18. 18. Extreme Examples Best Result Worst Result Falk Böschen and Ansgar Scherp Slide [ 17 / 18 ] P/R/F TX BL1 BL2 Unigram 0.95/0.95/0.95 0.02/0.26/0.02 0.02/0.26/0.02 Bigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00 Trigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00 Levenshtein 0.14 3.69 3.21 P/R/F TX BL1 BL2 Unigram 0.02/0.45/0.02 0.00/0.00/0.00 0.00/0.00/0.00 Bigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00 Trigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00 Levenshtein 3.47 0.14 0.14
  19. 19. Conclusion and Future Work  Conclusion • Automated pipeline for text extraction from infographics • Independent of infographic type (no special knowledge required)  Future Work • Improvements necessary for individual/broken characters, occlusion, dotted lines, shading, super-/subscripts, … • Make different approaches comparable (implementations) • Improved evaluation framework for different configurations • Test of alternative OCR engines • Expanding the ground truth set for extensive evaluation Falk Böschen and Ansgar Scherp Slide [ 18 / 18 ]
  20. 20. Questions? Ansgar Scherp ZBW – Leibniz Information Centre for Economics and Kiel University Germany asc@informatik.uni-kiel.de Falk Böschen Kiel University Germany fboe@informatik.uni-kiel.de http://www.kd.informatik.uni-kiel.de/en
  21. 21. The Road Ahead … Falk Böschen and Ansgar Scherp
  22. 22. Phase 1: Text Line Localization Structure of our Text Extraction Pipeline Adaptive Binarization and Labeling Grouping Regions into Text Elements Computing of Text Lines Estimating the Orientation of Text Lines Rotation of Text Lines and Applying OCR Evaluation Phase 2: Text Extraction and Evaluation Falk Böschen and Ansgar Scherp
  23. 23. Otsu‘s Method Input Image Output Image Source: https://en.wikipedia.org/wiki/Otsu's_method • Assumes two classes of pixels following bi-modal histogram (foreground pixels and background pixels) • Calculates the optimum threshold separating the two classes so that their combined spread (intra-class variance) is minimal / that their inter-class variance is maximal • Extension of the original method to multi-level thresholding exist Falk Böschen and Ansgar Scherp

×