SlideShare a Scribd company logo
1 of 25
Download to read offline
Presentation of
Hybrid Page Layout Analysis via Tab-Stop Detection
Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009.
Javier de la Rosa {jdelaros at uwo dotca}
CS 9883
2 | Internal use only2 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
Index.
1. Context and background.
2. Introduction.
3. Page layout via tab-stop detection.
4. Preprocessing.
5. Finding tab positions as line segments.
6. Finding the column layout.
7. Finding the regions.
8. Testing and results.
9. Conclusion and further work.
10. Criticism.
11. References.
3 | Internal use only3 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
1. Context and background.
• International Conference on Document Analysis and
Recognition [1].
• Page Segmentation competitions: 2001, 2003, 2005, 2007
and 2009 [2].
• Tesseract, the OCR from Google [3].
Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> [1]
A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> [2]
The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
4 | Internal use only4 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
2. Introduction.
Physical page layout analysis:
• Bottom-up [4].
• Top-down [5].
• Whitespaces [6].
Logical page layout analysis:
• Voronoi.
• Smearing.
• Etc.
M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408. [4]
G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. [5]
T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. [6]
5 | Internal use only5 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
3. Tab-stop detection.
• Regions bounded by tab-stops.
• Fixed x-positions.
• Vertical alignment.
Phases:
1. Preprocessing.
2. Bottom-up tab-stop detections.
3. Finding the column layout.
4. Set of typed regions.
6 | Internal use only6 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
4. Preprocessing
• Detection of vertical lines and image mask [7].
• Connected components (CCs) analysis.
• CCs filtering by width, w, and height, h:
– Small: h < 7 (@300ppi) or h < h75 / 2
– Large: h > 2h75 or w > 8h75
– Medium: rest of reminder.
Leptonica image processing and analysis library <http://www.leptonica.com> [7]
7 | Internal use only7 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (1/3)
• Candidate tab-stop components:
– A CC is a tab-stop by default.
– Look for aligned neighbours.
– Mark each CC as left tab, right tab or
neither.
• Grouping candidate tabs:
– In lines and, if there are many, in groups.
– Least median of squares to fit the lines
(left or right).
– Refit lines to the page-mean direction.
8 | Internal use only8 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (2/3)
• Tracking text lines to connect tab-stops:
– From one tab-stop to another.
– Associate tab-stops connected by text
lines.
– Discard tab-stop with no connections.
– Record the most frequently occurring
text lines widths.
9 | Internal use only9 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (3/3)
• Cleaning up tab-stop ends:
– Make connected tab lines end at the same y
coordinate:
– Moving the ends between the last member
CC and the first non-member CC.
• Reclassify CC as “Text” or “Unknown”:
– A CCs group of significant with form a text
line.
• Create artificial CCs from the image mask.
10 | Internal use only10 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (1/3)
• Scan CCs from left to right and top to down, gathering into
Column Partitions (CPs).
• A CP may not cross a tab-stop line.
• Collections of CPs are stored in Column Partition Sets
(CPsets).
• Find the column layout → find an optimal set of CPsets that
best “explains” all the CPsets on the page.
11 | Internal use only11 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (2/3)
• A good CP: it touches a tab line on both vertical edges.
• A good CP: its width is closely to frequency occurring width (slide 8).
• The coverage of CPset = total width of all the good CPs that it contains.
• A CPset A is better than CPset B if A has greater coverage.
• What does it mean “explain”? In a short:
– CPset A explains CPset B unless one or more of the following are true:
• B hasn't more text than A.
• A hasn't split a column fo common width.
• A hasn't a different number of columns to B.
• A hasn't merged two columns of B.
12 | Internal use only12 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (3/3)
• List from set of CPsets on
the page.
• Ordered by best ones first.
• Duplicates eliminated by
the A explains B rules.
• Image CPs are ignored.
• Improve the candidates
adding new CPs.
13 | Internal use only13 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (1/3)
• Create flows of CPs:
– Choose the best matching upper and
lower partner.
– The list of partners is forced to become
zero or one iteratively.
– Different rules for image CPs and text
CPs.
– Each chain of CPs returned represents a
candidate region:
• Text is blue.
• Heading text is cyan.
• Heading image is magenta.
• Pull-out image is orange.
14 | Internal use only14 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (2/3)
â—Ź The rules to apply:
1. Type. If there are multiple types, text can only stay with its own (exact) type,
whereas image any other image type.
2. Transitive partner shortcuts are broken. If A has 2 partners B and C, and also B
has C as a partner in the same direction, then delete C as a partner of A,
leaving a clean chain A-B-C. Also if A has a partner B, and B has a partner A in
the same direction, break the cycle.
3. (Text only) If A still has 2 partners B, C, chase B and C's partners to see which
has the longest chain. Delete from A the partner that has the shortest chain,
and convert the type of the shortest chain to pull-out.
4. (Image only) Choose the partner CP with the largest horizontal overlap.
15 | Internal use only15 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (3/3)
• Determinate the order reading:
1. Flowing blocks follow by y position within a column.
2. Pull-out blocks follow by y position in an imaginary column
between the real columns that they touch.
3. A heading spans multiple columns and follows anything that is
above it in the columns spanned, or between them.
4. A change in column layout works just like a heading.
5. Between headings, the content of columns is ordered from
left to right.
• Find the polygon boundary for each region:
–
Polygons are isothetic.
–
Polygon edges are chosen to minimize the number of
vertices.
–
All CPs are contained within their region.
16 | Internal use only16 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
8. Testing and results. (1/2)
• Algorithm implemented in C++.
• Part of Tesseract Open Source OCR system [3].
• 1 image of 8MPixel per second on a 3.4GHz Pentium 4.
17 | Internal use only17 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
8. Testing and results. (2/2)
Noise Sep Text Image Overall
0
10
20
30
40
50
60
70
80
90
100
PRImA Metric
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
F-Measure
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
Recall
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
Precission
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
ICDAR 2007 set [2, 8]
A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283. [8]
18 | Internal use only18 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
9. Conclusion and further work
• Tab-stop make an interesting and useful alternative to
white rectangles.
• It enables page layout analysis to easily handle the
complex non-rectangular layouts of modern magazines.
• Table detection will be added in the future.
19 | Internal use only19 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (1/4)
• The idea is totally new and it works reasonably well, but
• No references.
• No formulas.
• No algorithms.
• No mathematical justification.
• Excess text and literature.
• Process too long and with no justifications in many
occasions.
20 | Internal use only20 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (2/4)
• An example:
– Preprocessing: Small CCs: h < 7 (@300ppi) ...
• Why 7?
• Does it only work at 300ppi?
• Only on magazine papers (10.5” x78.5”)?
21 | Internal use only21 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (3/4)
• More:
– Reclassify CC as “Text” or “Unknown”: A CCs group of
significant width form a text line.
• What's a “significant width”?
– Find the polygon boundary for each region: Polygon
edges are choosen to minimize the number of
vertices.
• What's the algorithm or reference to do this?
22 | Internal use only22 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (4/4)
ICDAR 2009 Results [2]
23 | Internal use only23 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
11. References. (1/2)
1. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011)
<http://www.icdar2011.org/>
2. A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona,
Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/>
3. The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
4. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,”
SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.
24 | Internal use only24 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
11. References. (2/2)
5. G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf.
on Pattern Recognition, Montreal, Canada, 1984, pp347-349.
6. T. M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on
Document Analysis Systems V, Springer-Verlag 2002, pp188-199.
7. Leptonica image processing and analysis library
<http://www.leptonica.com>
8. A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc.
Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.
25 | Internal use only25 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
Questions?
Thank you

More Related Content

Similar to Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,
MalikPinckney86
 
_4a Means-end analysis - data analysis.ppt
_4a Means-end analysis - data analysis.ppt_4a Means-end analysis - data analysis.ppt
_4a Means-end analysis - data analysis.ppt
Renvic
 

Similar to Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection" (20)

Another Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line ClippingAnother Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line Clipping
 
Another Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line ClippingAnother Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line Clipping
 
Topology Optimization for Additive Manufacturing as an Enabler for Robotic Ar...
Topology Optimization for Additive Manufacturing as an Enabler for Robotic Ar...Topology Optimization for Additive Manufacturing as an Enabler for Robotic Ar...
Topology Optimization for Additive Manufacturing as an Enabler for Robotic Ar...
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,
 
Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognition
 
color doppler
color dopplercolor doppler
color doppler
 
_4a Means-end analysis - data analysis.ppt
_4a Means-end analysis - data analysis.ppt_4a Means-end analysis - data analysis.ppt
_4a Means-end analysis - data analysis.ppt
 
Another simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clippingAnother simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clipping
 
Marching Cubes
Marching CubesMarching Cubes
Marching Cubes
 
An Edge Detection Method for Hexagonal Images
An Edge Detection Method for Hexagonal ImagesAn Edge Detection Method for Hexagonal Images
An Edge Detection Method for Hexagonal Images
 
A Density Control Based Adaptive Hexahedral Mesh Generation Algorithm
A Density Control Based Adaptive Hexahedral Mesh Generation AlgorithmA Density Control Based Adaptive Hexahedral Mesh Generation Algorithm
A Density Control Based Adaptive Hexahedral Mesh Generation Algorithm
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Based on the cross section contour surface model reconstruction
Based on the cross section contour surface model reconstructionBased on the cross section contour surface model reconstruction
Based on the cross section contour surface model reconstruction
 
Final Project
Final ProjectFinal Project
Final Project
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
 
Presentazione Master's Degree
Presentazione Master's DegreePresentazione Master's Degree
Presentazione Master's Degree
 
Meshing Techniques.pptx
Meshing Techniques.pptxMeshing Techniques.pptx
Meshing Techniques.pptx
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
 
2019 cvpr paper overview by Ho Seong Lee
2019 cvpr paper overview by Ho Seong Lee2019 cvpr paper overview by Ho Seong Lee
2019 cvpr paper overview by Ho Seong Lee
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

  • 1. Presentation of Hybrid Page Layout Analysis via Tab-Stop Detection Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009. Javier de la Rosa {jdelaros at uwo dotca} CS 9883
  • 2. 2 | Internal use only2 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 Index. 1. Context and background. 2. Introduction. 3. Page layout via tab-stop detection. 4. Preprocessing. 5. Finding tab positions as line segments. 6. Finding the column layout. 7. Finding the regions. 8. Testing and results. 9. Conclusion and further work. 10. Criticism. 11. References.
  • 3. 3 | Internal use only3 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 1. Context and background. • International Conference on Document Analysis and Recognition [1]. • Page Segmentation competitions: 2001, 2003, 2005, 2007 and 2009 [2]. • Tesseract, the OCR from Google [3]. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> [1] A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> [2] The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
  • 4. 4 | Internal use only4 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 2. Introduction. Physical page layout analysis: • Bottom-up [4]. • Top-down [5]. • Whitespaces [6]. Logical page layout analysis: • Voronoi. • Smearing. • Etc. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408. [4] G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. [5] T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. [6]
  • 5. 5 | Internal use only5 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 3. Tab-stop detection. • Regions bounded by tab-stops. • Fixed x-positions. • Vertical alignment. Phases: 1. Preprocessing. 2. Bottom-up tab-stop detections. 3. Finding the column layout. 4. Set of typed regions.
  • 6. 6 | Internal use only6 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 4. Preprocessing • Detection of vertical lines and image mask [7]. • Connected components (CCs) analysis. • CCs filtering by width, w, and height, h: – Small: h < 7 (@300ppi) or h < h75 / 2 – Large: h > 2h75 or w > 8h75 – Medium: rest of reminder. Leptonica image processing and analysis library <http://www.leptonica.com> [7]
  • 7. 7 | Internal use only7 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (1/3) • Candidate tab-stop components: – A CC is a tab-stop by default. – Look for aligned neighbours. – Mark each CC as left tab, right tab or neither. • Grouping candidate tabs: – In lines and, if there are many, in groups. – Least median of squares to fit the lines (left or right). – Refit lines to the page-mean direction.
  • 8. 8 | Internal use only8 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (2/3) • Tracking text lines to connect tab-stops: – From one tab-stop to another. – Associate tab-stops connected by text lines. – Discard tab-stop with no connections. – Record the most frequently occurring text lines widths.
  • 9. 9 | Internal use only9 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 5. Finding the positions as line segments. (3/3) • Cleaning up tab-stop ends: – Make connected tab lines end at the same y coordinate: – Moving the ends between the last member CC and the first non-member CC. • Reclassify CC as “Text” or “Unknown”: – A CCs group of significant with form a text line. • Create artificial CCs from the image mask.
  • 10. 10 | Internal use only10 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (1/3) • Scan CCs from left to right and top to down, gathering into Column Partitions (CPs). • A CP may not cross a tab-stop line. • Collections of CPs are stored in Column Partition Sets (CPsets). • Find the column layout → find an optimal set of CPsets that best “explains” all the CPsets on the page.
  • 11. 11 | Internal use only11 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (2/3) • A good CP: it touches a tab line on both vertical edges. • A good CP: its width is closely to frequency occurring width (slide 8). • The coverage of CPset = total width of all the good CPs that it contains. • A CPset A is better than CPset B if A has greater coverage. • What does it mean “explain”? In a short: – CPset A explains CPset B unless one or more of the following are true: • B hasn't more text than A. • A hasn't split a column fo common width. • A hasn't a different number of columns to B. • A hasn't merged two columns of B.
  • 12. 12 | Internal use only12 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 6. Finding the column layout. (3/3) • List from set of CPsets on the page. • Ordered by best ones first. • Duplicates eliminated by the A explains B rules. • Image CPs are ignored. • Improve the candidates adding new CPs.
  • 13. 13 | Internal use only13 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (1/3) • Create flows of CPs: – Choose the best matching upper and lower partner. – The list of partners is forced to become zero or one iteratively. – Different rules for image CPs and text CPs. – Each chain of CPs returned represents a candidate region: • Text is blue. • Heading text is cyan. • Heading image is magenta. • Pull-out image is orange.
  • 14. 14 | Internal use only14 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (2/3) â—Ź The rules to apply: 1. Type. If there are multiple types, text can only stay with its own (exact) type, whereas image any other image type. 2. Transitive partner shortcuts are broken. If A has 2 partners B and C, and also B has C as a partner in the same direction, then delete C as a partner of A, leaving a clean chain A-B-C. Also if A has a partner B, and B has a partner A in the same direction, break the cycle. 3. (Text only) If A still has 2 partners B, C, chase B and C's partners to see which has the longest chain. Delete from A the partner that has the shortest chain, and convert the type of the shortest chain to pull-out. 4. (Image only) Choose the partner CP with the largest horizontal overlap.
  • 15. 15 | Internal use only15 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 7. Finding the regions. (3/3) • Determinate the order reading: 1. Flowing blocks follow by y position within a column. 2. Pull-out blocks follow by y position in an imaginary column between the real columns that they touch. 3. A heading spans multiple columns and follows anything that is above it in the columns spanned, or between them. 4. A change in column layout works just like a heading. 5. Between headings, the content of columns is ordered from left to right. • Find the polygon boundary for each region: – Polygons are isothetic. – Polygon edges are chosen to minimize the number of vertices. – All CPs are contained within their region.
  • 16. 16 | Internal use only16 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 8. Testing and results. (1/2) • Algorithm implemented in C++. • Part of Tesseract Open Source OCR system [3]. • 1 image of 8MPixel per second on a 3.4GHz Pentium 4.
  • 17. 17 | Internal use only17 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 8. Testing and results. (2/2) Noise Sep Text Image Overall 0 10 20 30 40 50 60 70 80 90 100 PRImA Metric 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 F-Measure 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 Recall 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method Noise Sep Text Image Overall 0 20 40 60 80 100 120 Precission 2007-Besus 2007-TH1 2007-TH2 Tesseract Measure Method ICDAR 2007 set [2, 8] A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283. [8]
  • 18. 18 | Internal use only18 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 9. Conclusion and further work • Tab-stop make an interesting and useful alternative to white rectangles. • It enables page layout analysis to easily handle the complex non-rectangular layouts of modern magazines. • Table detection will be added in the future.
  • 19. 19 | Internal use only19 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (1/4) • The idea is totally new and it works reasonably well, but • No references. • No formulas. • No algorithms. • No mathematical justification. • Excess text and literature. • Process too long and with no justifications in many occasions.
  • 20. 20 | Internal use only20 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (2/4) • An example: – Preprocessing: Small CCs: h < 7 (@300ppi) ... • Why 7? • Does it only work at 300ppi? • Only on magazine papers (10.5” x78.5”)?
  • 21. 21 | Internal use only21 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (3/4) • More: – Reclassify CC as “Text” or “Unknown”: A CCs group of significant width form a text line. • What's a “significant width”? – Find the polygon boundary for each region: Polygon edges are choosen to minimize the number of vertices. • What's the algorithm or reference to do this?
  • 22. 22 | Internal use only22 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 10. Criticism. (4/4) ICDAR 2009 Results [2]
  • 23. 23 | Internal use only23 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 11. References. (1/2) 1. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> 2. A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> 3. The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3] 4. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.
  • 24. 24 | Internal use only24 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 11. References. (2/2) 5. G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. 6. T. M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. 7. Leptonica image processing and analysis library <http://www.leptonica.com> 8. A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.
  • 25. 25 | Internal use only25 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883 Questions? Thank you