SlideShare a Scribd company logo
1 of 19
Download to read offline
Tesseract Tutorial: DAS 2014 Tours France
6. Character
Segmentation, Language
Models and Beam Search
The heart of Tesseract
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
Approaches to Segmentation
● Segment first using only geometry.
● Maximally chop, then combine with a beam
search. (Over-segmentation.)
● Sliding window to "avoid" segmentation
altogether.
● Tesseract: Chop only as needed, then
combine as needed.
Tesseract Tutorial: DAS 2014 Tours France
Over-Segmentation
● Aim is to maximize recall of chops with the compromise of
reduction in precision.
Tesseract Tutorial: DAS 2014 Tours France
Segmentation Graph
● Segmentation possibilities and classifier results form a directed
graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u %
Tesseract Tutorial: DAS 2014 Tours France
Searching the Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
higher
%
Tesseract Tutorial: DAS 2014 Tours France
Searching the Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
higher
}uglier
%
%
Tesseract Tutorial: DAS 2014 Tours France
Integration of Language Models (General Methods)
● Implement Language Model as Finite State Machine.
● Search Language Model and Segmentation Graph in parallel.
● Combine “probabilities” in some sensible way.
● Hidden Markov Model methods are good example.
Tesseract Tutorial: DAS 2014 Tours France
Segmentation Free = Extreme Over-Segmentation
● Slide over the word/textline with a classifier/HMM.
● Beam search + shape model probs + language model probs solves
the segmentation internally.
● Really just mega-over-segmentation.
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Segmentation Approach based on observations:
● Initial segmentation is often correct or close.
● Classifier generally doesn’t like incorrectly segmented
text.
● Over-segmentation often leads to poor results., eg m->iii
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Segmentation Approach
Classify Initial Segmentation
Search Word: OK? Yes => Done
while any Bad Blob has any Chops available
Chop and classify pieces of Worst Choppable Blob
Search Word: OK? Yes => Done
while any fixable “Pain point”
Associate adjacent blobs and classify
Search Word: OK? Yes => Done
Tesseract Tutorial: DAS 2014 Tours France
Types of Pain Point
● Initial: Join each adjacent pair
● Ambiguity: Eg m/rn
● Path: Neighbors of blobs in the current best path
Tesseract Tutorial: DAS 2014 Tours France
Ratings MATRIX = Segmentation Graph
l } I 1
m
u nh b
l } I 1 i l 1 e c r T
h b
i l 1 i l 1 i l 1 l } I 1
m
m
g 9 q
n u
%
l}Ii
hb il1
un il1
m nu il1
g9q l}I1
% l}I1
hb il1
m ec
m rT
Each entry holds a BLOB_CHOICE_LIST providing classifier choices
with rating and certainty.
Tesseract Tutorial: DAS 2014 Tours France
Live Demo of Segmentation Search
Chopper demo
api/tesseract unlv/mag.3B/2/8022_028.3B.tif test1 inter segdemo chopdebug3
Col 2, line 6, word 1
Associator demo
api/tesseract unlv/doe3.3B/4/2214_007.3B.tif test1 inter segdemo
Col 2, line 8, word 2
Tesseract Tutorial: DAS 2014 Tours France
Evaluation of a WORD_CHOICE (no params-model)
Word Rating = word_factor ∑blob_choice->rating()
word_factor = segmentation
Condition base word_factor Add-ons
Frequent dawg word 1.0 Inconsistent case +0.1
Other dawg word 1.1 Inconsistent case +0.1
Non-dawg word 1.25 +0.01 for each
char over 3.
Inconsistent case +0.1
Inconsistent punc +0.2
Inconsistent chartype +0.3
Inconsistent script +0.5
Inconsisted char spacing +0.01
All except script +0.01 for each
additional occurrence.
Tesseract Tutorial: DAS 2014 Tours France
Evaluation of a WORD_CHOICE (with params-model)
Word Rating = word_factor ∑outline length
word_factor = weighted sum of word features:
● mean blob rating
● num inconsistent spaces
● num inconsistent char type
● num x-height inconsistencies
● num case inconsistencies
● word length (in type categories)
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Word Recognizer
Tesseract Tutorial: DAS 2014 Tours France
Example of Chopping (unlv/mag.3B/2/8022_028.3B.tif
Col 2, line 6, word 1)
Word Distance Worst blob
Momm 212.2 7.7
Mommn 186.3 8.3
Momtfln178.0 9.2
Momtain 124.9 5.3
Mounm 184.0 7.7
Mountain 80.6 3.1 ACCEPT!
Tesseract Tutorial: DAS 2014 Tours France
Example of Combining (unlv/doe3.3B/4/2214_007.3B.tif,
col 2, line 8, word 2)
Word Distance
lilllit- 77.58
limit57.1
Emit 89.7
Unfit 95.4
Hulk 122.7
Bulk 136.8
Word Distance
HUM 120.6
MUM 127.0
BUM 134.7
1mm 112.8
Milk147.2
huh 137.1
fink 140.7
Emu 129.7
BMW 140.3
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

More Related Content

Similar to 6 char segmentation

XKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsXKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsNicolas Demengel
 
Lecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptLecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptMaiGaafar
 
Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Cyrille Martraire
 
Basics of Programming - A Review Guide
Basics of Programming - A Review GuideBasics of Programming - A Review Guide
Basics of Programming - A Review GuideBenjamin Kissinger
 
Python breakdown-workbook
Python breakdown-workbookPython breakdown-workbook
Python breakdown-workbookHARUN PEHLIVAN
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan ExamplesMonica Turner
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdfStephenLeo7
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst PracticesBurt Beckwith
 
Ruby for .NET developers
Ruby for .NET developersRuby for .NET developers
Ruby for .NET developersMax Titov
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar
 
Conducting ux research
Conducting ux researchConducting ux research
Conducting ux researchVina Sectiana
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detectionbutest
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 

Similar to 6 char segmentation (20)

XKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & ConstructsXKE - Programming Paradigms & Constructs
XKE - Programming Paradigms & Constructs
 
Max diff
Max diffMax diff
Max diff
 
Lecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.pptLecture 01 - Introduction and Review.ppt
Lecture 01 - Introduction and Review.ppt
 
Design smells
Design smellsDesign smells
Design smells
 
Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013Refactor your specs! Øredev 2013
Refactor your specs! Øredev 2013
 
Basics of Programming - A Review Guide
Basics of Programming - A Review GuideBasics of Programming - A Review Guide
Basics of Programming - A Review Guide
 
Decision tree
Decision tree Decision tree
Decision tree
 
Ninja Cursors
Ninja CursorsNinja Cursors
Ninja Cursors
 
Python breakdown-workbook
Python breakdown-workbookPython breakdown-workbook
Python breakdown-workbook
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan Examples
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdf
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
12.6-12.9.pptx
12.6-12.9.pptx12.6-12.9.pptx
12.6-12.9.pptx
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst Practices
 
Ruby for .NET developers
Ruby for .NET developersRuby for .NET developers
Ruby for .NET developers
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
 
Conducting ux research
Conducting ux researchConducting ux research
Conducting ux research
 
Code Metrics
Code MetricsCode Metrics
Code Metrics
 
Query Linguistic Intent Detection
Query Linguistic Intent DetectionQuery Linguistic Intent Detection
Query Linguistic Intent Detection
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 

More from Solin TEM

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2Solin TEM
 
4 downloading
4 downloading4 downloading
4 downloadingSolin TEM
 
1 intro history
1 intro history1 intro history
1 intro historySolin TEM
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contentsSolin TEM
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itcSolin TEM
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itcSolin TEM
 

More from Solin TEM (9)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
1 intro history
1 intro history1 intro history
1 intro history
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Recently uploaded

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

6 char segmentation

  • 1. Tesseract Tutorial: DAS 2014 Tours France 6. Character Segmentation, Language Models and Beam Search The heart of Tesseract Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France Approaches to Segmentation ● Segment first using only geometry. ● Maximally chop, then combine with a beam search. (Over-segmentation.) ● Sliding window to "avoid" segmentation altogether. ● Tesseract: Chop only as needed, then combine as needed.
  • 3. Tesseract Tutorial: DAS 2014 Tours France Over-Segmentation ● Aim is to maximize recall of chops with the compromise of reduction in precision.
  • 4. Tesseract Tutorial: DAS 2014 Tours France Segmentation Graph ● Segmentation possibilities and classifier results form a directed graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u %
  • 5. Tesseract Tutorial: DAS 2014 Tours France Searching the Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u higher %
  • 6. Tesseract Tutorial: DAS 2014 Tours France Searching the Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u higher }uglier % %
  • 7. Tesseract Tutorial: DAS 2014 Tours France Integration of Language Models (General Methods) ● Implement Language Model as Finite State Machine. ● Search Language Model and Segmentation Graph in parallel. ● Combine “probabilities” in some sensible way. ● Hidden Markov Model methods are good example.
  • 8. Tesseract Tutorial: DAS 2014 Tours France Segmentation Free = Extreme Over-Segmentation ● Slide over the word/textline with a classifier/HMM. ● Beam search + shape model probs + language model probs solves the segmentation internally. ● Really just mega-over-segmentation.
  • 9. Tesseract Tutorial: DAS 2014 Tours France Tesseract Segmentation Approach based on observations: ● Initial segmentation is often correct or close. ● Classifier generally doesn’t like incorrectly segmented text. ● Over-segmentation often leads to poor results., eg m->iii
  • 10. Tesseract Tutorial: DAS 2014 Tours France Tesseract Segmentation Approach Classify Initial Segmentation Search Word: OK? Yes => Done while any Bad Blob has any Chops available Chop and classify pieces of Worst Choppable Blob Search Word: OK? Yes => Done while any fixable “Pain point” Associate adjacent blobs and classify Search Word: OK? Yes => Done
  • 11. Tesseract Tutorial: DAS 2014 Tours France Types of Pain Point ● Initial: Join each adjacent pair ● Ambiguity: Eg m/rn ● Path: Neighbors of blobs in the current best path
  • 12. Tesseract Tutorial: DAS 2014 Tours France Ratings MATRIX = Segmentation Graph l } I 1 m u nh b l } I 1 i l 1 e c r T h b i l 1 i l 1 i l 1 l } I 1 m m g 9 q n u % l}Ii hb il1 un il1 m nu il1 g9q l}I1 % l}I1 hb il1 m ec m rT Each entry holds a BLOB_CHOICE_LIST providing classifier choices with rating and certainty.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Live Demo of Segmentation Search Chopper demo api/tesseract unlv/mag.3B/2/8022_028.3B.tif test1 inter segdemo chopdebug3 Col 2, line 6, word 1 Associator demo api/tesseract unlv/doe3.3B/4/2214_007.3B.tif test1 inter segdemo Col 2, line 8, word 2
  • 14. Tesseract Tutorial: DAS 2014 Tours France Evaluation of a WORD_CHOICE (no params-model) Word Rating = word_factor ∑blob_choice->rating() word_factor = segmentation Condition base word_factor Add-ons Frequent dawg word 1.0 Inconsistent case +0.1 Other dawg word 1.1 Inconsistent case +0.1 Non-dawg word 1.25 +0.01 for each char over 3. Inconsistent case +0.1 Inconsistent punc +0.2 Inconsistent chartype +0.3 Inconsistent script +0.5 Inconsisted char spacing +0.01 All except script +0.01 for each additional occurrence.
  • 15. Tesseract Tutorial: DAS 2014 Tours France Evaluation of a WORD_CHOICE (with params-model) Word Rating = word_factor ∑outline length word_factor = weighted sum of word features: ● mean blob rating ● num inconsistent spaces ● num inconsistent char type ● num x-height inconsistencies ● num case inconsistencies ● word length (in type categories)
  • 16. Tesseract Tutorial: DAS 2014 Tours France Tesseract Word Recognizer
  • 17. Tesseract Tutorial: DAS 2014 Tours France Example of Chopping (unlv/mag.3B/2/8022_028.3B.tif Col 2, line 6, word 1) Word Distance Worst blob Momm 212.2 7.7 Mommn 186.3 8.3 Momtfln178.0 9.2 Momtain 124.9 5.3 Mounm 184.0 7.7 Mountain 80.6 3.1 ACCEPT!
  • 18. Tesseract Tutorial: DAS 2014 Tours France Example of Combining (unlv/doe3.3B/4/2214_007.3B.tif, col 2, line 8, word 2) Word Distance lilllit- 77.58 limit57.1 Emit 89.7 Unfit 95.4 Hulk 122.7 Bulk 136.8 Word Distance HUM 120.6 MUM 127.0 BUM 134.7 1mm 112.8 Milk147.2 huh 137.1 fink 140.7 Emu 129.7 BMW 140.3
  • 19. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?