SlideShare a Scribd company logo
1 of 16
Download to read offline
3 Training Tesseract
An Introduction to the Training
Process
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
The Big Picture
Web Crawl
Repository
Language ID
Map-Reduce
Eng
Dirty Language
Corpora
Cleaned Language
Corpora
Text Filtration
Eng
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Eng
Manually generated Files
Tesseract Tutorial: DAS 2014 Tours France
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Bigram list
Word list
Realistic Training text
Punctuation patterns
combine_tessdata
wordlist2dawg
mftraining
cntraining
shapetraining
unicharset_extractor
set_unicharset_properties
Eng
Manually generated Files
text2image
<lang>.traineddata
The Open Source Parts
Tesseract Tutorial: DAS 2014 Tours France
Training Fundamentals
● Character samples must be segregated by font
=> Trained on synthetic (rendered, distorted) data
● Few samples required (4-10 of each combination is good. 1 is OK)
● Not many fonts required. (32 used for Latin)
● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.)
● Number of different “characters” now limited only by memory.
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (1)
Name Type Status Creator Description
config Text Optional Manual Lang-specific engine settings if needed
unicharset Text Mandatory unicharset_extractor The set of recognizable units
unicharambigs Text Optional Manual* Intrinsic ambiguities for the language
inttemp Binary Mandatory mftraining Classifier shape data
pffmtable Text Mandatory mftraining Extra classifier data (num expected features)
normproto Text Mandatory cntraining Classifier baseline position info
cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units
shapetable Binary Optional shapetraining Indirection between classifier and unicharset
params-model Text Optional Google tool Alternative method for combining LM & classifier
Tesseract Tutorial: DAS 2014 Tours France
Inside tesstrain.sh
Input Program Output
Realistic Training Text text2image *.tif, *.box
*.box unicharset_extractor unicharset
unicharset, <script>.unicharset set_unicharset_properties unicharset
Word List wordlist2dawg word-dawg
Frequent Word List wordlist2dawg freq-dawg
*.tif, *.box tesseract *.tr
*.tr cntraining normproto
*.tr mftraining inttemp, pffmtable
unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetExtracted Features (.tr files)
All from correctly segmented characters
Text Corpus
Training Data
Text Filtration
Filtered
Corpus
Text To
Wordlist
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
Combine Tessdata Traineddata
Overview of Tesseract Training Process
NOT open source
Ambigs
Training
Unichar
Ambigs
punc
dawg
word
dawg
number
dawg
freq
dawg
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
Language-Specific Data
● Training text: defines the character set.
● Wordlists: define the language model. (Including bigrams when
present.)
● Pango layout: defines the grapheme clusters (recognition units).
Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 ->
● ICU: determines what is right-to-left.
● Script.unicharset: Stores typical font metrics for unicode chars.
● Config files: in training/langdata/<lang>/<lang>.config.
● Vertical rendering: Determined manually.
Tesseract Tutorial: DAS 2014 Tours France
MFtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: inttemp (knn classifier data)
pffmtable (helper information for class pruner)
Operation:
1. Independently cluster features in each font/char class combination.
2. Combine similar cluster means across fonts (single char class).
3. Define each font/char class as a combination of cluster means (a
font config).
4. Build class pruner and main knn classifier.
Tesseract Tutorial: DAS 2014 Tours France
Clustering Result
Protos of Arial ‘A’ Protos of Times Italic ‘A’
Tesseract Tutorial: DAS 2014 Tours France
CNtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: normproto (GMM means of the CN feature)
Operation:
1. Independently cluster the CN feature of all fonts for each char
class.
2. Write cluster means (a Gaussian Mixture Model) to normproto file.
Tesseract Tutorial: DAS 2014 Tours France
Shapetraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: shapetable (Mapping from an index to a collection of unichar-
ids, fonts)
Operation:
1. Cluster across all fonts and all char classes.
2. Merge ambiguous classes into shapes.
3. Works OK for Indic, but not so good for others.
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Shape
Clustering
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetChopped
Fragments
Natural
Fragments
Correctly
Segmented
Naturally
touching
Extracted Features (.tr files)
Text Corpus
Training Data
Text filtration
Filtered
Corpus
Text To
Wordlist
Correctly
Segmented
Validation
Set
Junk
Master Trainer FileShape
Unicharset
Shape Table
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
punc
dawg
word
dawg
number
dawg
freq
dawg
Combine Tessdata Traineddata
Overview of Tesseract Training Process with Shapes
NOT open source
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
DAWGs (Directed Acyclic Word Graph)
From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson,
CACM 31(5) May 1988, pp572-585:
“The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The
relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.”
Trie: Dawg:
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (2)
Name Type Status Creator Description
punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words
word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model
number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?)
freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words
fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK
cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube
bigram-dawg Binary Optional wordlist2dawg Word bigram language model
unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

More Related Content

Similar to 3 training

Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
tswr
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
mskayed
 

Similar to 3 training (20)

8 modernization efforts
8 modernization efforts8 modernization efforts
8 modernization efforts
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 
C3 w2
C3 w2C3 w2
C3 w2
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
1 intro history
1 intro history1 intro history
1 intro history
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
NLP
NLPNLP
NLP
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usage
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
Text Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksText Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social Networks
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

More from Solin TEM (9)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Recently uploaded

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 

Recently uploaded (20)

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

3 training

  • 1. 3 Training Tesseract An Introduction to the Training Process Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France The Big Picture Web Crawl Repository Language ID Map-Reduce Eng Dirty Language Corpora Cleaned Language Corpora Text Filtration Eng Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Eng Manually generated Files
  • 3. Tesseract Tutorial: DAS 2014 Tours France Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Bigram list Word list Realistic Training text Punctuation patterns combine_tessdata wordlist2dawg mftraining cntraining shapetraining unicharset_extractor set_unicharset_properties Eng Manually generated Files text2image <lang>.traineddata The Open Source Parts
  • 4. Tesseract Tutorial: DAS 2014 Tours France Training Fundamentals ● Character samples must be segregated by font => Trained on synthetic (rendered, distorted) data ● Few samples required (4-10 of each combination is good. 1 is OK) ● Not many fonts required. (32 used for Latin) ● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.) ● Number of different “characters” now limited only by memory.
  • 5. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (1) Name Type Status Creator Description config Text Optional Manual Lang-specific engine settings if needed unicharset Text Mandatory unicharset_extractor The set of recognizable units unicharambigs Text Optional Manual* Intrinsic ambiguities for the language inttemp Binary Mandatory mftraining Classifier shape data pffmtable Text Mandatory mftraining Extra classifier data (num expected features) normproto Text Mandatory cntraining Classifier baseline position info cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units shapetable Binary Optional shapetraining Indirection between classifier and unicharset params-model Text Optional Google tool Alternative method for combining LM & classifier
  • 6. Tesseract Tutorial: DAS 2014 Tours France Inside tesstrain.sh Input Program Output Realistic Training Text text2image *.tif, *.box *.box unicharset_extractor unicharset unicharset, <script>.unicharset set_unicharset_properties unicharset Word List wordlist2dawg word-dawg Frequent Word List wordlist2dawg freq-dawg *.tif, *.box tesseract *.tr *.tr cntraining normproto *.tr mftraining inttemp, pffmtable unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
  • 7. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetExtracted Features (.tr files) All from correctly segmented characters Text Corpus Training Data Text Filtration Filtered Corpus Text To Wordlist cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG Combine Tessdata Traineddata Overview of Tesseract Training Process NOT open source Ambigs Training Unichar Ambigs punc dawg word dawg number dawg freq dawg bigram dawg
  • 8. Tesseract Tutorial: DAS 2014 Tours France Language-Specific Data ● Training text: defines the character set. ● Wordlists: define the language model. (Including bigrams when present.) ● Pango layout: defines the grapheme clusters (recognition units). Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 -> ● ICU: determines what is right-to-left. ● Script.unicharset: Stores typical font metrics for unicode chars. ● Config files: in training/langdata/<lang>/<lang>.config. ● Vertical rendering: Determined manually.
  • 9. Tesseract Tutorial: DAS 2014 Tours France MFtraining Input: font.tr files (Stored Features with utf-8 labels) Output: inttemp (knn classifier data) pffmtable (helper information for class pruner) Operation: 1. Independently cluster features in each font/char class combination. 2. Combine similar cluster means across fonts (single char class). 3. Define each font/char class as a combination of cluster means (a font config). 4. Build class pruner and main knn classifier.
  • 10. Tesseract Tutorial: DAS 2014 Tours France Clustering Result Protos of Arial ‘A’ Protos of Times Italic ‘A’
  • 11. Tesseract Tutorial: DAS 2014 Tours France CNtraining Input: font.tr files (Stored Features with utf-8 labels) Output: normproto (GMM means of the CN feature) Operation: 1. Independently cluster the CN feature of all fonts for each char class. 2. Write cluster means (a Gaussian Mixture Model) to normproto file.
  • 12. Tesseract Tutorial: DAS 2014 Tours France Shapetraining Input: font.tr files (Stored Features with utf-8 labels) Output: shapetable (Mapping from an index to a collection of unichar- ids, fonts) Operation: 1. Cluster across all fonts and all char classes. 2. Merge ambiguous classes into shapes. 3. Works OK for Indic, but not so good for others.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Shape Clustering Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetChopped Fragments Natural Fragments Correctly Segmented Naturally touching Extracted Features (.tr files) Text Corpus Training Data Text filtration Filtered Corpus Text To Wordlist Correctly Segmented Validation Set Junk Master Trainer FileShape Unicharset Shape Table cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG punc dawg word dawg number dawg freq dawg Combine Tessdata Traineddata Overview of Tesseract Training Process with Shapes NOT open source bigram dawg
  • 14. Tesseract Tutorial: DAS 2014 Tours France DAWGs (Directed Acyclic Word Graph) From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson, CACM 31(5) May 1988, pp572-585: “The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.” Trie: Dawg:
  • 15. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (2) Name Type Status Creator Description punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?) freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube bigram-dawg Binary Optional wordlist2dawg Word bigram language model unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
  • 16. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?