Entering the Fourth Dimension of OCR with Tesseract

Entering the Fourth DimensionEntering the Fourth Dimension
of OCR withof OCR with
TesseractTesseract
Hanno Embregts @hannotify

What is OCR?What is OCR?What is OCR?What is OCR?What is OCR?
/ rst dimension/ rst dimension/ rst dimension/ rst dimension/ rst dimension

Atechnology that can take writtenAtechnology that can take written
words and convert then aes mowords and convert then aes mo
aulerseadans Yom, proved meyren neaulerseadans Yom, proved meyren ne
fgntrom sng tne cortec cars ometimes,fgntrom sng tne cortec cars ometimes,
ate fant gait sae and ptch, dar engugnate fant gait sae and ptch, dar engugn
onthe pase, ana Your Bropared toonthe pase, ana Your Bropared to
spend Several cantinas covtecing al mespend Several cantinas covtecing al me
one PrP areee e SP es ts anine Oe tatone PrP areee e SP es ts anine Oe tat
came autas zerseecame autas zersee

A technology that can take writtenA technology that can take written
words and convert them back intowords and convert them back into
computer-readable form, providedcomputer-readable form, provided
they're in the right font, using thethey're in the right font, using the
correct colors sometimes, at the rightcorrect colors sometimes, at the right
point size and pitch, dark enough onpoint size and pitch, dark enough on
the paper, and you're prepared to spendthe paper, and you're prepared to spend
several centuries correcting all the onesseveral centuries correcting all the ones
that came out as l's, all the O's thatthat came out as l's, all the O's that
came out as zeroes, and all the colonscame out as zeroes, and all the colons
that come out like semicolons.that come out like semicolons.

A Proper De nitionA Proper De nition
Optical character recognitionOptical character recognition (...) is(...) is
the mechanical or electronic conversionthe mechanical or electronic conversion
of images of typed, handwritten orof images of typed, handwritten or
printed text into machine-encoded text,printed text into machine-encoded text,
whether from a scanned document, awhether from a scanned document, a
photo of a document, a scene-photophoto of a document, a scene-photo
(...) or from subtitle text superimposed(...) or from subtitle text superimposed
on an image.on an image.

Pattern recognitionPattern recognition

Feature detectionFeature detection

19291929192919291929
Gustav Tauschek patents a basic OCR
'reading machine'.

1960s1960s1960s1960s1960s
Postal services start using OCR for
mail sorting.

19931993199319931993
The Apple Newton becomes the rst
handheld computer to feature
handwriting recognition.

Financial transfersFinancial transfersFinancial transfersFinancial transfersFinancial transfers
Catch me if you can!

Book digitizationBook digitizationBook digitizationBook digitizationBook digitization
Also supports Ctrl+F.

Passport scanningPassport scanningPassport scanningPassport scanningPassport scanning
Gets you to your gate in time.

Number plateNumber plateNumber plateNumber plateNumber plate
recognitionrecognitionrecognitionrecognitionrecognition
Get your speeding ticket even faster!

GettingGettingGettingGettingGetting
StartedStartedStartedStartedStarted
/ second dimension/ second dimension/ second dimension/ second dimension/ second dimension

TesseractTesseract
Development started at Hewlett-Packard in 1985
Ported to Windows in 1996
Released as open-source in 2005
Google sponsors development of Tesseract since
2006
(( ))https://github.com/tesseract-ocr/tesseracthttps://github.com/tesseract-ocr/tesseract

—— Anthony KayAnthony Kay
in "Linux Journal", July 2007in "Linux Journal", July 2007
"The core feature, text recognition, is"The core feature, text recognition, is
drastically better than anything elsedrastically better than anything else
I've tried from the Open SourceI've tried from the Open Source
community."community."

FeaturesFeatures
character recognition
support for Unicode
input: JPEG, GIF, PNG, TIFF or BMP
output: searchable PDF, TSV, plain text or HOCR

HOCR exampleHOCR example
<p class="ocr_par" lang="deu" title="bbox930">
<span class="ocr_line" title="bbox 348 797 1482 838; baseline -
<span class="ocrx_word" title="bbox 348 805 402 832; x_wconf
<span class="ocrx_word" title="bbox 1199 797 1343 832; x_wcon
<span class="ocrx_word" title="bbox 1362 805 1399 823; x_wcon
<span class="ocrx_word" title="bbox 1417 x_wconf 96">ver-</sp
</span>
</p>

Used by GoogleUsed by Google
For text detection on mobile devices
In video
In Gmail image spam detection

New featuresNew features
in v3.0in v3.0
support for over 100 languages
page layout analysis
in v4.0in v4.0
LSTM recognition engine

Tess4JTess4JTess4JTess4JTess4J
A Java JNA wrapper for Tesseract

Tess4J featuresTess4J features
PDF input
Multi-page TIFF input
Image optimization
(( ))https://github.com/nguyenq/tess4jhttps://github.com/nguyenq/tess4j

DemoDemo
Install Tesseract (for multilanguage support)
Add Tess4J dependency
Convert image to plain text (English)
Convert image to plain text (Greek)

Choosing theChoosing theChoosing theChoosing theChoosing the
Right LibraryRight LibraryRight LibraryRight LibraryRight Library
/ third dimension/ third dimension/ third dimension/ third dimension/ third dimension

ABBYY FineReaderABBYY FineReader
Development started at ABBYY in 1993
Supports 192 languages
20 million users worldwide
Outputs to MS O ce, RTF, HTML, (searchable) PDF
and plain text
(( ))https://www.abbyy.com/en-eu/ nereaderhttps://www.abbyy.com/en-eu/ nereader

Google Cloud VisionGoogle Cloud Vision
APIAPI
Launched in 2016 by Google
Supports 56 languages
Outputs to JSON
Integrates nicely with Google Images and Google
SafeSearch
(( ))https://cloud.google.com/vision/https://cloud.google.com/vision/

ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0

ABBYY GCV Tesseract
costs $200
per computer
$1.50
month
$0
languages 192 56 102

ABBYY GCV Tesseract
costs $200
per computer
$1.50
month
$0
Java integration through SDK through REST
API
through JNA
wrapper

ABBYY GCV Tesseract
costs $200
per computer
$1.50
month
$0
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported

ABBYY GCV Tesseract
costs $200
per computer
$1.50
month
$0
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
custom training supported not supported supported

ABBYY GCV Tesseract
costs $200
per computer
$1.50
month
$0
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
custom training supported not supported supported
accuracy 9/10 8/10 7/10

Case studyCase studyCase studyCase studyCase study
Paper archives going digital.

AdvancedAdvancedAdvancedAdvancedAdvanced
FeaturesFeaturesFeaturesFeaturesFeatures
/ fourth dimension/ fourth dimension/ fourth dimension/ fourth dimension/ fourth dimension

What AdvancedWhat Advanced
Features?Features?
Reporting con dence
Multiple languages in a single document
Image optimization
Speed/accuracy tradeo s
Training

Improving accuracyImproving accuracyImproving accuracyImproving accuracyImproving accuracy
To better recognize the expected
input documents.

What is con dence?What is con dence?

Reporting con denceReporting con dence
Tess4J supports two return types:Tess4J supports two return types:
String (containing the OCR'ed text)
List<OCRResult> (OCR result is written to a le)
int confidence
List<Word> words

Multiple languages in aMultiple languages in a
single documentsingle document
Concatenate the language codes and separate themConcatenate the language codes and separate them
by a plus sign:by a plus sign:
tesseract.setLanguage("eng+nld");

DemoDemo
Reporting con dence
Multiple languages in a single document

Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.

convertImageToBinary(BufferedImage
image)

image)
convertImageToGrayscale(BufferedImage
image)

image)
image)
invertImageColor(BufferedImage image)

image)
image)
invertImageColor(BufferedImage image)
rotateImage(BufferedImage image,
double angle)

Still having problems?Still having problems?
https://github.com/tesseract-https://github.com/tesseract-
ocr/tesseract/wiki/ImproveQualityocr/tesseract/wiki/ImproveQuality

Speed/accuracySpeed/accuracy
tradeo stradeo s
Two types of training data:Two types of training data:
https://github.com/tesseract-ocr/tessdata_fast
https://github.com/tesseract-ocr/tessdata_best

DemoDemo
Image optimization
Speed/accuracy tradeo s

Training dataTraining data
400,000 textlines
4500 fonts
(for Latin-based languages)(for Latin-based languages)

Custom trainingCustom training

Fine tune (e.g. for an unusual font)

Cut o the top layer (e.g. for a new language)

Cut o the top layer (e.g. for a new language)
Retrain from scratch (e.g. don't do this!)

FurtherFurtherFurtherFurtherFurther
readingreadingreadingreadingreading

Further readingFurther reading
"An Overview of the Tesseract OCR Engine" by Ray
Smith
( )
Useful resourcesUseful resources
Tesseract on Github
( )
Try Tesseract online
( )
https://research.google.com/pubs/archive/33418.pdf
https://github.com/tesseract-ocr/tesseract
newocr.com

AnyAnyAnyAnyAny
questions?questions?questions?questions?questions?

Thank you! ☺ Thank you! ☺
https://hannotify.github.io
@hannotify
hanno.embregts@infosupport.com

Entering the Fourth Dimension of OCR with Tesseract

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Entering the Fourth Dimension of OCR with Tesseract

Similar to Entering the Fourth Dimension of OCR with Tesseract (20)

More from 🎤 Hanno Embregts 🎸

More from 🎤 Hanno Embregts 🎸 (19)

Recently uploaded

Recently uploaded (20)

Entering the Fourth Dimension of OCR with Tesseract