This document provides an overview of optical character recognition (OCR) and the Tesseract OCR engine. It discusses the basics of OCR, the history and development of Tesseract, its features such as character recognition and language support, and advanced topics like reporting confidence, image optimization, and custom training. The document also compares Tesseract to competitors like ABBYY FineReader and Google Cloud Vision in terms of cost, capabilities, and accuracy.
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
ย
Entering the Fourth Dimension of OCR with Tesseract
1. Entering the Fourth DimensionEntering the Fourth Dimension
of OCR withof OCR with
TesseractTesseract
Hanno Embregts @hannotify
2.
3. What is OCR?What is OCR?What is OCR?What is OCR?What is OCR?
/ rst dimension/ rst dimension/ rst dimension/ rst dimension/ rst dimension
4. Atechnology that can take writtenAtechnology that can take written
words and convert then aes mowords and convert then aes mo
aulerseadans Yom, proved meyren neaulerseadans Yom, proved meyren ne
fgntrom sng tne cortec cars ometimes,fgntrom sng tne cortec cars ometimes,
ate fant gait sae and ptch, dar engugnate fant gait sae and ptch, dar engugn
onthe pase, ana Your Bropared toonthe pase, ana Your Bropared to
spend Several cantinas covtecing al mespend Several cantinas covtecing al me
one PrP areee e SP es ts anine Oe tatone PrP areee e SP es ts anine Oe tat
came autas zerseecame autas zersee
5.
6.
7. A technology that can take writtenA technology that can take written
words and convert them back intowords and convert them back into
computer-readable form, providedcomputer-readable form, provided
they're in the right font, using thethey're in the right font, using the
correct colors sometimes, at the rightcorrect colors sometimes, at the right
point size and pitch, dark enough onpoint size and pitch, dark enough on
the paper, and you're prepared to spendthe paper, and you're prepared to spend
several centuries correcting all the onesseveral centuries correcting all the ones
that came out as l's, all the O's thatthat came out as l's, all the O's that
came out as zeroes, and all the colonscame out as zeroes, and all the colons
that come out like semicolons.that come out like semicolons.
8. A Proper De nitionA Proper De nition
Optical character recognitionOptical character recognition (...) is(...) is
the mechanical or electronic conversionthe mechanical or electronic conversion
of images of typed, handwritten orof images of typed, handwritten or
printed text into machine-encoded text,printed text into machine-encoded text,
whether from a scanned document, awhether from a scanned document, a
photo of a document, a scene-photophoto of a document, a scene-photo
(...) or from subtitle text superimposed(...) or from subtitle text superimposed
on an image.on an image.
19. Number plateNumber plateNumber plateNumber plateNumber plate
recognitionrecognitionrecognitionrecognitionrecognition
Get your speeding ticket even faster!
22. TesseractTesseract
Development started at Hewlett-Packard in 1985
Ported to Windows in 1996
Released as open-source in 2005
Google sponsors development of Tesseract since
2006
(( ))https://github.com/tesseract-ocr/tesseracthttps://github.com/tesseract-ocr/tesseract
23. โโ Anthony KayAnthony Kay
in "Linux Journal", July 2007in "Linux Journal", July 2007
"The core feature, text recognition, is"The core feature, text recognition, is
drastically better than anything elsedrastically better than anything else
I've tried from the Open SourceI've tried from the Open Source
community."community."
29. Tess4J featuresTess4J features
PDF input
Multi-page TIFF input
Image optimization
(( ))https://github.com/nguyenq/tess4jhttps://github.com/nguyenq/tess4j
30. DemoDemo
Install Tesseract (for multilanguage support)
Add Tess4J dependency
Convert image to plain text (English)
Convert image to plain text (Greek)
31. Choosing theChoosing theChoosing theChoosing theChoosing the
Right LibraryRight LibraryRight LibraryRight LibraryRight Library
/ third dimension/ third dimension/ third dimension/ third dimension/ third dimension
33. ABBYY FineReaderABBYY FineReader
Development started at ABBYY in 1993
Supports 192 languages
20 million users worldwide
Outputs to MS O ce, RTF, HTML, (searchable) PDF
and plain text
(( ))https://www.abbyy.com/en-eu/ nereaderhttps://www.abbyy.com/en-eu/ nereader
34. Google Cloud VisionGoogle Cloud Vision
APIAPI
Launched in 2016 by Google
Supports 56 languages
Outputs to JSON
Integrates nicely with Google Images and Google
SafeSearch
(( ))https://cloud.google.com/vision/https://cloud.google.com/vision/
37. ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
38. ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK through REST
API
through JNA
wrapper
39. ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
40. ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
custom training supported not supported supported
41. ABBYY GCV Tesseract
costs $200
per computer
$1.50
per 1000 images, per
month
$0
languages 192 56 102
Java integration through SDK through REST
API
through JNA
wrapper
handwriting
recognition
'handprinted'
text
supported not supported
custom training supported not supported supported
accuracy 9/10 8/10 7/10
55. Reporting con denceReporting con dence
Tess4J supports two return types:Tess4J supports two return types:
String (containing the OCR'ed text)
List<OCRResult> (OCR result is written to a le)
int confidence
List<Word> words
56. Multiple languages in aMultiple languages in a
single documentsingle document
Concatenate the language codes and separate themConcatenate the language codes and separate them
by a plus sign:by a plus sign:
tesseract.setLanguage("eng+nld");
58. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
59. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
60. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
61. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
invertImageColor(BufferedImage image)
62. Image optimizationImage optimization
Tess4J is bundled with theTess4J is bundled with the ImageHelperImageHelper class, whichclass, which
contains a few image optimization tricks.contains a few image optimization tricks.
convertImageToBinary(BufferedImage
image)
convertImageToGrayscale(BufferedImage
image)
invertImageColor(BufferedImage image)
rotateImage(BufferedImage image,
double angle)
63. Still having problems?Still having problems?
https://github.com/tesseract-https://github.com/tesseract-
ocr/tesseract/wiki/ImproveQualityocr/tesseract/wiki/ImproveQuality
70. Custom trainingCustom training
Fine tune (e.g. for an unusual font)
Cut o the top layer (e.g. for a new language)
Retrain from scratch (e.g. don't do this!)
75. Further readingFurther reading
"An Overview of the Tesseract OCR Engine" by Ray
Smith
( )
Useful resourcesUseful resources
Tesseract on Github
( )
Try Tesseract online
( )
https://research.google.com/pubs/archive/33418.pdf
https://github.com/tesseract-ocr/tesseract
newocr.com