SlideShare a Scribd company logo
1 of 22
Tesseract OCR Engine
What it is, where it came from,
      where it is going.
  Ray Smith, Google Inc
         OSCON 2007
Contents
•   Introduction & history of OCR
•   Tesseract architecture & methods
•   Announcing Tesseract 2.00
•   Training Tesseract
•   Future enhancements
A Brief History of OCR
• What is Optical Character Recognition?




                         OCR


  My invention relates to statistical machines
 of the type in which successive comparisons
 are made between a character and a charac-
A Brief History of OCR
• OCR predates electronic computers!




                US Patent 1915993, Filed Apr 27, 1931
A Brief History of OCR
•   1929 – Digit recognition machine
•   1953 – Alphanumeric recognition machine
•   1965 – US Mail sorting
•   1965 – British banking system
•   1976 – Kurzweil reading machine
•   1985 – Hardware-assisted PC software
•   1988 – Software-only PC software
•   1994-2000 – Industry consolidation
Tesseract Background
• Developed on HP-UX at HP between 1985
  and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XIS
  in the 1995 UNLV test.
 (See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:
  http://code.google.com/p/tesseract-ocr
• Highly portable.
Tesseract OCR Architecture
Input: Gray or Color Image
[+ Region Polygons]                        Adaptive
                                         Thresholding
                                                           Binary Image


                                             Character
                             Find Text                     Connected
                                             Outlines
                             Lines and                     Component
                               Words                        Analysis
         Character
         Outlines
         Organized
         Into Words
                                 Recognize               Recognize
                                   Word                    Word
                                  Pass 1                  Pass 2
Adaptive Thresholding is Essential
        Some examples of how difficult it can be to make a binary image
                   Taken from the UNLV Magazine set.
                  (http://www.isri.unlv.edu/ISRI/OCRtk )
Baselines are rarely perfectly straight

• Text Line Finding – skew independent –
  published at ICDAR’95 Montreal.
  (http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splines
  to account for skew and curl.
• Meanline, ascender and descender lines are a
  constant displacement from baseline.
• Critical value is the x-height.
Spaces between words are tricky too

 • Italics, digits, punctuation all create
   special-case font-dependent spacing.
 • Fully justified text in narrow columns can
   have vastly varying spacing on different
   lines.
Tesseract: Recognize Word


                                             No
  Character      Character
                                     Done?
  Chopper        Associator

                                   Yes

                                              Adapt to
                                               Word




  Static                      Adaptive
                                              Number
              Dictionary
 Character                    Character
                                              Parser
 Classifier                   Classifier
Outline Approximation



 Original Image      Outlines of components     Polygonal Approximation



Polygonal approximation is a double-edged sword.
Noise and some pertinent information are both lost.
Tesseract: Features and Matching




Prototype   Character     Extracted   Match of      Match of
            to classify   Features    Prototype     Features To
                                      To Features   Prototype

• Static classifier uses outline fragments as
  features. Broken characters are easily
  recognizable by a small->large matching
  process in classifier. (This is slow.)
• Adaptive classifier uses the same technique!
  (Apart from normalization method.)
Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-based
  languages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train at
  http://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes
Training Tesseract
                                                                       Tesseract Data Files


                                                                           User-words
                                   Wordlist2dawg
      Word List
                                                                           Word-dawg,
                                                                           Freq-dawg
      Training
    page images                                   mfTraining                inttemp,
                                  Character                                 pffmtable
              Tesseract
                                  Features
Tesseract
                                  (*.tr files)    cnTraining
 +manual
                                                                            normproto
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
Tesseract Dictionaries
                                         Tesseract Data Files
                Usually Empty
                                             User-words
            Infrequent
Word List
            Word List                        Word-dawg,
                         Wordlist2dawg
                                             Freq-dawg
            Frequent
            Word List
Tesseract Shape Data
                                                         Tesseract Data Files



                                Prototype Shape Features

                           Expected Feature Counts

      Training
    page images                             mfTraining        inttemp,
                             Character                        pffmtable
              Tesseract
                             Features
Tesseract
                             (*.tr files)   cnTraining
 +manual
                                                              normproto
correction


                      Character Normalization Features
      Box files
Tesseract Character Data
                                                                       Tesseract Data Files




      Training
    page images

                          List of Characters + ctype information
Tesseract
 +manual
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
     Typical OCR errors eg e<->c, rn<->m etc
Accuracy Results
          Comparison of current results against 1995 UNLV results

Testid   Testset    Character                        Non-stopword

                    Errors      Accuracy   Change    Errors     Accuracy   Change

1995     Bus.3B     5959        98.14%               1293       95.73%

1995     Doe3.3B    36349       97.52%               7042       94.87%

1995     Mag.3B     15043       97.74%               3379       94.99%

1995     News.3B    6432        98.69%               1502       96.94%

Gcc4.1   Bus.3B     6258        98.04%     5.02%     1312       95.67%     1.47%

Gcc4.1   Doe3.3B    28589       98.05%     -21.35%   6692       95.12%     -4.97%

Gcc4.1   Mag.3B     14800       97.78%     -1.62%    3123       95.37%     -7.58%

Gcc4.1   News.3B    7524        98.47%     16.98%    1220       97.51%     -18.77%

Gcc4.1   Total      57171                  -10.37%   12347                 -6.58%
Commercial OCR v Tesseract
• 100+ languages.     • 6 languages + growing.
• Accuracy is good    • Accuracy was good in
  now.                  1995.
• Sophisticated app   • No UI yet.
  with complex UI.
• Works on complex    • Page layout analysis
  magazine pages.       coming soon.
• Windows Mostly.     • Runs on Linux, Mac,
                        Windows, more...
• Costs $130-$500     • Open source – Free!
Tesseract Future

•   Page layout analysis.
•   More languages.
•   Improve accuracy.
•   Add a UI.
The End
• For more information see:
  http://code.google.com/p/tesseract-ocr

More Related Content

What's hot

Digital image processing
Digital image processingDigital image processing
Digital image processingAstha Jain
 
ppt on region segmentation by AJAY KUMAR SINGH (NITK)
ppt on region segmentation by AJAY KUMAR SINGH (NITK)ppt on region segmentation by AJAY KUMAR SINGH (NITK)
ppt on region segmentation by AJAY KUMAR SINGH (NITK)Ajay Kumar Singh
 
Computer animation Computer Graphics
Computer animation Computer Graphics Computer animation Computer Graphics
Computer animation Computer Graphics University of Potsdam
 
Digital image processing
Digital image processingDigital image processing
Digital image processingtushar05
 
introduction to Digital Image Processing
introduction to Digital Image Processingintroduction to Digital Image Processing
introduction to Digital Image Processingnikesh gadare
 
Basics of image processing using MATLAB
Basics of image processing using MATLABBasics of image processing using MATLAB
Basics of image processing using MATLABMohsin Siddique
 
DCT image compression
DCT image compressionDCT image compression
DCT image compressionyoussef ramzy
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processingData Science Thailand
 
Image Processing By SAIKIRAN PANJALA
 Image Processing By SAIKIRAN PANJALA Image Processing By SAIKIRAN PANJALA
Image Processing By SAIKIRAN PANJALASaikiran Panjala
 
Enhancement in spatial domain
Enhancement in spatial domainEnhancement in spatial domain
Enhancement in spatial domainAshish Kumar
 
From Image Processing To Computer Vision
From Image Processing To Computer VisionFrom Image Processing To Computer Vision
From Image Processing To Computer VisionJoud Khattab
 
hidden surface elimination using z buffer algorithm
hidden surface elimination using z buffer algorithmhidden surface elimination using z buffer algorithm
hidden surface elimination using z buffer algorithmrajivagarwal23dei
 
Two dimensional geometric transformations
Two dimensional geometric transformationsTwo dimensional geometric transformations
Two dimensional geometric transformationsMohammad Sadiq
 
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSING
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSINGHIGH PASS FILTER IN DIGITAL IMAGE PROCESSING
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSINGBimal2354
 
Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionSaibee Alam
 

What's hot (20)

Digital image processing
Digital image processingDigital image processing
Digital image processing
 
ppt on region segmentation by AJAY KUMAR SINGH (NITK)
ppt on region segmentation by AJAY KUMAR SINGH (NITK)ppt on region segmentation by AJAY KUMAR SINGH (NITK)
ppt on region segmentation by AJAY KUMAR SINGH (NITK)
 
PPT s08-machine vision-s2
PPT s08-machine vision-s2PPT s08-machine vision-s2
PPT s08-machine vision-s2
 
Computer animation Computer Graphics
Computer animation Computer Graphics Computer animation Computer Graphics
Computer animation Computer Graphics
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
introduction to Digital Image Processing
introduction to Digital Image Processingintroduction to Digital Image Processing
introduction to Digital Image Processing
 
Basics of image processing using MATLAB
Basics of image processing using MATLABBasics of image processing using MATLAB
Basics of image processing using MATLAB
 
DCT image compression
DCT image compressionDCT image compression
DCT image compression
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
 
Image Processing By SAIKIRAN PANJALA
 Image Processing By SAIKIRAN PANJALA Image Processing By SAIKIRAN PANJALA
Image Processing By SAIKIRAN PANJALA
 
Enhancement in spatial domain
Enhancement in spatial domainEnhancement in spatial domain
Enhancement in spatial domain
 
From Image Processing To Computer Vision
From Image Processing To Computer VisionFrom Image Processing To Computer Vision
From Image Processing To Computer Vision
 
Multimedia compression
Multimedia compressionMultimedia compression
Multimedia compression
 
Image processing Presentation
Image processing PresentationImage processing Presentation
Image processing Presentation
 
hidden surface elimination using z buffer algorithm
hidden surface elimination using z buffer algorithmhidden surface elimination using z buffer algorithm
hidden surface elimination using z buffer algorithm
 
Two dimensional geometric transformations
Two dimensional geometric transformationsTwo dimensional geometric transformations
Two dimensional geometric transformations
 
Image denoising
Image denoising Image denoising
Image denoising
 
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSING
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSINGHIGH PASS FILTER IN DIGITAL IMAGE PROCESSING
HIGH PASS FILTER IN DIGITAL IMAGE PROCESSING
 
Image Annotation
Image AnnotationImage Annotation
Image Annotation
 
Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognition
 

Viewers also liked

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholisticoscon2007
 
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...Dierk König
 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacksoscon2007
 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Touroscon2007
 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorialoscon2007
 
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss) FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss) Dierk König
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5oscon2007
 
FregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVMFregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVMDierk König
 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMDierk König
 
FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)Dierk König
 
Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Dierk König
 
Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015Dierk König
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructuresSolin TEM
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012Anil Madhavapeddy
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjonesoscon2007
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR EngineRaghu nath
 

Viewers also liked (20)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
 
Os Napier
Os NapierOs Napier
Os Napier
 
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacks
 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorial
 
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss) FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
 
Os Harkins
Os HarkinsOs Harkins
Os Harkins
 
FregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVMFregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVM
 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVM
 
FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)
 
Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege
 
Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
 

Similar to Os Raysmith

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)PostgreSQL Experts, Inc.
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012Mark Kilgard
 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionEelco Visser
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & DrupalDrupal Taiwan
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Using S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and TrainingUsing S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and TrainingScott Abel
 
Maximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FMEMaximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FMESafe Software
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebasebalhoff
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for RubyistsSean Cribbs
 
XML Schema Patterns for Databinding
XML Schema Patterns for DatabindingXML Schema Patterns for Databinding
XML Schema Patterns for DatabindingPaul Downey
 
Corel draw vba object model
Corel draw vba object modelCorel draw vba object model
Corel draw vba object modelkraken125
 
Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...Tomonari Masada
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashupskingargyle
 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureJinho Choi
 

Similar to Os Raysmith (20)

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax Definition
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Using S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and TrainingUsing S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and Training
 
Maximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FMEMaximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FME
 
Building a SQL Database that Works
Building a SQL Database that WorksBuilding a SQL Database that Works
Building a SQL Database that Works
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebase
 
Evolution: It's a process
Evolution: It's a processEvolution: It's a process
Evolution: It's a process
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
XML Schema Patterns for Databinding
XML Schema Patterns for DatabindingXML Schema Patterns for Databinding
XML Schema Patterns for Databinding
 
Corel draw vba object model
Corel draw vba object modelCorel draw vba object model
Corel draw vba object model
 
Cascon2011_5_rules+owl
Cascon2011_5_rules+owlCascon2011_5_rules+owl
Cascon2011_5_rules+owl
 
Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...
 
3 training
3 training3 training
3 training
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency Structure
 

More from oscon2007

Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifmoscon2007
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Moleoscon2007
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashearsoscon2007
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swposcon2007
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Mythsoscon2007
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillipsoscon2007
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdatedoscon2007
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reformoscon2007
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007oscon2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbakeroscon2007
 

More from oscon2007 (20)

Os Borger
Os BorgerOs Borger
Os Borger
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
 
Os Fogel
Os FogelOs Fogel
Os Fogel
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
 
Os Tucker
Os TuckerOs Tucker
Os Tucker
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
 
Os Furlong
Os FurlongOs Furlong
Os Furlong
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
 
Os Sharp
Os SharpOs Sharp
Os Sharp
 

Recently uploaded

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Recently uploaded (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Os Raysmith

  • 1. Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007
  • 2. Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements
  • 3. A Brief History of OCR • What is Optical Character Recognition? OCR My invention relates to statistical machines of the type in which successive comparisons are made between a character and a charac-
  • 4. A Brief History of OCR • OCR predates electronic computers! US Patent 1915993, Filed Apr 27, 1931
  • 5. A Brief History of OCR • 1929 – Digit recognition machine • 1953 – Alphanumeric recognition machine • 1965 – US Mail sorting • 1965 – British banking system • 1976 – Kurzweil reading machine • 1985 – Hardware-assisted PC software • 1988 – Software-only PC software • 1994-2000 – Industry consolidation
  • 6. Tesseract Background • Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. • Came neck and neck with Caere and XIS in the 1995 UNLV test. (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) • Never used in an HP product. • Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr • Highly portable.
  • 7. Tesseract OCR Architecture Input: Gray or Color Image [+ Region Polygons] Adaptive Thresholding Binary Image Character Find Text Connected Outlines Lines and Component Words Analysis Character Outlines Organized Into Words Recognize Recognize Word Word Pass 1 Pass 2
  • 8. Adaptive Thresholding is Essential Some examples of how difficult it can be to make a binary image Taken from the UNLV Magazine set. (http://www.isri.unlv.edu/ISRI/OCRtk )
  • 9. Baselines are rarely perfectly straight • Text Line Finding – skew independent – published at ICDAR’95 Montreal. (http://scholar.google.com/scholar?q=skew+detection+smith) • Baselines are approximated by quadratic splines to account for skew and curl. • Meanline, ascender and descender lines are a constant displacement from baseline. • Critical value is the x-height.
  • 10. Spaces between words are tricky too • Italics, digits, punctuation all create special-case font-dependent spacing. • Fully justified text in narrow columns can have vastly varying spacing on different lines.
  • 11. Tesseract: Recognize Word No Character Character Done? Chopper Associator Yes Adapt to Word Static Adaptive Number Dictionary Character Character Parser Classifier Classifier
  • 12. Outline Approximation Original Image Outlines of components Polygonal Approximation Polygonal approximation is a double-edged sword. Noise and some pertinent information are both lost.
  • 13. Tesseract: Features and Matching Prototype Character Extracted Match of Match of to classify Features Prototype Features To To Features Prototype • Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) • Adaptive classifier uses the same technique! (Apart from normalization method.)
  • 14. Announcing tesseract-2.00 • Fully Unicode (UTF-8) capable • Already trained for 6 Latin-based languages (Eng, Fra, Ita, Deu, Spa, Nld) • Code and documented process to train at http://code.google.com/p/tesseract-ocr • UNLV regression test framework • Other minor fixes
  • 15. Training Tesseract Tesseract Data Files User-words Wordlist2dawg Word List Word-dawg, Freq-dawg Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry
  • 16. Tesseract Dictionaries Tesseract Data Files Usually Empty User-words Infrequent Word List Word List Word-dawg, Wordlist2dawg Freq-dawg Frequent Word List
  • 17. Tesseract Shape Data Tesseract Data Files Prototype Shape Features Expected Feature Counts Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Character Normalization Features Box files
  • 18. Tesseract Character Data Tesseract Data Files Training page images List of Characters + ctype information Tesseract +manual correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry Typical OCR errors eg e<->c, rn<->m etc
  • 19. Accuracy Results Comparison of current results against 1995 UNLV results Testid Testset Character Non-stopword Errors Accuracy Change Errors Accuracy Change 1995 Bus.3B 5959 98.14% 1293 95.73% 1995 Doe3.3B 36349 97.52% 7042 94.87% 1995 Mag.3B 15043 97.74% 3379 94.99% 1995 News.3B 6432 98.69% 1502 96.94% Gcc4.1 Bus.3B 6258 98.04% 5.02% 1312 95.67% 1.47% Gcc4.1 Doe3.3B 28589 98.05% -21.35% 6692 95.12% -4.97% Gcc4.1 Mag.3B 14800 97.78% -1.62% 3123 95.37% -7.58% Gcc4.1 News.3B 7524 98.47% 16.98% 1220 97.51% -18.77% Gcc4.1 Total 57171 -10.37% 12347 -6.58%
  • 20. Commercial OCR v Tesseract • 100+ languages. • 6 languages + growing. • Accuracy is good • Accuracy was good in now. 1995. • Sophisticated app • No UI yet. with complex UI. • Works on complex • Page layout analysis magazine pages. coming soon. • Windows Mostly. • Runs on Linux, Mac, Windows, more... • Costs $130-$500 • Open source – Free!
  • 21. Tesseract Future • Page layout analysis. • More languages. • Improve accuracy. • Add a UI.
  • 22. The End • For more information see: http://code.google.com/p/tesseract-ocr