Online Server Support

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 30 August 2006

Announcing Tesseract OCR

Posted on 12:25 by Unknown
Post by Luc Vincent, Uber Tech Lead

We wanted to let you all know that a few months ago we quietly released - or actually re-released - an Optical Character Recognition (OCR) engine into open source. You might wonder why Google is interested in OCR? In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing.

This particular OCR engine, called Tesseract, was in fact not originally developed at Google! It was developed at Hewlett Packard Laboratories between 1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business and Tesseract has been collecting dust in an HP warehouse ever since. Fortunately some of our esteemed HP colleagues realized a year or two ago that rather than sit on this engine, it would be better for the world if they brought it back to life by open sourcing it, with the help of the Information Science Research Institute at UNLV. UNLV was happy to oblige, but they in turn asked for our help in fixing a few bugs that had crept in since 1995 (ever heard of bit rot?)... We tracked down the most obvious ones and decided a couple of months ago that Tesseract OCR was stable enough to be re-released as open source.

A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there. If you know of one that is more accurate, please do tell us!

We are grateful to all the people at HP who made it possible to release Tesseract into open source, and especially John Burns, who championed and babysat the project. We would also like to thank the original Tesseract development team, a partial list of whom is here. Last but not least, many thanks to our friends at UNLV's ISRI, including Tom Nartker, Kazem Taghva, Julie Borsack and Steve Lumos, for all their help with this project.

By the way, we are also hiring top-notch OCR engineers! See this job posting for more information.
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Google Summer of Code & Danish Linux Forum
    Posted by Leslie Hawthorn, Open Source Team The Danish Linux Conference is celebrating its tenth anniversary this year, and the date is com...
  • Weekly Google Code Roundup for July 2-6th
    By Dion Almaer, Google Developer Programs Having the July 4th holiday smack in the middle of the week creates a strange week when it is hard...
  • Weekly Google Code Roundup for June 11-15th
    By Dion Almaer, Google Developer Programs In API and developer-product news... I will start by going meta. Linking to a roundup from a round...
  • Weekly Google Code Roundup for July 16-20th
    By Dion Almaer, Google Developer Programs This week we have the pleasure of having MashupCamp hosted walking distance from the Googleplex. I...
  • Weekly Google Code Roundup for July 23-27th
    By Dion Almaer, Google Developer Programs It has been a busy time for conferences. From MashupCamp last week, to OSCON and The Ajax Experien...
  • Google Gadget Ventures
    By Tom Stocky, Google Developer Programs Good news for Google Gadget developers. We've just launched Google Gadget Ventures , a new pil...
  • Weekly Google Code Roundup for July 8-12th
    By Dion Almaer, Google Developer Programs In API and developer-product news... Othman Laraki talked about the Gears roadmap and development ...
  • Google Developer Day sessions move to San Jose Convention Center
    Posted by Andrew Bowers, Google Developer Programs Thanks to the incredible interest in Google Developer Day, we've moved the session po...
  • Google Sitemaps Launches
    Today, Google launched Google Sitemaps , a new service designed for webmasters that enables them to automatically submit their web pages to ...
  • Google Developer Podcast Episode Four: Mark Limber on Google SketchUp
    By Dion Almaer, Google Developer Programs Using iTunes? We have published the fourth episode of the Google Developer Podcast, which feature...

Categories

  • 20% project
  • 3d
  • accessibility
  • advogato
  • ajax
  • ajax search
  • ajax search books news apis
  • amarok
  • android
  • apache
  • apis
  • apis. charts
  • apple
  • atom publishing protocol
  • axsjax
  • barcodes
  • blogger
  • building ajax apps
  • c++
  • caja
  • calendar
  • camino
  • chronoscope
  • cifs
  • cms
  • collada
  • community
  • conferences
  • cricket
  • cryptography
  • danish linux forum
  • developer
  • django
  • documentation
  • dojo
  • dot net
  • dreamweaver
  • drupal
  • eclipse
  • eclipsecon
  • education
  • email
  • events
  • feeds
  • firevox
  • fosdem
  • freebsd
  • freenet
  • gadgets
  • gcc
  • gdata
  • gdd07
  • geoserver
  • getpaid
  • ghop
  • gnome
  • gnome women's summer outreach program
  • Google
  • google apps for your domain
  • google chart api
  • google checkout
  • google code
  • google code project hosting
  • google code search
  • google data apis
  • google developer day
  • google earth
  • google gadgets
  • google gears
  • google grants
  • google mashup editor
  • google summer of code
  • google web toolkit
  • green linux
  • gsoc
  • gtags
  • guice
  • GWSOP
  • gwt
  • haproxy
  • hibernate
  • howto
  • hpux
  • html
  • html5
  • igoogle
  • image search
  • Imara
  • interviews
  • java
  • javascript
  • joomla
  • joomladayus2007
  • joomladayusa
  • karaoke
  • KDE
  • KDE 4.0
  • kernel
  • kernel summit
  • kml
  • linux
  • linux foundation
  • linux summit
  • linux virtual server
  • linuxconf eu
  • LoCo
  • london
  • mac
  • MacFuse
  • maps
  • meetup
  • MIT CSAIL
  • mobile
  • mylar
  • MySQL
  • mythtv
  • named
  • netbsd
  • nss
  • objective-c
  • OCaml
  • ocr
  • ODF
  • oha
  • OOXML
  • open source
  • openajax alliance
  • opensocial
  • openssl
  • oreilly
  • oscon
  • oscon2007
  • oss devs
  • ossjam
  • osx
  • pactester
  • phone
  • picasa
  • picasa web
  • plone
  • plone sprint
  • podcast
  • portugal
  • programming
  • py3k
  • python
  • python sprint
  • reader
  • research
  • samba
  • scalability
  • screencast
  • security
  • shindig
  • silverstripe
  • sitemaps
  • sixapart
  • sketchup
  • soc
  • solaris
  • spa2007
  • speakers
  • standards
  • student programs
  • subversion
  • summer of code
  • syndication
  • testing
  • themes
  • topp
  • ubucon
  • ubuntu
  • unit test
  • unix
  • video
  • Vim
  • weekly roundup
  • windows
  • windows programming
  • Winter of Code
  • youtube
  • zurich
  • ZXing

Blog Archive

  • ►  2008 (7)
    • ►  January (7)
  • ►  2007 (159)
    • ►  December (8)
    • ►  November (13)
    • ►  October (16)
    • ►  September (11)
    • ►  August (16)
    • ►  July (11)
    • ►  June (14)
    • ►  May (13)
    • ►  April (12)
    • ►  March (19)
    • ►  February (14)
    • ►  January (12)
  • ▼  2006 (98)
    • ►  December (10)
    • ►  November (14)
    • ►  October (13)
    • ►  September (11)
    • ▼  August (14)
      • Announcing Tesseract OCR
      • Snakes on a Sprint
      • Crossing The Ubucon
      • New GData API: Google Base
      • Code on the Road: The Google Developers Event Cale...
      • Google Desktop Developer Update
      • Landing in Las Vegas
      • coolApp = new myCreativity(mapsAPI, searchAPI);
      • Google Web Toolkit Update
      • Google Maps API Tutorial
      • Project Hosting 'R' Us
      • Google Gadget Guru
      • MarkL on the AJAX Search API
      • Google Summer of Code Mid-Term Report
    • ►  July (9)
    • ►  June (5)
    • ►  May (5)
    • ►  April (6)
    • ►  March (4)
    • ►  February (2)
    • ►  January (5)
  • ►  2005 (40)
    • ►  December (4)
    • ►  November (1)
    • ►  October (3)
    • ►  September (2)
    • ►  August (5)
    • ►  July (3)
    • ►  June (11)
    • ►  May (2)
    • ►  April (4)
    • ►  March (5)
Powered by Blogger.

About Me

Unknown
View my complete profile