OCR correction pitch

From pro-iBiosphere Wiki
Jump to: navigation, search

The pitch

OCR correction is a long standing problem, especially given the huge amount of literature being made available by BHL. A simple way to correct errors in text would be useful. The whole text doesn't need to be corrected, often just a few characters need to edited to correct a taxonomic name, museum code, collection date, etc.


Examples of variation in a single name due to OCR errors:

  • Eleutherodactylus
  • Eleutheroclactylus
  • Eleuthewdactyliis
  • Eleiitherodactylus
  • Eleuthewdactylus

There are lots of ways to approach the problem of OCR markup correction, but I think the following is desirable:

  • Simple user interface, ideally WYSIWYG
  • User as much browser technology as we can, e.g. the "contenteditable" tag
  • Allow multiple people to edit
  • Ability to work offline (e.g., can we sit on a plane and correct text, then sync when we get back online?)

The ability to sync data on and offline suggests a database like CouchDB, and it's browser-based equivalent PouchDB. One implementation would be to display the page text in HTML, laid out as closely as possible to the original text. Each line is tagged "contenteditable" so a user can edit the text, if the user leaves that line its contents are saved to the database.

The use of CouchDB makes some design decisions simple. The data is stored in JSON, it can be placed on a public server in the cloud so that anyone could have the edited text (such as BHL). Users could edit the text offline in their browser, and store those edits in PouchDB, eventually syncing them with the central database.


Peter Hovenkamp: A lot of this sort of technology has aready been developed by the Distributed Proofreaders project (http://www.pgdp.net/c/)

--David Shorthouse Could think about using PDF.js for this as well https://github.com/mozilla/pdf.js. Example from GN http://demo.globalnames.org/reconciler?token=wN6HUIvWQuG1JdEVdaJ7cw

Marko Tähtinen: I added to the scope of use case the need to check specimen labels from OCR for match in limited dictionary for common attributes e.g. colletor's name, country etc. I created simple REST/JSON service that takes as input word from OCR and returns attribute type and possible correction. Example input: Latxia output: [{"ocrCorrectedString":"countryFixed: Latvia"}]

Starting points


Very crude demo showing capture of edits.

  • demo on bionames [1]

What would be the outcome?

A usable tool to correct OCR errors, and just as importantly, make those correction available to anyone.