[Taxacom] Status of OCR for specimen labels

Brad Boyle bboyle at email.arizona.edu
Wed May 27 11:15:19 CDT 2009


Hi all,

The discussion of OCR on a different thread ("Centrally supported
electronic archive") has reminded me to ask: what is the current status
of OCR technology for spanning specimen labels?

I ask because at the University of Arizona Herbarium we finally have a
website that supports images, and are now gearing up for a large-scale
scanning effort. We are also looking for ways to increase the rate of
databasing of our specimens - still painfully slow despite the valiant
efforts of our data entry students.

I have heard rumors of various OCR applications suitable for reading
herbarium specimen labels. However, I have yet to talk to someone who
is actually using such an application on a production scale, as opposed
to developing or tinkering.

If anyone out there is actually using OCR to database specimens, would
you mind answering a few questions?

1) What software do you use?
2) Is the error rate acceptable? That is, is the time spent spotting and
correcting errors still less than the time that would have been spent
databasing the specimen manually?
3) Can the software handle different languages reliably? (We have a lot
of Spanish language labels in particular).
4) Can the software reliably recover information from hand-written
labels?

Any other comments or recommendations you might have would be greatly
appreciated.

Best wishes,
Brad

___________________________________________

Brad Boyle, Ph.D.
Research Associate, Biodiversity Informatics

The University of Arizona Herbarium (ARIZ)
Herring Hall
PO Box 210036
1130 E. South Campus Dr.
Tucson, AZ 85721, USA

520-621-7243 (herbarium)
520-626-7860 (office)
bboyle at email.arizona.edu

ARIZ: http://ag.arizona.edu/herbarium
Biodiversity Informatics Initiative (BDII):
   http://loco.biosci.arizona.edu/bdii
___________________________________________






More information about the Taxacom mailing list