[Taxacom] Algorithms for misspelled taxon names (was: the hurdle for all biodiv informatics initiatives)

Mon Mar 1 02:51:01 CST 2010

Van: Tony.Rees at csiro.au [mailto:Tony.Rees at csiro.au]
Verzonden: ma 1-3-2010 5:55

[...]
So these are the two main areas in which my approach has been used to date. Searching taxonomic literature is another area again, which may indeed produce either a lower or a different pattern of errors, however once OCR is introduced (e.g. in the Biodiversity Heritage Library and even Adobe/Google's own OCR of pdf documents presented as images), errors start to multiply again - sometimes wildly so depending on the quality of the original - so either my or others' algorithms will definitely be needed in that space as well. [...]

***
Yes, OCR raises a separate set of problems. Each font used may well result in an
idiosynchratic set of errors (the other day I was dealing with a text in which the
letter "d" had consistently been 'recognized' as being "el").

In the early days of IPNI I fairly often encountered scanning errors (not only of names, 
but also of page numbers, etc); these were reduced by what must have been a rather 
massive effort. Probably not completely eliminated, even now. 

Paul