Spelling detection and correction in Taxonomic Databases
Doug Yanega
dyanega at POP.UCR.EDU
Tue May 28 10:33:31 CDT 2002
>I'm focussing on the spelling errors approach at this moment and I'm
>wondering if anyone is working in any similar or related approach in
>order to share our experiences.
The *idea* of fully automated name checking is okay, but it is
utterly impractical to use this as a basis for making databasing
decisions, since there are way too many cases where the original
orthography was either technically incorrect or INTENTIONALLY
inappropriate (as are many of the names of my "Curious Scientific
Names" webpage, which is offline at the moment) but validly published
nonetheless - meaning changing the name in the database would
represent an unjustified emendation. How, for example, do you
determine the gender of the genera Batman, Iyaiyai, This, Leonardo,
Heerz, Lalapa, and Verae? These are not latin names. Will your
spellchecker recognize that the combinations Leonardo davincii, Heerz
lukenatcha, Lalapa lusa, and Verae peculya are all actually correct?
You will NEVER be able to tell, with 100% certainty, whether an
apparently misspelled name in your database *is* misspelled unless
you consult the original descriptions, and that cannot be automated.
If the author of the genus Sphaeropthalma spelled it that way, then
that's the way it should appear in the database, even if it should
have been spelled Sphaerophthalma, and subsequent authors describing
species in that genus used the latter "emended" spelling - but you
have no way of distinguishing if all you're doing is running an
automated spellcheck. In this example, there will be names that
appear fine to the spellchecker, but do not match the original
*valid* spelling. Automation guarantees false positives, too.
Ultimately, it's a nice concept, but at some point, a human being has
to step in and resolve the matter - using a spellchecking routine is
fine ONLY if you use it to flag certain names as *possible* errors,
and realize that there will be many missed in the process. There is
no way around requiring someone to actually go to the literature, do
the research, find the original spelling, and then make a note in the
database that indicates which one is correct (and also thereafter
excludes those names from triggering the spellchecker).
Peace,
--
Doug Yanega Dept. of Entomology Entomology Research Museum
Univ. of California - Riverside, Riverside, CA 92521
phone: (909) 787-4315 (standard disclaimer: opinions are mine, not UCR's)
http://entmuseum9.ucr.edu/staff/yanega.html
"There are some enterprises in which a careful disorderliness
is the true method" - Herman Melville, Moby Dick, Chap. 82
More information about the Taxacom
mailing list