Spelling detection and correction in Taxonomic Databases

christian thompson cthompson at SEL.BARC.USDA.GOV
Wed May 29 10:54:50 CDT 2002


Thanks, Rich & Doug, for the proper answer to the problems of taxonomic
databases.

Unfortunately, there remains many, especially new ones, such as ALL, who
think there is some simple computerized solution to 250 years worth of
biological names. Hire a computer guru, surf the web, suck in the names,
let's the computer cruch them and you will have your Electronic Catalog of
Known Organisms or whatever. Unfortunately, those computer gurus, forget
GIGO!

And when we tell then from experience that the only possible way to resolve
this history of names is to re-examine the ORIGINAL literature, we are
ignored as luddites.

Yes, there are short-cuts, such as registration systems which have been
proposed in both zoology and botany, but so long as the scientific community
rejects them, the only acceptable alternative that remains is to do the job
properly, which involves the costly, time-consuming task of re-examining our
history thru its original literature.

The picture, however, is not as bleak as it seems. Take, for example,
Diptera or flies, a group that contains 10% of the known biodiversity. Over
the years we have been building a BioSystematic Database of World Diptera
(see www.diptera.org/biosys.htm). We now have more than 200,000 names in our
Nomenclator and are about 90% finished with capture of names from secondary
sources, such as published catalogs, Neave, Sherborn, ZR, etc. These names
then are being reviewed by specialists, who check them against the original
literature and provide up-to-date taxonomy. Two datasets (fruit flies
(1999), solider flies (2001)) have been peer-reviewed and published in
traditional format (books). To get the rest reviewed by specialists will
only take time (25 years) or if given support (about $50K per year) can be
done more quickly (5 years). Our real problem is that we, as a community,
are losing our specialists. But with modest support, that is, we are not
seeks the billions NASA wants for a Mars probe, just the few millions that
NSF is investing in the new Tree of Life program, the traditional
taxonomists can get their act together and deliver a comprehensive of
names.

Oh, well ...

F. Christian Thompson
Systematic Entomology Lab., ARS, USDA
Smithsonian Institution
Washington, D. C. 20560-0169
(202) 382-1800 voice
(202) 786-9422 FAX
cthompso at sel.barc.usda.gov [NB: no terminal "n"]
visit our Diptera site at www.diptera.org


>>> Richard Pyle <deepreef at BISHOPMUSEUM.ORG> 05/28 3:04 PM >>>
> The *idea* of fully automated name checking is okay, but it is
> utterly impractical to use this as a basis for making databasing
> decisions,

I agree...but I think that the point wasn't so much to automatically
*correct* errors, as much as to find potential candidate errors that a
knowledgable human would make decisions about.  At least, that was my
understanding of the purpose of the tools.

> using a spellchecking routine is
> fine ONLY if you use it to flag certain names as *possible* errors,
> and realize that there will be many missed in the process. There is
> no way around requiring someone to actually go to the literature, do
> the research, find the original spelling, and then make a note in the
> database that indicates which one is correct (and also thereafter
> excludes those names from triggering the spellchecker).

Yes -- exactly.

Aloha,
Rich




More information about the Taxacom mailing list