Spelling detection and correction in Taxonomic Databases

Fri May 24 14:10:01 CDT 2002

Dear taxacom,

I'm an MPhil/PhD student at University of Southampton - UK, working with
some computational techniques in order to detect (and maybe correct)
"bad data" in taxonomic databases. These techniques are organized in
three different approaches: structural, contextual  and spelling errors.

I'm working with some taxonomic databases where the most important are:

* Species 2000 - 51,918 "unique names"
* ILDIS - 15,616 "unique names"
* Northeast of Brazil Plants Checklist - 7,691 "unique names"
* Atlantic Rain Forest (Brazil) - 1,802 "unique names"

I'm focussing on the spelling errors approach at this moment and I'm
wondering if anyone is working in any similar or related approach in
order to share our experiences.

I would like to know, as well, if any taxacom members that have
Taxonomic Databases would like to have their database checked by the
tools that I'm working on. These tools generate a list of "suspect pairs
of names", that could be spelling errors, using different algorithms.

Here are some examples that arise from the cited dbs:

* Spirodela polyrhiza
  Spirodela polyrrhiza

* Inga brachystachya
  Inga brachystachys

* Squatina occulta
  Squatina oculata

* Steindachneria argentea
  Steindachnerina argentea

* Tephrosia clementii
* Tephrosia clementis

* Rhipsalis cassutha
  Rhipsalis cassyta

* Epidendrum cinnabarimum
  Epidendrum cinnabarinum

* Fleurya aestuans
  Fleurya aestyans

Thank you in advance for any comments and contributions to my work.

-------------------
Eduardo Dalcin
edalcin at soton.ac.uk
-------------------