[Taxacom] Algorithms for misspelled taxon names (was: the hurdle for all biodiv informatics initiatives)

Fri Feb 26 03:58:28 CST 2010

Van: Tony.Rees at csiro.au [mailto:Tony.Rees at csiro.au]
Verzonden: vr 26-2-2010 2:23

I am happy to speak further to this topic, in case there is anyone on the list interested in the issues here.

To take Paul's points in reverse order:

- Misspelled taxon names (excluding authority portion):

Quote: "there is a limited set of variations that are very likely to be found, and it should be possible to deal with these efficiently." (Earlier it was also suggested that a fairly simple algorithm might suffice).

I built such a "simple algorithm" here in 2000-2001 (these days I call it the "Rees 2001 phonetic match" algorithm). Basically it looks for mismatches on single vs. repeated letters (e.g. Nicholsia vs. Nichollsia), dropped or added "h"'s (e.g. Coelorynchys vs. Coelorhynchus), some "phonetic" vowel substitutions (e.g. Penaeus vs. Peneus, oe/ae, i/y and a few others), and some "phonetic" consonant substitutions (e.g. c/k, s/z, a few others). In this version, the leading letter of any name was considered to be correct (as per Soundex and similar, non-specialised "phonetic" algorithms). This algorithm was first installed in my CAAB database (http://www.cmar.csiro.au/caab/) and later in Aphia in Belgium (ERMS, World Porifera database etc. etc.), and OBIS in the USA. It turns out that this catches around 35-40% of actual errors either from user queries, or in stored data, with very few false positives (unlike Soundex, which generates a massive number of false hits).

Later in 2007 I added a few refinements, to produce my "Rees 2007 phonetic algorithm". For this I allowed the leading character to vary in certain cases (Pteranodon/Teranodon, Euglena/Uglena etc.), and added a refinement to deal with variant gender endings in species epithets (not genera), e.g. to match Pinus radiata / -us / -um. This algorithm is currently installed in the Euro+Med Plantbase system in Germany. On test it catches around 50% of errors, the balance being therefore "non-phonetic" (according to this definition) and non-gender errors.

For the balance of essentially non-phonetic errors, it is interesting to see what these comprise. Some are predictable e.g. transposed characters (e.g. Acropaginula/Arcopaginula) or the same with a character pair or syllable (serrulatus/serratulus). Some are generated by faulty OCR (optical character recognition) e.g. l/t, l/i, o/e, ri/n, rn/m. Some are keystroke errors from hitting a key adjacent to the intended one e.g. b/v, t/y. Some are simply bad characters inserted e.g. ; or * or $ or / for no obvious reason. Many are random missing characters or extra characters inserted anywhere (though rather rarely at the beginning or end). Sometimes a whole syllable is inserted or missing (e.g. triangulum/triangulatum).

In the end I decided it was not worth trying to cope with all of these cases separately since virtually all of them can be treated as one or more characters either missing, altered, inserted, or transposed. To detect these requires a type of test belong to the family of "edit distance" tests which are rather slow (e.g. 10 secs to test a single input name against 10,000 names, 16+ mins to test against 1m names) so requires a lot of pre-filtering (using human-devised rules) to avoid testing lots of names that almost certainly will not match, then more filtering (again using human-set rules) of the resulting raw hits to try so far as possible to eliminate false hits while keeping the result "bag" that will include true ones (if present).

With these optimisation steps, my system currently only has to test maybe 1,000 names (out of 1.4 million held in the system) and generally returns a result in 1-2 seconds per input name. (Of course if I had a supercomputer or a cloud of linked machines that could come down further). This is the non-phonetic portion of the algorithm I call TAXAMATCH, which in addition employs the "improved" (2007) phonetic algorithm as described above for its phonetic testing. Currently I have TAXAMATCH working in my IRMNG database, access at http://www.cmar.csiro.au/datacentre/irmng/, and some trial runs have also been undertaken with the 18m names in the global names index (www.globalnames.org) and GBIF China, as well as potentially elsewhere. To see it working e.g. in the IRMNG system, try this test (or anything else you may like to devise):

http://www.marine.csiro.au/mirrorsearch/ir_search.go?hlevel=species&searchtxt=hombo+sapient

There are a few cases that TAXAMATCH currently does not address; these include genus+species concatenated (think Homosapiens), broken words (think Fucus vesic ulosus), genus and species transposed, and subgenus with missing brackets (so it looks like a the middle of a trinomial name instead of a subgenus), all of which I encounter occasionally, but have generally picked up manually or with other approaches (such as testing otherwise unknown "species epithets" to see if they match any known genus name).

I deal with authority matching using a separate routine; since this message has become rather lengthy maybe I will describe this in a separate post if persons are interested (otherwise feel free to stop me now...).

Anyway, hope the above is of interest so far - or maybe "too much information",

Regards - Tony

***
I am glad to see that Tony is "happy to speak further to this topic,"; this is 
quite informative. This approach is different in focus from what I had in mind. 
The errors that are made in filling in a query window (or a spreadsheet as in 
the case of the fossil molluscs) look substantially different in nature from 
the variation I would expect in the literature and in databases. These are 
indeed misspellings, resulting from the interaction between a person 
(apparently in haste) and a keyboard. On the other hand the variation I saw 
in the GNI has a much narrower range (in line with what I encounter in the
literature) and looks much more predictable.

Paul