[Taxacom] Algorithms for misspelled taxon names (was: the hurdle for all biodiv informatics initiatives)

Thu Feb 25 19:23:32 CST 2010

Paul van Rijckevorsel wrote:

***
Subject: Re: [Taxacom] the hurdle for all biodiv informatics initiatives
....

An algorithm is indeed a logical avenue to explore (so I am glad to 
see this was not disregarded), but it is not necessarily misspellings
that have to be dealt with. There is a huge variation in author citation
that does not fall in that category.

An algorithm should for the most part be customized rather strictly: 
there is a limited set of variations that are very likely to be found, 
and it should be possible to deal with these efficiently.

***

I am happy to speak further to this topic, in case there is anyone on the list interested in the issues here.

To take Paul's points in reverse order:

- Misspelled taxon names (excluding authority portion):

Quote: "there is a limited set of variations that are very likely to be found, and it should be possible to deal with these efficiently." (Earlier it was also suggested that a fairly simple algorithm might suffice).

I built such a "simple algorithm" here in 2000-2001 (these days I call it the "Rees 2001 phonetic match" algorithm). Basically it looks for mismatches on single vs. repeated letters (e.g. Nicholsia vs. Nichollsia), dropped or added "h"'s (e.g. Coelorynchys vs. Coelorhynchus), some "phonetic" vowel substitutions (e.g. Penaeus vs. Peneus, oe/ae, i/y and a few others), and some "phonetic" consonant substitutions (e.g. c/k, s/z, a few others). In this version, the leading letter of any name was considered to be correct (as per Soundex and similar, non-specialised "phonetic" algorithms). This algorithm was first installed in my CAAB database (http://www.cmar.csiro.au/caab/) and later in Aphia in Belgium (ERMS, World Porifera database etc. etc.), and OBIS in the USA. It turns out that this catches around 35-40% of actual errors either from user queries, or in stored data, with very few false positives (unlike Soundex, which generates a massive number of false hits).

Later in 2007 I added a few refinements, to produce my "Rees 2007 phonetic algorithm". For this I allowed the leading character to vary in certain cases (Pteranodon/Teranodon, Euglena/Uglena etc.), and added a refinement to deal with variant gender endings in species epithets (not genera), e.g. to match Pinus radiata / -us / -um. This algorithm is currently installed in the Euro+Med Plantbase system in Germany. On test it catches around 50% of errors, the balance being therefore "non-phonetic" (according to this definition) and non-gender errors.

For the balance of essentially non-phonetic errors, it is interesting to see what these comprise. Some are predictable e.g. transposed characters (e.g. Acropaginula/Arcopaginula) or the same with a character pair or syllable (serrulatus/serratulus). Some are generated by faulty OCR (optical character recognition) e.g. l/t, l/i, o/e, ri/n, rn/m. Some are keystroke errors from hitting a key adjacent to the intended one e.g. b/v, t/y. Some are simply bad characters inserted e.g. ; or * or $ or / for no obvious reason. Many are random missing characters or extra characters inserted anywhere (though rather rarely at the beginning or end). Sometimes a whole syllable is inserted or missing (e.g. triangulum/triangulatum).

In the end I decided it was not worth trying to cope with all of these cases separately since virtually all of them can be treated as one or more characters either missing, altered, inserted, or transposed. To detect these requires a type of test belong to the family of "edit distance" tests which are rather slow (e.g. 10 secs to test a single input name against 10,000 names, 16+ mins to test against 1m names) so requires a lot of pre-filtering (using human-devised rules) to avoid testing lots of names that almost certainly will not match, then more filtering (again using human-set rules) of the resulting raw hits to try so far as possible to eliminate false hits while keeping the result "bag" that will include true ones (if present).

With these optimisation steps, my system currently only has to test maybe 1,000 names (out of 1.4 million held in the system) and generally returns a result in 1-2 seconds per input name. (Of course if I had a supercomputer or a cloud of linked machines that could come down further). This is the non-phonetic portion of the algorithm I call TAXAMATCH, which in addition employs the "improved" (2007) phonetic algorithm as described above for its phonetic testing. Currently I have TAXAMATCH working in my IRMNG database, access at http://www.cmar.csiro.au/datacentre/irmng/, and some trial runs have also been undertaken with the 18m names in the global names index (www.globalnames.org) and GBIF China, as well as potentially elsewhere. To see it working e.g. in the IRMNG system, try this test (or anything else you may like to devise):

http://www.marine.csiro.au/mirrorsearch/ir_search.go?hlevel=species&searchtxt=hombo+sapient

There are a few cases that TAXAMATCH currently does not address; these include genus+species concatenated (think Homosapiens), broken words (think Fucus vesic ulosus), genus and species transposed, and subgenus with missing brackets (so it looks like a the middle of a trinomial name instead of a subgenus), all of which I encounter occasionally, but have generally picked up manually or with other approaches (such as testing otherwise unknown "species epithets" to see if they match any known genus name).

I deal with authority matching using a separate routine; since this message has become rather lengthy maybe I will describe this in a separate post if persons are interested (otherwise feel free to stop me now...).

Anyway, hope the above is of interest so far - or maybe "too much information",

Regards - Tony

Tony Rees
Manager, Divisional Data Centre,
CSIRO Marine and Atmospheric Research,
GPO Box 1538,
Hobart, Tasmania 7001, Australia
Ph: 0362 325318 (Int: +61 362 325318)
Fax: 0362 325000 (Int: +61 362 325000)
e-mail: Tony.Rees at csiro.au
Manager, OBIS Australia regional node, http://www.obis.org.au/ 
Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566

-----Original Message-----
From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of dipteryx at freeler.nl
Sent: Thursday, 25 February 2010 7:50 PM
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] the hurdle for all biodiv informatics initiatives

Van: Tony.Rees at csiro.au [mailto:Tony.Rees at csiro.au]
Verzonden: wo 24-2-2010 23:44

Dear Paul, all,

I like your postal analogy, but in fact the effort we are discussing (I think) is directed not towards misprinted stamps (which indeed would be mere curiosities) but mis-spelled or mis-addressed recipients (typically with something of value in the included content). The fact that the recipients can also change their names and addresses, or two persons can share a common name or address yet be distinct, is also relevant, and compounds the issue. 

***
The postal system is not a bad analogy, although for this purpose 
it would probably be better to take into account everything that has 
ever been mailed.
* * *

So either we end up with a lot of undeliverable or wrongly delivered mail, or we try to handle the various problems - in part indeed with algorithms to deal with misspellings (a personal interest here, e.g. see my own attempts at http://www.cmar.csiro.au/datacentre/taxamatch.htm)

***
An algorithm is indeed a logical avenue to explore (so I am glad to 
see this was not disregarded), but it is not necessarily misspellings
that have to be dealt with. There is a huge variation in author citation
that does not fall in that category.

An algoritm should for the most part be customized rather strictly: 
there is a limited set of variations that are very likely to be found, 
and it should be possible to deal with these efficiently.

Paul
* * *