[Taxacom] the hurdle for all biodiv informatics initiatives

Wed Feb 24 16:44:46 CST 2010

Dear Paul, all,

You wrote: 

> Obviously, when dealing with each occurrence of a name it is important to
> resolve homonyms, to circumvent misspellings, etc, etc, but piling lots 
> and lots of text strings on top of one another is diverting time and 
> attention from actually doing that. It looks to me like an attempt to 
> understand the postal system by collecting misprinted stamps; it may 
> be a nice as a hobby but won't get anywhere. 

> Instead it would be advisable to build a (fairly simple) algorithm to do
> a rough sort, at least that would save time for the inevitable human 
> operator who will have to do the actual work.

I like your postal analogy, but in fact the effort we are discussing (I think) is directed not towards misprinted stamps (which indeed would be mere curiosities) but mis-spelled or mis-addressed recipients (typically with something of value in the included content). The fact that the recipients can also change their names and addresses, or two persons can share a common name or address yet be distinct, is also relevant, and compounds the issue. So either we end up with a lot of undeliverable or wrongly delivered mail, or we try to handle the various problems - in part indeed with algorithms to deal with misspellings (a personal interest here, e.g. see my own attempts at http://www.cmar.csiro.au/datacentre/taxamatch.htm), and in part by building reference lists of either "clean" names alone, or "clean plus dirty" (i.e. misspelled) names.

The extent to which the latter is required or desirable is a matter for debate; obviously if you have a lot of incoming data (such as museum specimens, field observations, and literature / nomenclator citations) labelled with such misspellings you cannot throw them away, and it is probably advantageous to keep them, but adequately reconciled and cross-referenced as required. More "secondary" misspellings such as OCR, non-specialist authored web content and database errors I am inclined not to keep (in my systems at least) since otherwise, where do you stop, however as with many such matters there is no exact line that is easily drawn.

So my approach, developed and iterated over more that a decade of handling such issues, is pragmatic but does not exclude "known" misspellings as well as those which are attached to data/information of interest. Of course one could try to predict and store the possible variants of even a four letter word such as "Aloe" but you soon run out of fingers or hands on which to count them, so definitely not worth it, and probably a waste of effort as you suggest...

Regards

Tony Rees
Manager, Divisional Data Centre,
CSIRO Marine and Atmospheric Research,
GPO Box 1538,
Hobart, Tasmania 7001, Australia
Ph: 0362 325318 (Int: +61 362 325318)
Fax: 0362 325000 (Int: +61 362 325000)
e-mail: Tony.Rees at csiro.au
Manager, OBIS Australia regional node, http://www.obis.org.au/ 
Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566