[Taxacom] globalnames?

dipteryx at freeler.nl dipteryx at freeler.nl
Thu Sep 17 03:28:04 CDT 2009


Van: David Remsen (GBIF) [mailto:dremsen at gbif.org]
Verzonden: wo 16-9-2009 21:52
> Paul,

> You might try [...]

> http://code.google.com/p/taxon-name-processing/wiki/NameParsing

***
FWIW I would like to point out that the example at the top has 
the basionymAuthor and the combinationAuthor interchanged; 
the same example in the code below that is correct in that respect
(although iffy in a different respect).
* * *

> You said there are likely 2-3 million names, at most.  In what 
> sense of the word "name" since I get different answers from 
> botanists than zoologists as to what they mean and it affects 
> the cardinality of the estimate.

***
I am using the word scientific name in what looks to me as the 
accepted way (as indicated at ttp://www.globalnames.org/about);
that is, taking "biological" in a fairly strict sense, 
(excluding many formalized ways to indicate organisms), this 
looks to me to exclude names of viruses (which follow a different
logic).

In the ICBN the definition is in Art. 6.3.:
"In this Code, unless otherwise indicated, the word "name" means
 a name that has been validly published, whether it is legitimate 
 or illegitimate (see Art. 12)." and in Art. 12.1:
"12.1. A name of a taxon has no status under this Code unless it
 is validly published [...]."

Obviously, this presents a problem to such projects as GNI in 
that strings like 'Faba faba' are not validly published, nor are 
many 'manuscript names' scattered through the literature, although
by form they are indistinguishable from actual scientific names.

Leaving this aside, authorship is emphatically not part of the 
scientific name. This is of more importance for botanical names
than for zoological names as a zoological name has only one kind
of author (for a zoological name variations in author 
representation will stop at something between half a dozen and 
a dozen?), while the authorship in a botanical name can include
up to five kinds of "authors" (authors in the sense of the ICBN). 
If one goes by the recommended form of at most two authors per 
kind-of-author that leads to a maximum of ten authors per name.
Most author-names have two fairly commonly used forms (some have 
more), which means that without anything out of the ordinary quite 
a few different representations of authorship will be possible. 
This is per author attribution, as with new research or a change 
in the Rules the attributed authorship may well change (with the 
publication of the 2006 Vienna Code a number of family names 
instantly became attributed to different authors). All in all, 
there are many possible ways to represent the authorship, for 
one particular scientific name, without any change to the
scientific name itself, or what it applies to.

If more of the literature were to be scanned and processed the 
number of 18 million text strings could be expanded enormously, 
without this adding one itty bit of information, or adding one 
single scientific name.
* * *

> As to the interface and ordering of the GNI index I am not 
> sure how they should be organised and I suppose it's based on 
> what the index is for.  I want to use it to access links to things
> that I want dynamically updated.  

***
To me it looks that the interface should be by scientific name
(which after all is what it claims to be indexing). The entries
will need to be 'disambiguated' anyway (for homonyms), no matter 
what.
* * *

> I want the orthographic matching to ensure that a link is made
> regardless of the specific orthography. 

***
That would be nice, but likely will take quite a bit of further 
work.

I guess the parser algorithms work as well could be expected
(the horrendous output of the "genus epitheton" aside), and that 
many of the silly errors are in the underlying databases (see the
amusing Bos taurus Skotsk hoejlandskvaeg caused by 
http://www.eol.org/pages/10199386).

Paul



More information about the Taxacom mailing list