[Taxacom] the hurdle for all biodiv informatics initiatives

Fri Feb 19 03:52:39 CST 2010

Van: Richard Pyle [mailto:deepreef at bishopmuseum.org]
Verzonden: do 18-2-2010 19:39

>> ***
>> OK, "biodiversity informatics" it is (BTW, when I took a 
>> course in bioinformatics it meant something different 
>> entirely from what is sketched above. There are a lot of 
>> terms that are confusing!)
>> * * *

> Yes -- same here.  The term "bioinformatics" existed in our sense 
> long before PCR was a mature technology.  But if you Google that 
> term, you'll see that it's almost always used sensu stricto for 
> DNA stuff.

***
Well, the course I took dealt with information processing by 
biotic systems.
* * *

> Without launching into another HUGE email, I'll just say that this 
> is how my own bio[diversity]informatics efforts began.  I spent 
> perhaps a decade staunchly avoiding any sort of "surrogate primary 
> key" (i.e., arbitrary number) in my data tables.  In fact, the first
> robust database I created -- the specimen database for Ichthyology 
> at Bishop Museum -- *still* uses the set of three fields "Genus", 
> "Species", "Subspecies" as the compound primary key for the taxonomy
> table.  But after a decade of beating my head against that 
> ideological wall, I finally had to re-evaluate my position and 
> embrace surrogate primary key fields (locally unique identifiers) 
> in my databases. Now that I have a better understanding of what is
> needed to allow taxonomic data to flow across the internet, I have
> come to embrace GUIDs.  Basically, over the years of dealing with 
> real-world taxonomy database issues, as I became both experienced 
> and educated, I transitioned from a staunch opposition to any kind 
> of identifier, to one of the loudest advocates of them.

***
Although I realize that using artificial identifiers (in the form 
of some kind of alphanumerical key/string) looks awkward and thus may 
raise instinctive objections with many who encounter them, I do not 
see any real problems with them, provided computers can read them 
unambiguously and as long humans are not obliged to deal with them 
(although for safety's sake they should have access!).

The issue that I am focussing on is the information content that is
accessed: which label unlocks what information? Just making a stack
of labels (which is what any "name" is, a label) only results in a 
stack of labels, which is not necessarily of any use whatsoever.

>From a nomenclatural point of view there are two (and only two, 
no more!) items of prime importance:
1) the scientific name (in its one correct spelling)
2) the type
In a database this (name + type) can be captured by a single 
alphanumerical key: it is a unique "entity", a nomenclatural entity.
Once this nomenclatural entity is included in a database, it is 
possible to attach a whole slew of nomenclatural information 
(what is its rank, who published it where, etc), which is very nice 
for completeness, and for quality control, but immaterial for
information-access purposes. This is the easy part, but absolutely
essential.

Nomenclatural information, by itself, is not of practical value. 
Some names can be dealt with entirely from a nomenclatural 
perspective, by application of nomenclatural rules (Caryophyllus 
Mill. is an illegitimate name, a homotypic synonym of Dianthus L.) 
or by the Act of a higher nomenclatural authority. The difficult 
part in building a database is how to access actual information. 
Often it will be necessary to break things down to separate 
documented usages of a name; these then each to be linked to a 
particular taxon concept (like: Fantasia imaginensis as treated 
in the Flora Utopica), that is, a particular circumscription 
(which may correspond to a chresonym, or a collection of chresonyms,
if you like). A database had better have a unique alphanumerical 
key for each circumscription, or it becomes a meaningless mess.
* * *

>> Text strings will never become consistent, nor stop multiplying.
>> Not unless humans are excluded from the entire process.

> I think you misunderstood my point.  By "consistent" I didn't mean
> that we'd all share the same taxonomy and nomenclature.  I meant 
> that we wouldn't have to continue to deal with variations like this:

***
Actually, that is exactly how I understood it. For practical purposes,
there is an endless amount of variations. Not only won't this ever 
stabilize, but such a variation of a text string does not unlock and
access useful information (for example, of the occurrences of spelling
variation umpteen-and-one there may be three that refer to 
circumscription A, two to circumscription B and five to circumscription 
C, while of the occurrences of spelling variation umpteen-and-two 
there are six that refer to circumscription A, one to circumscription 
B and two to circumscription C, and on-and-on for all the other 
spelling variations). Documenting variations of text strings gets you 
nothing, except lots and lots of such text strings.
* * *

>> ***
>> Actually, I do not see that the "myriad text strings... are 
>> our only link ... to important information about 
>> biodiversity" nor that they would be sufficient to access all 
>> the information. They are just what the "biodiversity 
>> informatics" people are dealing with.
>> * * *

> OK, I don't follow.  Can you elaborate on how else we index
> information about biodiversity in published (and unpublished) forms?
> It seems to me that the entire REASON we use taxon names is to 
> abstract the notion of a taxon concept in the form of a series 
> of text characters, and that we use those text-character strings 
> to label other information (specimens, images, DNA sequences,
> ecological datasets, taxonomic revisions, etc., etc.).  

***
Well, my point is a very general one. There is a lot of information 
"out there" and our link to it is the names that are used; these names 
can roughly be subdivided into scientific, purporting-to-be-scientific,
common and vernacular names. All of which may well have spelling
variations, without these necessarily meaning much (or anything at all).
* * *

> And it wasn't the "biodiversity informatics people" who created the mess. 

***
Oh no, it probably safe to say that biodiversity is messy to begin with,
and as for the myriad people who have been dealing with it ...
* * *

> You can blame the "biodiversity informatics people" for a lot of 
> things, but the mess of text-strings purported to represent scientific
> names that we now have to deal with is definitely not among those
> things.

***
That depends on how you phrase it ...

Paul