[Taxacom] the hurdle for all biodiv informatics initiatives

Richard Pyle deepreef at bishopmuseum.org
Thu Feb 18 12:39:31 CST 2010


> ***
> OK, "biodiversity informatics" it is (BTW, when I took a 
> course in bioinformatics it meant something different 
> entirely from what is sketched above. There are a lot of 
> terms that are confusing!)
> * * *

Yes -- same here.  The term "bioinformatics" existed in our sense long
before PCR was a mature technology.  But if you Google that term, you'll see
that it's almost always used sensu stricto for DNA stuff.

> ***
> But how much more effective it would be if this was skipped, 
> and instead an infrastructure was built with actual 
> information (names, types, circumscriptions), with the text 
> strings left to be handled by a trivial algorithm to be 
> designed for the purpose!
> * * *

Without launching into another HUGE email, I'll just say that this is how my
own bio[diversity]informatics efforts began.  I spent perhaps a decade
staunchly avoiding any sort of "surrogate primary key" (i.e., arbitrary
number) in my data tables.  In fact, the first robust database I created --
the specimen database for Ichthyology at Bishop Museum -- *still* uses the
set of three fields "Genus", "Species", "Subspecies as the compound primary
key for the taxonomy table.  But after a decade of beating my head against
that ideological wall, I finally had to re-evaluate my position and embrace
surrogate primary key fields (locally unique identifiers) in my databases.
Now that I have a better understanding of what is needed to allow taxonomic
data to flow across the internet, I have come to embrace GUIDs.  Basically,
over the years of dealing with real-world taxonomy database issues, as I
became both experienced and educated, I transitioned from a staunch
opposition to any kind of identifier, to one of the loudest advocates of
them.


> Text strings will never become consistent, nor stop multiplying.
> Not unless humans are excluded from the entire process.

I think you misunderstood my point.  By "consistent" I didn't mean that we'd
all share the same taxonomy and nomenclature.  I meant that we wouldn't have
to continue to deal with variations like this:

Cyclotrachelus (Evarthrus) sodalis (LeConte)
Cyclotrachelus (Evarthrus) sodalis (Le Conte, 1848)
Cyclotrachelus (Evarthrus) sodalis (Le Conte 1848)
Cyclotrachelus (Evarthrus) sodalis (Le Conte)
Cyclotrachelus (Evarthrus) sodalis (LeC., 1848)
Cyclotrachelus (Evarthrus) sodalis (LeC. 1848)
Cyclotrachelus (Evarthrus) sodalis (LeC.)
Cyclotrachelus (Evarthrus) sodalis (LeC)
Cyclotrachelus (E.) sodalis (LeConte)
Cyclotrachelus (E.) sodalis (Le Conte, 1848)
Cyclotrachelus (E.) sodalis (Le Conte 1848)
Cyclotrachelus (E.) sodalis (Le Conte)
Cyclotrachelus (E.) sodalis (LeC., 1848)
Cyclotrachelus (E.) sodalis (LeC. 1848)
Cyclotrachelus (E.) sodalis (LeC.)
Cyclotrachelus (E.) sodalis (LeC)Cyclotrachelus sodalis (LeConte)
Cyclotrachelus sodalis (Le Conte, 1848)
Cyclotrachelus sodalis (Le Conte 1848)
Cyclotrachelus sodalis (Le Conte)
Cyclotrachelus sodalis (LeC., 1848)
Cyclotrachelus sodalis (LeC. 1848)
Cyclotrachelus sodalis (LeC.)
Cyclotrachelus sodalis (LeC)
C. (Evarthrus) sodalis (LeConte)
C. (Evarthrus) sodalis (Le Conte, 1848)
C. (Evarthrus) sodalis (Le Conte 1848)
C. (Evarthrus) sodalis (Le Conte)
C. (Evarthrus) sodalis (LeC., 1848)
C. (Evarthrus) sodalis (LeC. 1848)
C. (Evarthrus) sodalis (LeC.)
C. (Evarthrus) sodalis (LeC)
C. (E.) sodalis (LeConte)
C. (E.) sodalis (Le Conte, 1848)
C. (E.) sodalis (Le Conte 1848)
C. (E.) sodalis (Le Conte)
C. (E.) sodalis (LeC., 1848)
C. (E.) sodalis (LeC. 1848)
C. (E.) sodalis (LeC.)
C. (E.) sodalis (LeC)Cyclotrachelus sodalis (LeConte)
C. sodalis (Le Conte, 1848)
C. sodalis (Le Conte 1848)
C. sodalis (Le Conte)
C. sodalis (LeC., 1848)
C. sodalis (LeC. 1848)
C. sodalis (LeC.)
C. sodalis (LeC)

....and that's just a fraction of the possible variations!

> ***
> Actually, I do not see that the "myriad text strings... are 
> our only link ... to important information about 
> biodiversity" nor that they would be sufficient to access all 
> the information. They are just what the "biodiversity 
> informatics" people are dealing with.

OK, I don't follow.  Can you elaborate on how else we index information
about biodiversity in published (and unpublished) forms?  It seems to me
that the entire REASON we use taxon names is to abstract the notion of a
taxon concept in the form of a series of text characters, and that we use
those text-character strings to label other information (specimens, images,
DNA sequences, ecological datasets, taxonomic revisions, etc., etc.).  These
text strings existed LONG before computers existed, and certainly before
there were any "biodiversity informatics people" around.  

And it wasn't the "biodiversity informatics people" who created the mess.
Only a tiny fraction of the name-string variations are created during the
process of digitization/databasing.  The vast, vast majority of them exist
in printed form -- in published and unpublished documents, in natural
history specimen ledgers and labels, in notebooks, etc.  You can blame the
"biodiversity informatics people" for a lot of things, but the mess of
text-strings purported to represent scientific names that we now have to
deal with is definitely not among those things.

Aloha,
Rich






More information about the Taxacom mailing list