[Taxacom] Sorry, but you are out-of-line

Mon Nov 15 12:30:08 CST 2010

Steve Gaimari wrote:

>You continually bring up GenBank as the model. There are differences,
>not the least of which is the relatively simple data structure. Also, I
>don't believe that GenBank will continue in its current conformation
>into perpetuity. They will upgrade their systems and migrate data and
>continue for as long as molecular biology is a critical field of study -
>I would say it will last a very very long time. But it will not be in a
>stagnant, *original* format into perpetuity. That is not something
>critical to molecular biology - access to the simple data is what is
>critical. However, it IS critical to the nomenclatural aspects of
>taxonomy - not just the simple data. Yes, there may be considerable time
>when a purely digital archive for taxonomy exists, and there will be
>continual upgrades to new technology for a while - maybe. But will
>taxonomy have the money and resources that the field of molecular
>biology has? Taxonomy sure hasn't demonstrated THAT, even with the
>world-recognized crisis in biodiversity. So I don't think setting up a
>system that will RELY on these resources into perpetuity is particularly
>forward-thinking. There is where the GenBank analogy falls apart, in my
>opinion.

You're raising two points here, and they're not 
really linked. Your first point, if I read it 
right, is that taxonomy *needs* a stagnant 
original form on file somewhere, but it's not 
clear what form you are referring to: is it the 
*paper* form, or a digital representation OF that 
paper form? If the former, I don't think it's 
fair to say we NEED the paper hard copy once we 
have created a securely-archived digital version. 
It's *better* to have a hard copy, it's 
*desirable* to have a hard copy, but I wouldn't 
use the word "need" once there is a secure 
digital version; at that point, the hard copy is 
effectively superfluous, in the same way that 
there is no longer a NEED for the metal meter 
stick that was THE standard of reference for a 
meter (at first there was a metal bar - the "hard 
copy" - then in 1960 it became "1,650,763.73 
wavelengths of the orange-red emission line in 
the electromagnetic spectrum of the krypton-86 
atom in a vacuum", and then in 1983 it became 
"the length of the path travelled by light in 
vacuum during a time interval of 1299792458 of a 
second" - and I haven't heard of any physicists 
objecting on the grounds that we may someday lose 
the technology that allows us to measure 
wavelengths or laser beams).

If you're referring to the digital version of the 
hard copy not being maintained in perpetuity, 
that's literally trivial; if GenBank asked 
authors for PDFs of the papers in which their 
sequences were cited, then do you honestly 
believe that GenBank's archives would somehow be 
inadequate to the task? Digital is digital as far 
as storage, and as far as format, PDF is NOT 
proprietary, and if the technology ever 
"migrates", then the migration can be fully 
automated. Remember, a centralized archive is NOT 
stagnant; it isn't "storage" in the conventional 
sense of something being set aside and left 
untouched and then retrieved at some later, 
indeterminate point, which is how *private* 
archives work (and why private archives decay or 
become obsolete - and why I've been harping on 
private archives as irrelevant to the 
discussion); that sort of "storage" would only 
really apply to backups and mirrors - the main 
archive, however, is *dynamic*, with all of its 
elements up and running perpetually, constantly 
updating, error-checking, and so forth - there 
are no hiding places where some bit of data 
(e.g., a PDF) can slip through the cracks and NOT 
be converted to a different format when a 
different format upgrade is initiated. Again, you 
can't think of a centralized archive as "storage" 
- the entire archive changes every second of 
every day, and calling that "storage of data" is 
like saying that a guy juggling three balls is 
"storing" them. The bottom line is that any PDF 
of a paper is just as secure, permament, and easy 
to archive and migrate as "simple data".

Your second point is the one I *don't* have a 
simple answer for, and as such, is of greater 
general concern; "will taxonomy have the money 
and resources that the field of molecular biology 
has?"

To some extent, I think we may be selling our 
commodity short a bit there; when you put ALL of 
taxonomy together, and consider how essential 
taxonomy is to the rest of the scientific 
community, it's not of trivial importance. Just 
one observation alone can suffice to make the 
general point: none of the data in GenBank are 
legitimately valuable to anyone if they are not 
linked to an organism, and that link is taxonomy. 
True, the bulk of GenBank is of common organisms 
whose taxonomy is absolutely stable (like "Homo 
sapiens"), but there's a lot of stuff in there 
for which taxonomy is crucial. Another part of 
this is that we have never gotten together AS a 
community and said "We are unanimous in our 
desire to have a permanent centralized archive - 
will you fund it?" - how can we expect or imagine 
being given money if we haven't shown we can work 
together or agree on anything? Consider that 
there *has* been money and resources given to 
taxonomy (in the broad sense) - repeatedly - to a 
number of different iniatives, each with slightly 
different goals and approaches. Is it possible 
that overlap and/or competition between these 
initiatives has created an environment such that 
nothing that even *smells* like a cataloguing 
effort will attract new funding (because "so-n-so 
is already doing that")? The bottom line here is 
that your second point deals far, far more with 
politics than anything else, and - as such - is 
less about logic or practicality, and accordingly 
almost completely unpredictable. There are no 
easy, obvious answers, aside from this one: we 
won't have a centralized archive if we abandon 
the idea without even trying. THIS is the topic 
we most badly need to be discussing, instead of 
the technical stuff. I agree that the analogy to 
GenBank fails in *this* matter - the *politics* 
behind it - but previous iterations of the 
discussion, including my original statement of 
the analogy - were in reference to the 
*technical* side, which is what people were 
worrying about, and the analogy still holds there.

As for a solution to the political dilemma, one 
idea I have raised before, to limited and at best 
half-hearted response, is that - if creating our 
own GenBank-like archive seems genuinely beyond 
our means (either practically or politically) - 
we might consider riding on GenBank's coattails; 
approach them and see if they would be willing to 
incorporate taxonomic data in their archives. 
Then, in the best-case scenario, the only funding 
*we* would need is for the process of getting the 
data uploaded, and perhaps designing a 
taxonomist-friendly interface; the actual 
infrastructure (otherwise a significant expense) 
would be GenBank's, and already in place. As 
Donat has already demonstrated, our data are NOT 
very different from their data, as seen through 
the proverbial eyes of a computer. They already 
have several orders of magnitude more sequences 
archived than there are nomenclatural acts in all 
of recorded history; we would not make much of a 
dent in their dataspace.

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82