[Taxacom] Occurrence data...

Fri Feb 18 16:42:51 CST 2011

Dear Lyubo,

It sounds like your response to my comment

"A barrier to be overcome if DCAs are to appear more often in publications is that most data creators are either unfamiliar with the TDWG scheme for classifying and formatting data items, or are unwilling to spend time working out how their own preferred data fields relate to that scheme."

is

"Naturally, we are aware that at the present stage DwC-A would in many cases need some support from experienced data managers to be properly implemented. It will take some time. On the other side, the future comes often faster than anyone would expect.  Data managers become quickly wanted job positions even in not that large taxonomic institutions. Individual taxonomist will be facilitated by tools to export their datasets in DwC-A or in another interoperable formats."

But this avoids the questions: is it necessary? is it even desirable? ZooKeys already semantically marks up the text and assigns the all-important LSIDs. You are now encouraging authors to go to the next stage, and structure their raw occurrence and nomenclatural data. How long will it be before you ask authors to digitally map their images, so that some aggregator ('Encyclopedia of Morphology' project) can pull up all the hind-leg tarsus image-elements in the digitised insect literature?

I am concerned that what is happening is flawed at two levels. First and foremost, there is a legacy feeling from the days of libraries, when you could create a single authoritative index and it would sit on a shelf in the Reference section, and it was the first place you went as an introduction to a topic. You can still find such things on the Web: lists of links, generally way out of date. There is far too much information on the Web to make this viable, there are too many data quality issues and updating is haphazard. The alternative is to let software find things for you - the Rod Page approach - so that there are as many indexes and compendia as there are occasions on which someone goes data-hunting. And to link (or allow software to link for you) and link again, until you have a densely interconnected network of data sources to facilitate that data-hunting.

The second level is that even today, 20 years into the new age, promoters of Gigantic All-Encompassing Biodiversity Databases (and indexes, Rich) still have no clear idea who wants the information and for what purposes. If I ask that question I sometimes get the sincere but vacuous answer that we don't know and it isn't important, the important thing is to have the data ready when someone, somewhere, wants it for some purpose. I can't think of any other major human enterprise that tolerates such vagueness in its aims.

The many bottom-up biodiversity databases on the Web typically have an audience in mind, namely the specialists who contribute to their creation, and who are the primary users of the data. They've been structured for those users, built with careful attention to detail, and can be 'handed down' from volunteer specialist to volunteer specialist, with some confidence that the same general aims and devotion will also be handed down. I don't think you could say that for any of the aggregation projects.

I see these bottom-up resources as high-use nodes in the future networks of linked biodiversity data. Their contents don't need to be aggregated, indexed, repackaged or otherwise fooled with. They can be accessed directly in an anarchic, unstructured Web. Like Pete DeVries, I don't see any good reason why the same can't be true for raw data. If raw data is made available this way, as in ZooKeys supplements, I'd prefer it *wasn't* marked up, so that I - as *user*, not aggregator - can pass an eye over it a la Chris Thompson.

Rich Pyle wrote (as I was writing the above):

"Criticize aggregators all you want, but one thing that they certainly *can* help with is in eliminating a lot of redundant effort."

Effort by whom? For what purpose? Do you really expect or want to have the background on every RCL Perkins collection in Hawaii and every other collector in every other place on Earth in another gigantic index-on-the-shelf? With no errors? How about just putting on the Web the individual results of careful scholarship and allowing *users* to find them through linking? Isn't the aim to connect user with datum, not to keep programmers and data managers employed?
-- 
Dr Robert Mesibov
Honorary Research Associate
Queen Victoria Museum and Art Gallery, and
School of Zoology, University of Tasmania
Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
Ph: (03) 64371195; 61 3 64371195
Webpage: http://www.qvmag.tas.gov.au/?articleID=570