[Taxacom] Occurrence data...

Sat Feb 19 08:09:44 CST 2011

Dear Bob,

Please, please - we neither did nor do we intend to "ask" or "request'
authors to provide their occurrence data in Darwin Core Archive format! We
only provide just one more opportunity - perhaps the first one based on an
approved biodiversity standard  - to those of our authors who would like to
publish their data. The author will choose the way to publish data, if he or
she wishes so. It can be done within the text (conventionally), as an
individual data table in a supplementary file (certainly not recommended but
possible), or through Darwin Core Archive.

" But this avoids the questions: is it necessary? is it even desirable?
ZooKeys already semantically marks up the text and assigns the all-important
LSIDs. You are now encouraging authors to go to the next stage, and
structure their raw occurrence and nomenclatural data. How long will it be
before you ask authors to digitally map their images, so that some
aggregator ('Encyclopedia of Morphology' project) can pull up all the
hind-leg tarsus image-elements in the digitised insect literature?"

Bob, from the above statement  it looks like that we torture our authors
with never-ending requests to provide more and more details with their
manuscripts. Please note most of the innovations ZooKeys and PhytoKeys
implement as a routine practice during the in-house editorial process and
with the help of software tools we constantly develop. Naturally, some of
our authors come with good ideas how to better publish their manuscripts.
The DwC-A files in the discussed paper, for example, were generated by Joe
Cora, the data manager of HOL, and the authors in a close cooperation with
GBIF's and ZooKeys' people.

To answer your question if all this is necessary or even desirable, I dare
to list at least five good reasons why authors would be motivated (some
are!) to publish their occurrence data in Darwin Core Archive format:

1. Publishing data in standardized format opens new ways of collaboration -
for example, easy collation with other datasets of interest (the same taxon
from other region, other taxa from the same region, etc.)
2. Publication of data registers priority and credits those who have
collected and processed the data
3. Publication of data may bring additional citations for the authors,
including invention of "data usage indexes"
4. Darwin Core Archive is not an "end" format; it opens the way for data to
be exported and re-used in different formats. For example, we work now on a
method to convert DwC-A files into conventionally looking manuscripts, e.g.
checklists or catalogues. Could you imagine such a tool to be projected in a
way to satisfy the hundreds of different formats the individual authors use
to manage and preserve their data?
5.  Last but not least - there are authors who simply do not want their data
to get lost on their personal computers, CDs, memory sticks, or whatever
careers. Such authors would prefer to publish and open their data and the
best way to do this is through a standardized format, such as Darwin Core
Archive is.

And one more significant technological reason - it is not easy and I doubt
it will become easy ever - to mark up occurrence data scattered within the
text, so that they to be harvested,  indexed, linked etc. by computers.
There are too many different elements within a single locality description
to make the efforts of extraction and handling  meaningful and cost
efficient.

Somehow I am confused why data publication, which is obviously a good thing
and it looks like we all agree in this, is so much mixed up with the way
data are or will be used in the future, e.g., with the old discussion on
"bottom-up" and "top-down" initiatives, "abbreviated monster aggregators" ,
etc.? GBIF launched the DwC-A standard, data is recorded and described in
their registry, it can also be downloaded in a "non-marked-up" form (e.g.,
text delimited file) from the journal's website. What is wrong here?

Best regards,
Lyubo

On Sat, Feb 19, 2011 at 12:42 AM, Bob Mesibov <mesibov at southcom.com.au>wrote:

> Dear Lyubo,
>
> It sounds like your response to my comment
>
> "A barrier to be overcome if DCAs are to appear more often in publications
> is that most data creators are either unfamiliar with the TDWG scheme for
> classifying and formatting data items, or are unwilling to spend time
> working out how their own preferred data fields relate to that scheme."
>
> is
>
> "Naturally, we are aware that at the present stage DwC-A would in many
> cases need some support from experienced data managers to be properly
> implemented. It will take some time. On the other side, the future comes
> often faster than anyone would expect.  Data managers become quickly wanted
> job positions even in not that large taxonomic institutions. Individual
> taxonomist will be facilitated by tools to export their datasets in DwC-A or
> in another interoperable formats."
>
> But this avoids the questions: is it necessary? is it even desirable?
> ZooKeys already semantically marks up the text and assigns the all-important
> LSIDs. You are now encouraging authors to go to the next stage, and
> structure their raw occurrence and nomenclatural data. How long will it be
> before you ask authors to digitally map their images, so that some
> aggregator ('Encyclopedia of Morphology' project) can pull up all the
> hind-leg tarsus image-elements in the digitised insect literature?
>
> I am concerned that what is happening is flawed at two levels. First and
> foremost, there is a legacy feeling from the days of libraries, when you
> could create a single authoritative index and it would sit on a shelf in the
> Reference section, and it was the first place you went as an introduction to
> a topic. You can still find such things on the Web: lists of links,
> generally way out of date. There is far too much information on the Web to
> make this viable, there are too many data quality issues and updating is
> haphazard. The alternative is to let software find things for you - the Rod
> Page approach - so that there are as many indexes and compendia as there are
> occasions on which someone goes data-hunting. And to link (or allow software
> to link for you) and link again, until you have a densely interconnected
> network of data sources to facilitate that data-hunting.
>
> The second level is that even today, 20 years into the new age, promoters
> of Gigantic All-Encompassing Biodiversity Databases (and indexes, Rich)
> still have no clear idea who wants the information and for what purposes. If
> I ask that question I sometimes get the sincere but vacuous answer that we
> don't know and it isn't important, the important thing is to have the data
> ready when someone, somewhere, wants it for some purpose. I can't think of
> any other major human enterprise that tolerates such vagueness in its aims.
>
> The many bottom-up biodiversity databases on the Web typically have an
> audience in mind, namely the specialists who contribute to their creation,
> and who are the primary users of the data. They've been structured for those
> users, built with careful attention to detail, and can be 'handed down' from
> volunteer specialist to volunteer specialist, with some confidence that the
> same general aims and devotion will also be handed down. I don't think you
> could say that for any of the aggregation projects.
>
> I see these bottom-up resources as high-use nodes in the future networks of
> linked biodiversity data. Their contents don't need to be aggregated,
> indexed, repackaged or otherwise fooled with. They can be accessed directly
> in an anarchic, unstructured Web. Like Pete DeVries, I don't see any good
> reason why the same can't be true for raw data. If raw data is made
> available this way, as in ZooKeys supplements, I'd prefer it *wasn't* marked
> up, so that I - as *user*, not aggregator - can pass an eye over it a la
> Chris Thompson.
>
> Rich Pyle wrote (as I was writing the above):
>
> "Criticize aggregators all you want, but one thing that they certainly
> *can* help with is in eliminating a lot of redundant effort."
>
> Effort by whom? For what purpose? Do you really expect or want to have the
> background on every RCL Perkins collection in Hawaii and every other
> collector in every other place on Earth in another gigantic
> index-on-the-shelf? With no errors? How about just putting on the Web the
> individual results of careful scholarship and allowing *users* to find them
> through linking? Isn't the aim to connect user with datum, not to keep
> programmers and data managers employed?
> --
> Dr Robert Mesibov
> Honorary Research Associate
> Queen Victoria Museum and Art Gallery, and
> School of Zoology, University of Tasmania
> Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
> Ph: (03) 64371195; 61 3 64371195
> Webpage: http://www.qvmag.tas.gov.au/?articleID=570
>

-- 
Dr Lyubomir Penev
Managing Director
Pensoft Publishers
13a Geo Milev Street
1111 Sofia, Bulgaria
Fax +359-2-8704282
www.pensoft.net
info at pensoft.net