data sharing
Stuart G. Poss
sgposs at SEAHORSE.IMS.USM.EDU
Fri Dec 4 20:48:26 CST 1998
I've found myself agreeing with many of the excellent points made by
nearly all contributors to this thread.
In my opinion perhaps Donn McAlister and Julian Humphries have hit upon
the the important context in which to view this problem, namely one of
logistics and cost-benefit. It should be apparent that no single
satisfactory answer will emerge for all users under all circumstances,
since different collections, curators, institutions, and potential
disseminators or users of information are not all equal with respect to
either their needs or their capabilities. To forcefully advocate one
perpective or another requires a finite expenditure of time and energy
to develop systems of wide utility, that may, in all probability, not be
fully transferable to others requiring alternate strategies or
addressing different problems.
It might be more productive to critically address this issue in specific
contexts. To be critical it might be most appropriate to criticize some
obvious shortcommings in one's own efforts.
As part of some Gulf of Mexico Program initiatives, we are building a
taxon-centric directory of information related to species of concern in
a couple of contexts (endangered species
http://www.ims.usm.edu/~musweb/endanger.html; non-indigenous species
http://www.ims.usm.edu/~musweb/invaders.html).
For these efforts (at least for the former, which is largely complete
for fishes), we have gathered distribution data from a variety of
collections containing records. These are largely gathered in various
native and ascii formats and then placed into a single off-line
database. From this, data are plotted on distributon maps or presented
as lists of records, which are then placed on-line.
We face several problems. While some of our collaborators are able to
provide us raw data via online searches from which we can derive
virtually all available data pertinent to a given record, for most
records data collection is a very labor intensive process, especially
for records not geo-referenced and whose identifications are suspect.
For both kinds of problems, it is extremely difficult to do a
comprehensive job of proofing, either by us or by collaborators. To
make it is ready for "online dissemination" we face problems involving
the lack of coordinate data structures and inconsistent use of attribute
"fields" (even when sharing the same data structures), as well as
problems with incomplete or inaccurate recording.
We must also take into account different views of how much of the
original data should be disseminated, and hence become restricted
essentially by the "least common demoninator". In my view it would be
preferable, if we could hyperlink from such compilations directly back
to on-line records residing at individual sites that share some common
calling convention and some minimum tracking features.
I am unconvinced that the risk facing species that may result from the
inappropriate use of our data ranks high among the primary threats posed
to these species, so I would advocate full disclosure, warts and all.
This is particularly true since my tax money paid to collect, record,
and study it, not to mention will continue to pay to refine and further
correct it. Perhaps others also share my peave as a taxpayer of too
often having to pay more than once to get something done right. That
some might misuse or complain about the current state of colletions data
can be looked upon as opportunities to educate as well as to ask for
help in resolving problems, rather than excuses to never accept the
challenge of making them more widely available.
I share the opinion that if we don't get this data out and more widely
disseminated and appreciated among the broader public, appropriate
management agencies, as well as ourselves, many of these species will
dissappear for reasons unrelated to collections-based research and
archival. This is not to deny knowledge of some localities (usually
last remaining ones) could pose risks if improperly used. In any event,
not all the records are "my" data, so I must accept restrictions placed
on their use.
Another major problem with our approach, is that despite a lot of
eyestrain and backache, we wind up with only a static view, hardly what
is needed to adequately manage futures for theses species. However, a
static view does has some advantages, since it makes it easy for others
to easily recongize what records were available to us when we reached
our conclusions. Perhaps a bit of clever programming, freely shared
could largely resolve this problem.
I believe we need to emphasize that collections-related research is a
dynamic process and that like all good science it is continuously
advancing as new information is collected and previously available data
refined. It is pointless to worry that if we put our data on-line, then
somehow we will have given away reasons for contined support of our
science. To the contrary, it would evident for all to see that these
data change dramatically, are in constant use, widely shared, and
integral to numerous aspects of science more broadly construed. Few
advocate unplugging atomic clocks because they just keep measuring time
in the same old way, second by second, hour by hour, without surprises.
Collections data too have a similar important referential component and
are as essential to Biology as atomic clocks are to Physics.
Certainly, we would like to overcome these numerous significant
limitations and inadequacies of "our" data, as well in accessing these
data. We would love to have automatic mechanisms to more adquately
track changes to the data. For example, it is absolutely critical for
us to understand if the lat/long value accompanying the data were taken
at the time of capture or subsequently, since we seek, in principle at
least, the ability to return to a site to confirm a record at a specific
location or be unable to confirm a record despite intensive sampling at
that precise location. One area where more effort would be helpful
would be for all collections to provide more ready access to
station/locality data that may be not always, uniformly associated with
a specimen. This is ofen problematic for older records which are less
likely to be fully computerized.
We would also love to skip all the time-consuming drugery of having to
bring individual datasets into a single format or continually badger our
most patient colleagues yet again for corrections or additional records.
It would be much easier to simply click a mouse and have it all handed
to us painlessly so that we can get about analyzing implications, while
hopefully, not too shamelessly failing to acknowlede the considerable
efforts of those whose work has made our's possible. However, that is a
demand of their resources, which is difficult to make, since it would be
incumbent upon us to be able to reciprocate and, even with generous
support, we ourselves lack the resources to fully respond. But, perhaps
just as few of us still swing from tree to tree to get to work in the
morning, we might share new behaviors that might make life easier for
all of us.
One could go on to elaborate many specific nuances and additional
problems that greatly limit the usefulness of our work. We must
recognize that the value of particular solutions is largely relative to
the alternatives realistically available to us. Rather than arguing
these points in a vacuum, it might prove more useful for us to share
what works (and what doesn't) in various contexts, so that we can all
take advantage of new and innovative approaches, avoid common pitfalls,
and better appreciate the numerous differences that make each situation
unique. I suggest that by sharing the relative merits of specific
approaches under "real world" conditions, we might more quickly arrive
at what might form consensus concerning costs and benefits, not to
mention learn more about datasets that we can all share and appreciate.
It might also be worthwhile, for those having such data, to share how
often some of our favorite solutions or data sets, are actually used by
others. It never ceases to amaze me that as I look at my webstats some
of what I perceive to be my best ideas are so consistently ignored.
Doesn't anybody else hear the drums?
--
_____________________________________________________________________
Stuart G. Poss E-mail: sgposs at seahorse.ims.usm.edu
Senior Research Scientist & Curator Tel: (228)872-4238
Gulf Coast Research Laboratory FAX: (228)872-4204
P.O. Box 7000
Ocean Springs, MS 39566-7000
More information about the Taxacom
mailing list