Sound data design
Una Smith
una at MINERVA.CIS.YALE.EDU
Wed Nov 24 10:02:26 CST 1993
Re data exchange: So many other tedious aspects of research have been
made so much easier, thanks to computers, that now we wish this too could
be solved automatically, once and for all.
As database manager for a 50 hectare research tree plot, containing
250,000 stems, I struggled with the issue of data exchange and format
on a daily basis for several years. I found that it was in everyone's
best interest if I placed minimal conditions on the format in which
data was contributed by workers who added their own research to the
standard plot datasets. All datasets shared certain attributes, but
most also had special, unique features. Some of these, such as bias
in the sampling methods, could not be encoded in the data structure
in any meaningful way. There is simply a lot of information that can
not be easily encoded as data, but it is always necessary to draw the
line somewhere. So, in effect, I encoded data in the strict sense in
a minimal structure, and captured the rest as free text. This meant
that I spent a significant part of my own time curating the data.
The process of atomizing and defining data is highly intellectual,
and the sorts of problems presented by scientific data are at present
far beyond the capabilities of even the best artificial intelligence.
Science is the process of discovering new things, so I consider it a
largely futile exercise to attempt to encode all information as strict
data; our ideas and our data change so rapidly!
It was always astonishing to me how some scientists assumed it must be
a simple thing to encode the plot data in an easy-to-use format that
anyone could query without assistance. Others had infinite patience
and went to great lengths to define their questions with care. Para-
doxically, it was this later group whose data requests were easiest
to satisfy. When the researcher has a sophisticated and essentially
correct view of the available data, the data structure need not be as
sophisticated. It is for the people who can not be bothered to grasp
the fine details that we struggle so hard to *fix* the details into
our data structures, where they can not (we hope) be avoided.
When I read the SAS-L mailing list, both the abstract and practical
aspects of this issue were frequent topics of discussion. You may
find the list useful to you. Send "subscribe sas-l <Your Name>" to
listserv at uga.cc.uga.edu, or read it via Usenet as comp.soft-sys.sas.
Una Smith smith-una at yale.edu
Department of Biology, Yale University, New Haven, CT 06520-8104 USA
More information about the Taxacom
mailing list