data sharing
Stan Blum
sblum at CALACADEMY.ORG
Mon Dec 7 12:17:48 CST 1998
At 09:28 AM 12/7/98 -0500, Hugh Wilson wrote:
>On 5 Dec 98 at 23:46, Dave Vieglais <vieglais at UKANS.EDU> wrote:
>> 2. Specimen based information systems.
>> Any distributed information system being developed using the WWW as the
>> backbone (ie. the http protocol) is likely to fail.
>
>Anyone using Alta-Vista or any other web-based search engine can
>quickly demonstrate that this is not true.
Dave MAY have overstated the case against HTTP -- I'm no expert on
state-less (e.g., HTTP) and state-ful (e.g., Z39.50) protocols -- but
Alta-Vista and the other Web crawling-indexing have absolutely nothing to
do with the argument. To see how reliable full-text indexing is in
information retrieval, all you have to do is issue the same query to
Alta-Vista, Hotbot, Lycos, etc., and compare the results. How come they're
so different? As a scientist trying to assemble a data set from
distributed sources, would you be happy with that kind of a hit-or-miss
approach?
There are lots of "horses" in the race to distributed information
retrieval, and at this point we would be wise to hedge our bets; i.e., keep
experimenting with a variety of approaches. In particular, we shouldn't
dismiss what library and information science has to contribute to the
solution.
>Web technologies are
>pre-adapted to process and display distributed data
*** but only one connection at a time ***
>and one aspect of
>this - full text indexing - provides a simple, fast, non-structured,
>and fully open option.
Full-text indexing has well known limitations. Within the collections data
context, I can't easily and reliably issue a query that distinguishes
between "Smith" as the collector and "Smith" as the identifier
(determiner). I can't search for specimens collected between 200 and 500
meters elevation. More importantly, I can't easily do any kind of
post-processing (re-sorting, sub-querying, etc.) on the results precisely
because the records are either unstructured (full-text) or heterogeneously
structured (i.e., structured, but according to different semantic standards).
>Development has not required imposition of standards or data
>structures beyond creation of a simple data exchange format:
>
>http://www.csdl.tamu.edu/FLORA/ftc/ftcffld4.htm
Sorry, the FTC data exchange format is as much of a _standard_ as any other
(e.g., HISPID <http://www.rbgsyd.gov.au/HISCOM/HISPID3/hispidright.html> or
any of the TDWG standards). The data provider, the system software, and
the receiver ALL have to agree about a lot of stuff before a useful
interchange of information can occur. Which is exactly what you say here:
> via consensus among those working to computerize their collections
^^^^^^^^^
You did in fact arrive at a consensus -- or standard -- through
negotiations among the participants.
>We have estabished that specimen data development and expression
>systems are not comparable to the phone book. Application of this [Z39.50]
>ancient 'standard' brings up the library card catalog as a model.
Z39.50 absolutely does NOT imply a library card catalog as a model. Z39.50
may have been adopted and implemented first in the library community to
work with MARC records, but version 3 of the standard (and this is a
communication protocol standard, NOT a conceptual or semantic standard) can
use virtually ANY kind of data set syntax, from tag-value (e.g., MARC), to
flat, to hierarchical. I think even XML can be used as a data markup
syntax (but I may be wrong on that one). And given that version 3 was
"approved" this year (I think), you can hardly call this an ancient protocol.
>Are biological collections data and - more important - biological
>collections and staff structured in a way that will allow full
>adoption (large and small facilities) of this standard and its
>associated (pre-web) infrastructure.
Yes. Absolutely. Dave has very effectively developed software that
integrates the Z30.50 protocal into the PC environment. Put simply, if you
have Excel on your PC you can be a user. If you use Access or some other
ODBC compliant database (which is just about anything) and have a dedicated
Internet connectionm you can be a data provider. It's not that hard to do.
>Since this involves
>machine-to-machine interactions that must be conducted across a
>network that is becoming more congested every day, one wonders if
>this potential solution - which certainly appears to be popular with
>the U.S. Federal Government (NBII, ITIS, NSF, etc.) will function to
>link data resources (biodiversity collections) that differ, in many
>fundamental ways, from the established model (libraries).
The bandwidth arguments are irrelevant. People doing distributed searches
for biological specimen data aren't going to have a serious impact on net
traffic; certainly nothing like streaming video, which is the kind of
capacity the telecommunications companies are planning for.
I am very optimistic about the suitability of Z39.50 to our community, and
I think we're going to see some impressive results very soon.
-Stan
More information about the Taxacom
mailing list