data sharing

Mon Dec 7 12:17:48 CST 1998

At 09:28 AM 12/7/98 -0500, Hugh Wilson wrote:
>On  5 Dec 98 at 23:46, Dave Vieglais <vieglais at UKANS.EDU> wrote:
>> 2. Specimen based information systems.
>> Any distributed information system being developed using the WWW as the
>> backbone (ie. the http protocol) is likely to fail.
>
>Anyone using Alta-Vista or any other web-based search engine can
>quickly demonstrate that this is not true.

Dave MAY have overstated the case against HTTP -- I'm no expert on
state-less (e.g., HTTP) and state-ful (e.g., Z39.50) protocols -- but
Alta-Vista and the other Web crawling-indexing have absolutely nothing to
do with the argument.  To see how reliable full-text indexing is in
information retrieval, all you have to do is issue the same query to
Alta-Vista, Hotbot, Lycos, etc., and compare the results.  How come they're
so different?  As a scientist trying to assemble a data set from
distributed sources, would you be happy with that kind of a hit-or-miss
approach?

There are lots of "horses" in the race to distributed information
retrieval, and at this point we would be wise to hedge our bets; i.e., keep
experimenting with a variety of approaches.  In particular, we shouldn't
dismiss what library and information science has to contribute to the
solution.

>Web technologies are
>pre-adapted to process and display distributed data

*** but only one connection at a time ***

>and one aspect of
>this - full text indexing - provides a simple, fast, non-structured,
>and fully open option.

Full-text indexing has well known limitations.  Within the collections data
context, I can't easily and reliably issue a query that distinguishes
between "Smith" as the collector and "Smith" as the identifier
(determiner).  I can't search for specimens collected between 200 and 500
meters elevation.  More importantly, I can't easily do any kind of
post-processing (re-sorting, sub-querying, etc.) on the results precisely
because the records are either unstructured (full-text) or heterogeneously
structured (i.e., structured, but according to different semantic standards).

>Development has not required imposition of standards or data
>structures beyond creation of a simple data exchange format:
>
>http://www.csdl.tamu.edu/FLORA/ftc/ftcffld4.htm

Sorry, the FTC data exchange format is as much of a _standard_ as any other
(e.g., HISPID <http://www.rbgsyd.gov.au/HISCOM/HISPID3/hispidright.html> or
any of the TDWG standards).  The data provider, the system software, and
the receiver ALL have to agree about a lot of stuff before a useful
interchange of information can occur.  Which is exactly what you say here:

> via consensus among those working to computerize their collections
      ^^^^^^^^^
You did in fact arrive at a consensus -- or standard -- through
negotiations among the participants.

>We have estabished that specimen data development and expression
>systems are not comparable to the phone book.  Application of this [Z39.50]
>ancient 'standard' brings up the library card catalog as a model.

Z39.50 absolutely does NOT imply a library card catalog as a model.  Z39.50
may have been adopted and implemented first in the library community to
work with MARC records, but version 3 of the standard (and this is a
communication protocol standard, NOT a conceptual or semantic standard) can
use virtually ANY kind of data set syntax, from tag-value (e.g., MARC), to
flat, to hierarchical.  I think even XML can be used as a data markup
syntax (but I may be wrong on that one). And given that version 3 was
"approved" this year (I think), you can hardly call this an ancient protocol.

>Are biological collections data and - more important - biological
>collections and staff structured in a way that will allow full
>adoption (large and small facilities) of this standard and its
>associated (pre-web) infrastructure.

Yes. Absolutely.  Dave has very effectively developed software that
integrates the Z30.50 protocal into the PC environment.  Put simply, if you
have Excel on your PC you can be a user.  If you use Access or some other
ODBC compliant database (which is just about anything) and have a dedicated
Internet connectionm you can be a data provider.  It's not that hard to do.

>Since this involves
>machine-to-machine interactions that must be conducted across a
>network that is becoming more congested every day, one wonders if
>this potential solution - which certainly appears to be popular with
>the U.S. Federal Government (NBII, ITIS, NSF, etc.) will function to
>link data resources (biodiversity collections) that differ, in many
>fundamental ways, from the established model (libraries).

The bandwidth arguments are irrelevant.  People doing distributed searches
for biological specimen data aren't going to have a serious impact on net
traffic; certainly nothing like streaming video, which is the kind of
capacity the telecommunications companies are planning for.

I am very optimistic about the suitability of Z39.50 to our community, and
I think we're going to see some impressive results very soon.

-Stan