data sharing

Mon Dec 7 16:29:14 CST 1998

At 12:17 PM 12/7/98 -0800, Stan Blum wrote:
>Alta-Vista and the other Web crawling-indexing have absolutely nothing to
>do with the argument.  To see how reliable full-text indexing is in
>information retrieval, all you have to do is issue the same query to
>Alta-Vista, Hotbot, Lycos, etc., and compare the results.  How come they're
>so different?

Well, this is not really the issue, they are different because they
use different strategies to try and index a *very* moving target,
hardly comparable to label data for specimens.   Even in our wildest
views of what specimen data are, there don't approach the content
variation of the web (by several orders of magnitude).  In fact,I dare say if
we put museum records up as "files" every major search engine
would produce nearly identical hits (once they found them of course).

>
>>Web technologies are
>>pre-adapted to process and display distributed data
>
>*** but only one connection at a time ***

Thats not quite true and while it might not be relevant, not only are
browsers multi-threaded, the servers you connect to can definitely
do multiple searches to lots of machines at once (see FishGopher).

>
>Full-text indexing has well known limitations.  Within the collections data
>context, I can't easily and reliably issue a query that distinguishes
>between "Smith" as the collector and "Smith" as the identifier
>(determiner).

Well, I might say you can't reliably do fielded searches for
Tom Smith when the data are entered as T. Smith or Smith,T.

>I can't search for specimens collected between 200 and 500
>meters elevation.  More importantly, I can't easily do any kind of
>post-processing (re-sorting, sub-querying, etc.) on the results precisely
>because the records are either unstructured (full-text) or heterogeneously
>structured (i.e., structured, but according to different semantic standards).
>

There is nothing inherent in full text indexing that says you can't
return structured data.  There are advantages and disadvantages
to both fielded and full text versions of databases.   For really high quality
data from homogenenous sources, I would guess you can more easily
find what you are looking for with a traditional fielded database.  But,
with data from mixed sources and/or with annotation in note fields, its
often easier to find that record you are looking for (even if it means looking
through extra records) with a full text engine.

.....
>
>>Since this involves
>>machine-to-machine interactions that must be conducted across a
>>network that is becoming more congested every day, one wonders if
>>this potential solution ....

>The bandwidth arguments are irrelevant.  People doing distributed searches
>for biological specimen data aren't going to have a serious impact on net

I would think the question is not if we impact the net, but the reverse.
How many
distributed data sources are practical before net congestion makes
retrieval too painful to be practical... 5? 25? 100?  I think it is a great
idea
to work on distributed search systems, but data warehousing remains
a viable alternative for the near future, particularly where the audience (and
data sources) are international.   I will be very interested in performance
on Dave's system with realistic numbers of data sources spread around the
world.  I will be thrilled if it works.  It will be interesting to see what
the effect
to Z39.50 type transactions (vs HTTP) is on performance.

Julian Humphries