data sharing

Thu Dec 10 07:11:33 CST 1998

Coments interspersed below:

> -----Original Message-----
> From: Biological Systematics Discussion List
> [mailto:TAXACOM at cmsa.Berkeley.EDU]On Behalf Of Hugh Wilson
> Sent: Tuesday, December 08, 1998 11:04 AM
> To: Multiple recipients of list TAXACOM
> Subject: Re: data sharing
>
>
> On  7 Dec 98 at 12:17, Stan Blum <sblum at CALACADEMY.ORG> wrote:
>
> > Alta-Vista and the other Web crawling-indexing have absolutely
> nothing to
> > do with the argument.....
>
> aside from demonstrating fast web access to a large mass of info and
> great success in terms of simple utility.  While the need to 'cull'
> from indexed returns will probably always be part of the game, the
> trick - which needs more research - is to create, from traditional
> databases, 'clean' text files for use as base indices.

Yes, certainly these text search engines are fast, and they work nicely for
large amounts of unstructured textual information.  But I doubt any of the
developers of thse search engines would suggest using them for ordered
access to structured data (certainly their marketing folks might have
another opinion), and would probably be amused at the suggestion.  The
reason is quite simple- by exporting to a text file, and providing access
via web search engines, you are effectively creating a new DBMS, albeit one
of very much reduced functionality.  You lose all the functionality for
fielded searching, and lose control over the format of information that
might be returned from the search.  You are also creating a new headache for
yourself in maintaining more than a single operational copy of your database
(exporting the data to another format is essentially the same as creating a
replica).

Obviously, it would be better to treat remote queries on the database
essentially the same as local queries, with no loss of functionality.  And
then to retrieve strutured information in response to a query would provide
the opportunity of using that information directly rather than just staring
at it in a web browser.

Certainly it is possible to provide this functionality through a web search
engine type of interface, however there is a lot of functionality that needs
to be built to provide this sort of mechanism.  Why not just use one of the
systems already available for handling access to distributed information
systems?

> >
> > Full-text indexing has well known limitations....
>
> and very conspicuous advantages, which include minimal processing
> between query and return and the ability to work with text files that
> can be generated by any user from any development system.

Processing time is not typically an issue with properly setup DBMS systems.
Certainly, text searching will usually be faster, but at the loss of a
tremendous amount of functionality.  And yes, text files can be generated
from any development system, but in doing so, it is widely recognized that
information loss is inevitable.

>
> > >Since this involves
> > >machine-to-machine interactions that must be conducted across a
> > >network that is becoming more congested every day, one wonders if
> > >this potential solution - which certainly appears to be popular with
> > >the U.S. Federal Government (NBII, ITIS, NSF, etc.) will function to
> > >link data resources (biodiversity collections) that differ, in many
> > >fundamental ways, from the established model (libraries).
> >
> > The bandwidth arguments are irrelevant.  People doing
> distributed searches
> > for biological specimen data aren't going to have a serious
> impact on net
> > traffic; certainly nothing like streaming video, which is the kind of
> > capacity the telecommunications companies are planning for.
>
> Am not worried about searches, evidently via local PC nodes, slowing
> down the network.  My point was that the act of searching will be
> *slow*, i.e., lots of lag between query and response.

How much lag time is there currently between a request for information about
a specimen from an arbitrarily selected collection?  Query response should
not be slow in a properly setup database system (be it based on a text
search engine or a real DBMS).  Personally I wouldn't mind waiting a few
seconds for the information to flow straight into my spreadsheet or GIS app
or whatever in a nice orderly fashion, rather than having to cut and paste
oddly formatted lumps out of my web browser screen (with the associated risk
of messing things up).

regards,
  Dave V.