data sharing

Thu Dec 10 12:33:14 CST 1998

> -----Original Message-----
> From: Biological Systematics Discussion List
> [mailto:TAXACOM at cmsa.Berkeley.EDU]On Behalf Of Hugh Wilson
> Sent: Thursday, December 10, 1998 10:20 AM
> To: Multiple recipients of list TAXACOM
> Subject: Re: data sharing
>
>
> On 10 Dec 98 at 7:11, Dave Vieglais <vieglais at UKANS.EDU> wrote:
>
> > large amounts of unstructured textual information.  But I doubt
> any of the
> > developers of thse search engines would suggest using them for ordered
> > access to structured data (certainly their marketing folks might have
> > another opinion), and would probably be amused at the suggestion.  The
> > reason is quite simple- by exporting to a text file, and
> providing access
> > via web search engines......
>
> My reference to full text indexing was not intended to imply that we
> should set up text indices for Alta-vista to search.  The general
> notion is to create a structured text file, index it, and then create
> a 'front end' that sends queries to it.  For users able to frame up a
> query URL, the front end can be bypassed.

Hmm.  You have just described a DBMS, albeit a rather simple one.  Rule
number one of information management goes something like: Never, ever create
your own database management system.  They exist. People have invested huge
amounts of time and intellectual effort into creating them and optimizing
them for all kinds of data.  Why disregard all this remendous achievment and
simply put the data in a text file?  Is it not easier to keep the data in a
database (access, dbase, whatever) and have your query posted directly
against the database?  Are you suggesting that it is easier and less
expensive to go through many steps of creating a text file and indexing it
each time you update the database?  If you *must* go through a web
interface, at least interface directly to the database (there are *many*
tools for this- free or otherwise), rather than go through the painful
process of rebuilding yet another flavor of the same thing.

>
> > Processing time is not typically an issue with properly setup
> DBMS systems.
> > Certainly, text searching will usually be faster, but at the loss of a
> > tremendous amount of functionality.  And yes, text files can be
> generated
> > from any development system, but in doing so, it is widely
> recognized that
> > information loss is inevitable.
>
> Take a look at the membership directory query system at the ASPT
> website at:
>
> http://www.csdl.tamu.edu/FLORA/aspt/aspthome.htm
> (select 'membership')
>
> The query page carries links to 'front end' systems that allow the
> user to query a full text index using a mouse.  The index originates
> as a DbaseIV file sent to me by the ASPT treasurer.  I convert it to
> a data table using an archaic, cheap, DOS-based database program and,
> using a simple (150 lines of code), home-built program, generate a
> structured text file (2 min. run - 1300+ records).  This is shipped
> to a UNIX server and indexed by me using a locally built webpage.
> There is *no* loss of information in this process and programed
> output allows the base data to be enhanced via insertion of
> browser-based email links and functional web links for ASPT members
> that cite a web page.  I update the system on almost a daily basis
> and full updates (new file from the treasurer) take about 10 minutes.
>  Also, the index is available for open URL query (avoiding the 'front
> end') if a user has an interest.  The URL:
>
> http://www.csdl.tamu.edu/FLORA/cgi/aspt_query?focus=Amaranthaceae
>
> will pull a listing of those ASPT members listing this family as an
> item research interest.

I will argue that there quite likely IS a loss of information.  The original
database probably exists as more than a single table.  With this comes the
relationships between data elements, plus the business rules and so forth
associated with the maintenance of that data.  Simply dumping the data to a
text file reduces the information content as you do not propogate these
relationships and rules, let alone the field definitions and so forth.  Even
if the characters are the same as in the original DB (i.e. the data is the
same) the information content is reduced.  This information loss expands as
the complexity of your data increases.

Secondly, you argue that this is an inexpensive system.  Many steps are
taken to propogate the information from the Dbase IV file through to a
system that can be queried through your web interface.  This is fine if one
enjoys that kind of activity, however, this is what computers were
originally developed for.  It is a trivial matter to make your Dbase IV file
directly accessable from the web with a  few lines of java or the language
of your choice.  It would take say a day to setup a database interface
through the web (about the same as you initially expended setting up the
convoluted interface).  Then it is done.  You spend 10 minutes every day
maintaining your interface as described above.  That's approximately an hour
or so a week, 50 hours a year.  With a direct connection to the database,
the additional maintenance time is zero.

However, even through this system (direct connection to the database), you
are *still* restricted to working with the information through a web
browser.  By using a protocol designed for information retrieval (z39.50 +
others), rather than information perusal (http), one could gain access to
the data elements directly, rather than going through further tedious steps
of translation.

>
> > search engine or a real DBMS).  Personally I wouldn't mind waiting a few
> > seconds for the information to flow straight into my
> spreadsheet or GIS app
> > or whatever in a nice orderly fashion, rather than having to
> cut and paste
> > oddly formatted lumps out of my web browser screen (with the
> associated risk
> > of messing things up).
>
> A 'real' DBMS requires 'real' server based DBMS costs, complexity,
> and administrative overhead that is beyond most collection managers.
> No doubt, for small applications - such as the ASPT membership
> directory cited above - Bill Gates and others will have local systems
> available that can be used by the general public, but larger datasets
> with broader functionality, if DBMS-based, will require a 'wait' at
> two levels, 1) on-line response time (seconds), and 2) development
> time (years and years, with endless symposia, workshops, and
> 'standards' debate so far).

Are you suggesting that collections are maintained as indexed text files?  I
think that such a proposal doesn't require much response.  A small DBMS such
as Access or even FileMaker can handle a few hundred thousand records
easily.  There is really no need for Oracle or one of those big systems
unless you are working with really big databases.  As in tables with > 10E6
records, and dozens of tables.

I'm a little unsure about your comments on the development time.  The reason
why standards take a long time to be completed is simply because they
require a community agreement.  Standards for data interchange are very
tedious things to develop (primarily becuase no one person can appreciate
the needs of all others), but in those communities where they exist, there
have been tremendous benefits.  The bibliographic, geospatial, astronomical,
and financial domains have all adopted standards for data interchange and
have never looked back.  Standard development continues to evolve in all of
these as new protocols and desires for data interchange come to the fore.

A standard for interchange of collections and taxonomic authority
information *could* be obtained very rapidly if we all decided there was a
pressing need for it, and focussed on solving the problem rather than
expending energy on debating whether such standards are necessary.

In anycase, for the type of system proposed by Hugh, standards are still
required if any compatability between different instances is to be expected.
The semantics of the query, the response of the server (errors or ok), and
the syntax of the result set all need to be defined no matter what protocol
is used.

> In terms of relative on-line response time, three systems carry
> roughly similar sets of data and returns (text/map) for North
> American vascular plants:
>
> server-based DBMS:
>
> http://www.mip.berkeley.edu/bonap/checklist_intro.html
> http://plants.usda.gov/plants/
>
> and full text indexing (both text and map returns):
>
> http://www.csdl.tamu.edu/FLORA/b98/check98.htm
>
> a few queries to each will demonstrate differences in responses to
> the client with regard to both speed and content

> Hugh D. Wilson
> Texas A&M University - Biology
> h-wilson at tamu.edu (409-845-3354)
> http://www.csdl.tamu.edu/FLORA/Wilson/homepage.html

Yes, quite nice.  Each of them responded very rapidly with no discernable
difference in response time except the Plants system.  A quick check
indicated a very slow network connection between Denver and what appears to
be Fort Collins.  Hence the the slower response percieved there is most
likely due to some network hardware problems rather than the performance of
the database as such.

regards,
  Dave V.

---------------------
Dave Vieglais
Natural History Museum and Biodiversity Research Center
phone (785) 864 4540 fax (785) 864 5335