data sharing

Hugh Wilson wilson at BIO.TAMU.EDU
Thu Dec 10 11:19:55 CST 1998


On 10 Dec 98 at 7:11, Dave Vieglais <vieglais at UKANS.EDU> wrote:

> large amounts of unstructured textual information.  But I doubt any of the
> developers of thse search engines would suggest using them for ordered
> access to structured data (certainly their marketing folks might have
> another opinion), and would probably be amused at the suggestion.  The
> reason is quite simple- by exporting to a text file, and providing access
> via web search engines......

My reference to full text indexing was not intended to imply that we
should set up text indices for Alta-vista to search.  The general
notion is to create a structured text file, index it, and then create
a 'front end' that sends queries to it.  For users able to frame up a
query URL, the front end can be bypassed.

> Processing time is not typically an issue with properly setup DBMS systems.
> Certainly, text searching will usually be faster, but at the loss of a
> tremendous amount of functionality.  And yes, text files can be generated
> from any development system, but in doing so, it is widely recognized that
> information loss is inevitable.

Take a look at the membership directory query system at the ASPT
website at:

http://www.csdl.tamu.edu/FLORA/aspt/aspthome.htm
(select 'membership')

The query page carries links to 'front end' systems that allow the
user to query a full text index using a mouse.  The index originates
as a DbaseIV file sent to me by the ASPT treasurer.  I convert it to
a data table using an archaic, cheap, DOS-based database program and,
using a simple (150 lines of code), home-built program, generate a
structured text file (2 min. run - 1300+ records).  This is shipped
to a UNIX server and indexed by me using a locally built webpage.
There is *no* loss of information in this process and programed
output allows the base data to be enhanced via insertion of
browser-based email links and functional web links for ASPT members
that cite a web page.  I update the system on almost a daily basis
and full updates (new file from the treasurer) take about 10 minutes.
 Also, the index is available for open URL query (avoiding the 'front
end') if a user has an interest.  The URL:

http://www.csdl.tamu.edu/FLORA/cgi/aspt_query?focus=Amaranthaceae

will pull a listing of those ASPT members listing this family as an
item research interest.


> search engine or a real DBMS).  Personally I wouldn't mind waiting a few
> seconds for the information to flow straight into my spreadsheet or GIS app
> or whatever in a nice orderly fashion, rather than having to cut and paste
> oddly formatted lumps out of my web browser screen (with the associated risk
> of messing things up).

A 'real' DBMS requires 'real' server based DBMS costs, complexity,
and administrative overhead that is beyond most collection managers.
No doubt, for small applications - such as the ASPT membership
directory cited above - Bill Gates and others will have local systems
available that can be used by the general public, but larger datasets
with broader functionality, if DBMS-based, will require a 'wait' at
two levels, 1) on-line response time (seconds), and 2) development
time (years and years, with endless symposia, workshops, and
'standards' debate so far).

In terms of relative on-line response time, three systems carry
roughly similar sets of data and returns (text/map) for North
American vascular plants:

server-based DBMS:

http://www.mip.berkeley.edu/bonap/checklist_intro.html
http://plants.usda.gov/plants/

and full text indexing (both text and map returns):

http://www.csdl.tamu.edu/FLORA/b98/check98.htm

a few queries to each will demonstrate differences in responses to
the client with regard to both speed and content

Hugh D. Wilson
Texas A&M University - Biology
h-wilson at tamu.edu (409-845-3354)
http://www.csdl.tamu.edu/FLORA/Wilson/homepage.html




More information about the Taxacom mailing list