[Taxacom] formation of zoological names with Mc, Mac, etc.

Tue Sep 1 00:01:57 CDT 2009

> The question is whether one can have a machine 
> look at a text string and *decide* which link should be used, 
> rather than having a human *tell* it (a "time consuming 
> process"); this is the crux of Francisco's point:

Yes, and also the crux of the services being developed around the Global
Names Index, which are based on existing efforts to implement "fuzzy
matching" [Tony Rees -- this is your cue...]
That's not my forte, but what I'd like to see emerge from such services is
not that the computers *decide* how to link, but rather use some sort of
standard metric of liklihood, which the Human can use to make the final
call.  Thus, though it may involve a human, I think the services could cut
down the "time consuming" part dramatically (with access to the right
databases, page images, etc.)  This is slightly different from how Doug
described it, but ultimately it would end up being the same thing.  The easy
matches (only one, high-probability match) would probably be accepted
automatically in most cases; but the not-so-certain matches would be flagged
for human scrutiny.  But even in those cases, if all the necessary resources
are one mouse-click away (e.g., page images of original literature through
BHL), then the job would be a LOT easier than it is now.

Yes, yes, I know that a certain small fraction of auto-accepted links will
be wrong, but I suspect this would be a tiny, tiny fraction, and even most
of those would be discovered soon enough, then easily resolved by a human.

> While *full* automation may be unattainable, one can 
> certainly imagine that such a process can be
> *largely* automated, because names as a whole are *largely* 
> unique and therefore most would not require disambiguation 

Yes!  Exactly.

But my larger point was that while such algorithms can certainly help on the
initial linking step, I see the real advantage of a
common/shared/global/ditributed GUID pool is that once the database links
are made *once*, they *NEVER* need to be made again.  We may have
incremental savings of effort in the short run for building the initial
links from existing datasets; but we have HUGE advantages in the long run if
all new electronic datasets are automatically plugged into these GUIDs, and
therefore automatically plugged into *every* other dataset that shares the
same pool of GUIDs.

> Since I am one of those people with a taxon-name database (mine has
> 194,363 insect names to date, each linked to a surrogate key 
> number), I am eager to see how this system will develop, and 
> hopeful that such a thing will come to pass in the near 
> future, to facilitate communication among museums and 
> taxonomists, among other benefits.

Well...how 'bout throwing those names at GNI (www.globalnames.org), and see
what you get?  It's amazingly easy to do with just a basic understanding of
how to output content from your database.  There's lots of information on
the site to help guide you through the process.

Aloha,
Rich