[Taxacom] formation of zoological names with Mc, Mac, etc.

Mon Aug 31 16:37:54 CDT 2009

Assuming I'm following this correctly, I may be able to expand a 
little on a point Rich was making about surrogate keys:

>We should use HUMAN IDENTIFERS when interacting with Humans.  The
>opaque/obscure/surrogate identifiers only exist to serve the needs of
>computer databases, and as such should be optimized for use by computer
>databases.  No human should ever see one (I am guilty of violating this in
>exposing ZooBank identifiers in human interfaces -- and this was a VERY
>tough decision for me to make -- but I did it only because the
>bioinformatics community and its relationship to unique identifiers is not
>yet mature, so I reluctantly exposed them for the time being until such time
>as the community matures a bit more on this). CERTAINLY, no human should
>ever type one into a computer (this applies to ZooBank identifiers, even
>when exposed to human eyeballs).
>
>>  So
>>  I am asking myself if surrogate keys can combine elements of
>>  natural keys, in a way that each databaser and taxonomist
>>  would be able to create them easily without involving the
>>  time consuming process of consulting a central source. For
>>  that I would need to understand better where exactly you draw
>>  the line between a natural and a surrogate key.
>
>Fair enough.  My database credentials are not hard-core enough to know
>exactly where to draw that line.  You can see what Wikipedia says about it
>here:
>http://en.wikipedia.org/wiki/Surrogate_key
>
>I guess I lean more towards the first of the two presented definitions; but
>I still maintain that they should be visible only to the application, and
>(in most cases) not to the user. The points I am trying to make are embedded
>in the bulleted list of points at the bottom of the "Definition" section of
>the Wikipedia page.  The only part I'm a little fuzzy on is whether the
>identifier should be visible to an application.  Certainly I think they
>should be -- but I don't think they should be visible to humans (except the
>database developers themselves, and as embedded in URLs and such).

Using Wikipedia itself as an example, the distinction Rich is making 
here is that of using disambiguated hyperlinks for any references to 
"The Italian Job" within Wikipedia, rather than having to specify the 
year in any given citation. That is, in Wikipedia, you can have two 
text strings, even side-by-side, that refer to "The Italian Job", but 
if one is hyperlinked to the original movie, and the other is 
hyperlinked to the re-make, they look identical to a human but will 
act differently when one clicks on them (they are different to the 
machine). The question is whether one can have a machine look at a 
text string and *decide* which link should be used, rather than 
having a human *tell* it (a "time consuming process"); this is the 
crux of Francisco's point:

>  > I imagine this could be superior to a system in which the two
>>  involved partners needed to consult a central source to find
>  > the globally used identifiers for the 20,000 names.

If you can *fully* automate the linkage of the "invisible" GUIDs to 
the visible names in both lists, then the "consultation" could be 
over and done in a matter of seconds. While *full* automation may be 
unattainable, one can certainly imagine that such a process can be 
*largely* automated, because names as a whole are *largely* unique 
and therefore most would not require disambiguation (e.g., if I have 
"Danaus plexippus" in one list and "D. plexippus L." in another, 
these should both be automatically cross-referenced to the same 
GUID). Realistically, the way this sort of thing would be likely to 
work is that the software would perform whatever linkages it could, 
and indicate which names could not be resolved given the information 
on hand (and therefore require human intervention and clarification). 
This would still be an immense improvement over having humans go 
name-by-name, and gets back to Rich's response about the GNA and GNI:

>If the Global Names Architecture (GNA) delivers on what the crafters of it
>intend it to deliver, then there will be a single set of opaque, globally
>unique, computer-friendly, human-unfriendly identifiers that can be
>replicated across all database systems.  In this sense, it is not
>"centralized"; but rather "coordinated".
[snip]
>The permanent, shared, unique persistent identifiers will emerge from the
>Global Names Usage Bank (GNUB; still in very early-stage development).  In
>my vision for the GNA, the GNI will serve as a conduit to help people with
>taxon-name databases to link into the shared GNUB identifiers.  Any given
>database would only need to go through this process once; afterwhich it
>would be immediately cross-linked to all other taxonomic databases that have
>gone through the same linking process.  In my view, *THIS* is how the
>world's biodiversity data will eventually get cross-linked together.

Since I am one of those people with a taxon-name database (mine has 
194,363 insect names to date, each linked to a surrogate key number), 
I am eager to see how this system will develop, and hopeful that such 
a thing will come to pass in the near future, to facilitate 
communication among museums and taxonomists, among other benefits.

Sincerely,
-- 

Doug Yanega        Dept. of Entomology         Entomology Research Museum
Univ. of California, Riverside, CA 92521-0314        skype: dyanega
phone: (951) 827-4315 (standard disclaimer: opinions are mine, not UCR's)
              http://cache.ucr.edu/~heraty/yanega.html
   "There are some enterprises in which a careful disorderliness
         is the true method" - Herman Melville, Moby Dick, Chap. 82