[Taxacom] formation of zoological names with Mc, Mac, etc.

Mon Aug 31 15:36:25 CDT 2009

> But I still did not really understand the exact difference 
> between what you call natural and surrogate keys. Maybe an 
> example would be good to get the exact differences.

OK, imagine a database of movies. All movies have an official title, but
that title is not unique. Sometimes a movie gets re-made years later, so
Title + Year would be a better way to uniquely identify a film.

So, let's make a table of movies, that includes the Production Company and
some rating for how good the movie is:

TITLE				YEAR	COMPANY	RATING
-------------------------------------------------
Casablanca			1943	Warner Bros. 8.8
Jaws				1975	Universal	 8.3
Star Wars			1977	Lucasfilm	 8.8
The Italian Job		1969	Oakhurst 	 7.3
The Italian Job		2003	Paramount	 6.9
[etc...]
-------------------------------------------------

Now, suppose I had a list of Actor names, and I wanted to link Actors to
movies.  Do build this link, I need to uniquely identify each movie record.
In the list above, the Natural Key is the combination of TITLE+YEAR.  So, my
linked Actor list might look like this:

TITLE				YEAR	ACTOR
------------------------------------------------
The Italian Job		1969	Michael Caine
The Italian Job		1969	Noel Coward
The Italian Job		1969	Benny Hill
The Italian Job		1969	Raf Vallone
The Italian Job		1969	Tony Beckley
The Italian Job		2003	Mark Wahlberg
The Italian Job		2003	Charlize Theron
The Italian Job		2003	Donald Sutherland
The Italian Job		2003	Jason Statham
[etc...]
------------------------------------------------

In this case, the link between the movies table and the Actors table is made
via the Natural Key combination of TITLE+YEAR.

However, this process of linking tables via compound keys (=keys involving
more than one field) can be quite cumbersome to deal with when developing
databases.  Especially when you have 9 or 10 fields as was suggested for
taxon names -- and even more especially if some of thse fields can be Null
(empty) -- as in "subspecies epithet".

A much simpler solution is to create a Surrogate Key -- which is usually
(always?) an arbitrary number.  For example, using a Surrogate Key on the
Movies table above would give us:

ID TITLE				YEAR	COMPANY	RATING
------------------------------------------------------
1  Casablanca			1943	Warner Bros. 8.8
2  Jaws				1975	Universal	 8.3
3  Star Wars			1977	Lucasfilm	 8.8
4  The Italian Job		1969	Oakhurst 	 7.3
5  The Italian Job		2003	Paramount	 6.9
[etc...]
-------------------------------------------------

...which means to link to the Actors table, you only need to do this:

MovID	ACTOR
------------------------
4	Michael Caine
4	Noel Coward
4	Benny Hill
4	Raf Vallone
4	Tony Beckley
5	Mark Wahlberg
5	Charlize Theron
5	Donald Sutherland
5	Jason Statham
[etc...]
------------------------

The best kinds of Surrogate Keys are the ones that have no informatic
content.  That is, they exist only to uniquely identify a specific record.
The big problem with using Natural Keys to identify records (except as Human
identifiers), is that they rely on fields that contain actual information.
Sometimes, this is not a big problem.  But if a field contains information,
Murphy's Law of Database Integrity is that someone, sometime, will want to
change the information.  Sometimes, this is because the original record had
a mistake (e.g., what if it was later discovered that the actualy release
year for "Casablanca" was 1942).  Sometimes, people want to change the
standard format for information.  For example, what if the managers of this
database decided to change the title "The Italian Job" in the database to
"Italian Job, The".  Doing so in the Movie table would break all the links
to the Actors table, if the link was made using the Natural Key -- unless
referential integrity across all tables is maintained automatically or
manually (i.e., as system is established that automatically cascades updates
in the Movie table to the Actor table).

But if using a Surrogate Key, like the integers above, then there is no
information content in the ID field itself (its only purpose is to uniquely
identify a record). Thus, no one is ever tempted to change it, because it
can never be "incorrect", and there is no reason to change the format, etc.
When you see references to "opaque" identifiers, or identifier "opacity",
this is usually what is meant (i.e., that the object that the identifier
refers to is not self-evident from the identifier itself, and no meaningful
information about the object being identified is encoded or embedded within
the identifier).

Sorry for the long explanation -- but given that I foresee these kinds of
questions becoming more common place as the taxonomic community begins to
debate identifiers and LSIDs and UUIDs all that stuff, I thought it would be
worthwhile to explain some of this stuff in some detail.

> If I understood you correctly,
> 
> Helix aspersa O.F. Müller, 1774 should be a human identifier. 
> Others would be Helix aspersa O. F. Müller, 1774, Helix 
> aspersa Mueller, 1774, Helix aspersa Müller, 1774, Helix 
> aspersa O. Müller 1774, Helix aspersa OF Müller, 1774 or 
> Helix aspersa Müller O.F. 1774 .

Sure.  As long as a human can figure out what is meant, it's an appropriate
human identifier.

> Natural keys? or surrogate keys?:
> 
> Helix aspersa Müller, 1774
> Helix aspersa Müller 1774
> helix-aspersa-müller-1774
> helix_aspersa_müller_1774
> helix-aspersa-muller-1774
> helix_aspersa_muller_1774
> helix_aspersa_muller_1774_01
> helix_aspersa_muller_1774_p0059_01
> helix_aspersa_muller_1774_used
> helix_aspersa_muller_1774_p0059_01_orig
> helix_aspersa_muller_1774_p0059_01_used

I'm not sure exactly what you mean by the above.  If you concatinate all the
components as one long text string separated by underscores, then you're
collapsing a compound Natural Key into a single-field Natural Key.  But the
compound key/single-field key issue is different from the Natural/Surrogate
key issue; so in this case, I would still call all of these Natural Keys,
because they are formed using that actual information that uniquely
identifies a single record.

> hel7829_asp28473_mull747634
> 378thua3p4uht9p8q3ht
> h139857138957187
> 11702

These are what I would call Surrogate Keys, assuming they are arbitrarily
assigned to individual records and do not contain information about the
object being identified.

> A shortcoming of numeral keys is that errors cannot easily be 
> recognized by humans, and that a central source must be 
> consulted to combine the name of the species with the key. 

This "shortcoming" only applies if humans ever see or type these kinds of
identifiers.  In fact, this gets to one of my biggest frustrations when
discusing unique identifiers (hi, Rod...):

We should use HUMAN IDENTIFERS when interacting with Humans.  The
opaque/obscure/surrogate identifiers only exist to serve the needs of
computer databases, and as such should be optimized for use by computer
databases.  No human should ever see one (I am guilty of violating this in
exposing ZooBank identifiers in human interfaces -- and this was a VERY
tough decision for me to make -- but I did it only because the
bioinformatics community and its relationship to unique identifiers is not
yet mature, so I reluctantly exposed them for the time being until such time
as the community matures a bit more on this). CERTAINLY, no human should
ever type one into a computer (this applies to ZooBank identifiers, even
when exposed to human eyeballs).

> So 
> I am asking myself if surrogate keys can combine elements of 
> natural keys, in a way that each databaser and taxonomist 
> would be able to create them easily without involving the 
> time consuming process of consulting a central source. For 
> that I would need to understand better where exactly you draw 
> the line between a natural and a surrogate key.

Fair enough.  My database credentials are not hard-core enough to know
exactly where to draw that line.  You can see what Wikipedia says about it
here: 
http://en.wikipedia.org/wiki/Surrogate_key

I guess I lean more towards the first of the two presented definitions; but
I still maintain that they should be visible only to the application, and
(in most cases) not to the user. The points I am trying to make are embedded
in the bulleted list of points at the bottom of the "Definition" section of
the Wikipedia page.  The only part I'm a little fuzzy on is whether the
identifier should be visible to an application.  Certainly I think they
should be -- but I don't think they should be visible to humans (except the
database developers themselves, and as embedded in URLs and such).

> I could imagine that a databaser who likes to send out to 
> another database information connected to 20,000 names of 
> currently used specific taxa could easily create an automatic 
> program to convert data of 4 fields (1) Helix (2) aspersa (3) 
> Müller (4) 1774 to produce a key 
> helix_aspersa_muller_1774_used, and the recipient would 
> understand this key.

Yes, this would be a convenient way to collapse a compound natural key into
a single-field natural key, which removes some of the hassles that
databasers need to deal with.  But I would still not call this a Surrogate
Key -- I would call it a concatenated Natural Key.

> I imagine this could be superior to a system in which the two 
> involved partners needed to consult a central source to find 
> the globally used identifiers for the 20,000 names.

Superior in some contexts; inferior in others.

If the Global Names Architecture (GNA) delivers on what the crafters of it
intend it to deliver, then there will be a single set of opaque, globally
unique, computer-friendly, human-unfriendly identifiers that can be
replicated across all database systems.  In this sense, it is not
"centralized"; but rather "coordinated".

When linking a dataset into this system for the first time, a database nerd
such as myself would need to do a number of tricks along the lines of what
you suggest -- that is, in one way or another the new dataset has to be
reconciled against a dataset that incorporates the Globally shared
identifiers.  Indeed, the "Global Names Index" (GNI), which is part of the
GNA, exists exactly for this purpose -- to reconcile taxonomic databases
against each other using clever algorithms and fuzzy matching protocols to
reconcile taxonomic datasets.  Check it out at www.globalnames.org.  Still
in development, but it's already quite useful.

The permanent, shared, unique persistent identifiers will emerge from the
Global Names Usage Bank (GNUB; still in very early-stage development).  In
my vision for the GNA, the GNI will serve as a conduit to help people with
taxon-name databases to link into the shared GNUB identifiers.  Any given
database would only need to go through this process once; afterwhich it
would be immediately cross-linked to all other taxonomic databases that have
gone through the same linking process.  In my view, *THIS* is how the
world's biodiversity data will eventually get cross-linked together.

Well....darn it. I knew I should not have jumped into this thread!  Note to
self: no more checking email after midnight.....

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html