Taxacom: Tropicos and gender of names

Thu Feb 10 04:04:30 CST 2022

 Donald said 'My message was mostly a response to the suggestion that our great problem is databases being too crude and imprecise to represent biological nomenclature'
Hah, yes, that is an ironic suggestion, since the truth is in fact that biological nomenclature is too crude and imprecise to be represented by databases (which, by their very nature, have to be extremely precise)!
Stephen
    On Thursday, 10 February 2022, 10:02:13 pm NZDT, Donald Hobern <dhobern at gbif.org> wrote:  

 #yiv6679711271 P {margin-top:0;margin-bottom:0;}Thanks, Stephen.
I'm not going to disagree with much of what you say. Most uses of scientific names are imprecise exercises in ostension and rely on the same rubbery qualities as any word uses. And nomenclature gives us so many painful minutiae that can soak up our time. My message was mostly a response to the suggestion that our great problem is databases being too crude and imprecise to represent biological nomenclature. That really is not the problem. All the things you mention are bigger issues.
Donald

----------------------------------------------------------------------Donald Hobern - dhobern at gbif.orgGlobal Biodiversity Information Facility https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gbif.org%2F----------------------------------------------------------------------&data=04%7C01%7Ctaxacom%40lists.ku.edu%7C3d3ddc8e8d8849e3346508d9ec7cbc5c%7C3c176536afe643f5b96636feabbe3c1a%7C0%7C0%7C637800842758705633%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7HJvk06%2BQvVP3xum9%2FsBRE5c1qtH5aJj8UaiTXRwy9c%3D&reserved=0

From: Stephen Thorpe <stephen_thorpe at yahoo.co.nz>
Sent: Thursday, February 10, 2022 6:55 PM
To: Donald Hobern <dhobern at gbif.org>; taxacom at lists.ku.edu <taxacom at lists.ku.edu>; Richard Pyle <deepreef at bishopmuseum.org>
Subject: Re: Taxacom: Tropicos and gender of names Hi Rich and Donald (and other Taxacomers),I have to throw a bit of a spanner in the works here, I'm afraid. I've been thinking hard about these issues for nearly 3 decades now, and have come to the conclusion that an overly precise approach (which computerised data management seems to require) is the wrong approach. Instead, what we need is a human generated narrative, or, at the very least, the facility to add annotations to computer generated records. Taxonomy and nomenclature are just too "fuzzy". It is too easy to get bogged down in pointless minutiae and lose track of what is actually important. One of the main problems, as I see it, is that the nomenclatural codes (at least the ICZN) are too complex and ambiguous. The result is that we get authors claiming their own subjective interpretations of the Code to be objectively correct! There are many possible problems that could trip up a computerised approach. Just for one minor example, Pselaphotheseus ihupuku is presumably described well enough to be able to recognise the species from the description, without any need to examine "the type". This is just as well, because two different specimens were designated as "the holotype", in the original description! There is no objective solution to this problem. One could say that the species name is unavailable, because no holotype was *unambiguously* designated. Alternatively, one could take Al Newtons interpretation as effectively syntypes were designated, but erroneously referred to as holotypes, thus saving the availability of the name. There simply is no right or wrong answer, but it probably doesn't matter, since the species is presumably identifiable from the original description and illustrations. Another example of pointless complexity involves an old  paper from 1877 that I registered earlier today on ZooBank. The problem is that the running header changes from January 1877 to February 1877, part way through the paper, but without any other indication of interrupted publication. I really don't think it is clear how to interpret this. I suspect that the dates are printing dates, not publication dates, in which case it is one paper, not two, but, if they are publication dates, then that has potential nomenclatural implications for priority of potential synonyms, etc. However, the best approach, it seems to me, is just to make a brief note of the problem and push on, not worrying too much about it. Hence the need for annotation. Without that, someone is just going to have to make a subjective call, which opens up the possibility of other people making different calls, thus resulting in disagreement and confusion, with each side claiming that their interpretation is "the correct one"!Cheers, Stephen
On Thursday, 10 February 2022, 07:16:26 pm NZDT, Richard Pyle via Taxacom <taxacom at lists.ku.edu> wrote:

Donald,

This is perhaps the most succinct and well-worded description of the taxonomy-database landscape that I have ever read!  Well done, sir!

I'm going to hitch my wagon and add one more bit of elaboration.  As you note, some people build databases to track "names" (by any of a large number of definitions).  Some build databases to track "nomenclatural acts". Some build databases to track "species" (or taxa more broadly).  Some build databases about other important biodiversity concerns (e.g., conservation measures or regulations, ecology, biogeography, genomics, phylogeny, medical issues & diseases, etc., etc., etc.), which anchor critical bits of information to some sort of taxonomic or nomenclatural backbone.  We have a plethora of database solutions (and, consequently, a plethora of "things" to which identifiers are assigned) because we have a plethora of use-cases and data management needs in the broad spectrum of biodiversity inquiry and documentation. Most of these were purpose-build, as you note, to optimally address the relevant use case.

However, there is one "thing" that lies at the root of ALL of these use-cases, and to which ALL of the relevant information can be anchored and intelligently cross-referenced.  It's not "names", or "acts", or "taxa", or any of the other "things" around which most taxonomic databases (or taxonomic backbones to other biodiversity databases) are built.  Think of it as the grand unified elemental "unit" of taxonomy -- the "atoms" of ALL nomenclatural and taxonomic information worth documenting:  Taxonomic Name Usage (TNU) instances.  For those not actively engaged in this space, think of these TNUs as taxonomic treatments or scientific-name mentions within a specific "Reference" (i.e., a specific instance of published or unpublished literature and other forms of static documentation).  They serve as anchor-points for ALL nomenclatural assertions (both the Code-compliant ones, and the non-Code compliant ones), as well as all assertions about taxonomic circumscriptions and classification (synonymies, combinations, hierarchies, etc.). In fact, the serve as anchor-points for every meaningful piece of biodiversity information that is referenced through scientific names.

TNUs are powerful and universal because they are granular.  But granularity is a double-edge sword.  While there are a few million accepted taxa, and perhaps tens of millions (+/-) of scientific names, there are hundreds of millions (billions?) of TNUs.  Indexing them all seems an overwhelming task. The good news is that you only need a few million of them to really do some powerful interlinking of biodiversity data. Basically, you need all the Protonyms (think Basionyms, except applying only to the terminal epithet, and applying to all ranks), plus at least one additional TNU for each name to represent its current "accepted" status (by whatever meta-authority you favor).  And this extra one only applies in cases where the accepted status differs from the original status.  More good news is that TNUs themselves consist entirely of facts (with a very tiny fraction of edge cases), so there is no need for subjective decision-making or endless debate in populating them in a database (the subjectivity comes in only when a meta-authority decides which among several optional statuses for any given Protonym, represents the "accepted" status -- but that happens at a different layer that sits on top of TNUs).

In other words, with a reasonably-scaled "seed" set of TNUs (the majority of which already exist in digital structured form in some database or another), you can start doing some automated reasoning and cross-linking that borders on magic. And as a bonus, you get a common store of "facts" around which essentially everyone agrees, plus an index of names as they appear in literature (and how they are treated in literature over time).

The basic infrastructure for this already exists. It only contains about a half-million TNUs, and requires a fair bit of clean-up; but it's growing literally daily.  There are several million more TNUs waiting in the wings to be imported, and we're getting very close to fleshing out a layer that allows this same infrastructure to automatically cluster TNUs into congruent taxon circumscriptions (watch this space).

Forgive me if I sound preachy, but this has been a core sermon of mine for more than two decades.  It's both delightful and painful to see that we are ever-so-slowly moving towards the realization of this on a much broader scale (delightful because it is moving in the right direction, painful because it has been a slower process than it really needed to be).

I hadn't intended this message to be as long as it ended up being, and I'd wager that the majority of people who started reading it haven't got this far (either because they got bored or fell asleep along the way).  But Donald's EXCELLENT summary inspired me to ride his coat-tails and add my own personal cause to the discussion.

Aloha,
Rich

Richard L. Pyle, PhD
Senior Curator of Ichthyology | Director of XCoRE
Bernice Pauahi Bishop Museum
1525 Bernice Street, Honolulu, HI 96817-2704
Office: (808) 848-4115;  Fax: (808) 847-8252
eMail: deepreef at bishopmuseum.org
BishopMuseum.org
Our Mission: Bishop Museum inspires our community and visitors through the exploration and celebration of the extraordinary history, culture, and environment of Hawaiʻi and the Pacific.

> -----Original Message-----
> From: Taxacom <taxacom-bounces at lists.ku.edu> On Behalf Of Donald Hobern
> via Taxacom
> Sent: Wednesday, February 9, 2022 6:00 PM
> To: <taxacom at lists.ku.edu> <taxacom at lists.ku.edu>;dipteryx at freeler.nl
> Subject: Re: Taxacom: Tropicos and gender of names
> 
> I'm sure this will just serve to waste more of all of our breath, but I want to
> highlight a distinction that I feel is being glossed over. I have spent many years
> processing scientific names for use in databases. And I am familiar with
> countless projects that started by asserting that building a good names
> management tool is simple to build and that then got bogged down for years in
> making it work adequately.
> 
> It is not hard - and it is highly desirable - to build a nomenclatural database
> that meets the needs of the taxonomists working on a particular group. If the
> database is to capture nomenclatural acts or name usages and support what is
> effectively row-by-row lookup, the challenges are ("simply") making sure that
> the data model can represent the range of special conditions considered
> important. A tool like this, particularly one that links to the original
> publications, is enormously valuable for taxonomists and for some other user
> groups. It should by and large converge over time on being an inarguable
> summary of facts. (Although, in the real world, the edge cases make this ideal
> hard to achieve.) Such a tool is of great importance to many people on this
> group and could benefit from large-scale cross-taxon effort, as with ZooBank,
> IPNI, Index Fungorum, LPSN, and ICTV.
> 
> However, a nomenclatural database does not meet the needs of 95% of
> consumers of biodiversity information and may in fact cause more confusion
> than it solves. For most other biologists, environmental scientists,
> conservationists, invasion biologists, field naturalists, molecular researchers,
> etc., nomenclature is just the archaeological remains of more than 260 years of
> taxonomic effort. These users have two basic types of information need (which
> are simply the gateways for them to answer many other more interesting
> questions): 1) I've found a scientific name somewhere - what species is being
> referenced? and 2) Find me all the relevant information of some type that
> relates to what I know as species X.
> 
> Binomials (plus authorship (plus page number (plus sensu reference (plus ...))))
> are dreadful keys for building data systems that meet these challenges. We
> have to document original names, known combinations and asserted
> synonymy, allow for normal fluidity around author spelling/abbreviation and
> publication years, perhaps strip away presumed Latin gender endings,
> accommodate a range of bespoke orthography to represent uncertainty,
> balance the probability of misspellings (including misspellings in the original
> epithets), etc. It is far from trivial to build a system that can do what a
> taxonomist familiar with a group does when encountering a novel combination
> (instantly recognising that the combination represents a transfer to another
> known genus, or a wholly new genus, or an obscure resurrection of a genus
> based on presumed priority, or ...).
> 
> The use cases and number of users for a taxonomic (rather than
> nomenclatural) information system make it likely that funding will be easier to
> attract for such cases. The quality of such systems will be improved if
> comprehensive nomenclatural datasets are available to underpin them. This is
> itself an important reason to support initiatives like ZooBank, along with other
> platforms that bring together a taxonomic community to create a single point-
> of-truth for the nomenclature of their group. Standardising how all these
> systems represent the messy parts would also be a big help.
> 
> However, any discussion about why databases struggle with biological
> nomenclature should acknowledge that the problem is not with representing
> nomenclatural acts. As noted, that can be done. Rather, it's with the fluidity of
> real-world references to the names that taxonomists publish. That's where
> names as lookup keys start to fail.
> 
> Donald
> 
> 
> ----------------------------------------------------------------------
> Donald Hobern - dhobern at gbif.org<mailto:dhobern at gbif.org>
> Global Biodiversity Information Facility
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.g%2F&data=04%7C01%7Ctaxacom%40lists.ku.edu%7C3d3ddc8e8d8849e3346508d9ec7cbc5c%7C3c176536afe643f5b96636feabbe3c1a%7C0%7C0%7C637800842758705633%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=jD6IRWoaCSSFNccq%2FeW%2FYXJqkfaNAaTgCDPtC2ygXt8%3D&reserved=0
> bif.org%2F&data=04%7C01%7Ctaxacom%40lists.ku.edu%7Ca5a76a7fa5f
> 84af34d1008d9ec49d5de%7C3c176536afe643f5b96636feabbe3c1a%7C0%7C
> 0%7C637800624296012086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp
> ;sdata=oDcpszWESgzdwwgmyMG1%2B%2F0x7%2BPeAg4cShs%2FiYnLOUM%
> 3D&reserved=0
> ----------------------------------------------------------------------
> 
> ________________________________
> From: Taxacom <taxacom-bounces at lists.ku.edu> on behalf of dipteryx--- via
> Taxacom <taxacom at lists.ku.edu>
> Sent: Wednesday, February 9, 2022 8:48 PM
> To: <taxacom at lists.ku.edu> <taxacom at lists.ku.edu>
> Subject: Re: Taxacom: Tropicos and gender of names
> 
> The idea of adapting biological nomenclature so that it fits in databases
> reminds me of the early days of computers when software was written to fit
> the current model, and any new model made it necessary to throw out the
> previous work and start anew. Or, for that matter, cutting off sections of
> famous paintings to make them fit their newly allotted wall space.
> 
> It seems much simpler to wait until databasers grow up and can handle
> concepts beyond those aimed to fit in the grasp of a 3-year old?
> 
> Paul
> 
> > Op 08-02-2022 17:36 schreef Scott Thomson via Taxacom
> <taxacom at lists.ku.edu>:
> > [...]
> > However, taking off my biologist or linguist hat for a moment. As a
> > computer programmer having designed databases, mostly in SQL I think
> > there are a lot of valid reasons to be rid of gender agreement and
> > just use original spelling. Mostly these come down to the accuracy of
> > pick up by databases of these issues. It is one aspect that could be
> > avoided making all databases far more accurate and with simpler rules.
> > It should be remembered that as I was taught when I did software
> > engineering, a computer program is a recipe designed for a 3 year old,
> > the computer may be faster than us, but do not equate that to more
> > intelligent. Famous movie quote, it just runs programs. The computer
> > cannot make any decision it is not told to make. Therefore if we want
> > high speed and excruciatingly accurate data reading by these
> > databases, then we should be making it easier for databases to read
> > and process data, not harder. [...]
> _______________________________________________
> Taxacom Mailing List
> 
> Send Taxacom mailing list submissions to: taxacom at lists.ku.edu For list
> information; to subscribe or unsubscribe, visit:
> https://lists.ku.edu/listinfo/taxacom
> You can reach the person managing the list at: taxacom-owner at lists.ku.edu
> The Taxacom email archive back to 1992 can be searched at:
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftaxaco
> m.markmail.org%2F&data=04%7C01%7Ctaxacom%40lists.ku.edu%7Ca5a
> 76a7fa5f84af34d1008d9ec49d5de%7C3c176536afe643f5b96636feabbe3c1a
> %7C0%7C0%7C637800624296012086%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C
> 3000&sdata=PgHwLunGJ8VRUiIlStOTllYrZ%2FmhBqjXXiRded24fsc%3D&a
> mp;reserved=0
> 
> Nurturing nuance while assailing ambiguity for about 35 years, 1987-2022.
> _______________________________________________
> Taxacom Mailing List
> 
> Send Taxacom mailing list submissions to: taxacom at lists.ku.edu For list
> information; to subscribe or unsubscribe, visit:
> https://lists.ku.edu/listinfo/taxacom
> You can reach the person managing the list at: taxacom-owner at lists.ku.edu
> The Taxacom email archive back to 1992 can be searched at:
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftaxacom.markmail.org%2F&data=04%7C01%7Ctaxacom%40lists.ku.edu%7C3d3ddc8e8d8849e3346508d9ec7cbc5c%7C3c176536afe643f5b96636feabbe3c1a%7C0%7C0%7C637800842758705633%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2F3o8D73zEPGNOMMYmFlZspqjn2ytRB8CgLgrFq%2Bmn%2BY%3D&reserved=0
> 
> Nurturing nuance while assailing ambiguity for about 35 years, 1987-2022.

_______________________________________________
Taxacom Mailing List

Send Taxacom mailing list submissions to: taxacom at lists.ku.edu
For list information; to subscribe or unsubscribe, visit: https://lists.ku.edu/listinfo/taxacom
You can reach the person managing the list at: taxacom-owner at lists.ku.edu
The Taxacom email archive back to 1992 can be searched at: https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftaxacom.markmail.org%2F&data=04%7C01%7Ctaxacom%40lists.ku.edu%7C3d3ddc8e8d8849e3346508d9ec7cbc5c%7C3c176536afe643f5b96636feabbe3c1a%7C0%7C0%7C637800842758705633%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2F3o8D73zEPGNOMMYmFlZspqjn2ytRB8CgLgrFq%2Bmn%2BY%3D&reserved=0

Nurturing nuance while assailing ambiguity for about 35 years, 1987-2022.