[Taxacom] ZooBank Progress
Richard Pyle
deepreef at bishopmuseum.org
Sun Apr 28 14:08:56 CDT 2013
> If it's done manually it might be worth to correct some other names that
> were linked by ZooBank to the Linnean 1758 work.
Yes, exactly! Part of the process of cross-linking is to compare discrepancies. We've found that in most datasets that we cross-link against, there are a relatively small fraction of discrepancies. For example, out of 50,000 names, there might be only a few hundred discrepancies -- usually involving the date of publication, correct authorship, or the exact orthography of the name. This means that it's a very manageable task to investigate each one of the discrepancies. Also, I've found that no database is perfect. Some are better than others, to be sure -- but one can never assume that one database is always correct -- which means that it's important to examine the discrepancies individually. Indeed, this is one of the main reasons why we wanted to establish this link with BHL -- to make it easier to resolve discrepancies.
> My understanding is that ZooBank is a data resource where available names
> are contained. Unavailable names should probably not be contained at all,
> and if yes, they should clearly be marked as such.
> I am not sure how names should be treated which were initially made
> available and later suppressed.
This is an issue that has been debated since ZooBank was first conceived. In 2008, at a Commissioners meeting in Paris, it was determined that ZooBank *would* include unavailable names, and that those names would be clearly marked not only as unavailable, but also give the reason(s) why the name is unavailable. We already have a very robust data model to deal with this (which I'd be happy to describe, if anyone is interested). But as with most aspects of ZooBank development, the tricky part is how to implement it (devil is always in the details). One of the things in the works is a policy on data verification in ZooBank. Right now, the focus is on building the core infrastructure of ZooBank, populating it with restrospective content, and building tools to streamline the capture of prospective content. However, what people *really* want from ZooBank is a definitive declaration of whether or not any particular name is available under the Code. This is the entire process of content verification. So far, we focused only on registration (these are two very different things).
> Example:
> http://zoobank.org/NomenclaturalActs/1E691819-76A8-492D-8AE1-DA84F9103CF8
> Acarus telarius Linnæus, 1758 - this name should somehow be marked as
> suppressed (ICZN Op. 968).
> There were many other such names established in the 1758 work, which
> were totally or partly suppressed by the Commission.
Yes, indeed! In fact, one of the projects we've been working on (with LARGE thanks to Charles Hussey, and also to Rod Page who defined the article boundaries of historical BZN volumes in BHL) is a complete database of Opinions. This is effectively complete (still needs some verification, though), and will be one of the new features added to ZooBank this summer. But again, we need to sort out exactly how this sort of thing will be implemented on the ZooBank website, and what the policy is for editing these sorts of things, etc.
Many thanks for pointing out the individual issues related to Linneaus names. This sort of thing is EXTREMELY helpful! I will definitely use these as test cases when we implement the next set of features involving ZooBank record verification/validation. But again, it probably can't be implemented until later this summer (northern hemisphere summer, that is).
> Maybe some other systematic things could be fixed.
>
> - Remove the long s throughout the original spellings, and replace it at all
> instances by the short s.
In this case, we want to maintain the precise orthography as it originally appeared on the printed page -- in al respects. Basically, if a UTF-8 character exists for a particular glyph, we want to capture it as such. The main exceptions are that all-caps words are not faithfully captured as such, and other stylistic attributes (e.g., boldface, small-caps, when original names were not italicized, etc.) will not be captured. But characters such as the long s and dipthong "æ" will be captured as originally printed on the page.
The next step is to build the correct algorithm to transform these things, so that the Code-corrected "original spelling" can be generated automatically. In most cases, this is easy to do -- but there are some tricky ones (e.g., see Art. 32.5.2.1. -- which would require us to know whether the root word is German or not; or some of Art. 32.5.2.4.). This is one more example of features currently in the works, that will be introduced over time as they rise up the priority list, and as appropriate policies are drafted and ratified.
> Example Musca Linnæus, 1758, this name was
> spelled Musca with long s at some occasions and MUSCA at others, MUSCA is
> usually converted to Musca with short s. So all specific names should
> correctly be combined with Musca with short s.
This is a slightly separate issue (multiple spellings of the same genus name, and how they map to the species they are combined with). The new GNUB data model (not yet implemented) deals with this by capturing separately the verbatim name-string, and the separate name components. At the moment, this sort of issue is rare enough that it has not risen up the priority "to-do" list. But it's definitely on the list.
> Also, the long s is not cited consistently. Example:
> http://zoobank.org/NomenclaturalActs/D2B4DA70-35AE-4D87-9E34-E219FC8E3DA0
> Ostrea Puſio Linnæus, 1758 - here Pusio with long s and Ostrea with
> short s, both had the long s in the original source.
This is another example of the previous. The genus was rendered as OSTREA on p. 696 (http://www.biodiversitylibrary.org/pagethumb/727611), so the genus is captured as such in the database (minus the all-caps). I only see "Oſtrea" in the page header. Is it rendered this way somewhere else?
> - Consider presenting a field "original spelling" and another field "correct
> spelling". This would probably reduce confusion. In the correct spelling field
> the species would not appear capitalised, and diacritics would be removed.
Yes! This is already part of the plan. It just needs to rise up the priority list for implementation.
> - I am confused by the statement "Fossil: No" in the ZooBank data result set.
> Is this nomenclaturally relevant?
It's not a Code-relevant issue, but it is a useful piece of information (just like type locality, figures, and page number).
> Is there an exact definition for the term "fossil"? Since when does a taxon
> need to be extinct for obtaining the attribute "fossil"?
If you read the help section for this particular field (click on the blue icon when registering a new name, or editing an existing name), it explains it thusly:
"If this new name is based on fossil material, select this checkbox. Otherwise, leave the checkbox unselected."
In other words, it only applies to species-group names, and it is a specific indication of the nature of the name-bearing type material. Technically, if the type specimen of Latimeria chalumnae had been a fossil, and then it was later discovered alive, this would be "Fossil: Yes". However, I am not aware of any case where a name is established based on a fossilized type, and then later discovered (at the species level) to be extant. Generally such cases are described as separate species.
> Can we be sure that all molluscs and brachiopods named in the early Linnean
> works were recent?
Nope. Neither can we be sure that all the page numbers are correct, or all the type localities are correct -- or any number of other things. That doesn't mean the data field should be eliminated. It just means we have to deal with cases that prove to be inaccurate (or unknown).
> Would it not be better to remove the statement, to avoid running the risk to
> give an incorrect information?
I don't think so, but I'd be interested in hearing opinions from others on this. As I already said, there is no such thing as a perfect database. One of the things Rob Whitton constantly reminds me of is not to let the "perfect" be the enemy of the "good". I tend to be a perfectionist on thses sorts of things (as many database managers are). But sometimes it's better to just get what you have out there, and then provide a crowd-sourcing mechanism to get it corrected.
> Example:
> http://zoobank.org/NomenclaturalActs/04B5D5F4-648A-489F-ADE9-2C13971F8A69
> Anomia Gryphus Linnæus, 1758. Here a fossil species was described, and in
> Zoobank it was marked as "Fossil: No".
Many thanks for the correction! I have already implemented it on ZooBank (it took me 7 seconds to correct this -- but you did the hard part of finding the error, and made it extremely easy for me by providing the link).
I want to thank you again for providing all of these VERY VALUABLE corrections to names in ZooBank. I will study them in more detail (along with your other recent messages), and will likely come back to you with follow-up questions.
Aloha,
Rich
More information about the Taxacom
mailing list