[Taxacom] prolific species describers
Richard Pyle
deepreef at bishopmuseum.org
Sat Apr 13 17:28:41 CDT 2013
These are the kinds of questions we could answer in a split second with a
fully-populated Global Names Usage Bank. I know those sound like very empty
and/or idealistic and/or naïve words (fair enough). But those words also
happen to be true.
I had never thought to do this sort of analysis before using GNUB, but It
took me only about five minutes to write a query that produced the following
list, filtered by "Actinopterygii" (only the top 20 records are shown; the
total results included 3899 records):
Name Total Valid %
Start End
----------------------------------------------------------------------------
-----------------------------------
Bleeker, Pieter 1952 829 42
1836 1931
Valenciennes, A. 1893 695 36
1821 1986
Günther, Albert Charles Lewis Gotthilf 1646 913 55
1857 1910
Fowler, Henry Weed 1422 577 40
1899 1977
Jordan, David Starr 1325 684 51
1875 2000
Boulenger, George Albert 1135 791 69
1881 1923
Cuvier, Georges 1108 419 37 1798
1986
Steindachner, F. 1040 591 56
1860 1970
Regan, C. T. 909 569 62
1902 1940
Gilbert, Charles H. 788 601 76
1878 2000
Eigenmann, C. H. 751 537 71
1887 1948
Bloch, Marcus Elieser 706 265 37
1779 1835
Lacépède, Bernard Germain Étienne de 617 140 22 1797
1917
Randall, John Ernest 593 568 95
1955 2011
Richardson, J. 538 205 38
1823 1930
Linnaeus, Carolus 471 363 77
1758 1771
Poey, F. 469 108 23
1851 2000
de Laporte Castelnau La Ferté-Sénectère, Francis L. 464 106 22
1855 1879
Cope, Edward Drinker 437 168 38
1861 1910
Schneider, J. G. 431 118
27 1801 1801
I want to make something perfectly clear: GNUB deserves exactly ZERO credit
for the data behind these numbers. In this case, because I filtered on
Actinopterygii (ray-finned fishes), essentially all of the data come from
Bill Eschmeyer's "Catalog of fishes". Those data include two major
components:
- Original descriptions of species (linked to literature citations)
- "Current status" of each species (i.e., whether the species is currently
regarded as valid, or as a junior synonym)
So, what does GNUB contribute? Two things:
1) A data model that is both highly normalized (generalized), and is
scalable to effectively all treatments of all taxon names throughout all
history
2) An architecture that allows leveraging myriad data sources (while
attributing appropriate credit to those sources) to be able to answer
questions that cannot be answered by any single data source
It's worth unpacking these two things a bit more.
1) The word "normalized" is geek-speak for how data models are designed. A
general trend is that normalized data models more finely atomize the data
content. More highly normalized data models tend to have far more tables
and links between those tables, and generally make life difficult for anyone
wanting to build a user-friendly interface to access and maintain data.
Also, many (most?) datasets relating to biodiversity data that currently
exist are not so highly normalized, and it is an amazingly difficult and
time-consuming process to transform non-normalized (i.e., "flat") data into
a highly normalized structure (at least with any semblance of accuracy).
However, once you get over that barrier of capturing & parsing the data in a
normalized form, a normalized data model will be able to answer questions
that weren't even conceived when the data model was designed.
The example above is a perfect case in point: because the GNUB data model
is highly normalized, it only took me about five minutes (far less time than
it took me to write this email) to ask the question, "How many species have
been named by each author, how many of those names are still regarded as
valid, and what was the range of years in which they published the new
names?" As I said above, I never considered asking that question when the
data model was originally designed, but because it's highly normalized, it
was amazingly easy for me to ask this novel question, and get a real &
meaningful answer.
I'll also point out a couple of things about the dataset above: first, it
includes names established by individual people -- meaning that names
established as "Carl Linnarus" and "Carl von Linné" are both included for
"Linnaeus, Carolus". The same is true for names established by, for
example, "Rofen, Robert R.", who also published under "Harry, Robert R."
(e.g., see: http://zoobank.org/01E53574-BB1F-4371-A3E6-DC145CEDF6ED). Also,
I tossed in the "Start" and "End" year for publishing new species just
because it was incredibly easy to do. There are a near infinite number of
things I could also have calculated -- such as the average number of
co-authors, the average number of new species per publication, the ratio of
new species to new genera, or almost anything else someone might want to
ask.
2) Although the highly normalized data model makes it very easy to ask novel
and previously unanticipated questions of this sort, the REAL power of GNUB
comes from its core function to integrate information from myriad data
sources. I'll say this again for emphasis: : GNUB deserves exactly ZERO
credit for the data behind these numbers! There is currently no logo for
GNUB, and there is no reason why GNUB should ever take any credit for the
data records themselves. I've said this many times before, but it's always
worth repeating: GNUB is *NOT* simply "yet another database; yet another
acronym" to compete with all the other alphabet soup of acronyms already
generating, harvesting, aggregating, etc. data about scientific names.
Rather, GNUB is intended to be the invisible architecture operating behind
the scenes that allows all the other biodiversity datasets in the world to
more seamlessly cross-link to each other. As I've also said many times
before, an analogy for GNUB is the DNS system -- anyone who ever uses the
internet relies on DNS, and almost nobody who uses the internet is aware of
it.
Why is there a need for something like GNUB? The world has many, many data
sources for taxon names. The only areas we need more of those databases is
for the taxonomic groups that are not yet included in any of the existing
databases. That's not really GNUB's role (although it certainly could serve
in that capacity for taxonomic groups without another, more dedicated
repository). The real function of GNUB is to interconnect those different
datasets that already exist, and thereby increase their functional utility
by allowing them to cross-leverage each other.
For example, I chose to filter the above dataset on Actinopterygii because
the bulk of Bill Eschmeyer's Catalog of Fishes is already indexed by GNUB,
and therefore I knew I'd get a relatively comprehensive result set.
Obviously the same is not true for the many existing datasets that are not
yet included within the GNUB index. Unfortunately, that makes it hard to
see the advantage of GNUB, because you could get these same numbers directly
from the Catalog of Fishes itself. Granted, it would have taken a bit more
than five minutes to do that (largely because CoF doesn't assign GUIDs to
authors and cross-link aliases of authors), but that's more a function of
normalization than data integration. But CoF only tracks fish names, so
Carl Linnaeus comes out at #16 in the list, with only 471 species. What CoF
can't do, is generate the following:
Name Total Valid %
Start End
----------------------------------------------------------------------------
--------------------------
Linnaeus, Carolus 4559 4448 97 1758
1789
Bleeker, Pieter 2004 850 42 1836
1931
Valenciennes, A. 1933 711 36 1821
1986
Günther, Albert Charles Lewis Gotthilf 1692 938 55 1857
1910
Fowler, Henry Weed 1450 588 40 1899
1977
Jordan, David Starr 1377 717 52 1875
2000
Sharp, David 1188 1188 100 1873
1919
Boulenger, George Albert 1139 794 69 1881
1923
Cuvier, Georges 1132 434 38 1798 1986
Steindachner, F. 1053 597 56 1860
1970
Regan, C. T. 936 583 62 1902
1940
Gilbert, Charles H. 823 623 75 1878
2000
Perkins, Robert Cyril Layton 788 788 100 1899
1938
Hardy, D. Elmo 756 756 100 1953
1981
Bloch, Marcus Elieser 753 281 37 1779
1835
Eigenmann, C. H. 751 537 71 1887
1948
Lacépède, Bernard Germain Étienne de 650 146 22 1797 1917
Randall, John Ernest 622 597 95 1955
2011
Yamaguti, S. 581 581 100 1934
1971
Richardson, J. 554 213 38 1823
1930
This is the same query except not filtered by Actinopterygii. Obviously,
the data are still skewed heavily towards authors of fish names, because
that's the skew of the content currently in GNUB. I'd be much happier to
see names like Charles Paul Alexander, Johann Christian Fabricius, Oldfield
Thomas, Theodore Cockerell, Francis Walker and Maurice Pic (among the other
names contributed so far in this thread). But that won't happen until we
focus more on importing larger datasets into the GNUB index. Most of the
effort so far has been building and testing the core infrastructure, rather
than bulk-importing content. And, as I said before, bulk import is tedious
when the content is coming from a flat dataset (because it must be parsed
and normalized and cross-compared to what is already included in the GNUB
index). But the infrastructure-building and -testing phase of GNUB
development is now winding down, so the content importing process (e.g.,
Sherborn, Hymenoptera Name Server, and many others already in the queue)
will now begin to ramp up. As it does, these sorts of questions will get
more and more meaningful cross-taxon answers.
Finally, this cross-taxon analysis is only part of the value of the data
integration aspect of GNUB. The other part is cross-service integration.
For example, the query results above requires information on nomenclature
(names described), information on taxonomy (percent currently regarded as
valid; and also if you want to cluster by major group of organism, like
Actinopterygii), and information on literature (Start and End years). In
the case of fishes, Bill Eschmeyer's database is primarily a nomenclator,
but it also makes assertions about the current status of names -- so I was
able to pull all that information together from essentially a single source.
In many case, however, the same source does not provide all kinds of
information. What makes the data integration aspect of GNUB *really*
powerful is that it would allow you to re-run this query instantly using a
different authority for classification & taxonomy. For example, you could
switch to seeing the % valid values as per current status asserted by
FishBase, rather than CoF. And, of course, you could do all sorts of cool
analyses to compare individual taxonomies as treated by CoF vs. FisheBase
vs. whoever. And wouldn't it be nice to also tie-in BHL pages? And
WikiSpecies pages? And CoL? And EoL? And WoRMS? And countless others? And
wouldn't it be nice to be able to jump directly to Museum records for type
specimens? And global distribution maps in GBIF based on current status
according to (e.g.) Species2000?
The Global Names Architecture (of which GNUB is only one component) is
intended to facilitate exactly that sort of inter-database connectivity.
Personally, I feel that the biggest barrier to its success is the
understandable fear by many managers of large datasets that GNA/GNUB
represents some sort of threat to their future operations. My hope in this
extremely long message is to convey the opposite. Specifically, I will only
consider this endeavor a success when the ratio of people who utilize GNUB
services on a daily basis to people who *know* they are using GNUB services
on a daily bases approaches that same ratio for all internet users compared
to the tiny subset of users who understand the role that DNS and IP
addresses play in their internet activities. The goal is for users to be
able to jump between all the different relevant datasets seamlessly with a
single mouse click, without those users knowing why that jump is so
seamless. The users should only see the data-provider and service-provider
web pages; and these same providers should be able to proudly display how
often their datasets are used in these sorts of cross-discipline services
(in the same way that many web pages proudly display their usage
statistics).
Damn -- I guess I'll never get that hour back.....
Aloha,
Rich
P.S. Bonus points for those who: a) read all the way down to here; b)
noticed something odd about the end year for a few rows in the first table
above; and c) gives the literature citation that makes those odd values
technically correct. Hint: the numbers in the lists above include
unavailable names.
> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-
> bounces at mailman.nhm.ku.edu] On Behalf Of Ohl, Michael
> Sent: Saturday, April 13, 2013 6:21 AM
> To: Ken Kinman
> Cc: taxacom at mailman.nhm.ku.edu
> Subject: Re: [Taxacom] prolific species describers
>
> Hi Ken,
>
> I agree! Thanks for suggesting Oldfield Thomas for mammals.
>
> So non-entomologists, please post numbers for any group, which are
> extraordinary in any taxon framework, even if much smaller than in
> entomology.
>
> Cheers, Michael
>
> Sent from my iPad
>
> Am 13.04.2013 um 18:07 schrieb "Ken Kinman"
> <kinman at hotmail.com<mailto:kinman at hotmail.com>>:
>
> Hi Michael,
>
> I don't think any vertebrate zoologists could compete with the
> entomologists for such numbers. I remembered that Oldfield Thomas of the
> British Museum named large numbers of mammals, and I see that his
> biography on Wikipedia says that he named 2,000 species and subspecies. I
> don't know if that number includes only recognized species and subspecies
> (many of his names are now merely synonyms of recognized species and
> subspecies).
>
> -------------Ken
>
>
>
----------------------------------------------------------------------------
------------------
> -------------------
>
> > From: Michael.Ohl at mfn-berlin.de<mailto:Michael.Ohl at mfn-berlin.de>
> > To:
> taxacom at mailman.nhm.ku.edu<mailto:taxacom at mailman.nhm.ku.edu>
> > Date: Sat, 13 Apr 2013 13:53:02 +0000
> > Subject: [Taxacom] prolific species describers
> >
> > Hi,
> >
> > I had expected that this request has already been posted in Taxacom, but
> my (very) quick search in the Taxacom archive resulted in nothing. So here
it
> is (again?).
> >
> > I am trying to set up a list of the most prolific species describers in
zoology.
> A few names immediately came to my mind, like Charles Paul Alexander with
> more than 11,000 names in Diptera, and Johann Christian Fabricius with
more
> than 10,000 names across several insect orders. But there might be more!
> >
> > So please post your bid for higher or similar numbers! I will distribute
a top-
> 10-list at the end.
> >
> > Although I am basically interested in the total number of names
published
> by a single person, any further statistics are welcome, if available.
Number of
> names still valid today, total number of genus- vs. species-group names
....
> >
> > Cheers, Michael Ohl
> >
> > Museum fuer Naturkunde, Berlin
> > _______________________________________________
> > Taxacom Mailing List
> > Taxacom at mailman.nhm.ku.edu<mailto:Taxacom at mailman.nhm.ku.edu>
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> > The Taxacom Archive back to 1992 may be searched with either of these
> methods:
> >
> > (1) by visiting http://taxacom.markmail.org
> >
> > (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
> >
> > Celebrating 26 years of Taxacom in 2013.
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom Archive back to 1992 may be searched with either of these
> methods:
>
> (1) by visiting http://taxacom.markmail.org
>
> (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
>
> Celebrating 26 years of Taxacom in 2013.
More information about the Taxacom
mailing list