[Taxacom] prolific species describers
Boyle, Bradley L - (bboyle)
bboyle at email.arizona.edu
Mon Apr 15 12:51:46 CDT 2013
Rich, thank you for that wonderful rap on normalization. I'm going to print it, frame it, and mount it on the wall behind my desk. All I'll have to do is point to it the next time the denormalization mafia comes calling.
I'll claim a bonus point for (a). As for (b), I don't know what you're talking about. I thought it was normal for fish taxonomists to live 150 years or more.
Brad
On Apr 14, 2013, at 10:00 AM, <taxacom-request at mailman.nhm.ku.edu>
<taxacom-request at mailman.nhm.ku.edu> wrote:
> ------------------------------
>
> Message: 10
> Date: Sat, 13 Apr 2013 12:28:41 -1000
> From: "Richard Pyle" <deepreef at bishopmuseum.org>
> Subject: Re: [Taxacom] prolific species describers
> To: "'Ohl, Michael'" <Michael.Ohl at mfn-berlin.de>, "'Ken Kinman'"
> <kinman at hotmail.com>
> Cc: taxacom at mailman.nhm.ku.edu
> Message-ID: <005101ce3896$41532160$c3f96420$@bishopmuseum.org>
> Content-Type: text/plain; charset="iso-8859-1"
>
> These are the kinds of questions we could answer in a split second with a
> fully-populated Global Names Usage Bank. I know those sound like very empty
> and/or idealistic and/or na?ve words (fair enough). But those words also
> happen to be true.
>
> I had never thought to do this sort of analysis before using GNUB, but It
> took me only about five minutes to write a query that produced the following
> list, filtered by "Actinopterygii" (only the top 20 records are shown; the
> total results included 3899 records):
>
> Name Total Valid %
> Start End
> ----------------------------------------------------------------------------
> -----------------------------------
> Bleeker, Pieter 1952 829 42
> 1836 1931
> Valenciennes, A. 1893 695 36
> 1821 1986
> G?nther, Albert Charles Lewis Gotthilf 1646 913 55
> 1857 1910
> Fowler, Henry Weed 1422 577 40
> 1899 1977
> Jordan, David Starr 1325 684 51
> 1875 2000
> Boulenger, George Albert 1135 791 69
> 1881 1923
> Cuvier, Georges 1108 419 37 1798
> 1986
> Steindachner, F. 1040 591 56
> 1860 1970
> Regan, C. T. 909 569 62
> 1902 1940
> Gilbert, Charles H. 788 601 76
> 1878 2000
> Eigenmann, C. H. 751 537 71
> 1887 1948
> Bloch, Marcus Elieser 706 265 37
> 1779 1835
> Lac?p?de, Bernard Germain ?tienne de 617 140 22 1797
> 1917
> Randall, John Ernest 593 568 95
> 1955 2011
> Richardson, J. 538 205 38
> 1823 1930
> Linnaeus, Carolus 471 363 77
> 1758 1771
> Poey, F. 469 108 23
> 1851 2000
> de Laporte Castelnau La Fert?-S?nect?re, Francis L. 464 106 22
> 1855 1879
> Cope, Edward Drinker 437 168 38
> 1861 1910
> Schneider, J. G. 431 118
> 27 1801 1801
>
> I want to make something perfectly clear: GNUB deserves exactly ZERO credit
> for the data behind these numbers. In this case, because I filtered on
> Actinopterygii (ray-finned fishes), essentially all of the data come from
> Bill Eschmeyer's "Catalog of fishes". Those data include two major
> components:
> - Original descriptions of species (linked to literature citations)
> - "Current status" of each species (i.e., whether the species is currently
> regarded as valid, or as a junior synonym)
>
> So, what does GNUB contribute? Two things:
> 1) A data model that is both highly normalized (generalized), and is
> scalable to effectively all treatments of all taxon names throughout all
> history
> 2) An architecture that allows leveraging myriad data sources (while
> attributing appropriate credit to those sources) to be able to answer
> questions that cannot be answered by any single data source
>
> It's worth unpacking these two things a bit more.
>
> 1) The word "normalized" is geek-speak for how data models are designed. A
> general trend is that normalized data models more finely atomize the data
> content. More highly normalized data models tend to have far more tables
> and links between those tables, and generally make life difficult for anyone
> wanting to build a user-friendly interface to access and maintain data.
> Also, many (most?) datasets relating to biodiversity data that currently
> exist are not so highly normalized, and it is an amazingly difficult and
> time-consuming process to transform non-normalized (i.e., "flat") data into
> a highly normalized structure (at least with any semblance of accuracy).
> However, once you get over that barrier of capturing & parsing the data in a
> normalized form, a normalized data model will be able to answer questions
> that weren't even conceived when the data model was designed.
>
> The example above is a perfect case in point: because the GNUB data model
> is highly normalized, it only took me about five minutes (far less time than
> it took me to write this email) to ask the question, "How many species have
> been named by each author, how many of those names are still regarded as
> valid, and what was the range of years in which they published the new
> names?" As I said above, I never considered asking that question when the
> data model was originally designed, but because it's highly normalized, it
> was amazingly easy for me to ask this novel question, and get a real &
> meaningful answer.
>
> I'll also point out a couple of things about the dataset above: first, it
> includes names established by individual people -- meaning that names
> established as "Carl Linnarus" and "Carl von Linn?" are both included for
> "Linnaeus, Carolus". The same is true for names established by, for
> example, "Rofen, Robert R.", who also published under "Harry, Robert R."
> (e.g., see: http://zoobank.org/01E53574-BB1F-4371-A3E6-DC145CEDF6ED). Also,
> I tossed in the "Start" and "End" year for publishing new species just
> because it was incredibly easy to do. There are a near infinite number of
> things I could also have calculated -- such as the average number of
> co-authors, the average number of new species per publication, the ratio of
> new species to new genera, or almost anything else someone might want to
> ask.
>
> 2) Although the highly normalized data model makes it very easy to ask novel
> and previously unanticipated questions of this sort, the REAL power of GNUB
> comes from its core function to integrate information from myriad data
> sources. I'll say this again for emphasis: : GNUB deserves exactly ZERO
> credit for the data behind these numbers! There is currently no logo for
> GNUB, and there is no reason why GNUB should ever take any credit for the
> data records themselves. I've said this many times before, but it's always
> worth repeating: GNUB is *NOT* simply "yet another database; yet another
> acronym" to compete with all the other alphabet soup of acronyms already
> generating, harvesting, aggregating, etc. data about scientific names.
> Rather, GNUB is intended to be the invisible architecture operating behind
> the scenes that allows all the other biodiversity datasets in the world to
> more seamlessly cross-link to each other. As I've also said many times
> before, an analogy for GNUB is the DNS system -- anyone who ever uses the
> internet relies on DNS, and almost nobody who uses the internet is aware of
> it.
>
> Why is there a need for something like GNUB? The world has many, many data
> sources for taxon names. The only areas we need more of those databases is
> for the taxonomic groups that are not yet included in any of the existing
> databases. That's not really GNUB's role (although it certainly could serve
> in that capacity for taxonomic groups without another, more dedicated
> repository). The real function of GNUB is to interconnect those different
> datasets that already exist, and thereby increase their functional utility
> by allowing them to cross-leverage each other.
>
> For example, I chose to filter the above dataset on Actinopterygii because
> the bulk of Bill Eschmeyer's Catalog of Fishes is already indexed by GNUB,
> and therefore I knew I'd get a relatively comprehensive result set.
> Obviously the same is not true for the many existing datasets that are not
> yet included within the GNUB index. Unfortunately, that makes it hard to
> see the advantage of GNUB, because you could get these same numbers directly
> from the Catalog of Fishes itself. Granted, it would have taken a bit more
> than five minutes to do that (largely because CoF doesn't assign GUIDs to
> authors and cross-link aliases of authors), but that's more a function of
> normalization than data integration. But CoF only tracks fish names, so
> Carl Linnaeus comes out at #16 in the list, with only 471 species. What CoF
> can't do, is generate the following:
>
> Name Total Valid %
> Start End
> ----------------------------------------------------------------------------
> --------------------------
> Linnaeus, Carolus 4559 4448 97 1758
> 1789
> Bleeker, Pieter 2004 850 42 1836
> 1931
> Valenciennes, A. 1933 711 36 1821
> 1986
> G?nther, Albert Charles Lewis Gotthilf 1692 938 55 1857
> 1910
> Fowler, Henry Weed 1450 588 40 1899
> 1977
> Jordan, David Starr 1377 717 52 1875
> 2000
> Sharp, David 1188 1188 100 1873
> 1919
> Boulenger, George Albert 1139 794 69 1881
> 1923
> Cuvier, Georges 1132 434 38 1798 1986
> Steindachner, F. 1053 597 56 1860
> 1970
> Regan, C. T. 936 583 62 1902
> 1940
> Gilbert, Charles H. 823 623 75 1878
> 2000
> Perkins, Robert Cyril Layton 788 788 100 1899
> 1938
> Hardy, D. Elmo 756 756 100 1953
> 1981
> Bloch, Marcus Elieser 753 281 37 1779
> 1835
> Eigenmann, C. H. 751 537 71 1887
> 1948
> Lac?p?de, Bernard Germain ?tienne de 650 146 22 1797 1917
> Randall, John Ernest 622 597 95 1955
> 2011
> Yamaguti, S. 581 581 100 1934
> 1971
> Richardson, J. 554 213 38 1823
> 1930
>
> This is the same query except not filtered by Actinopterygii. Obviously,
> the data are still skewed heavily towards authors of fish names, because
> that's the skew of the content currently in GNUB. I'd be much happier to
> see names like Charles Paul Alexander, Johann Christian Fabricius, Oldfield
> Thomas, Theodore Cockerell, Francis Walker and Maurice Pic (among the other
> names contributed so far in this thread). But that won't happen until we
> focus more on importing larger datasets into the GNUB index. Most of the
> effort so far has been building and testing the core infrastructure, rather
> than bulk-importing content. And, as I said before, bulk import is tedious
> when the content is coming from a flat dataset (because it must be parsed
> and normalized and cross-compared to what is already included in the GNUB
> index). But the infrastructure-building and -testing phase of GNUB
> development is now winding down, so the content importing process (e.g.,
> Sherborn, Hymenoptera Name Server, and many others already in the queue)
> will now begin to ramp up. As it does, these sorts of questions will get
> more and more meaningful cross-taxon answers.
>
> Finally, this cross-taxon analysis is only part of the value of the data
> integration aspect of GNUB. The other part is cross-service integration.
> For example, the query results above requires information on nomenclature
> (names described), information on taxonomy (percent currently regarded as
> valid; and also if you want to cluster by major group of organism, like
> Actinopterygii), and information on literature (Start and End years). In
> the case of fishes, Bill Eschmeyer's database is primarily a nomenclator,
> but it also makes assertions about the current status of names -- so I was
> able to pull all that information together from essentially a single source.
> In many case, however, the same source does not provide all kinds of
> information. What makes the data integration aspect of GNUB *really*
> powerful is that it would allow you to re-run this query instantly using a
> different authority for classification & taxonomy. For example, you could
> switch to seeing the % valid values as per current status asserted by
> FishBase, rather than CoF. And, of course, you could do all sorts of cool
> analyses to compare individual taxonomies as treated by CoF vs. FisheBase
> vs. whoever. And wouldn't it be nice to also tie-in BHL pages? And
> WikiSpecies pages? And CoL? And EoL? And WoRMS? And countless others? And
> wouldn't it be nice to be able to jump directly to Museum records for type
> specimens? And global distribution maps in GBIF based on current status
> according to (e.g.) Species2000?
>
> The Global Names Architecture (of which GNUB is only one component) is
> intended to facilitate exactly that sort of inter-database connectivity.
> Personally, I feel that the biggest barrier to its success is the
> understandable fear by many managers of large datasets that GNA/GNUB
> represents some sort of threat to their future operations. My hope in this
> extremely long message is to convey the opposite. Specifically, I will only
> consider this endeavor a success when the ratio of people who utilize GNUB
> services on a daily basis to people who *know* they are using GNUB services
> on a daily bases approaches that same ratio for all internet users compared
> to the tiny subset of users who understand the role that DNS and IP
> addresses play in their internet activities. The goal is for users to be
> able to jump between all the different relevant datasets seamlessly with a
> single mouse click, without those users knowing why that jump is so
> seamless. The users should only see the data-provider and service-provider
> web pages; and these same providers should be able to proudly display how
> often their datasets are used in these sorts of cross-discipline services
> (in the same way that many web pages proudly display their usage
> statistics).
>
> Damn -- I guess I'll never get that hour back.....
>
> Aloha,
> Rich
>
> P.S. Bonus points for those who: a) read all the way down to here; b)
> noticed something odd about the end year for a few rows in the first table
> above; and c) gives the literature citation that makes those odd values
> technically correct. Hint: the numbers in the lists above include
> unavailable names.
>
__________________________________
Brad Boyle, Ph.D.
Dept. of Ecology and Evolutionary Biology
University of Arizona
BSW 310, P.O. Box 210088
Tucson, AZ 85721, USA
520-626-3336
bboyle at email.arizona.edu
Botanical Information & Ecology Network:
http://bien.nceas.ucsb.edu/bien/
The SALVIAS Project:
http://www.salvias.net
Taxonomic Name Resolution Service:
http://tnrs.iplantcollaborative.org/
__________________________________
More information about the Taxacom
mailing list