[Taxacom] FW: Wikipedia classification

Thu Jul 2 19:45:39 CDT 2009

Dear Paul / Rod / all,

Taking the bait here, I may live to regret it...

We have had elements of this conversation before, no doubt will again, but as one of the aggregators either mentioned or implied, let me see if there is a way to shed light on my reasons for doing so and hence my "world view" on these matters.

First, the purpose of doing scientific research or supporting activities (like compiling lists of species names or taxonomic opinions, or asserting them) is surely to advance the nature of the science so that (a) other workers can use the results, and (b) do not have to repeat the work. In my view there is really no difference between a print source and a database / web source - you are putting the results into public accessibility for others to utilise (and let us hope, credit their source/s).

Second, for some purposes (like within-taxon comparisons and analyses), the "specialist group" databases and web resources are perfect - by this we mean the GSDs (Global Species Databases) of which AlgaeBase, Species Fungorum and all the others we know. We can expect or at least hope (subject to the limits of efforts, sometimes heroic, usually under-valued, of their creators and maintainers) to find the most accurate, up-to-date, and comprehensive information on any relevant taxon or name here. No dispute. That is why these efforts should have a high claim on available funding from relevant authorities, since they feed into many other what you might call secondary or tertiary uses, etc. and are closest perhaps to the primary literature. Perhaps it is appropriate to call such projects "primary aggregators". (There is a separate question of the extent to which IP can be asserted over such lists of names and the data structures within which they are stored, which I will leave for others to address).

Then there are questions which can only be answered by tools that need to look across multiple such "primary aggregators" and / or  add supplementary information - to take my own use case, to analyse incoming lists of taxa from any phylum or kingdom, extant and fossil, and make some logical inferences as to where they fit on a taxonomic hierarchy, should they be displayed in a "marine only" / "extant only" system, what is their geographic range, anything you like. For a start, most if this information may not be available from the primary aggregator either in a consistent way or at all, and in addition the things I like/need to do - such as developing algorithms to detect misspellings, near duplicates, mis- or alternately classified taxa (such as fossil Bryozoa in Porifera and vice versa) and so on - actually REQUIRE the base data (in this case the taxonomic names and authors, and their stated taxonomic positions) to be available locally. There is just no way you can do this with a real-time distributed search. IF such compilations as the Catalogue of Life were complete then the task would be somewhat easier, but even here the problem is not solved because of fossil taxa that are listed erroneously in the COL, and the fact that there are plenty of gradations between the Recent and Fossil worlds as well as homonyms between the two that the COL will never detect. So as I stated once before (with slight facetiousness but in reality entirely serious), if you want to do integrated analysis of such data, traversing the relevant trees to see the whole picture, and base new data systems upon such content, in fact there is no alternative to build your own, at least if you want something *now* and not wait however long for the product to miraculously appear - the universal catalogue of all life, misspellings and misclassifications all fixed, all synonyms, homonyms, basionyms etc. correctly flagged and cross referenced, and all the rest. (I'm not holding my breath).

NOW comes the interesting part. Some people are positively helpful in providing their "primary" aggregations to what we might call 'secondary" compilers (such as myself in this instance), others are less so, for a number of completely understandable reasons. Actually I wear the same hat myself in another life - that of the webmaster of the Australian CAAB database of marine species in Australian waters - http://www.cmar.csiro.au/caab/ . We (more correctly: myself and a team of others at my agency over the past 20+ years) have put a not-insignificant quantum of effort into creating this list of 25,000+ taxa, checking and double checking entries and correcting them as needed, adding photos, maps, cross-referencing synonyms, and the rest, plus maintaing it as searchable web resource as an enabling contribution so that local workers can see what (at least in our opinion) are the locally occurring members of family X or phylum Y, and so on.

Of course at intervals this is not enough; fairly regularly (at least half a dozen times a year) I get a request from some worker for the complete list because the web accessible version is not what they need (similar to my global database of names vs. single web accessible GSDs). What to do? People could "steal the list" and re-post it as their "OWN" resource without credit - possible but unlikely. People could post it with credit, but as a static snapshot that gradually goes out of date and their users may not realise. True, in fact this is the biggest danger to my view; the main mitigation options are first, a warning to local users to this effect, and second, a process for propagating updates at some reasonable interval, and facilitating the incorporation of these into at least a subset of such collaborations as are viewed favourably. The third possibility is that in giving away "my" list (actually: sharing it with people who actually have a different use for it than myself), I may weaken the case to get funding to maintain it and they may in effect divert some of the available funds towards themselves. I have heard this more than once, from persons whom I respect and which has now been raised again. All I can say is that it is a poor funding agency that cannot distinguish between the requirement to compile primary information into a GSD or locally authoritative list, and the requirement to do further aggregation / analysis / tool building / etc. on top of such activities. In my mind we need both.

One final point - you would not believe the number of errors and discrepancies between local or taxon-specific lists that are actually revealed by the process of secondary aggregation. I have a good working relationship between at least one of the "primary" aggregators (WoRMS) and myself in that every month or two. I can send them a list of suspect content in their system which would never have been discovered without the inter-list comparison that is an inevitable part of secondary data aggregation activities carried out with at least with some scientific rigour - and by this I do not activities of the Google or Zip Code Zoo variety. And the numbers of errors or inconsistencies I have found in a supposedly reliable resource such as the online Nomenclator Zoologicus run into over a thousand and are still not all detected, I am sure. I think that this in itself is a valid argument for secondary level aggregation activities at least when carried out in a scientific as opposed to a non scientific environment.

The final point I would make is that of credit. The ultimate credit is of course to the authors who have published or revised the tax, which is a non-issue for at least the former since the authority name should never be omitted. Next there is the source from which a secondary aggregator such as myself obtained the name and, perhaps, assertions about its taxonomic status and classification. In my CAAB database, every name is intended to be tied to 2 literature references: one to say that the taxon occurs in Australia, and a second to say that this is the taxon's correct current name for example see http://www.marine.csiro.au/caabsearch/caab_search.caab_report?spcode=41112001 (if this URL splits over 2 lines, the last portion should read "41112001".). So as you will see, I use "primary" compilations or sources and cite them, which is surely their intention as publications. In IRMNG (my attempt at "one list to rule them all...") I do the same, for example lets look at a species list for a genus of decapod crustaceans, Periclimenes, http://www.marine.csiro.au/mirrorsearch/ir_search.list_species?gen_id=cru1003077  (again if the link splits, the last value should read "cru1003077"). This is in fact an interesting case, in that Crustacea are covered patchily in Catalogue of Life (chiefly from the ITIS database), and many relevant species are not held there.

The first half dozen entries or so now read:

[species, authority              family                     source]

  Periclimenes abolineatus Bruce & Coombes, 1997   Palaemonidae   Museum Victoria KEmu database (Oct 2006)/CAAB

  Periclimenes adularans Bruce, 2003  Palaemonidae   CAAB (Jul 2007)

  Periclimenes aegylios Grippa & d'Udekem d'Acoz, 1996   Palaemonidae   Aphia2006/ERMS

  Periclimenes aesopius (Bate, 1863)   Palaemonidae   CoL2006/ITS-612372

  Periclimenes aesopus   Palaemonidae   Aphia2006/E. Africa Mar. Species DB  -- syn. of Periclimenes psamathe

  Periclimenes affinis (Zehnter, 1894) cru10023391   Palaemonidae   CoL2006/ITS-612373

  Periclimenes affinis Borradaile, 1915   Palaemonidae   Australian Faunal Directory (August 2007) -- syn. of Kemponia longirostris

and so on.

Now, to my way of thinking (and actual required work) this is immeasurably more useful than consulting any one source for this information - in fact you could not do it (as some of the lists supplied to me are pre-publication lists or otherwise internal-only cagency level compilations). Also one can start to do useful QA such as add missing authorities from other sources as needed, detect that we have both "Periclimenes aesopius" and "Periclimenes aesopus" on different, equally "authoritative" lists that are almost certainly the same taxon, and much much more. Also I honestly believe that I am doing all that could be reasonably expected to credit my sources, as well as not deflect any funding towards myself that they may reasonably be able to access from elsewhere. As I see it, our tasks are different (but perhaps complementary) and have different foci and potential usages / clients. My main concern wearing the "aggregator" hat is the fact that some of my content may already be out of date with respect to its sources ("points of truth" in this context), which is why as of several months ago I have been displaying the following at the foot of each and every page of IRMNG (currently almost half a million):

"DISCLAIMER: Information in IRMNG is based on multiple sources and may not have been verified at time of writing; a small number of genus names may also be listed more than once under multiple classifications, to be rationalised in due course. Users are encouraged to seek independent confirmation of any IRMNG data before incorporating into their own systems."

As you can see, I really think that it is disingenuous to expect to be able to "publish" taxon lists in whatever form and that other persons will not actually want to copy them for their on purposes. Data sharing should be just that, with all facilitation possible to ensure that as far as reasonably possible, "slave" or derived copies have a mechanism to refer back to whatever source they have used in case this has changed in the meantime (no different from any hard copy publication that is fixed in time although science moves on, except that continuous propagation of updates is more possible in the new e-world).

Now here's a second question: if someone posts a GSD, but either refuses to share it, or to answer email requests for the same, what's another worker who wishes to make use of their list for whatever purpose to do, short of recreate it (potentially less well) from the primary literature? (I will leave this for others to answer).

Just my opening statement for the defence :)

Regards - Tony

Tony Rees
Manager, Divisional Data Centre,
CSIRO Marine and Atmospheric Research,
GPO Box 1538,
Hobart, Tasmania 7001, Australia
Ph: 0362 325318 (Int: +61 362 325318)
Fax: 0362 325000 (Int: +61 362 325000)
e-mail: Tony.Rees at csiro.au
Manager, OBIS Australia regional node, http://www.obis.org.au/
Biodiversity informatics research activities: http://www.cmar.csiro.au/datacentre/biodiversity.htm
Personal info: http://www.fishbase.org/collaborators/collaboratorsummary.cfm?id=1566

-----Original Message-----
From: taxacom-bounces at mailman.nhm.ku.edu [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Paul Kirk
Sent: Thursday, 2 July 2009 9:27 PM
To: Roderic Page
Cc: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] FW: Wikipedia classification

The only evidence that this is likely to be the case is from our friends
at Google and the rankings I see when names of fungi are searched for.
Having said that, traffic has certainly risen, especially from 'overt
aggregators' (GBIF, CoL etc) and probably others.

It is an indication of value, but it's somewhat frustrating that the
source of these data is ranked below the 'covert aggregators' [I use
this phrase for those who have harvested content without first seeking
permission or exploring more efficient delivery mechanisms than server
hammering]. Perhaps I'm not being realistic ... perhaps the world would
be a better place if everyone and his dog started building their very
own Encyclopedia of Life ... ;-)

Some aggregators link without a prompt ... others need a prompt to do
so.

Regards,

Paul

-----Original Message-----
From: Roderic Page [mailto:r.page at bio.gla.ac.uk]
Sent: 02 July 2009 11:01
To: Paul Kirk
Cc: dipteryx at freeler.nl; taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] FW: Wikipedia classification

Dear Paul,

Can you demonstrate that 'covert aggregators' are taking traffic from
you (as opposed to being an additional source of traffic)? And, isn't
"link love" an indication that your resource is valued? Are you
suggesting aggregators don't link to you?

Regards

Rod

On 2 Jul 2009, at 09:16, Paul Kirk wrote:

> Just a quick point on the last paragraph ...
>
> There is nothing like a new product to attract customers - ask anyone
> in the real world. And in the case of the 'covert aggregators' (e.g.
> ZipCodeZoo, amongst others), who may add a link to the source web site

> ... thanks ... but ... taking traffic from the source has a detrimetal

> effect on the profile of the source and thus the justification for
> maintaining it.
>
> Paul
>
> -----Original Message-----
> From: taxacom-bounces at mailman.nhm.ku.edu
> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of
> dipteryx at freeler.nl
> Sent: 02 July 2009 09:07
> To: taxacom at mailman.nhm.ku.edu
> Subject: [Taxacom] FW: Wikipedia classification
>
> Van: taxacom-bounces at mailman.nhm.ku.edu namens Bob Mesibov
> Verzonden: do 2-7-2009 8:20
>
> It sounds like there's agreement in this discussion and Rod Page's
> blog that Wikipedia/Wikispecies is emerging as a very useful taxonomic

> resource, that it's getting better, and that it has structural and
> administrative problems - top among these being rigidity of format and

> variable quality of expertise.
>
> ***
> It may sound like that, but it is not the case. Wikipedia/Wikispecies
> may be moderately useful for popular groups and for newly published
> matters, but so far its problems are bigger than its merits. The
> database mentality is very strong in (English) Wikipedia and
> ZipcodeZoo; accuracy and realism are rare enough (although it is a lot

> better once one gets away from the English Wikipedia, which is why
> there is more hope for Wikispecies).
>
> In the mean time professional sites are growing at a very respectable
> rate; I am always pleasantly surprised when I visit the USDA-sites,
> and the Angiosperm Phylogeny Website keeps improving (almost
> justifying the awe in which it is widely held on the www).
> * * *
>
> We get back to a question raised in earlier TAXACOM discussions: who
> will use which online resources, and for what purposes?
>
> I don't think this question has been asked often enough by the top-
> down compilers/developers of online biodiversity resources. Many
> people seem to think that information is information, and that the
> more you put up on the Web, and the more different ways the
> information can be shared and linked, the better.
>
> ***
> This often results in a database-orientation, copying data
> helter-skelter and let-the-devil-catch-the-reader (that is, copying
> good data from a good database and converting it to create a flawed,
> or misleading, 'new' entry).
>
> Paul
> _______________________________________________