[Taxacom] data quality vs. data security

Stephen Thorpe s.thorpe at auckland.ac.nz
Thu Feb 11 22:59:29 CST 2010


>If you think that combining the work of more than 3,000 specialists into a single (more or less) coherent whole coverning maybe 60% of the world's extant species (plus working to continually extend this coverage) is a value-less exercise

No, I don't. But you have worded it misleadingly IMHO. What advantage(s) does it have over Google? Combining is more than just gathering together, the bits have to be fashioned into a meaningful structure - a house is more than a pile of bricks. At the moment, CoL is just a pile of bricks (with many missing or broken). That could change in the future, but will require much manual effort to fashion the bricks into a meaningful structure. Wikispecies is an already existing infrastructure which allows meaningful content to be easily made available to everyone in the world with internet access. Shouldn't we be using it more? CoL has manyclones or near clones, perhaps all taking resources away from primary taxonomic research? It has to be easier to transfer semistructured Wiki data to structured databases than to put unstructured primary source data into them, so wouldn't it be sensible to put MORE effort into creating more meaningful content on Wikispecies, then transfer it at some later stage to a suitable structured database? And there are two many independent candidate databases in development, all just minor variations on the same theme. Shouldn't we try to limit the number of them?

________________________________________
From: Tony.Rees at csiro.au [Tony.Rees at csiro.au]
Sent: Friday, 12 February 2010 5:44 p.m.
To: Stephen Thorpe; taxacom at mailman.nhm.ku.edu
Subject: RE: [Taxacom] data quality vs. data security

Dear Stephen,

You wrote:

<snip>
66 taxon specific databases is still not much taxonomic coverage. You have actually confirmed my thoughts on what CoL is: namely a data aggregator of other taxon specific databases (66, in fact), and it is therefore no better or worse than those source databases.
</snip>

If that was a secret, well it's not a very well kept one :)

See e.g. the opening paragraph/s on http://www.catalogueoflife.org/info_about_col.php :

"The Species 2000 & ITIS Catalogue of Life is planned to become a comprehensive catalogue of all known species of organisms on Earth. Rapid progress has been made recently and this, the ninth edition of the Annual Checklist, contains 1,160,711 species. Please note that this is probably just more than half of the world's known species. This means that for many groups it continues to be deficient, and users will notice that many species are still missing from the Catalogue.

"The present Catalogue is compiled with sectors provided by 66 taxonomic databases from around the world. Many of these contain taxonomic data and opinions from extensive networks of specialists, so that the complete work contains contributions from more than 3,000 specialists from throughout the taxonomic profession. Species 2000 and ITIS teams peer review databases, select appropriate sectors and integrate the sectors into a single coherent catalogue with a single hierarchical classification."

If you think that combining the work of more than 3,000 specialists into a single (more or less) coherent whole coverning maybe 60% of the world's extant species (plus working to continually extend this coverage) is a value-less exercise, well I guess that's your perogative...

Regards - Tony

-----Original Message-----
From: Stephen Thorpe [mailto:s.thorpe at auckland.ac.nz]
Sent: Friday, 12 February 2010 3:36 PM
To: Rees, Tony (CMAR, Hobart); taxacom at mailman.nhm.ku.edu
Subject: RE: [Taxacom] data quality vs. data security

66 taxon specific databases is still not much taxonomic coverage. You have actually confirmed my thoughts on what CoL is: namely a data aggregator of other taxon specific databases (66, in fact), and it is therefore no better or worse than those source databases. I would say that any advantages of such a structure over Google, as a data aggregator, are unlikely to be of sufficient magnitude to justify the cost. By contrast, Wikispecies puts data together in intelligent ways, so although it will always lag well behind in absolute numbers of taxa covered, what is there is more useful. CoL only cites the source databases as sources, so you have to go there anyway to find the real sources. Wikispecies not only cites the primary sources, but also provides links to them whenever possible, and images of taxa whenever available. It is easy to get up numbers (of taxa covered) by sucking data out of multiple databases, and the numbers might impress the funders, but the content (or lack thereof) in CoL, EoL and the like is unlikely to impress anyone ...

________________________________________
From: Tony.Rees at csiro.au [Tony.Rees at csiro.au]
Sent: Friday, 12 February 2010 5:23 p.m.
To: Stephen Thorpe; taxacom at mailman.nhm.ku.edu
Subject: RE: [Taxacom] data quality vs. data security

Hi Stephen,

You write:

<snip>
I can make no sense of their "annual checklists", the annual checklist for 2009 has HUGE gaps for new taxa published in 2009, 2008, ... In fact, all they seem to have is what they can automatically suck out of the few taxon specific databases out there, and nothing much else!
</snip>

In that case, this is no doubt best explained by an appropriate CoL person, but by "gaps" I meant gaps in taxonomic coverage, i.e. not currently covered by any of their 66 (latest count) source databases, which with one significant exception aspire to "complete" coverage of particular taxonomic groups. Chronological gaps, e.g. for recently described taxa, are then the responsibility of the contributing databases, who in the main are progressing such gap filling as their resources and enthusiasm allow (and accepting that there will be a time lag before being uploaded into the next annual release of the CoL).

One mechanism currently being worked on either now or soon, I believe, is improving the "dynamic checklist" version of CoL such that live updates in the source databases are more rapidly accessible via a dynamic version of the CoL in advance of the "static snapshot" annual checklist, so certainly some latency issues exist but are within scope to be worked on as part of the 4D4Life project, see http://www.4d4life.eu/ (more acronyms of course, fun fun fun).

That's 2 more cents gone, hopefully not wasted...

Cheers - Tony

-----Original Message-----
From: Stephen Thorpe [mailto:s.thorpe at auckland.ac.nz]
Sent: Friday, 12 February 2010 3:13 PM
To: Rees, Tony (CMAR, Hobart); taxacom at mailman.nhm.ku.edu
Subject: RE: [Taxacom] data quality vs. data security

> If your argument is that one is more likely to find a more complete species list for *any* genus in wikispecies than elsewhere

No, that is not my argument, and I agree that it is patently untrue
My argument is that Wikispecies is a very easy and cheap way of providing verifiable and complete data to the whole world right now, and so people (especially those in the scientific community) would be foolish not to make the most of it, but, alas, I don't think they are ...
Compared to EOL, for example, the advantages of Wikispecies are too obvious to go over yet again ...
I'm glad you mentioned Catalogue of Life - for it isn't at all what it seems: you speak of "gaps", but actually I can make no sense of their "annual checklists", the annual checklist for 2009 has HUGE gaps for new taxa published in 2009, 2008, ... In fact, all they seem to have is what they can automatically suck out of the few taxon specific databases out there, and nothing much else!

________________________________________
From: taxacom-bounces at mailman.nhm.ku.edu [taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Tony.Rees at csiro.au [Tony.Rees at csiro.au]
Sent: Friday, 12 February 2010 4:56 p.m.
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] data quality vs. data security

Dear Stephen,

If your argument is that one is more likely to find a more complete species list for *any* genus in wikispecies than elsewhere, then this is patently untrue, since the 2009 online Catalogue of Life currently contains more than 1.1m valid species names and 0.7m synonyms, compared to wikispecies presently quoted 210k taxa at all taxonomic ranks. Certainly CoL has gaps and it is in these areas that wikispecies may gain some points, however to generalise that one system is therefore "better" than the other seems a bit pointless, particularly as the likely winner is CoL.

Maybe a more fruitful area would be to see how the additional effort and content you and others are putting into wikispecies may also contribute to CoL and other related efforts (such as the upcoming GNA and GNUB, see e.g. http://code.google.com/p/gbif-ecat/wiki/GNUBIntro ), but that is unlikely to be advanced by repeated arguments that wikispecies is essentially better than the other initiatives that already feed into such compilations, some of which are considerably richer in species-level information than the equivalent wikispecies entries, and even (gasp) kept up-to-date by equally diligent workers...

Just my 2 cents' worth (I am drawing down my available cents here, though, probably will reach zero soon).

- Tony



More information about the Taxacom mailing list