[Taxacom] data quality vs. data security: a survey

Stephen Thorpe s.thorpe at auckland.ac.nz
Sat Feb 13 15:44:59 CST 2010


Hi Rich,

No, I don't buy it! 

>Everytime information about a species, a taxonomic publication citation, etc., etc. is typed by humans on a keyboard (whether it be typed into a manuscrapt, a database, a wikispecies page, or wherever), that's duplication of effort. Individually, it seems trivial -- but in aggregate it is most certainly *not* trivial

First off, if someone types a citation into a wikispecies page, it may in some sense be a duplication of effort if someone else has already typed it into something else, or an "acronym" or ten have already "harvested" it, but since it was typed into wikispecies free of charge, it isn't a SERIOUS duplication of effort (on the part of the wikispecies contributor). What is a SERIOUS duplication of effort is when science funding goes individually to several different aggregators to each put the citation in their own particular database, and even worse when all they are in fact doing is "harvesting" the information from an existing taxon specific database. The aggregators are merely parasites ...

>While there is certainly some overlap among them, the duplication is by no means "massive".  To say so reveals a poor understanding about what these different initiatives actually do

I may not know what they do (behind the scenes), but I know what they give the end user, in terms of content, and it just isn't very much at all, at least for GBIF, EOL, COL, and the like. All they do is "harvest" names and create stubs. I don't want a nice looking map of the world on a species page if there are no points plotted on it, or if there are so few points plotted compared to the actual distribution. How "massive" is "massive", in terms of overlap?

>You seem to be confusing "Aggregation" with "Integration".  Google is an aggregator (an indexer, really -- like GBIF)

OK, so why do we need GBIF, when we already have Google? I am NOT, obviously, saying that Google is sufficient for all our needs - far from it! I am saying that an expensive entity like GBIF is not much better than Google.

This seems to be what is going on: dedicated taxonomists (like Bob, for example) work darn hard for relatively little reward, creating new taxonomic knowledge. Then, if you are lucky, that knowledge gets integrated into either a taxon specific database, and/or (if I have anything to do with it) Wikispecies. So far, so good. It is what happens next that is the problem! Increasing numbers of "parasites" then make far more money and have a far easier life than Bob by "harvesting" the names from the taxon specific databases, and creating skeleton pages on some site that promises so much, but never seems to end up delivering much in terms of actual content! If you could get actual useful content out of these sites, then fine, but all too often you just find a map devoid of points, and a page devoid of content!

Cheers,

Stephen

________________________________________
From: taxacom-bounces at mailman.nhm.ku.edu [taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Richard Pyle [deepreef at bishopmuseum.org]
Sent: Sunday, 14 February 2010 7:47 a.m.
To: 'TAXACOM'
Subject: Re: [Taxacom] data quality vs. data security: a survey

Hi Stephen,

> OMG! Did you really just say that! How is a massive
> duplication of effort increasingly allowing a massive
> reduction of redundant/duplicate effort????????

It appears you didn't understand my post.  As you say, "communication is a
very difficult thing, particularly on topics as complex as this", so I'll
try again.  You seem to characterize all the various large-scale data
aggregators (GBIF, EOL, COL, ALA, etc.) as "massive duplication of effort".
While there is certainly some overlap among them, the duplication is by no
means "massive".  To say so reveals a poor understanding about what these
different initiatives actually do.

Everytime information about a species, a taxonomic publication citation,
etc., etc. is typed by humans on a keyboard (whether it be typed into a
manuscrapt, a database, a wikispecies page, or wherever), that's duplication
of effort. Individually, it seems trivial -- but in aggregate it is most
certainly *not* trivial.

> INTEGRATION is one thing, but MULTIPLE INTEGRATION
> INITIATIVES leading to numerous clone or near clone
> integrated databases is completely self-defeating!

You seem to be confusing "Aggregation" with "Integration".  Google is an
aggregator (an indexer, really -- like GBIF).  The DNS system is an
architecture for integration.  The equivalent of DNS for biodiversity
information is what I mean by integration.

Aloha,
Rich



_______________________________________________

Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom

The Taxacom archive going back to 1992 may be searched with either of these methods:

(1) http://taxacom.markmail.org

Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here



More information about the Taxacom mailing list