[Taxacom] saturday morning fun

Mon Nov 29 04:19:22 CST 2010

I tried to explain the state that data is when it is published from  
source collections data and the challenge of organising it for  
evaluation.   I also said that since the taxonomic organisation of  
these source data is so inconsistent and we lack the means to rank  
individual data sets (something I think we need to revisit) that we  
rely on additional external sources to provide taxonomic authority  
files.

- We use the Catalogue of Life because it is available and is  
purported to be curated by a network of taxonomic experts.   We have  
the capacity to utilise additional and alternative sources if they are  
made available to us.   What are they and where can I find them?     
And which sectors of the Catalogue of Life are completely worthless?   
Like GBIF,  the Catalogue of Life is composed of component parts that  
are curated elsewhere.   Like GBIF,  CoL does actually shoulder some  
responsibility for the organisation of these parts.    Is it the  
higher (supra-familial) taxonomy that the COL uses to organise the  
components.   Is it specific taxonomic sectors like the Porifera or  
the legumes or weevils?   Or, is it like the proverbial rotten egg,  
where you don't need to eat the whole thing to just know it's bad?     
The latter requires the least effort but doesn't actually say much.

- We are aware of particulars regarding species concept issues but  
specific concept references are not provided in data sources and  
concept differentiation remains a problem for nearly all biodiversity  
data networks.   We know how to improve this however, but it requires  
concept identifiers and concept definitions themselves, to be provided  
in a structured manner, with the data.

- The GBIF data portal is in need of a make-over and will be getting  
one over the next year.   I agree we need to make the simple thing it  
is supposed to be doing simpler and with improved precision.    That  
simple thing is access to primary biodiversity data records shared  
through the network.   We will be doing a lot more processing of these  
records to try to make more clear qualitative separations.  There are  
two basic issues however that are very difficult to address the first  
being comparing two records and determining if record A and record B  
actually refer to the same thing.

Lastly,  GBIF is an open source project and the data published through  
it are (or can be) available to anyone wishing to provide improved  
access to it.   This of course, requires an agreement that access to  
raw collections data from a federation of sources has some sort of  
merit - something that seems to me is not agreed by commenters here.   
If this isn't agreed then the issue is far larger than GBIF and in my  
mind raises the question as to why NSF and others continue to wish to  
fund such digitisation and networking efforts.

http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5448
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503559&org=&from_org=NSF

If you think there IS merit in sharing these data but that the problem  
is we just continually botch it up in Copenhagen,  we would support  
your proposal to improve access to these meritous data.   I will  
provide you with a copy of the current index (a ~75 GB text file) or  
any subset for evaluation and new ideas.  Just remember,  the fixes  
and organisations have to be re-applied next month to a new copy.

Lastly,  just what did you search for in Google that provided better  
recall and precision than a GBIF search in terms of specimen  
records?   If Google has access to an wider set of better curated  
specimen records than indeed,  the portal merits a real existential  
look.

David Remsen

On Nov 29, 2010, at 9:46 AM, <dipteryx at freeler.nl>  
<dipteryx at freeler.nl> wrote:

> Van: taxacom-bounces at mailman.nhm.ku.edu namens Jim Croft
> Verzonden: ma 29-11-2010 1:04
>
>> To be fair, the only reason GBIF is 'feeding us shit' is
>> because 'shit' is what we gave them.
>
> ***
> Not at all sure about that. What has been playing through my
> mind is the idea that a data aggregator is an agency which can
> be characterized by "Data in, garbage out". It is a complete
> mystery to me why GBIF uses something known to be so completely
> worthless as the taxonomy of the Catalogue of Life; nothing good
> can come of that ...
>
> Like some other list-members, I tried a small test, for which I
> selected a genus where it is known to be essential to be explicit
> about the species concept used in order to be able to interpret
> and handle data, in anything like a meaningful manner.
>
> Using the GBIF data portal, the most noticeable thing is how much
> work it is to use, before getting to any data. There is indeed a
> significant degree of completely irrelevant material linked from
> this entry (the wondrous ways of computers!), but this is easily
> identifiable, so not much of an actual problem. There is no apparent
> awareness of the species-concept issue, with more than one species
> concept used happily side by side. So, a lot of work (and 'expert'
> knowledge required), but basically usable. This in contrast to the
> Wikipedia entry, which requires very little work on the part of the
> reader for him to be completely misinformed. Wikispecies is  
> preferable,
> although it offers only little information, with a 25% rate of error
> (as compared to the source it was copied from), but at least it
> indicates its source, and it has selected a relevant source.
>
> On the whole it proves that the casual user is best advised to just
> use Google (which not only did turn up the relevant information but
> quickly showed me a very nice site unknown to me): this is less work
> and yields more useful results (a higher ratio of information/amount
> -of-work) than trying one of the self-advertised high-profile sites
> (obviously, the 'expert' does not need advice).
>
> Paul van Rijckevorsel
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either  
> of these methods:
>
> (1) http://taxacom.markmail.org
>
> Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/ 
> pipermail/taxacom  your search terms here
>