[Taxacom] saturday morning fun

Mon Nov 29 14:47:50 CST 2010

David,
You mention some key issues here. Let me focus on just one of them for the 
moment, namely COL and its suitability as a data provider for GBIF. I suspect 
that GBIF have basically just thought something like "well, COL is an 
aggregation of trusted specialist databases in a form that GBIF can use" - but 
the reality is *way* more complex. For me, when starting to compile a 
Wikispecies page, I will often use COL as a *starting point only*, actually 
little more than a convenient way of getting big lists of taxa formatted and put 
on Wikispecies pages for further scrutiny. Sometimes, the COL data is so 
obviously worse than useless, that I don't use it at all, not *even* as a 
starting point. The data providers from COL vary widely in nature. Some of them 
are near complete for their group, others are highly fragmentary. Some are 
*very* raw, others are quite well polished. Sometimes, there are problems in the 
way that COL interprets the data from sources, so all sorts of synonyms get 
interpreted as valid, etc. Another issue, which I don't fully understand yet, 
and I could perhaps be mistaken (???), is that even in COL 2010, much of the 
data seems to have been harvested in 2008 ... I would have thought that COL 2010 
would have harvested its data in 2010. If not, then COL is running a couple of 
years behind its own data providers, who will typically not be completely 
up-to-date either. So, in summary, I would say that COL is nothing more than a 
convenient *starting point* for building solid biodiversity data, and it 
requires a fair amount of careful and informed interpretation, not to mention a 
great deal of manual work to improve on it. I'm not sure that GBIF has fully 
grasped this? For example, in COL, the family Scarabaeidae is actually what 
would almost universally be called the subfamily Scarabaeinae of the family 
Scarabaeidae, and this is not at all obvious. So, COL is actually quite good if 
you want data on Scarabaeinae, but completely lacking in any data whatsoever on 
the *huge* scarabaeid subfamilies Melolothinae and Rutelinae.
Cheers,
Stephen

________________________________
From: David Remsen (GBIF) <dremsen at gbif.org>
To: dipteryx at freeler.nl
Cc: taxacom at mailman.nhm.ku.edu
Sent: Mon, 29 November, 2010 11:19:22 PM
Subject: Re: [Taxacom] saturday morning fun

I tried to explain the state that data is when it is published from  
source collections data and the challenge of organising it for  
evaluation.  I also said that since the taxonomic organisation of  
these source data is so inconsistent and we lack the means to rank  
individual data sets (something I think we need to revisit) that we  
rely on additional external sources to provide taxonomic authority  
files.

- We use the Catalogue of Life because it is available and is  
purported to be curated by a network of taxonomic experts.  We have  
the capacity to utilise additional and alternative sources if they are  
made available to us.  What are they and where can I find them?    
And which sectors of the Catalogue of Life are completely worthless?  
Like GBIF,  the Catalogue of Life is composed of component parts that  
are curated elsewhere.  Like GBIF,  CoL does actually shoulder some  
responsibility for the organisation of these parts.    Is it the  
higher (supra-familial) taxonomy that the COL uses to organise the  
components.  Is it specific taxonomic sectors like the Porifera or  
the legumes or weevils?  Or, is it like the proverbial rotten egg,  
where you don't need to eat the whole thing to just know it's bad?    
The latter requires the least effort but doesn't actually say much.

- We are aware of particulars regarding species concept issues but  
specific concept references are not provided in data sources and  
concept differentiation remains a problem for nearly all biodiversity  
data networks.  We know how to improve this however, but it requires  
concept identifiers and concept definitions themselves, to be provided  
in a structured manner, with the data.

- The GBIF data portal is in need of a make-over and will be getting  
one over the next year.  I agree we need to make the simple thing it  
is supposed to be doing simpler and with improved precision.    That  
simple thing is access to primary biodiversity data records shared  
through the network.  We will be doing a lot more processing of these  
records to try to make more clear qualitative separations.  There are  
two basic issues however that are very difficult to address the first  
being comparing two records and determining if record A and record B  
actually refer to the same thing.

Lastly,  GBIF is an open source project and the data published through  
it are (or can be) available to anyone wishing to provide improved  
access to it.  This of course, requires an agreement that access to  
raw collections data from a federation of sources has some sort of  
merit - something that seems to me is not agreed by commenters here.  
If this isn't agreed then the issue is far larger than GBIF and in my  
mind raises the question as to why NSF and others continue to wish to  
fund such digitisation and networking efforts.

http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5448
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503559&org=&from_org=NSF

If you think there IS merit in sharing these data but that the problem  
is we just continually botch it up in Copenhagen,  we would support  
your proposal to improve access to these meritous data.  I will  
provide you with a copy of the current index (a ~75 GB text file) or  
any subset for evaluation and new ideas.  Just remember,  the fixes  
and organisations have to be re-applied next month to a new copy.

Lastly,  just what did you search for in Google that provided better  
recall and precision than a GBIF search in terms of specimen  
records?  If Google has access to an wider set of better curated  
specimen records than indeed,  the portal merits a real existential  
look.

David Remsen

On Nov 29, 2010, at 9:46 AM, <dipteryx at freeler.nl>  
<dipteryx at freeler.nl> wrote:

> Van: taxacom-bounces at mailman.nhm.ku.edu namens Jim Croft
> Verzonden: ma 29-11-2010 1:04
>
>> To be fair, the only reason GBIF is 'feeding us shit' is
>> because 'shit' is what we gave them.
>
> ***
> Not at all sure about that. What has been playing through my
> mind is the idea that a data aggregator is an agency which can
> be characterized by "Data in, garbage out". It is a complete
> mystery to me why GBIF uses something known to be so completely
> worthless as the taxonomy of the Catalogue of Life; nothing good
> can come of that ...
>
> Like some other list-members, I tried a small test, for which I
> selected a genus where it is known to be essential to be explicit
> about the species concept used in order to be able to interpret
> and handle data, in anything like a meaningful manner.
>
> Using the GBIF data portal, the most noticeable thing is how much
> work it is to use, before getting to any data. There is indeed a
> significant degree of completely irrelevant material linked from
> this entry (the wondrous ways of computers!), but this is easily
> identifiable, so not much of an actual problem. There is no apparent
> awareness of the species-concept issue, with more than one species
> concept used happily side by side. So, a lot of work (and 'expert'
> knowledge required), but basically usable. This in contrast to the
> Wikipedia entry, which requires very little work on the part of the
> reader for him to be completely misinformed. Wikispecies is  
> preferable,
> although it offers only little information, with a 25% rate of error
> (as compared to the source it was copied from), but at least it
> indicates its source, and it has selected a relevant source.
>
> On the whole it proves that the casual user is best advised to just
> use Google (which not only did turn up the relevant information but
> quickly showed me a very nice site unknown to me): this is less work
> and yields more useful results (a higher ratio of information/amount
> -of-work) than trying one of the self-advertised high-profile sites
> (obviously, the 'expert' does not need advice).
>
> Paul van Rijckevorsel
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either  
> of these methods:
>
> (1) http://taxacom.markmail.org
>
> Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/ 
> pipermail/taxacom  your search terms here
>

_______________________________________________

Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom

The Taxacom archive going back to 1992 may be searched with either of these 
methods:

(1) http://taxacom.markmail.org

Or (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  
your search terms here