[Taxacom] saturday morning fun

Mon Nov 29 15:30:05 CST 2010

Stephen,

Thanks for the summary.  I'd be interested to hear what various  
Catalogue of Life providers think of all this.  I know some taxonomic  
sectors,  like the Lepidoptera,  derived from LepIndex NHM-London,   
have not been thoroughly reviewed, falling into your 'raw' category.

You hit the nail on the head when you say it provides you with a  
starting point.   We use it as a starting point too.   We could forego  
this and simply leave the raw data as it is but it seemed an  
improvement to go with it.  We are trying to expand the capacity to  
access other, perhaps more comprehensive or refined sources,  should  
they be offered or available.   At the moment, that starting place is  
the one of the few places we can go.  Of course, flaking together  
disparate sets of even high quality data introduces additional  
complications but I'd be happy to take them on.

I'm sure we (at least I) have not fully grasped all the ramifications  
of this.  Ive tried to relay some of the complexities and a rationale  
behind what we are faced with and do.   I failed to mention the  
constraints we are under to improve the issues raised this weekend.   
Until very recently we have had 2.5 programmers working on the  
entirety of our infrastructure with nearly no resources for the portal  
to fix these problems.   This will change in 2011.

Best,
David

On Nov 29, 2010, at 9:47 PM, Stephen Thorpe wrote:

> You mention some key issues here. Let me focus on just one of them  
> for the moment, namely COL and its suitability as a data provider  
> for GBIF. I suspect that GBIF have basically just thought something  
> like "well, COL is an aggregation of trusted specialist databases in  
> a form that GBIF can use" - but the reality is *way* more complex.  
> For me, when starting to compile a Wikispecies page, I will often  
> use COL as a *starting point only*, actually little more than a  
> convenient way of getting big lists of taxa formatted and put on  
> Wikispecies pages for further scrutiny. Sometimes, the COL data is  
> so obviously worse than useless, that I don't use it at all, not  
> *even* as a starting point. The data providers from COL vary widely  
> in nature. Some of them are near complete for their group, others  
> are highly fragmentary. Some are *very* raw, others are quite well  
> polished. Sometimes, there are problems in the way that COL  
> interprets the data from sources, so all sorts of synonyms get  
> interpreted as valid, etc. Another issue, which I don't fully  
> understand yet, and I could perhaps be mistaken (???), is that even  
> in COL 2010, much of the data seems to have been harvested in  
> 2008 ... I would have thought that COL 2010 would have harvested its  
> data in 2010. If not, then COL is running a couple of years behind  
> its own data providers, who will typically not be completely up-to- 
> date either. So, in summary, I would say that COL is nothing more  
> than a convenient *starting point* for building solid biodiversity  
> data, and it requires a fair amount of careful and informed  
> interpretation, not to mention a great deal of manual work to  
> improve on it. I'm not sure that GBIF has fully grasped this? For  
> example, in COL, the family Scarabaeidae is actually what would  
> almost universally be called the subfamily Scarabaeinae of the  
> family Scarabaeidae, and this is not at all obvious. So, COL is  
> actually quite good if you want data on Scarabaeinae, but completely  
> lacking in any data whatsoever on the *huge* scarabaeid subfamilies  
> Melolothinae and Rutelinae.
> Cheers,
> Stephen