[Taxacom] Data quality of aggregated datasets

Thu Apr 25 06:04:23 CDT 2013

Rod Page wrote:

"There at least things we need to do to tackle this problem.." and I like to think the presumably unintentional gap between 'least' and 'things' shows that Rod had something else in mind, maybe a definition of who 'we', the annotators, might be?

Donald Hobern says 'It will also involve a commitment from us all to work collaboratively to manage digital knowledge of biodiversity', and he used 'we' quite a bit in his last Taxacom post. But Doug Yanega's awesome post (and lots of other examples known to Taxacom listers) shows that the hard grind of data inspection and cleaning isn't done by a rhetorical 'we,' it's done by a small number of individuals around the world with the interest and the time (often a lot of time) to do that work.

While persistent identifiers and an effective annotation mechanism can help with the 'time' bit, and ensure that any particular job only has to be done once, they do nothing for the 'interest' bit. Here and there in the landscape of biological data, isolated individuals appear, roll up their sleeves and try to bring order and accuracy to what's known so far. Is there a technical fix for increasing their numbers?

These people aren't going to appear out of nowhere when effective data-item identifiers and annotation mechanisms are developed and agreed on, because their job has been made easier. It hasn't. In the same way that digital tools can't significantly accelerate taxonomy if 90% of taxonomists' time is spent quietly examining and curating specimens, the technical solutions for archiving/tagging/annotating data can't reduce the effort involved in the detective work needed to upgrade the data.

I used digital tools in my Australian millipede records audit to quickly identify potential problems, and I then contacted the data providers with quite a few queries, most of which only curatorial staff (sometimes particular curatorial staff) could answer. Dealing with those queries took time. Between us, the staff members and I had enough 'interest' to pursue these data issues.

So a vanishingly small percentage of the world's biodiversity data items got upgraded. What about all the others? Who does those? Like Rod says, there's a lot of arm-waving about projects, but not much talk about fielding an army of data-checkers, except moonshine about 'the crowd', which properly interpreted means 'an additional small number of patient, capable people we hope to find by asking for help over the Internet'.

Which leaves us where? I can see a future with an increase in the quality of data 'crystallised' around particular taxa and geographical areas, because that's how human interest focuses. Maybe Rod has an idea?

It would be nice if those 'crystallisations' got supported. The money in recent years seems to have been going mainly to aggregating data with very patchy quality.
-- 
Dr Robert Mesibov
Honorary Research Associate
Queen Victoria Museum and Art Gallery, and
School of Agricultural Science, University of Tasmania
Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
Ph: (03) 64371195; 61 3 64371195