[Taxacom] Data quality in aggregated datasets

Fri Apr 19 17:04:16 CDT 2013

There have been occasional grumblings here on Taxacom about data quality in the aggregator world, e.g. in GBIF, but what would happen if you methodically audited a sample of aggregated species occurrence records? What sorts of errors would you find? Would they be rare? Frequent?

I've done an audit of this kind for Australian millipede records in GBIF and the Atlas of Living Australia (ALA) and published the results in ZooKeys: http://www.pensoft.net/journals/zookeys/article/5111/a-specialist

The audit results can't be generalised to all taxa and all parts of the world, but they're pretty disappointing. GBIF and ALA, however, disclaim all responsibility for data problems. If there's an error, it's the fault of the data provider. So how do errors in online databases get discovered and fixed?

In this particular case, an interested third party (me) finds problems and alerts the data provider directly. The data provider fixes the errors and in the fullness of time sends corrected records to the aggregator. (Although I found evidence that erroneous records can persist through an update.)

What about aggregated datasets in general? What mechanisms are there for detecting and fixing errors besides (interested third party) > (data provider) > aggregator?

[Long silence.]
-- 
Dr Robert Mesibov
Honorary Research Associate
Queen Victoria Museum and Art Gallery, and
School of Agricultural Science, University of Tasmania
Home contact: PO Box 101, Penguin, Tasmania, Australia 7316
Ph: (03) 64371195; 61 3 64371195