[Taxacom] Data quality in aggregated datasets

Roderic Page r.page at bio.gla.ac.uk
Thu Apr 25 03:41:42 CDT 2013


Leaving aside the issues of what both providers and aggregators can do to clean the data, we seem trapped in an endless cycle of :

1. OMG the data is broken!

2. SOMETHING MUST BE DONE!

3. Wave arms frantically, mention projects currently underway that will almost certainly solve the problem "real soon now".

4. ... [tumble weed]

5. Go to 1

There at least things we need to do to tackle this problem, and until we do we're not being serious about data quality.

1. Identifiers

In order to clean data that data has to persist long enough for people or algorithms to act on it. If I add an annotation to a piece of data I want that information to persist, otherwise why would I bother? At the level of specimens we don't have identifiers, and few have shown any commitment to tackling this problem (notable exception is Roger Hyam's work at the RBGE, see http://www.mapress.com/phytotaxa/content/2012/f/pt00073p030.pdf  ). GBIF routinely deletes vast (in some cases literally millions) of specimen URLs, so any attempt to attach annotations to those records is doomed. 

2. Annotation tools

Of course there are tools being developed by our community, but I've not seen any that look at all usable. In the real world we are used to tracking packages being couriered around the world (there's an app for that), and many will have come across feedback tools online where you can notify a site of an issue and engage in a conversation to resolve it. There are also more general annotation tools being developed, e.g. http://hypothes.is/  Let's leverage these.

Annotation rests on being able to identify the thing being annotated, and on the web URLs serve that purpose. Until we have stable URLs for specimens, and these are used by everyone who has something to say about that specimen, then we are doomed to repeat steps 1-5. 

But of course, we know all this, and have done so for a while...

Regards

Rod


---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
Skype: rdmpage
Facebook: http://www.facebook.com/rdmpage
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
ORCID id: http://orcid.org/0000-0002-7101-9767




More information about the Taxacom mailing list