[Taxacom] Data quality in aggregated datasets
Stephen Thorpe
stephen_thorpe at yahoo.co.nz
Fri Apr 26 16:30:05 CDT 2013
The problem here is that "modelers" are not the only users of biodiversity data, and some of the *other* users may be more concerned about "odd misidentifications, wrong grid references and other
random errors", particularly if they want to know the answer to a particular question, and the relevant data is error ridden ...
Stephen
From: Quentin Groom <quentin.groom at br.fgov.be>
To:
Cc: TAXACOM <taxacom at mailman.nhm.ku.edu>
Sent: Friday, 26 April 2013 8:32 PM
Subject: Re: [Taxacom] Data quality in aggregated datasets
As someone that would like to use GBIF data for modeling and data
analysis, and as a collector of distributional data, I'm not that
flustered about odd misidentifications, wrong grid references and other
random errors. By far the largest problem is all the missing data, both
from counties that don't participate in GBIF and from participating
countries that shared their data once and don't update it.
Modelers expect the data to be ugly, it can even be factored in to their
models. However, they can't do that if they don't have the data in the
first place.
While we should not be complacent about quality I would much rather we
focus our efforts on data availability.
Quentin
Roderic Page wrote:
> Leaving aside the issues of what both providers and aggregators can do to clean the data, we seem trapped in an endless cycle of :
>
> 1. OMG the data is broken!
>
> 2. SOMETHING MUST BE DONE!
>
> 3. Wave arms frantically, mention projects currently underway that will almost certainly solve the problem "real soon now".
>
> 4. ... [tumble weed]
>
> 5. Go to 1
>
> There at least things we need to do to tackle this problem, and until we do we're not being serious about data quality.
>
> 1. Identifiers
>
> In order to clean data that data has to persist long enough for people or algorithms to act on it. If I add an annotation to a piece of data I want that information to persist, otherwise why would I bother? At the level of specimens we don't have identifiers, and few have shown any commitment to tackling this problem (notable exception is Roger Hyam's work at the RBGE, see http://www.mapress.com/phytotaxa/content/2012/f/pt00073p030.pdf ). GBIF routinely deletes vast (in some cases literally millions) of specimen URLs, so any attempt to attach annotations to those records is doomed.
>
> 2. Annotation tools
>
> Of course there are tools being developed by our community, but I've not seen any that look at all usable. In the real world we are used to tracking packages being couriered around the world (there's an app for that), and many will have come across feedback tools online where you can notify a site of an issue and engage in a conversation to resolve it. There are also more general annotation tools being developed, e.g. http://hypothes.is/ Let's leverage these.
>
> Annotation rests on being able to identify the thing being annotated, and on the web URLs serve that purpose. Until we have stable URLs for specimens, and these are used by everyone who has something to say about that specimen, then we are doomed to repeat steps 1-5.
>
> But of course, we know all this, and have done so for a while...
>
> Regards
>
> Rod
>
>
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
>
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> Skype: rdmpage
> Facebook: http://www.facebook.com/rdmpage
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page
> Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
> ORCID id: http://orcid.org/0000-0002-7101-9767
>
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom Archive back to 1992 may be searched with either of these methods:
>
> (1) by visiting http://taxacom.markmail.org
>
> (2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
>
> Celebrating 26 years of Taxacom in 2013.
>
>
>
--
Dr. Quentin Groom
(Botany and Information Technology)
National Botanic Garden of Belgium
Domein van Bouchout
B-1860 Meise
Belgium
Landline; +32 (0) 226 009 20 ext. 364
FAX: +32 (0) 226 009 45
E-mail: quentin.groom at br.fgov.be
Skype name: qgroom
Website: www.botanicgarden.be
_______________________________________________
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
The Taxacom Archive back to 1992 may be searched with either of these methods:
(1) by visiting http://taxacom.markmail.org
(2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
Celebrating 26 years of Taxacom in 2013.
More information about the Taxacom
mailing list