[Taxacom] Data quality in aggregated datasets

Fri Apr 26 17:03:03 CDT 2013

I agree, Stephen.  Any erroneous data in an analysis are problematic and shouldn’t be considered insignificant.  For example, with all the introductions of species around the world, how does one know whether a distribution record is valid or not just because it seems far away from the known distribution?  It could be an introduced population or a disjunct, native one.  From similar discussions at scientific meetings, it seems that many biologists and ecologists, who don’t specialize in taxonomy, don’t comprehend the many potential problems with aggregated data.

A recent and somewhat related “Letter to the Editor” is:
Villegas Vallejos, Marcelo Alejandro and David Morimoto.  2013.  The importance of data verification: Unchecked errors in basic natural history sampling may greatly impair conservation research.  Biological Conservation 157: 437-438. [although not about aggregated data]

It seems that the most important point is that the initial data input (into any database) needs to be done more carefully and should be double checked.  However, a colleague has mentioned several instances were even double-entered and verified data were incorrect in database output later (database hiccups).  Only the original collector(s) are likely to spot such errors … after 20 years no one will be the wiser.

The misinformation that exists in some records can send one on a wild goose chase that lasts for hours or days trying to verify.  Because of an erroneous record, does one want to travel to a site many hours away to attempt the collection of a species that was never present at the site?

I’ve seen erroneous distribution records published in recent years.  It took quite a bit of sleuthing through old data (good old paper) to verify the erroneous nature of those records that were mixed up in a database.  How long do you suppose these distribution records will be perpetuated?

Old records often are listed under old name combinations.  I always try to search every possible name combination for taxa and often find old records that have not been updated for 50 or more years.  Searching misspelled names also locates records that otherwise would have slipped through the cracks (more of those “missing” data).

Bill

________________________________________
From: taxacom-bounces at mailman.nhm.ku.edu [taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Stephen Thorpe [stephen_thorpe at yahoo.co.nz]
Sent: Friday, April 26, 2013 5:30 PM
To: Quentin Groom
Cc: TAXACOM
Subject: Re: [Taxacom] Data quality in aggregated datasets

The problem here is that "modelers" are not the only users of biodiversity data, and some of the *other* users may be more concerned about "odd misidentifications, wrong grid references and other
random errors", particularly if they want to know the answer to a particular question, and the relevant data is error ridden ...

Stephen

From: Quentin Groom <quentin.groom at br.fgov.be>
To:
Cc: TAXACOM <taxacom at mailman.nhm.ku.edu>
Sent: Friday, 26 April 2013 8:32 PM
Subject: Re: [Taxacom] Data quality in aggregated datasets

As someone that would like to use GBIF data for modeling and data
analysis, and as a collector of distributional data, I'm not that
flustered about odd misidentifications, wrong grid references and other
random errors. By far the largest problem is all the missing data, both
from counties that don't participate in GBIF and from participating
countries that shared their data once and don't update it.
Modelers expect the data to be ugly, it can even be factored in to their
models. However, they can't do that if they don't have the data in the
first place.
While we should not be complacent about quality I would much rather we
focus our efforts on data availability.
Quentin

Roderic Page wrote:
> Leaving aside the issues of what both providers and aggregators can do to clean the data, we seem trapped in an endless cycle of :
>
> 1. OMG the data is broken!
>
> 2. SOMETHING MUST BE DONE!
>
> 3. Wave arms frantically, mention projects currently underway that will almost certainly solve the problem "real soon now".
>
> 4. ... [tumble weed]
>
> 5. Go to 1
>
> There at least things we need to do to tackle this problem, and until we do we're not being serious about data quality.
>
> 1. Identifiers
>
> In order to clean data that data has to persist long enough for people or algorithms to act on it. If I add an annotation to a piece of data I want that information to persist, otherwise why would I bother? At the level of specimens we don't have identifiers, and few have shown any commitment to tackling this problem (notable exception is Roger Hyam's work at the RBGE, see http://www.mapress.com/phytotaxa/content/2012/f/pt00073p030.pdf ). GBIF routinely deletes vast (in some cases literally millions) of specimen URLs, so any attempt to attach annotations to those records is doomed.
>
> 2. Annotation tools
>
> Of course there are tools being developed by our community, but I've not seen any that look at all usable. In the real world we are used to tracking packages being couriered around the world (there's an app for that), and many will have come across feedback tools online where you can notify a site of an issue and engage in a conversation to resolve it. There are also more general annotation tools being developed, e.g. http://hypothes.is/ Let's leverage these.
>
> Annotation rests on being able to identify the thing being annotated, and on the web URLs serve that purpose. Until we have stable URLs for specimens, and these are used by everyone who has something to say about that specimen, then we are doomed to repeat steps 1-5.
>
> But of course, we know all this, and have done so for a while...
>
> Regards
>
> Rod
>
>
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
>
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> Skype: rdmpage
> Facebook: http://www.facebook.com/rdmpage
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page
> Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
> ORCID id: http://orcid.org/0000-0002-7101-9767
>
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom Archive back to 1992 may be searched with either of these methods:
>
> (1) by visiting http://taxacom.markmail.org
>
> (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>
> Celebrating 26 years of Taxacom in 2013.
>
>
> 

--
Dr. Quentin Groom
(Botany and Information Technology)

National Botanic Garden of Belgium
Domein van Bouchout
B-1860 Meise
Belgium

Landline; +32 (0) 226 009 20 ext. 364
FAX:      +32 (0) 226 009 45

E-mail:    quentin.groom at br.fgov.be
Skype name: qgroom
Website:    www.botanicgarden.be

_______________________________________________
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom

The Taxacom Archive back to 1992 may be searched with either of these methods:

(1) by visiting http://taxacom.markmail.org

(2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

Celebrating 26 years of Taxacom in 2013.
_______________________________________________
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom

The Taxacom Archive back to 1992 may be searched with either of these methods:

(1) by visiting http://taxacom.markmail.org

(2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here

Celebrating 26 years of Taxacom in 2013.

William J. Poly 
Research Associate
Department of Ichthyology
California Academy of Sciences
55 Music Concourse Drive, Golden Gate Park
San Francisco, California 94118
wpoly at calacademy.org
http://research.calacademy.org/ichthyology/staff/wpoly