[Taxacom] What can Global Biodiversity Information Facility (GBIF) do for you?

Mon Oct 21 05:48:37 CDT 2013

Hi Lyubo,

One thing I'm not clear on is whether there is a pipeline from the Biodiversity Data Journal to GBIF. In other words, if I publish a data-rich paper in your journal, does it get automatically harvested by GBIF?

Regards

Rod

On 20 Oct 2013, at 12:36, Lyubomir Penev wrote:

> 'No, that's not the argument. Biodiversity data aren't like biodiversity books or papers, for which you can (in principle) generate a complete catalogue or index. Given such a catalogue or index, you can go further and digitise and make available on the Web all the content. Cool, yes?' 
> 
> Yes, cool,  but between the "cool stuff" and the BHL scans or recently published PDFs there is such thing called "markup". 
> 
> Imho, the biggest problem of GBIF is that the main source of data are still mostly large institutional collections. Not bad at all, and it is definitely a great thing to index and make these discoverable. Unfortunately, this comes with the trade-off and on the expense of inclusion of non-verified data as well.  GBIF are still faraway from mobilization of the properly published "small" data (that is well documented in literature, peer-reviewed,  registered for authorship and priority, citeable, etc.). Why it is so? Because of the same "markup" problem.
> 
> Here comes the second big problem of GBIF which is that a huge amount of "small" biodiversity data are still outside the institutional collections of the GBIF member countries (e.g., published in the historical literature, in many cases even not in English, or stored in museums of non-member countries, smaller museums,  private collections, observations not stored in collections, etc.). Nobody knows the real proportion between the "big" collections' data and "small" data, but according to some calculations, the "small" data constitute about 80 % of all data. In other words, the "small' data are in fact the "big" data. When GBIF will be able to gather the precious, peer-reviewed, small data from historical and recently published literature, for example? The answer is: "probably never", until we continue to publish in archaic way, such as paper/PDF, forgetting that someone should spend the huge effort to extract data from PDFs, put this data into a database and then upload it to GBIF or anywhere else...... while PDFs continue to pile up every day!
> 
> Even ZooKeys, despite integrated with GBIF Integrated Published Toolkit (IPT), that is giving authors the option to publish their data associated with a taxonomic revision, or in the form of "data papers", still publishes a lot of non-marked-up occurrence records. Why so? Again, because of the huge effort associated with markup, especially of complex biodiversity data types, such as occurrence records and morphological descriptions.  
> 
> Fortunately, it looks like that there is a light in the tunnel, called Biodiversity Data Journal. Any kind of occurrence records (and other types of small data) are published mandatory in both human-readable (HTML, PDF) and computer-readable (XML, Darwin Core and DwC Archive) formats. It is like a piece of cake to download or harvest data published in this way and to make them not just "discoverable via open access" or "linked to taxon names, mentioned in the same article", but REUSABLE! 
> 
> The "gold open access" is a good thing but not sufficient anymore.  We need to switch to "platinum open access" publishing which will eliminate the costs of markup and make data easily available and reusable to all.
> 
> Cheers,
> Lyubomir
> 
> 
> 
> On Sun, Oct 20, 2013 at 1:44 PM, Roderic Page <r.page at bio.gla.ac.uk> wrote:
> Hi Bob,
> 
> Having spent a lot of time trying to extract content from BHL for projects such as BioStor and BioNames, the kinds of issues you raise for specimens sound all too familiar.
> 
> BHL grabs physical things, scans them, associates whatever metadata library catalogues have, and puts the online. Simples. Ah, but then the fun starts. Locating articles (i.e., the things we actually cite) in BHL is sometimes straightforward, but often it is anything but. Journals can change names, may have multiple names (sometimes in multiple languages), concurrent or inconsistent volume and/or page numbering, etc. Notions that we take for granted today (that there are "articles" and that they have explicit titles) may not hold, and off course every taxonomist knows that determining the data of publication can be a challenge (as I'm sure Neal Evenhuis, among others, will testify). Much of the time I spend on BioNames consists of taking cryptic, often misleading (if not downright erroneous) citations to original descriptions and matching these to BHL (or other sources).
> 
> My point is that I don't think there's a world of difference between the two problems. For all the issues that you document in "A specialist’s audit of aggregated occurrence records" http://dx.doi.org/10.3897/zookeys.293.5111 , I could probably find equivalent horror stories for bibliographic data.
> 
> As you say, many of the basic elements of a GBIF occurrence are potentially contested, subject to uncertainty, error, etc. I guess it's for everyone to decide whether the trade-off involved in simplifying the data so it can be aggregated in bulk is worthwhile.
> 
> One thing I'd like to see is GBIF occurrence data integrated with the literature, for example by linking specimens to their citation in the literature (another reason to play with BHL). If we can go from a specimen to the associated literature we could then track some of the issues you mention, such as different identifications, discussion of whether the collection locality is correct, etc.
> 
> For a simple example, the specimen FMNH 147942 appears in at least three articles in BioStor (http://biostor.org/specimen/FMNH%20147942 ). Below are the three article links plus text extract around the specimen code:
> 
> http://biostor.org/reference/81423
> 
> Crunomys suncoides Rickart et al.,
> 1998. — Mindanao Island, Bukidnon Prov-
> ince, Mount Katanglad Range, 18.5 km S,
> 4 km E Camp Phillips, elev. 2,250 m,
> 8°9'30"N, 124°5rE, 1 male (FMNH
> 147942).
> 
> http://biostor.org/reference/65896
> 
> Crunomys suncoides Rickart, Heaney, Tabar-
> anza, and Balete, 1998
> The Kitanglad shrew-mouse is currently
> known only from the Kitanglad Range (Rickart
> et al., 1998), though we suspect that it is more
> widespread in mossy forest on Mindanao. The
> species was described based on a single adult
> male (FMNH 147942; 37 g) we captured in April
> 1993 in old-growth mossy forest at 2250 m (Site
> 6, Fig. 8). It had scrotal testes measuring 14 X
> 8 mm.
> 
> http://biostor.org/reference/95679
> 
> Crunomys suncoides, new species
> (Figs. 2, 4-9)
> HoLOTYPE — Adult male, fmnh 147942; collect-
> ed 10 April 1993 (original number 5330 of L. R.
> Heaney); initially fixed in formalin, now pre-
> served in ethyl alcohol with the skull removed
> and cleaned. The stomach and both femora have
> been removed; otherwise the specimen is in ex-
> cellent condition. It is deposited at fmnh but will
> be transferred to pnm.
> 
> Each tells us something about the specimen (and more than GBIF does). So, what if we linked this information together so that GBIF users could learn more about that record?
> 
> Regards
> 
> Rod
> 
> 
> On 19 Oct 2013, at 22:53, Bob Mesibov wrote:
> 
> > Hi, Rod.
> >
> > 'So, if the argument is that GBIF should be looking beyond museum collections then I completely agree...'
> >
> > No, that's not the argument. Biodiversity data aren't like biodiversity books or papers, for which you can (in principle) generate a complete catalogue or index. Given such a catalogue or index, you can go further and digitise and make available on the Web all the content. Cool, yes? Anyone anywhere with access to the Web can view a biodiversity publication at the click of a mouse. This works because biodiversity publications are very well-defined objects which either exist or don't. BHL is hugely valuable and 'intrinsically' successful because the goal of digitising all biodiversity publications is achievable, in principle.
> >
> > GBIF is intrinsically unsuccessful because it treats occurrence records as very well-defined objects, which they aren't. Each record is instead an entry point into an investigation (minimally) of the identity of the organism(s) observed, of the location of the observation, of the timing of the observation, of the observer and of the fate of any specimen(s) or images which are hard evidence for the observation. I say 'minimally' because the museum records that wind up in GBIF often have more than these basics in their 'pre-GBIF' form, and are sometimes only condensed versions of even more information available elsewhere. You don't get that from GBIF.
> >
> > Records aren't open-ended, but some users will go much further with them than other users. GBIF best suits users who accept the data as-is and can find trivial purposes for which those untested, sparse data are 'fit'.
> >
> > The argument that GBIF in fact suits everyone - because it lets everyone know where to find out more - fails because GBIF is a lousy index. It contains lots of errors, it's taxonomically, geographically, ecologically and 'literature-wise' grossly incomplete, and for many biodiversity studies (see Meier and Dikow) you're better off starting with your own plan of attack and chasing sources independently.
> >
> > It would be possible to rebuild GBIF from scratch as the thing its title suggests (an information facility), namely a 'meta' resource that points to and introduces data sources, but I don't think that's going to happen, because it's too hard. GBIF has taken an easier approach and has been accumulating records as though they were coins, and measuring its usefulness by counting its 'wealth' of records, so that if it has twice as many records it must be twice as useful, right? Other people in this thread have pointed out how raw counts are meaningless for assessing usefulness. Here I just wanted to say that what works for BHL doesn't work for GBIF, because the items being made Web-available are inherently different.
> > --
> > Dr Robert Mesibov
> > Honorary Research Associate
> > Queen Victoria Museum and Art Gallery, and
> > School of Agricultural Science, University of Tasmania
> > Home contact:
> > PO Box 101, Penguin, Tasmania, Australia 7316
> > (03) 64371195; 61 3 64371195
> >
> 
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
> 
> Email:          r.page at bio.gla.ac.uk
> Tel:                    +44 141 330 4778
> Fax:            +44 141 330 2792
> Skype:          rdmpage
> Facebook:       http://www.facebook.com/rdmpage
> LinkedIn:       http://uk.linkedin.com/in/rdmpage
> Twitter:                http://twitter.com/rdmpage
> Blog:           http://iphylo.blogspot.com
> Home page:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> Wikipedia:      http://en.wikipedia.org/wiki/Roderic_D._M._Page
> Citations:      http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
> ORCID:          http://orcid.org/0000-0002-7101-9767
> 
> _______________________________________________
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> 
> The Taxacom Archive back to 1992 may be searched with either of these methods:
> 
> (1) by visiting http://taxacom.markmail.org
> 
> (2) a Google search specified as:  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
> 
> Celebrating 26 years of Taxacom in 2013.
> 
> 
> 
> -- 
> Dr. Lyubomir Penev
> Managing Director
> Pensoft Publishers
> 13a Geo Milev Street
> 1111 Sofia, Bulgaria
> Fax +359-2-8704282
> ww.pensoft.net
> Publishing services for journals: http://www.pensoft.net/services-for-journals
> Books published by Pensoft: http://www.pensoft.net/books-published-by-Pensoft
> Services for scientific projects: http://www.pensoft.net/projects
> Find us on: Facebook, Google+, Twitter 

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: 		r.page at bio.gla.ac.uk
Tel: 			+44 141 330 4778
Fax: 		+44 141 330 2792
Skype: 		rdmpage
Facebook: 	http://www.facebook.com/rdmpage
LinkedIn: 	http://uk.linkedin.com/in/rdmpage
Twitter: 		http://twitter.com/rdmpage
Blog: 		http://iphylo.blogspot.com
Home page: 	http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Wikipedia: 	http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations: 	http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
ORCID: 		http://orcid.org/0000-0002-7101-9767