[Taxacom] saturday morning fun
Stephen Thorpe
stephen_thorpe at yahoo.co.nz
Mon Nov 29 14:47:50 CST 2010
David,
You mention some key issues here. Let me focus on just one of them for the
moment, namely COL and its suitability as a data provider for GBIF. I suspect
that GBIF have basically just thought something like "well, COL is an
aggregation of trusted specialist databases in a form that GBIF can use" - but
the reality is *way* more complex. For me, when starting to compile a
Wikispecies page, I will often use COL as a *starting point only*, actually
little more than a convenient way of getting big lists of taxa formatted and put
on Wikispecies pages for further scrutiny. Sometimes, the COL data is so
obviously worse than useless, that I don't use it at all, not *even* as a
starting point. The data providers from COL vary widely in nature. Some of them
are near complete for their group, others are highly fragmentary. Some are
*very* raw, others are quite well polished. Sometimes, there are problems in the
way that COL interprets the data from sources, so all sorts of synonyms get
interpreted as valid, etc. Another issue, which I don't fully understand yet,
and I could perhaps be mistaken (???), is that even in COL 2010, much of the
data seems to have been harvested in 2008 ... I would have thought that COL 2010
would have harvested its data in 2010. If not, then COL is running a couple of
years behind its own data providers, who will typically not be completely
up-to-date either. So, in summary, I would say that COL is nothing more than a
convenient *starting point* for building solid biodiversity data, and it
requires a fair amount of careful and informed interpretation, not to mention a
great deal of manual work to improve on it. I'm not sure that GBIF has fully
grasped this? For example, in COL, the family Scarabaeidae is actually what
would almost universally be called the subfamily Scarabaeinae of the family
Scarabaeidae, and this is not at all obvious. So, COL is actually quite good if
you want data on Scarabaeinae, but completely lacking in any data whatsoever on
the *huge* scarabaeid subfamilies Melolothinae and Rutelinae.
Cheers,
Stephen
________________________________
From: David Remsen (GBIF) <dremsen at gbif.org>
To: dipteryx at freeler.nl
Cc: taxacom at mailman.nhm.ku.edu
Sent: Mon, 29 November, 2010 11:19:22 PM
Subject: Re: [Taxacom] saturday morning fun
I tried to explain the state that data is when it is published from
source collections data and the challenge of organising it for
evaluation. I also said that since the taxonomic organisation of
these source data is so inconsistent and we lack the means to rank
individual data sets (something I think we need to revisit) that we
rely on additional external sources to provide taxonomic authority
files.
- We use the Catalogue of Life because it is available and is
purported to be curated by a network of taxonomic experts. We have
the capacity to utilise additional and alternative sources if they are
made available to us. What are they and where can I find them?
And which sectors of the Catalogue of Life are completely worthless?
Like GBIF, the Catalogue of Life is composed of component parts that
are curated elsewhere. Like GBIF, CoL does actually shoulder some
responsibility for the organisation of these parts. Is it the
higher (supra-familial) taxonomy that the COL uses to organise the
components. Is it specific taxonomic sectors like the Porifera or
the legumes or weevils? Or, is it like the proverbial rotten egg,
where you don't need to eat the whole thing to just know it's bad?
The latter requires the least effort but doesn't actually say much.
- We are aware of particulars regarding species concept issues but
specific concept references are not provided in data sources and
concept differentiation remains a problem for nearly all biodiversity
data networks. We know how to improve this however, but it requires
concept identifiers and concept definitions themselves, to be provided
in a structured manner, with the data.
- The GBIF data portal is in need of a make-over and will be getting
one over the next year. I agree we need to make the simple thing it
is supposed to be doing simpler and with improved precision. That
simple thing is access to primary biodiversity data records shared
through the network. We will be doing a lot more processing of these
records to try to make more clear qualitative separations. There are
two basic issues however that are very difficult to address the first
being comparing two records and determining if record A and record B
actually refer to the same thing.
Lastly, GBIF is an open source project and the data published through
it are (or can be) available to anyone wishing to provide improved
access to it. This of course, requires an agreement that access to
raw collections data from a federation of sources has some sort of
merit - something that seems to me is not agreed by commenters here.
If this isn't agreed then the issue is far larger than GBIF and in my
mind raises the question as to why NSF and others continue to wish to
fund such digitisation and networking efforts.
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5448
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503559&org=&from_org=NSF
If you think there IS merit in sharing these data but that the problem
is we just continually botch it up in Copenhagen, we would support
your proposal to improve access to these meritous data. I will
provide you with a copy of the current index (a ~75 GB text file) or
any subset for evaluation and new ideas. Just remember, the fixes
and organisations have to be re-applied next month to a new copy.
Lastly, just what did you search for in Google that provided better
recall and precision than a GBIF search in terms of specimen
records? If Google has access to an wider set of better curated
specimen records than indeed, the portal merits a real existential
look.
David Remsen
On Nov 29, 2010, at 9:46 AM, <dipteryx at freeler.nl>
<dipteryx at freeler.nl> wrote:
> Van: taxacom-bounces at mailman.nhm.ku.edu namens Jim Croft
> Verzonden: ma 29-11-2010 1:04
>
>> To be fair, the only reason GBIF is 'feeding us shit' is
>> because 'shit' is what we gave them.
>
> ***
> Not at all sure about that. What has been playing through my
> mind is the idea that a data aggregator is an agency which can
> be characterized by "Data in, garbage out". It is a complete
> mystery to me why GBIF uses something known to be so completely
> worthless as the taxonomy of the Catalogue of Life; nothing good
> can come of that ...
>
> Like some other list-members, I tried a small test, for which I
> selected a genus where it is known to be essential to be explicit
> about the species concept used in order to be able to interpret
> and handle data, in anything like a meaningful manner.
>
> Using the GBIF data portal, the most noticeable thing is how much
> work it is to use, before getting to any data. There is indeed a
> significant degree of completely irrelevant material linked from
> this entry (the wondrous ways of computers!), but this is easily
> identifiable, so not much of an actual problem. There is no apparent
> awareness of the species-concept issue, with more than one species
> concept used happily side by side. So, a lot of work (and 'expert'
> knowledge required), but basically usable. This in contrast to the
> Wikipedia entry, which requires very little work on the part of the
> reader for him to be completely misinformed. Wikispecies is
> preferable,
> although it offers only little information, with a 25% rate of error
> (as compared to the source it was copied from), but at least it
> indicates its source, and it has selected a relevant source.
>
> On the whole it proves that the casual user is best advised to just
> use Google (which not only did turn up the relevant information but
> quickly showed me a very nice site unknown to me): this is less work
> and yields more useful results (a higher ratio of information/amount
> -of-work) than trying one of the self-advertised high-profile sites
> (obviously, the 'expert' does not need advice).
>
> Paul van Rijckevorsel
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either
> of these methods:
>
> (1) http://taxacom.markmail.org
>
> Or (2) a Google search specified as: site:mailman.nhm.ku.edu/
> pipermail/taxacom your search terms here
>
_______________________________________________
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
The Taxacom archive going back to 1992 may be searched with either of these
methods:
(1) http://taxacom.markmail.org
Or (2) a Google search specified as: site:mailman.nhm.ku.edu/pipermail/taxacom
your search terms here
More information about the Taxacom
mailing list