[Taxacom] saturday morning fun
David Remsen (GBIF)
dremsen at gbif.org
Sun Nov 28 11:28:43 CST 2010
I said in the previous email I would attempt to explain a bit about
the roots of the problems inherent to the taxonomic index within the
data accessed through the GBIF data portal. While Stephen laments "
'official' databases like GBIF ... feeding us sh!t" we should
collectively recall the old saw that when you point a finger at
someone remember there are three pointing back at you. We (not sure
who you think 'we' are) may indeed be responsible for all of messes
and problems you see in the data accessible via the portal but I
believe the causes can be distributed a bit more widely than a few
well-intentioned, but perhaps taxonomically under-informed,
programmers in Copenhagen.
Trying to organise 264 million-plus records from over 8000 different
databases is more complicated than you probably understand. I'd try
to explain but even a brief explanation isn't as short as Id like.
1. There is no "GBIF" data and GBIF is not a database. GBIF is an
agreement among participating countries and networks to publish and
share data.
There are datasets that originate in a growing number (~8000) of
datasets derived from natural history collections and observational
data. The list is here. http://data.gbif.org/datasets/ or here http://gbrds.gbif.org/browse/agents?agentType=RESOURCE
2. GBIF provides a mechanism to share these data and enable them to
be discovered in a single place instead of going to 8000 different
URLs and asking yourself. This is done by A) registering that
datasets exist and B) by providing a URL that links to some sort of
access mechanism whereupon GBIF (or anyone else) can access the source
data. We access these data and store a cache of it in our data portal.
3. Every month, we rebuild the index of data by starting at the top
of the list of registered datasets, contacting each database, and re-
acquiring a new set of data in case there are updates or additions.
This super-index is then available through our portal. It provides
links back to individual data records where the real details lay.
Your problem is that we have yet to overcome problems inherent in the
data to properly organise it into neat taxonomic bundles for more
precise access. Here is why:
4. After visiting all 8000 datasets and pulling and processing the
XML data we receive every month, what we obtain, and cache, from each
database looks like a single table, such as you can put in a
spreadsheet. This months table is 269 million rows deep. Among the
columns it contains are Kingdom, Phylum, Class, Order, Family, Genus
and species. These will be the columns for which you here have the
most complaints.
5. While you may think that the worlds collections data are sorted and
in good order, even in simple Linnaean ranks you would probably be
surprised at what shows up in the collective data. Last week we did
some basic analysis of this part of the index. Some basic numbers
can be found here:
http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Statistics
Row 2 of the Basic Figures tells me that the 269 million records are
represented by 5.5M distinct combinations of Kingdom, Phylum, Class,
Order, Family, Genus, Species. It tells me that even Lynn Margulis
would be surprised at over 9000 distinct Kingdom values for these
data, surpassing the 5000+ Order values listed in the same file.
6. Here is a sample entry for a single species (Mytilus edulis, the
common blue mussel):
http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis
This ignores misspellings for the species name itself - which add to
the list but it illustrates a lot of variation both in the orthography
and correctness of the authorship as well as in the higher taxonomy.
Some is completely incorrect, some out of date, some missing, some
transposed and some is, well, it's right.
Of course we could easily weed out all of the incorrect stuff,
right? Sure, if A) we had basic lists of the correct information
and B) these authoritative sources came anywhere near being
comprehensive enough to answer basic questions like: How many Mytilus
edulis are there really? We don't have either.
7. Sampling only from these specimen and observation datasets, and
with the limited metadata provided from the soures, we have no
effective way of distinguishing good taxonomy and orthography from bad
or for qualitatively ranking one dataset from another. So this
makes even determining how many real taxa are represented in these
data problematic.
8. We therefore look for organised collections of authoritative data
that we might rely on to organise this giant and messy beast of
primary data. The Catalogue of Life has played this role almost
exclusively. Why? It's available for us to use and it offers a
chance we might do better than the source data to answer what must
seem like simple questions like "What records exist for Coleoptera?"
9. This only provides a solution for 54% of names that match the
Catalog of Life (as well as IPNI and Index Fungorum). For taxa not
listed in those sources our choices have been to 1) ignore them or 2)
try to provisionally place them in a taxonomic hierarchy that uses the
Catalogue of Life as the starting point and hope we get it close.
In other words, if someone asked for all the Coleoptera, we could
provide only those records that the Catalogue of Life says are beetles
or what the source data say are beetles. Since the COL only matches
50% of the names, we felt that 1) was too restrictive.
10. We derive a classification based on the Catalogue of Life where
we try to place species not reported in it to a higher taxon that is.
For example, if Scombrus maximus (i made this up) isn't in the
Catalogue of Life but is reported in the Family Scombridae (which is)
we could provisionally link the taxon in the Scombridae. This is a
simple and easy case.
More problematic is when the placement is ambiguous. Oenanthe appears
as a genus among the plants and the birds but within the plants it's
reported as both an orchid and an umbelliferid. Plant species are
listed in the Graminaceae and other older higher taxa that don't
conform to the available higher taxonomy we rely on and make it hard
to place them.
"Phantom" genera appear where the same genus is listed in different
higher taxa according to different sources. Genera appear listed as
Families or Orders. How should we know what are legit? Download
the complete and official list of genera. Or the one of families or
orders? Where are they? Do we teach our programmes the higher taxa
suffixes and let them sort it out? I say "Hah," sirs, simply "Hah!"
11. Lastly, Wolfgang, and others, have offered to 'fix' the
taxonomy issues in the portal to the extent we even funded a pilot to
see what a review could be found and done with the data. Like
everything else, this isn't as simple as just pointing out a bad
record. Since we rebuild the cache every month from a network of
sources, applying corrections either needs to be done at the source
or applied repeatedly by us. We are now working on the means to
support source-level remediation by
1) developing an annotation mechanism to enable suggested fixes to be
sent back to a source curator and
2) enabling taxonomic data-cleaning services to be offered by people
who have the sort of authority files to offer them.
We are also looking to have expanded access to a wider array of
taxonomic authority files that we can use as more and better authority
files. This has been hindered by an accessible data standard to
allow easy sharing of authority files as well as incentives to make
them available to GBIF and others. We are trying to work on this as
well. I posted the link to a call to evaluate that format and
particularly those parts related to citation and attribution by all
who use the data to those who curate it. These could help address
the above immensely. Want to help? Offer us some authority files in
this format. We will ensure you are credited.
So, while your Saturday morning fun appears to make what we do seem
sort of senseless and value-less, my Sunday evening response is not
based on that view. You are seeing a processed version of what we
see that originates in the big dirty messy world. I think GBIF
(Copenhagen) is remiss in that we have not properly put the noise in
it's proper proportion and done a better job of faceting and sorting
the majority "good" data from the bad. We work in a shifting
landscape of noisy stakeholders all shouting that everything is a
priority. I'm sorry we still haven't gotten this straightened out
but we hope to have a better solution very soon.
----------------------------------------------------------------------------
David Remsen, Senior Programme Officer
Electronic Catalog of Names of Known Organisms
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321472 Fax: +45-35321480
Mobile +45 28751472
Skype: dremsen
----------------------------------------------------------------------------
On Nov 28, 2010, at 3:34 PM, Dr. Rodham E. Tulloss wrote:
> Dr. Lorenz,
>
> Thank you for drawing my attention to GBIF.
>
> I am one of the few persons left in the world who actually work at
> morphological
> taxonomy of an agaric group worldwide and have over 30 years of
> experience. I
> work on the Amanitaceae (Agaricales, Basidiomycetes, Fungi).
>
> I went to look at GBIF this morning. For fungi, it is worse than a
> bad joke. I
> immediately decided that it was better to do a few more dozen new
> taxa than it
> was to be a part of straightening out the the hopeless mess related
> to my group
> on GBIF (such as it is...they have one specie inexplicably listed
> both as a
> fungus and a synonym of a Pine tree).
>
> I PAY MY OWN WAY in retirement. They have millions. What a travesty.
>
> Rod Tulloss
>
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either
> of these methods:
>
> (1) http://taxacom.markmail.org
>
> Or (2) a Google search specified as: site:mailman.nhm.ku.edu/
> pipermail/taxacom your search terms here
>
More information about the Taxacom
mailing list