[Taxacom] Biodiv Informatics Challenges

Tue Feb 17 10:21:50 CST 2009

Wolfgang and all,

This is Hong Cui from University of Arizona. I am not a biologist but have
been working with biodiversity information. My research areas include
semantic markup of taxonomic descriptions.

A recent project in collaboration with the Flora of North America Project
may be relevant to THE challenge as Wolfgang stated, "In my mind, one of THE
challenges in biodiversity informatics - is how to distinguish unfinished
(manuscript) and finished content (reviewed reliable publication)", so I
thought I'd share some ideas with you.

This challenge can be translate to a quality measure problem: how can a
computer identify a quality description from problematic ones?

Our idea is to use semantic markup techniques to fully markup document
content (our focus is on morphological descriptions), then make the computer
to compare a manuscript with "finished content" and identify any anormalies.
We have developed a unsupervised semantic markup algorithm that works very
well at clause level markup and generates very promising results at
character level markup. The good thing about the algorithm is that no
training examples are needed: they work directly from raw text, and no
dictionaries etc are needed either. The algorithm has been tested on Flora
of North America, Flora of China, and portions from the Treatise of
Inverbrate Paleontology. We have also started to develop some algorithms
that compare one semantically marked-up document with another.

We'd like to hear from others working on similar problems and also welcome
your comments.

Hong Cui
Asisstant Professer, Information Technologies
School of Information Resources and Library Science
University of Arizona

On Tue, Feb 17, 2009 at 3:31 AM, <faunaplan at aol.com> wrote:

> Dear all,
> when following the interesting TAXACOM thread on 'species pages' I was
> waiting that someone would point to the enormous current problems in
> retrieving relevant information on the internet. I didn't see such postings
> so here are some of my thoughts (&dreams - I'm not IT-savvy enough, if at
> all...), on that issue:
>
> Most descriptions are published along with nomenclatural content which must
> be printed on paper since the internet is not (yet) accepted as a
> Code-compliant publication media. Not sure it will be accepted in the near
> future, but even if the change comes, the printed information will continue
> to play a major role. Therefore, scanning, digitization and markup of
> literature seems to be what is needed in the very first place. We have tools
> for extracting at least some taxonomic names from PDFs, but are we really
> able to work with that and combine it with info from other web resources in
> an "industrialized" manner?
> Probably it is GBIF that could play a pioneer role in developing new
> strategies simply because it's in the nature of the project that is has to
> cope with more unresolved misspellings, synonyms, homonyms, in litteris
> names, etc. than any other major web project (just take a look into GBIF's
> "kingdom unknown" basket!). I believe we urgently do need what David Remsen
> has alluded to - an Electronic Catalogue of Names that can pick out at least
> available names from the salad.
>
> As for animal names, a first step could be an official ZooBank list of all
> principally available nomina,=2
> 0knowing that these are not simply extractable from printed and electronic
> resources (e.g., the available nomen <Cicindela trisignata> was published by
> the name string "C.Tri-Signata"; the printed string "N. Kratteri valonensis"
> is available as <Nebria valonensis> even if it has never been used in that
> form; etc.).
>
> Then, what can we do with available nomina strings? Are they not useful for
> assigning unique identifiers to ascertained names?
> For example, how about adding more information to LSIDs? In my imagination,
> a vetted (ascertained) name could be recognized as such by an human- &
> computer-readable LSID-substring carrying information on its ascertained
> nomenclatural basis (and, in case of misspellings etc., pointing to its
> correct name basis). The benefits would be, in my mind, that vetted/
> unvetted names could be distinguished easily and involvement of taxon
> experts would be facilitated (& attracted) very much better than with those
> *cryptic* LSIDs that are in use so far. Once a name is ascertained, you can
> see it and the computer can know it by the composition of it's LSID;
> unvetted names would stay with computer-readable numbers only.
>
> E.g.,
> could an LSID for an ascertained taxonomic name in a GBIF occurrence record
> be something like:
> <urn:lsid:gbif.org:ecat:ZS-Carabus_depressus=Licinus_depressus.19020147> ?
>
> ... building upon a basic LSID nomen substring issued by ZooBank:
> <urn:lsid:iczn.org:zoobank:ZS-Carabus_depressus=Licinus_depressus> ?
>
> The human- & computer(?)-readable substrings in that imaginary example
> would be:
> "Z" = animal na
> me governed by ICZN
> "S" = name of species-group ("G" for genus-, "F" for family-group)
> "Carabus depressus" original generic combination (linking to type material)
> "=Licinus depressus" subsequent generic combination (up to here all
> metadata incl. author, date, name history, etc. would be resolvable via
> ZooBank)
> "19020147" unique GBIF ID for a name associated with a dataset (where the
> name could be a misspelling that should not be deleted from the record if it
> is a verbatim citation e.g. from a specimen label).
>
> Not thinkable? I know the devil is always in the details..., but where
> would IT savvy taxacomers say 'principally impossible'?
> In my mind, one of THE challenges in biodiversity informatics - is how to
> distinguish unfinished (manuscript) and finished content (reviewed reliable
> publication). As for the issue of taxonomic names, maybe a difference in the
> LSID could be part of the solution?
>
> Cheers,
> Wolfgang
> ---------------------------------
> Wolfgang Lorenz, Tutzing, Germany
>
>
> ________________________________________________________________________
> AOL eMail auf Ihrem Handy! Ab sofort können Sie auch unterwegs Ihre AOL
> email abrufen. Registrieren Sie sich jetzt kostenlos.
> _______________________________________________
>
> Taxacom Mailing List
>
> Taxacom at mailman.nhm.ku.edu
>
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The entire Taxacom Archive back to 1992 can be searched with either of
> these methods:
>
> http://taxacom.markmail.org
>
> Or use a Google search specified as:  site:
> mailman.nhm.ku.edu/pipermail/taxacom  your search terms here