Character states 'present by misinterpretation'

Mike Dallwitz M.J.Dallwitz at NETSPEED.COM.AU
Mon Mar 20 14:46:33 CST 2006


Alessandra Baptista wrote (Re: [TAXACOM] interactive keys):

> A piece of advice I have for the ones who want to build their own
> interactive keys: even if you use LUCID to publish your key, build your
> database in DELTA, and only export it to LUCID when it is time to publish
> the key. Delta allows for better control of the database, and helps you to
> avoid stupid mistakes and build a more consistent database.

Kevin Thiele wrote (Re: [TAXACOM] interactive keys):

> There is one problem with the DELTA data format (with respect to interactive
> identification) which works against Alessandra's solution. With respect to
> IntKey, DELTA only allows you to record whether a character state is present
> or absent in a taxon (or uncertain for the whole character). Lucid is much
> richer in this respect (it allows a character state to be scored
> present/absent/rare/uncertain/not scored/present-by-misinterpretation). The
> last of these is very important - if you have a state that is actually
> absent in a taxon but a user may misinterpret it as present (or vice versa),
> then using DELTA you have only two choices - code for the strict truth (in
> which case most users will go astray here) or encode false data (clearly a
> bad idea). Think of a key that includes Euphorbia and has characters like
> petals:present/absent flowers:bisexual/unisexual. This circumstance (when
> you want to pre-empt a likely user mistake but maintain the integrity of the
> data while doing it) is very common in keys, and is pretty much the only
> reason why I advise people against using DELTA to build interactive keys.
> Many DELTA data sets have false data recorded in them because of this
> problem, and in my experience these false data are virtually unrecoverable.

I agree with Kevin that the 'present by misinterpretation' coding is very
important for data integrity (truth), while allowing for user errors in
identification. It was proposed, as part of a more general scheme, by the
DELTA team well before the advent of Lucid (see 'Dallwitz, M.J., Paine, T.A.
and Zurcher, E.J. 1993 onwards. Proposed new features for the DELTA system.
http://delta-intkey.com/www/proposal.htm', under 'Coded ‘comments’
(subsidiary information)'). Unfortunately, because of other priorities, it
was not implemented before the DELTA project was terminated.

Furthermore, the distinction between true and misinterpreted values is
important even _within_ the context of identification. In the Lucid Player,
users who are confident of not making errors in interpretation can instruct
the programs to ignore values marked 'present by misinterpretation', and
this will usually result in shorter paths to identifications. Therefore,
this feature (under the name 'Special values for keys') is given a high
weight in my 'Comparison of Interactive Identification Programs'
(http://delta-intkey.com/www/comparison.htm). Although Intkey doesn't have
this feature, it has other features that are important for identification
but are lacking in Lucid - for example, character reliabilities, diagnostic
descriptions, and finding the best characters for separating a given taxon
from the rest.

Although the current DELTA format doesn't have a special 'by
misinterpretation' coding, the information can be incorporated as a
free-text comment. In fact, the Lucid Translator supplied by the Lucid team
does this when translating from Lucid to DELTA format. Unfortunately, the
reverse translation from DELTA to Lucid ignores this comment. Perhaps this
could be remedied if anyone writes a DELTA/SDD translator.

As a stopgap, I've written a program which strips out values attached to a
<by misinterpretation ...> comment, so that DELTA users can pass the 'true'
data to applications that require it, such as classification programs or
(perhaps) natural-language generation. See
     http://delta-intkey.com/test/dismis.txt


DELTA and Intkey have many features that help authors produce accurate and
robust data. The particular one that Alessandra may have in mind is the
checking of character dependencies, which the Lucid Builder doesn't do. For
example, it would allow you to record 'petals absent' and 'petals red' for
the same taxon. This is a serious problem. A medium-sized dataset (a few
hundred characters and taxa) typically contains hundreds of dependency
errors if produced without checking of dependencies. This can happen even in
DELTA if dependencies are not actually specified by the author. Bitter
experience led us to incorporate a warning when dependencies are not
specified in datasets containing more than 20 characters.

The 'present by misinterpretation' coding makes dependency checking slightly
more difficult. The dependencies effectively have to be checked twice: once
excluding all 'present by misinterpretation' codings, and once including
them all. Data would need to be recorded like this:
     petals absent; or present (by misinterpretation)
     petals red (by misinterpretation)
Recording simply 'petals red' would be an error, which should be detected by
the program.

--
Mike Dallwitz
Contact information: http://delta-intkey.com/contact/dallwitz.htm
DELTA home page: http://delta-intkey.com




More information about the Taxacom mailing list