Characters; interactive identification (x Delta-l)

Wed Mar 29 14:06:28 CST 1995

                                                                 29 March 1995

> From: Renaud Fortuner <fortuner at MATH.U-BORDEAUX.FR>
> To: Taxacom
>
> Should we record all the terms that are attached to a particular character?
> ... Yes, absolutely. ... It is obvious that we are going to find new states.
> How do I know which one to record? Easy, I record them all. ... It is not
> necessary to redefine a character when you are just adding a new state
> (green, yellow, mauve). Changing `color' to `hue' and `saturation'; or
> `whether uniformly colored', `background (or only) color', `secondary color'
> is something else. This is no longer adding a state but modifying the
> character itself.

You can't do good taxonomy unless you get away from the idea that a character
is just a list of terms. In the words of my collaborator, Leslie Watson, `The
taxonomic wisdom [embodied in the character list] will define the standard of
the rest of the work' (DELTA Primer).

You can't `find' states; they are concepts which have to be thought up. If you
are trying to get data from the literature, you will have to accept the
thoughts of others, at least in part, but these thoughts need to be critically
evaluated and reconciled.

Adding a state DOES redefine or modify the character.

We have been using a very simple example here (flower colour with only a few
states), and I, at least, intended it only to illustrate the principles
involved. I wouldn't object to adding `pink' to `white, red, blue' (although
information is lost unless the program `knows' that pink is intermediate
between white and red). But what if the descriptors that have been used are
phrases? Are we going to call each one with a slightly different word order,
qualification, or terminology a different state?

The use of the word `pink' may have been a bad example, as I think I recall
reading that this occurs as a named `species' of color in many languages,
which presumably reflects our perception of its distinctness. The folly of
thoughtlessly adding each new descriptor as a `state' is more readily apparent
if we consider the term `light blue': not only does the other state have to be
changed to `dark blue' (to make the states mutually exclusive), but clearly
all the taxa that have been coded `blue' need to be re-examined. Having done
that, what if we later come across the term `azure'?

Anyway, a program cannot prevent people from doing bad taxonomy, but it CAN
prevent them from doing good taxonomy. That is why I said in my previous
posting that a program must be CAPABLE of using existing character character
states plus a qualifying comment.

> Will [recording every term] lead to a separate state for every taxon?
> ... No, it doesn't (I have checked).

I didn't say it would, I said `taken to an extreme, it could' (and I've seen
it happen). A character becomes less generally useful as it approaches this
extreme.

> We seem to agree about the error tolerance: this is for controlling the
> remaining taxa, it doesn't give "graceful" degradation. ... another
> difference between the **tomous keys which give THE answer and a similarity
> program which takes all the available evidence and offers a list of possible
> answers for the user to ponder.

I think we agree in that our programs allow essentially the same mode of
operation, but we disagree in that you don't realise this. I presume your
program would allow something like this.
    After entering information for a few characters, you find that there are a
    few taxa whose similarities to the specimen are in the range 90-100%. You
    browse through illustrations and/or descriptions of these. If necessary,
    you extend your browsing to some lower level of similarity.
INTKEY allows this too. Only the terminology is different: `number of
differences' instead of `% similarity'.

> Number of differences used by INTKEY as a dissimilarity measure: why not
> divide this number by the number of characters used? Surely, 2 differences
> when you use 2 characters is not the same as when you use 20? Also, why use
> only 0 and 1? There is a lot of room between "identical" and "completely
> different".

For classification, yes; for identification, no.

Firstly, a minor point of presentation. If you had used six characters in an
identification, would you prefer to be told that the specimen was `83%
similar' to the specimen, or had `1 difference' from the specimen?

Let's take as an example three taxa described as:
    #A/ 1,1 2,1 3,1 4,1/2 5,1/2
    #B/ 1,1 2,1 3,2 4,1/2 5,1/2
    #C/ 1,1 2,1 3,1 4,1   5,1
(I'm using DELTA notation for brevity: taxon A has state 1 of characters 1-3
and states 1 or 2 of characters 4 and 5, etc.) With a typical phenetic
distance measure, taxa B and C are both distant .2 from taxon A. But if B and
C are not taxa, but specimens, do we want to be told that both have 1
difference from taxon A (or the equivalent expressed as a similarity)? If
there is no variation within individual specimens, C is as close to A as it is
possible to get, and with the settings recommended for identification, INTKEY
would say that there are no differences between C and A.

> How do I calculate the "best" character? That depends on what you mean by
> "best": most reliable, most discriminating, easiest to record, or what?

Well, what do YOU mean by it? Does your program give the user any guidance on
what character to use next? INTKEY can sort the available characters on a
criterion involving the discriminating power and a subjective
ease-of-use/reliability usually (but not necessarily) assigned by the author
of the data. To calculate a discriminating power, you need some indication of
the set of taxa that need to be discriminated.

> I fail to see how entering SET RELIABILITY COLOR,0 is going to affect the
> character:
>     #6. flowers/
>          1. white/
>          2. blue/
>          3. red/
> In other words, how is the computer going to know that this character is
> describing a color?

I must admit that I seem to have underestimated the capabilities of your
program here. INTKEY needs to be told which are the `color' characters. Does
your program infer this from the character description alone?

There may be no point in prolonging this discussion unless we get some input
from other people.

Mike Dallwitz                                  Internet md at ento.csiro.au
CSIRO Division of Entomology                   Fax +61 6 246 4000
GPO Box 1700, Canberra ACT 2601, Australia     Phone +61 6 246 4075