[Taxacom] was contamination
Jason Mate
jfmate at hotmail.com
Tue Apr 5 17:21:58 CDT 2011
Apologies for reposting.
>The location of a nucleotide is the character and the state is the
>particular nucleotide you find there. Similarly with proteins and
>practically any character you can think of (segment 3 is defined in
>relation to segments 2 and 4, unless you have developemental
>information which is quite rare and even then the homology is
>relational).
Ummm...not quite. The character state of a nucleotide is free to
change back and forth between states repeatedly (even in a conserved
coding region this is true for roughly 1/3rd of all the nucleotides),
in precisely such a way as to utterly confound attempts to establish
homology.
There are two aspects which appear mixed-up. On the one hand, can we say that a certain position in a string of
basepairs in one gene of a taxon is homologous to that of another? I should say yes, since the string of basepairs
encode a complex structure that can be refered upon if needed (protein or a rDNA). Similarly the bits that encode this text can
The other aspect is practical and has to do with our ability to determine homology. This depends, in great part,
on comparing the right things to begin with(no multiple copy genes, pseudogenes, contamination). But this
is not enough, we must also make our comparisons at the right scale. Choosing the right gene/s for the level of
divergence is far more important (and effective)than applying a complex model of evolution. Similarly a
morphologist would have a tough time comparing the fin of a sardine and that of a whale!
This can be true even within a species - and we all know
how useless polymorphic or homoplastic characters tend to be in a
morphological matrix. Most morphologists intentionally exclude
characters that are too variable, and/or suspected to have high
likelihood of homoplasy, whereas nucleotide states are accepted
despite such shortcomings.
I have two issues with this approach. One is bias and the other is the definition of homoplasy.
Bias can express itself in the sampling of the characters and as well as in the definition of the states.
The former comes from a certain hubris, a result of complacency in the familiarity with the organisms
which can lead the researcher astray. The second bias is practical in nature and it is the manifestation
of each individual´s ability to ´´sense´´ (colours, shapes, etc). In regards to homoplasy and its definition,
one man´s homoplasy is another one´s homology. Again the level at which you study the characters will
determine if they are homoplasious or not and this becomes apparent on hindsight. So eliminating characters
beforehand risks not only losing data but channelling the research along a particular direction(subconsciously or not).
In fact, one cannot use nucleotide states
in one's analyses unless one includes an additional level of modeling
designed to account for the differential probabilities of state
changes between different pairs of bases (on top of which there are
other confounding meta-properties such as "A-T rich" genomes). I can
think of lots of characters that do NOT require secondary models of
transformation probabilities in order to be informative, and - as
such - nucleotides are demonstrably a very different type of
character.
It is advisable to emply models in some cases (highly divergent genes and or taxa,
longbranches, etc) but to say that ´´you cannot use nucleotides...´´ is simply not true.
Nevertheless the reason we can use models with molecular data has more to do with a better
understanding and the simplicity of molecular characters. This simplicity does not imply,
as has been suggested in this forum, a lack of robustness, but more the equivalence between
genes (can we compare the value of a hand and a foot as easily?) which lend them more
easily to model. At any rate many morphological characters behave similarly and can be
studied in this statistical manner. Of the top of my head, continuous characters, morphometric
data, etc. Similarly certain molecular characters behave and can be studied like morphological
ones (i.e. a proteins form or rDNA secondary structure). Therefore the distinction between
´´morphological´´ and ´´molecular´´ is hardly useful when it comes to analysing the data.
Myself, I tend to have greater faith in trees built by human beings
who subjectively choose to ignore certain characters, than I do in
algorithms that are objective and try to work with every single
possible character and hope to find the signal amidst the noise -
though, naturally, there are some human beings that are good at
taxonomy, and some that are terrible.
This is altogether a different topic and it affects any data. The ultimate idea is control.
Human beings tend to be quite irrational in terms of control or lack of and will strongly
believe that their abilities are somehow superior to ´´machines´´. As long as the methods
and results are verifiable then it matters not.
More information about the Taxacom
mailing list