[Taxacom] was contamination

Tue Apr 5 17:21:58 CDT 2011

Apologies for reposting. 

>The location of a nucleotide is the character and the state is the 
>particular nucleotide you find there. Similarly with proteins and 
>practically any character you can think of (segment 3 is defined in 
>relation to segments 2 and 4, unless you have developemental 
>information which is quite rare and even then the homology is 
>relational).

Ummm...not quite. The character state of a nucleotide is free to 
change back and forth between states repeatedly (even in a conserved 
coding region this is true for roughly 1/3rd of all the nucleotides), 
in precisely such a way as to utterly confound attempts to establish 
homology. 

There are two aspects which appear mixed-up. On the one hand, can we say that a certain position in a string of 
basepairs in one gene of a taxon is homologous to that of another? I should say yes, since the string of basepairs
 encode a complex structure that can be refered upon if needed (protein or a rDNA). Similarly the bits that encode this text can 

The other aspect is practical and has to do with our ability to determine homology. This depends, in great part, 
on comparing the right things to begin with(no multiple copy genes, pseudogenes, contamination). But this 
is not enough, we must also make our comparisons at the right scale. Choosing the right gene/s for the level of 
divergence is far more important (and effective)than applying a complex model of evolution. Similarly a 
morphologist would have a tough time comparing the fin of a sardine and that of a whale!

This can be true even within a species - and we all know 
how useless polymorphic or homoplastic characters tend to be in a 
morphological matrix. Most morphologists intentionally exclude 
characters that are too variable, and/or suspected to have high 
likelihood of homoplasy, whereas nucleotide states are accepted 
despite such shortcomings. 

I have two issues with this approach. One is bias and the other is the definition of homoplasy. 
Bias can express itself in the sampling of the characters and as well as in the definition of the states. 
The former comes from a certain hubris, a result of complacency in the familiarity with the organisms 
which can lead the researcher astray. The second bias is practical in nature and it is the manifestation 
of each individual´s ability to ´´sense´´ (colours, shapes, etc). In regards to homoplasy and its definition, 
one man´s homoplasy is another one´s homology. Again the level at which you study the characters will 
determine if they are homoplasious or not and this becomes apparent on hindsight. So eliminating characters 
beforehand risks not only losing data but channelling the research along a particular direction(subconsciously or not).

In fact, one cannot use nucleotide states 
in one's analyses unless one includes an additional level of modeling 
designed to account for the differential probabilities of state 
changes between different pairs of bases (on top of which there are 
other confounding meta-properties such as "A-T rich" genomes). I can 
think of lots of characters that do NOT require secondary models of 
transformation probabilities in order to be informative, and - as 
such - nucleotides are demonstrably a very different type of 
character.

It is advisable to emply models in some cases (highly divergent genes and or taxa, 
longbranches, etc) but to say that ´´you cannot use nucleotides...´´ is simply not true. 
Nevertheless the reason we can use models with molecular data has more to do with a better 
understanding and the simplicity of molecular characters. This simplicity does not imply, 
as has been suggested in this forum, a lack of robustness, but more the equivalence between 
genes (can we compare the value of a hand and a foot as easily?) which lend  them more 
easily to model. At any rate many morphological characters behave similarly and can be 
studied in this statistical manner. Of the top of my head, continuous characters, morphometric 
data, etc. Similarly certain molecular characters behave and can be studied like morphological 
ones (i.e. a proteins form or rDNA secondary structure). Therefore the distinction between 
´´morphological´´ and ´´molecular´´ is hardly useful when it comes to analysing the data.

Myself, I tend to have greater faith in trees built by human beings 
who subjectively choose to ignore certain characters, than I do in 
algorithms that are objective and try to work with every single 
possible character and hope to find the signal amidst the noise - 
though, naturally, there are some human beings that are good at 
taxonomy, and some that are terrible. 

This is altogether a different topic and it affects any data. The ultimate idea is control. 
Human beings tend to be quite irrational in terms of control or lack of and will strongly 
believe that their abilities are somehow superior to ´´machines´´. As long as the methods 
and results are verifiable then it matters not.