[Taxacom] Character weighting

Wed Aug 19 16:13:18 CDT 2009

>Pierre has a point. It is the analysis that is phenetic, not the
traits. Okay, maybe I really meant analysis of molecular characters has
much in common with phenetic analysis, particularly in equal weighting
(excepting codon bias) to ensure no evolutionary contamination of the
automatic classification."

> In one way, molecular is indeed phenetic. There is no weighting for phyletic importance. Well, there is one case, codon bias, in which selection on a pool of messenger RNA emphasizes one synonymous codon over another (if I have this right), but all other weighting (I think) is purely part of the analysis, e.g. avoiding 3rd codon positions because they may be over-saturated with changes. Basically the Dirichlet priors are all 1 in Bayesian analysis. In some cases certain site positions are weighted differently but I'm not sure how this is part of pre-weighting for phyletic importance.
> 
> (Now ask me what phyletic importance is.)

I´ll ask you if you tell me how you arrive at it. Thanks to Pierre for succintly putting an end to "molecular data is phenetic" mantra. I will get into hot water with many but how do you weigh your characters objectively? I can see objectiveness in weighing based on codon position or transitions vs transversions or according to the aminoacid encoded. I can also see the objectiveness of character reweighting according to some measure of internal congruence (the idea being that you encourage the signal; of course doing this between datasets encourages the selection of an "average" phylogeny which might not be the species tree). But even these can arrive at "funny" results. In the case of extreme weighting (i.e. effectively 0 weight to 3rd codons) you can loose all resolution. 

>Richar wrote: Hope no one thought I agreed that differential lineage sorting meant
that ancestors were so undifferentiated molecularly that molecular
analysis was impossible if there was any evidence of different gene
histories. No, no. I tend to subscribe to the common understanding that
there is one species history, and that during speciation or at least
isolation of two lineages from one, a process goes on that eliminates
all but one of the ancestral polymorphisms in each of the new lineages,
a process called reciprocal monophyly. New sets of polymorphisms are
created, but these can be distinguished as derivative.
....
>Right now I repeat my take on this, Grehan is right for wrong reasons.
Molecular history shared by of a vast majority of gene sequences is
correct but the traits shared by man and orang were salted away
epigenetically in the ancestors of chimp and gorilla. Is there evidence
for this? There is a lot of cases in the literature that major traits
and trait complexes are conserved for hundreds of thousands of years,
then apparently reactivated. I can give examples if anyone wants to
investigate. Morphology is the interface with selection, and we cannot
ignore it.

>Michael wrote: In fact retention of ancestral polymorphism is very common (there are
more than 12 000 hits for 'incomplete lineage sorting' on Google) and
'It is now well known that incomplete lineage sorting can cause serious
difficulties for phylogenetic inference' (Maddison & Knowles, 2006.
Syst Bol. 55, 21-30). 

I would argue that the problems with polymorphic lineages can often be overcome with wider sampling (more species) or several examplars of each species when dealing with closely related taxa. In any case polymorphism is only really a problem when dealing with sister species/subspecies. Extinction and drift will erase the information very quickly as Michael pointed out.

>If you have indels, which are very common in most of the widely used,
longer sequences, two issues arise wrt identifying characters. The
first is how you align sequences from different sources, because
different multiple sequence alignment procedures (whether carried out
first, or as part of direct optimisation) can give you different
positional homologies. [It was interesting to see in that staphylinid
paper recommended by Stephen Thorpe that the authors did separate
analyses based on Clustal and MAFFT alignments. AFAIK this kind of
catholic approach to alignment is rare. Most labs seem to pick their
MSA method and stick with it.]

>The second issue is how you treat gaps in your analysis after
alignment. You can ignore them entirely, and this amounts to character
weighting because an indel is an evolutionary novelty. Alternatively,
you can treat a gap as a fifth character state. Someone more familiar
with the molecular phylogeny literature than I am may be able to say
how often analyses are done both ways, and the results compared.

>'Ignore third codon' weighting for coding sequences can be avoided by
doing an analysis of the amino acid sequence in its entirety. I'm not
sure whether enough proteins are known yet to allow AA analyses to be
useful at all taxonomic levels. There are also wonderful surprises
lurking in the 'proteome'. I used to think (as a non-molecular
taxonomist) that histone H3 was a very highly conserved nuclear protein
with wonderful base-level variety. A few weeks ago I learned that H3
paralogy is ...um ... a problem.

I don´t know which approach is most common but neither is rare so take your pick. Indels are a nightmare if frequent and large but a blessing if uncommon. Many try to align insertions therefore assuming a common origin but considering how fast they tend to evolve it can cause problems aligning them. Long branch attraction often happens with chance "homologies" emerging if sufficiently long insertions occur. To that end some prefer to cull them and use them as characters added at the end. As for ignoring third codon, as I mentioned before, it is not very helpful in many instances. In particular in intrageneric studies (whatever a genus means) you can lose all the information. Also it often happens that only a few aminoacids flicker back and forth so you could get spurious results as they would be the only signal left. H3 shouldn´t affect you, me or John for that matter as it is too slowly evolving. Still, where did you readd about the H3 parology? I would like to check it out.

Brian wrote> The question of indels is interesting because it depends on your  
approach. If you base your evaluation on similarity values you have to  
work out whether you include these regions in the analysis - one often  
removes them. However, these could be major evolutionary events with  
much greater effects that single base changes. Does treating the whole  
of the indel as a fifth base do justice to its "significance"?

You could treat it as a whole "character", i.e. as the whole gene. Indels can be really good at keeping sister taxa together but often tell you nothing about their relationships to other taxa.

_________________________________________________________________
Show them the way! Add maps and directions to your party invites. 
http://www.microsoft.com/windows/windowslive/products/events.aspx