Invariant sites and reliability

Wed Jul 20 14:52:18 CDT 2005

This behavior occurs because, when implementing a homogeneous ML  
model which averages branch lengths over the tree, every character  
added shortens edge lengths on the tree.  This means that the  
probability of erroneously interpreting a homology as homoplasy  
declines, and so support for clades increases.  You can add invariant  
characters to any matrix and get the same results.  In a Bayesian  
analysis, the relative likelihoods of trees is used to accept a new  
one or keep an old one, and the disparity in likelihoods increases as  
invariant characters are added, so the number of trees retained in  
the credible set declines.  This has only to do with the way ML  
treats branch lengths, and nothing to do with character types.  This  
kind of support measure differs from resampling support like the  
bootstrap, in which apparent support falls as invariant characters  
are added (because the probability of resampling the informative  
character falls as the total number of characters increases).  In a  
homogeneous ML framework, the way the Bayesian support is working is  
the way it "should" given the branch-length smoothing of that  
framework.  The unexpected behavior of the bootstrap under this  
framework partly why Goloboff and Farris came up with the poisson  
bootstrap, which is available in TNT.  Much of this is discussed in  
Goloboff, P., J. S. Farris, M. Källersjö, B. Oxelman, M. J. Ramirez  
and C. A. Szumik. 2003. Improvements to resampling measures of group  
support. Cladistics 19: 324-332.

Kurt

> Interesting business, that of likelihood (e.g. max likelihood and  
> Bayesian)
> and uninformative characters.
>
> I've used MrBayes to analyze a contrived data set like this:
>
> 1 CCCCCCCCCCCCCCCCCCC etc.
> 2 CACCCCCCCCCCCCCCCCC
> 3 CACCCCCCCCCCCCCCCCC
> 4 CCCCCCCCCCCCCCCCCCC
>
> (You can also use 4 sequences of totally random data, except for  
> one site.)
>
> If you set the model to recognize all parsimony uninformative sites as
> invariable, then you get a low (33%) Bayesian posterior probability  
> for the
> tree ((2, 3)4, 1).
>
> If you set the model to treat all uninformative sites as variable,  
> you get
> BPP of 100% for the same tree.
>
> Doubtless (I think) this last situation is because getting two "A"s  
> at one
> site for taxon 2 and 3 is very improbable when all the other sites  
> have no
> mutations to "A"s at all.
>
> This shows a very clear difference in what characters and character  
> states
> really mean between analysis in morphology and molecular studies,  
> though
> it's hard to put into words.
>
> Also, choice of model to recognize invariant sites or not makes a big
> difference, doesn't it? How might this affect measures of uncertainty?
>
> ______________________
> Richard H. Zander
> Bryology Group, Missouri Botanical Garden
> PO Box 299, St. Louis, MO 63166-0299 USA
> richard.zander at mobot.org <mailto:richard.zander at mobot.org>
> Voice: 314-577-5180;  Fax: 314-577-9595
> Websites
> Bryophyte Volumes of Flora of North America:
> http://www.mobot.org/plantscience/bfna/bfnamenu.htm
> Res Botanica:
> http://www.mobot.org/plantscience/resbot/index.htm
> Shipping address for UPS, etc.:
> Missouri Botanical Garden
> 4344 Shaw Blvd.
> St. Louis, MO 63110 USA
>
>
> -----Original Message-----
> From: A.P. Jason de Koning [mailto:apjdk at ALBANY.EDU]
> Sent: Tuesday, July 19, 2005 9:31 PM
> To: TAXACOM at LISTSERV.NHM.KU.EDU
> Subject: Re: [TAXACOM] Molecular taxonomy: on way out?
>
>
>
>> Why bother expurging the data matrix from non-informative
>> characters when the algorithm doesn't take them into account?
>>
>>
>
> I agree that throwing out data here is unnecessary, and in general
> can be biasing.
>
> An unmentioned issue to consider with respect to probabilistic
> analyses (of molecular or phenotypic characters): the inclusion of
> characters with phylogenetically non-informative character state
> distributions can be beneficial because they may contribute
> information to the estimation of model parameters (such as relative
> rates of character state transformation), which can have an
> "indirect" effect on the selection of an optimal topology (though
> perhaps rarely, and not strongly).  Why?  Because the effect on
> relative rate estimates by 'non-informative' characters can alter the
> relative probabilities of trees for other 'informative' characters!
> For a very large dataset, inclusion of such characters could matter
> for any given phylogeny problem...certainly it matters in molecular
> evolutionary analyses where phylogeny is not the desired endpoint.
>
> Cheers,
> - Jason
>

============================
Kurt Milton Pickett
Theodore Roosevelt Fellow
Division of Invertebrate Zoology
American Museum of Natural History
79th Street at Central Park West
New York, NY 10024-5192
(212) 313-7622
kpickett at amnh.org