[Taxacom] (no subject)

Mon Jul 27 10:26:11 CDT 2009

Taxacom is where we can share curiosities and problems. Some of us lurk and learn, some lock and load, others get entertained, and we switch places depending on the topic. In honor of John Grehan's return to Taxacom, I submit an idea of what he might possibly mean when he says molecular data could be wrong and the true tree is ((man, orang) chimp, gorilla) or somthing on that order. Bear with me on this, since this has to do with self-taught statistics, which could be a good thing because it frees one from the competing schools, but is probably bad.

Different genes give different gene histories when their time of divergence is different from speciation events. Some genes seem to track species trees better than others, but given any three taxa terminal on a molecular cladogram, of say 40 DNA sequences analysed, maybe half or 2/3 support one of the three possible resolved trees, and the other two trees are supported about half and half by the other DNA sequences. IF the null hypothesis is that different trees have an equal chance of appearing given equal chance of gene trees, then a superfluity of one gene tree would tend to support that gene tree as the species tree. 

E.g. In the meta-analysis by Satta et al. (2000) of 39 hominoid loci, 23 supported the ((Homo Pan) Gorilla) gene tree, 8 supported ((Homo Gorilla) Pan), and 8 supported ((Gorilla Pan) Homo). 

This works fine if one assumes all apes have a similar distribution of right (matches to species tree) and wrong gene trees. The mechanism of delayed gene history is understood, and one should be able to do a calculation that chi-square indicates the above distribution would occur by chance alone only 2 percent of the time (null proportion of 1/3). You can try this yourself:
http://faculty.vassar.edu/lowry/csfit.html 

But the number of terminals on the tree are small, and the null may be wrong. Do other apes have this kind of distribution of more right gene trees than wrong? How do you tell? One way is to examine more than 30 gene sequences and see what kind of distribution we have? Is it much skewed?

E.g. the magic number 30 is apparently that minimum number of observations that allow calculations without prior assumptions of distribtuion. For instance, we test if a coin is loaded by flipping it. Given a binomial distribution (two equal columns of observations to match the normal curve) each should appear 50% of the time. But coins are not binomially distributed, since the head side is heavier, and the actual distribution (small as it is) requires not small number analysis (with assumptions of binomial distribution) but large number analysis (to see if there is a difference between expected number of heads from a coin naturally slightly loaded on the heads side and the number of heads from the particular coin you are flipping to see if it is additionally loaded and unlike other coins. 

Thus, gene tree histories may be different from species tree histories by (1) chance alone, and (2) gene histories different from species histories in each taxon that may be due to linkage, selection, and whatnot that  may confound analysis of shared ancestry.

As to (1), it could be that by chance alone the most common gene tree is not the species tree in the great ape complex. How might we tell this? A broad survey of lots of gene sequences might do it so we can maybe postulate a poisson distribution which would never demonstrate anything as wrong as that which we see in great apes, but we don't have this. 

Another way is to see if the gene tree of many sequences but parsimoniously reflecting the most common approximates a tree from a different class of data. Grehan suggests or implies or could imply, I think, that most of the ape/monkey molecular tree matches what is expected from morphology, but not with the great apes, and this difference is statistically to be expected (an occasional wrong inference because such occurs by chance alone, say 2% of the time). Is this right? Anybody who knows this kind of statistics better than me (doubtless anyone who has actually taken a course) have an opinion?

As to (2), there may be a mechanism that warps the null of all different gene tree histories being equally likely. This could be selection, linkage, mistaken orthology, or maybe cussedness. Then, the most unlikely gene tree might match the species tree in great apes. Grehan has to at least demonstrate what these mechanisms might be and how they work to make his point better.  

_______________________
Richard H. Zander
Missouri Botanical Garden
PO Box 299
St. Louis, MO 63166 U.S.A.
richard.zander at mobot.org