RASA

Thu Mar 6 06:30:22 CST 1997

        I'm responding to Tom's request that I explain my
        probabilistic approach.  This is rather serendipitous,
        for I am close to releasing RASA 2.1, and I can take
        full advantage of this forum to explain in some
        detail the workings and my understanding of RASA theory
        to date.  Most of the extensions are currently under peer
        review, but most of this has been presented in verbal
        form elsewhere.  Given the informal nature of this list,
        I'll not cite literature when I should.

Signal:
        RASA provides a measure of how much cladistic hierarchy is present
        in a matrix of character state beyond that which is expected by
        chance alone.  It is similar in it goal to g1 and PTP, but
        quite different in its implementation and flexibility.

        It does this by taking advantage of the observable consequence
        that random data produces no differences between an observed
        and null rate of increase in apparent cladistic similarity
        (RAS) among pairs of taxa as a function of overall similarity, but
        data with signal produces a statistically significant difference
        between the slopes.

        [note that some folk balk at the use of the term
        cladistic similarity;  I don't really care what
        one calls the metric; for me, it's a semantic issue.]

        Cladistic similarity is defined for each taxon pair as the
        number of times that pair shares a character state to the
        exclusion of other taxa, all characters considered.
        Semantics aside, this metric incorporates all of the
        (A,B),X comparisons where A is taxon A, B is taxon B, and
        X is any other taxon, for all characters.  As a measure of
        unique similarity, the metric contains ALL of the information
        in the matrix.  Invariant sites are trivial.

        the phenetic used (on the x-axis) is the number of characters
        that are putatively informative for a taxon pair; i.e., the
        number of characters that contribute to the cladistic
        metric for that pair.  This two metric will positively
        covary EVEN with random data, so a test of this slope
        against zero is trivial.

        The test statistic (trasa) incorporates three parameters, each
        observed directly from the matrix: the observed slope, the
        null slope, and an error term (a function of scatter and slope
        in the observed model).  The null slope can either be
        determined by permutations to the matrix, or by an analytical
        procedure. Each matrix has a characteristic null slope.  Details
        on the process that determines the null slope (reciprocal
        equiprobable redistribution) are as follows:

        Each taxon pair contributes a particular amount of information
        to each axis.  If no cladistic hierarchy is present, there is
        nothing remarkable about any the three-taxon comparisons (ie.,
        no synapomorphic information is represented).  To
        determine the expected rate of increase in the cladistic metric as
        a function of the phenetic, the total amount of information on
        each axis is summed, and the "matrix rate" is the ratio of the sum
        of the cladistic metric values to the sum of the phenetic values
        (D. Colless pointed out that my eq. in the RASA I paper was
        inelegant).  I am at present exploring the effects of a variety
        of factor other than randomness and phylogeny on the null.


        Simulated and empirical test of the test for signal have
        been conducted.  trasa has predictive power for the accuracy
        of methods of phylogenetic inference.  It tracks the decay
        of signal as internodal nodes become long.  It, like
        mp, is sensitive to long branches, to outgroup selection,
        and other pathologies that can reduce the accuracy of
        evolutionary trees.

        This sensitivity has been turned to an advantage.  But first,
        assumptions, falsifiability, and limitations

        The assumptions of the test are that past processes of evolution
        may or may not produce pattern in the data that
        looks like signal, and that other processes only produce
        signal-like pattern rarely (explicitly, ca. 5% of the time).
        This makes the method falsifiable.  If other conditions can be
        shown to produce signal > 5% of the time, the test statistic
        may be misleading.

        On the other hand, the method assumes that processes of
        evolution that produce distributions of character states
        among taxa and cause methods of phylogenetic inference
        to retrieve accurate trees will result in a difference
        between the slopes.  Often there is a complex interplay
        between the cladistic hierarchy in a matrix, caused
        by the processes of the evolution, and localized
        violations of the assumptions of phylogenetic
        methods.  This turns out to not be a problem, and
        because these influences are readily identifiable..

        The method is falsifiable.  Specifically, some rather bold
        predictions can be made.  For example, anything one does to the
        matrix that increases tRASA will increase the accuracy of the
        phylogenetic inference.  This includes taxon sampling,
        and character sampling. tRASA therefore provides a ready
        criterion for determining the implications of things like
        removing apparent autapomorphies.

Testing Homology

        If a preponderance of hypotheses of homology are in err, then
        an mp tree will likely be wrong, and inferred synapomorphies
        will be erroneously adduced.  If a preponderance of hypotheses
        of homology are wrong, phylogenetic signal is weak, tRASA
        will not be significant, and the tree, which would have wrongly
        been used to erroneously falsify correct hypotheses of
        homology, would be recognized as a poor test, to be distrusted.

        In contrast, tRASA tracks the proportion of erroneous
        hypotheses of homology well, and can be applied to the
        task of trying to falsify individual hypotheses of
        homology.

        The impact of individual hypotheses of homology
        can be explored by comparing signal when they are
        included or excluded.  This particalar application of
        RASA is not well explored, but it appears to be justified.
        I included here because it is most relevent to the
        disucssion.  From my experience with empirical data, tree-based
        indicators of accuracy can be improved dramatically by finding
        and removing apparently misleading characters.  Once an
        estimate of phylogeny is produce that uses apparently
        accurate hypotheses of homology, the evolution of the
        misleading characters can be better understood a posteriori.
        Naturally, this is similar to previous approaches because it
        has the same goal, but there are some fundamental differences
        here; the RASA algorithm is solved in polynomial time (therefore,
        the entire tree space doesn't have to be explored).

Long branches

        Taxa at the end of long branches have undergone a great
        amount of anagenetic evolution (not necessarily an increased
        rate of evolution), without cladogenesis.  As a result,
        autapomorphies accumulate.  This quite effectively randomizes the
        charactre states _with respect to phylogenetic information).

        This can be a result either of true evolutionary history (i.e.,
        living fossils), or as result of taxon sampling.

        Taxa (and clades) at the ends of long branches tend to be
        not all that similar to other taxa in matrix (in the phenetic
        sense); but for their low similarity, they tend to have a
        high value of pairwise cladistic similarity with other
        taxa in the matrix.  This tends to reduce the observed slope.
        The contribution of each individual taxon to the variance
        on each axis is summarized is a type of residual plot
        (a taxon variance plot), in which taxa on long branches
        are readily identified.

        Although such patterns are caused by long branch attraction,
        this application makes no assumptions of evolutionary process,
        for if ANY factor causes the characteristic pattern, the
        algorithms for trees will respond similarly, and whatever
        topologies are found to be optimal will be something other than
        an accurate estimation of phylogeny.  A direct analogy is the
        test for normality for parametric statistics; it is entirely
        irrelevant to the functioning, accuracy and utility of the
        test that a multitude of factors may causes a sample to be
        non-normal

        Although the cause of the observed pattern may be multifarious,
        the method does not assume details of evolutionary processes
        because all inferences are based on observed character states, not
        on inferred transformations.  RASA does not assume homology;
        it test for it.  For example, if an alignment of nucleotide is
        wrong owing to a computer copy error, that matrix will be
        innappropriate for phylogenetic inquiry; this type of error
        will be observable in the signal, and in other ways (see below).

        In another turn, I have found that when long branches have
        influences the distribution of character states among
        taxa (via simulation and some emprical data set as well),
        a bootstrap power analysis that uses tRASA as a criterion
        to ask the question "do I have enough characters?" has a
        predictable and intuitive effect.  When signal is strong,
        and there are o long branches, and characters are resampled
        are various leves of Nc (number of characters), signal
        increases and asymptotes.  This is a more classical and
        better justified use of the boostrap (power analysis).  When
        long branches are present, signal _decreases+ as one adds
        more characters, just as Felsenstein predicted; obviously,
        the taxon variance plot is direct evidence of which
        taxa maybe problematic.   The efficiency of the algorithm
        pays off here, too.

Outgroup Selection
        In the calculation of RAS, comparisons of (A,B),X for a given
        character are ignored if X is a putative outgroup and if
        the state of that character is the same for A, B and X.
        Similarly, a comparison of (A,B),Y is ignored if Y is not an
        outgroup, but if A and B have the same state as X, which is a
        putative outgroup.

        When X is a poor outgroup choice (e.g., random data), the
constraints on the comparisons of (A,B),Y will includes (byt chance)
homoplasy, plesiomorphy, and synapomorphy.  When X has a sufficient
plesiomorphy content, MOSTLY plesiomorphy will be constrained, and that
plesiomorphy in the ingroup will include true plesiomorphy, and some
reversals.  When plesiomorphy is constrained, tRASA increases (i.e., less
noise in going into RASij and more signal (i.e., synapomorphy) is
incorporated.  This procedure can be used to choose among possible
outgroup candidates; this is important because of they PAUP and other
programs use outgroups, a poor outgroup selection can bend trees around,
and cause polarity inferences to be way off.

        It is my hope that TAXACOMERS don't misread this post as
        an advertisement.  My goal is to increase the accuracy
        of phylogenetic trees by helping people to see pitfalls
        in their data before they fall in.

        Cheers,


        James Lyons-Weiler