PLACE: 1652 Gilman
SPEAKER:
James Lyons-Weiler
The Pennsylvania State University
TITLE:
Data Exploration and Hypothesis Testing in Molecular Phylogenetics,
Molecular Evolution, and Beyond
ABSTRACT:
Evolutionary genomics draws upon phylogenetics, molecular evolution,
and functional and structural genomics. Both phylogenetics and molecular
evolution are improved by Tree-Independent Data Exploration, and molecular
evolution can be improved by a more recently devised Monte Carlo Test of
Purifying
Selection. A matrix regression model is used in tree-independent
data exploration, allowing researchers to find noisy genes (measure signal),
perform optimal outgroup analysis, detect long branches, evaluate taxon
sampling, and perform noise reduction. An example where such
data exploration has lead to markedly improved phylogenetic estimates is
the recent study by Culligan et al. (2000), who concluded that the eukaryotic
postreplication mismatch repair 'mutS homolog' multigene family (MSH2-6)
represents a monophyletic gene family derived from a mutS copy present
in the protomitochondrial endosymbiont. In a
similar manner, hypothesis testing in molecular evolution can be improved
by a novel, computationally intensive test and measure of purifying selection.
Classical, distribution-dependent tests measuring rates of synonymous and
nonsynonymous substitutions have low power. The new test employs
a comparison
of the observed amino acid divergence to a null distribution of amino
acid divergence predicted by a neutral substitution model and neutral rates
of nucleotide divergence. This Monte Carlo test is shown to have
remarkably higher power to detect purifying selection, leading to a dramatic
difference in the interpretation of the importance of selection during
molecular evolution. This also provides an example where caution
is warranted in the biological interpretation of negative statistical results.
Because most attempts to study
the importance of natural selection have been based on low power tests,
the importance of natural selection as a driving force behind molecular
evolution has, and is likely to continue to be, underestimated. Evolutionary
genomics will be much improved by the careful construction and application
of powerful statistical approaches to hypothesis testing that focus on
the responses of relationships among measureable variables in addition
to those which focus primarily on simple parameter estimation.
When marker classes at a locus are coded 1, 0, -1 for MM, Mm, and mm, respectively, the multiple regression for data from large F2 populations has some elegant properties:
1. The vector of regression coefficients b = (X’X)-1X’Y may be considered a product of marker relationship information contained in X’X and simple linear regression estimates for each marker locus, X’Y.
2. As sample size , n, gets large, 2X’X/n approaches the correlation matrix among markers, R.
3. The inverse of R has been derived for the no-interference case by Wright and Mowers (1994).
4. With evenly spaced markers on a chromosome, R has the same form as an error variance matrix for a first-order autoregressive process.
5. Marker-pair regressions using linked markers give reasonable estimates of positions and effects of additive genetic factors located between markers.
6. Less promising is the result that variances of multiple regression are strongly affected by closeness to flanking markers of the nearest distal markers.
A practical example of use of marker-pair and multiple regressions is
given for gray leaf spot tolerance in maize. Magnitude of effects
and location of possible genetic factors are estimated from the regressions.
COFFEE: 3:45 p.m., 104 Snedecor Hall