Alignment- and Alignment-refining Algorithms: Effects on Branch-length Estimation and Selection Pattern Analyses
This dissertation consists of a study of the effects of multiple-alignment method on phylogenetic analyses. First, I investigated the effects of multiple-sequence alignment quality on branch-length estimation, which can influence downstream bioinformatic analyses such as estimating rates of evolution and divergence times. To quantify the accuracy of branch-length estimates, I devised a scale-free measure of branch length proportionality between two phylogenetic trees that contain the same taxa and topology. This measure was named “normalized tree distance” (NTD). NTD is an ideal measure for detecting coevolutionary processes, in addition to measuring the accuracy of branch-length estimates. Using NTD as an error measure, I investigated the effects of multiple-sequence alignment quality on branch-length estimation. I simulated coding sequences and estimated the effects of multiple evolution parameters and choice of alignment- and alignment-filtering algorithms on the accuracy of branch-length estimation. I demonstrated that branch-length accuracy is indeed dependent on the method of alignment. Alignments with high-accuracy algorithms combined with methods for filtering out unreliable sites produce significantly better branch-length estimates. The optimal method combination depends on the evolutionary scenario. Thus, different alignment algorithms and different combinations of algorithms yield better branch-length estimates under different evolutionary conditions. A judicious choice of alignment- and alignment filtering algorithms is recommended for phylogenetic studies. Second, I studied the correlation between two types of purifying selection: against nonsynonymous mutations and against deletions using mammalian genomic protein-coding sequences. Intuitively, a codon that is intolerant of amino-acid altering substitutions is expected to be also intolerant of deletion. However, there has not been any comprehensive study on this purported correlation. In addition to the nine-species alignments of 8,595 genes, I simulated coding sequences along the same phylogenetic trees. The real data showed a much stronger correlation than the simulated sequences. I demonstrated that the correlation between amino-acid replacement and deletion rates exists and cannot be explained solely by alignment errors. Further investigations on nonsynonymous and synonymous mutations showed that this is most likely due to selection rather than mutation rates. Understanding selection on different types of mutations would help strengthen the link between population genetics and sequence evolution.