Место издания:KMK Scientific Press Ltd Москва, Россия
Первая страница:274
Последняя страница:277
Аннотация:--- Motivation ---
Approaches and algorithms described here are used to facilitate the tasks of enhancing gene phylogeny reconstruction, inference of species phylogenies and gene evolution scenarios with a set of gene trees, and constructing multiple-gene sets for purposes of phylogenomic studies.
Events in gene evolution are usually viewed as gene divergence during species differentiation, gene duplication, gene gain, loss and horizontal gene transfer (HGT). Mathematic models of gene evolution were formulated to account for the observed differences between gene and species trees, and optimization of model parameters is used as a tool to reconstruct evolutionary history of gene families.
Disparate substitution rates across regions of homologous sequences and mutational saturation are well known to result in elevated levels of homoplasy in molecular data and to overshadow available phylogenetic signal in the alignment. Many types of analysis strongly depend upon reliability of individual gene trees. Original method is developed to detect and filter out noisy columns in a multiple protein alignment to provide for increased accuracy and resolution of phylogenetic analysis. The method is tested with simulation experiments.
A high-quality dataset of orthologous genes widely representing the diversity of Metazoa is prerequisite to obtaining a reliable animal phylogeny. Despite this being a high priority task, given the need to interpret the bulk of experimentally accumulated evidence on phylogenetically diverse model organisms in a robust evolutionary framework, a comprehensive phylogeny of Metazoa is missing. A complex of conventional and original bioinformatic methods is used to analyze databases of annotated genomes and unannotated proteomes of a range of metazoan phyla in order to find genes suitable for inference of deep metazoan phylogeny. For the first time such dataset is constructed and includes 51 gene for 30 representatives of 11 animal phyla.
--- Results and discussion ---
Refinement of phylogenetic signal in multiple protein alignment.
The method is described in detail in Lyubetsky et al. (2005). It uses originally developed programs that implement algorithms of computing the objective scoring function and constrained generation of random trees. A scoring function is used to rank columns of the alignment according to the consistency of the column’s content with a list of reliable clades and iteratively removes the least consistent ones until signal is refined to provide for a better resolution of the tree. The list of reliable clades is the set of splits occurring in a 70% majority-rule consensus topology constructed after bootstrapping the intact alignment (i.e., before column removal). On each step of removing columns the g1 statistic is estimated on current alignment (Hillis and Huelsenbeck, 1992) with the original algorithm of generating random trees strictly compatible with the list of reliable clades (Rusin et al., 2007) and is used to determine the step, at which the procedure halts (the resulting alignment is capable of producing the maximum amount of resolved splits in random trees). The obtained alignment is considered optimal for tree reconstruction (definitive phylogenetic analysis). The method’s performance is verified with simulations (1000 datasets of 40 sequences with the length of 300 amino acids were generated using maximum-likelihood model parameters and branch lengths estimated from real COG data). In 100% cases removing noisy columns permitted to reconstruct the tree, which is closer to the known tree used to simulate the data than is the tree obtained without refinement; in 53% cases the difference in likelihood between the found “best” tree and the tree inferred with intact data was statistically significant according to standard tests of phylogenies.
--- Reconstruction of the species tree and ancestral gene evolution events ---
Reconciliation of protein and source species phylogenies is based on analyzing topological incongruence between protein family tree G and species tree S. Nodes introducing this incongruence are accounted for by events of horizontal gene transfer (or gene gain) and ancestral gene duplication with subsequent loss in many descendant lineages. We developed an algorithm that infers the species tree on the set of gene trees and uses it to reconstruct gene evolution events. E.g., it was used for bulk identification of HGTs in bacteria. It was also applied to infer the species tree of major bacterial groups on the set of 132 gene trees constructed for this study. We also developed an algorithm to identify ancestral duplications (Gorbunov, Lyubetsky, 2005) and used it in global searches. Total number of duplications was inferred at each node of the species tree. High duplication estimates comparing to the number of protein families might suggest a whole genome duplication at a node. Thus, 93 duplications were inferred in the root of the Archea, which might suggest whole genome duplication in their common ancestor. Total numbers of gene loss were estimated for taxonomic families of bacteria. Unlike with duplications, accounting for HGTs considerably lowers gene loss estimates. Total number of gene loss over the entire tree is 9000 if not accounting for HGTs and is 8000 otherwise. Accounting for one HGT lowers loss predictions by 4.4 in average.
--- Constructing the dataset to infer animal phylogeny ---
A complex approach was taken to mine for suitable phylogenetic markers (genes) across a variety of genomic resources. The initial list of queries to mine data banks was selected from among human genes contained in the KOGs (Tatusov et al., 2005) representing (a) ribosomal protein families and (b) monotypic protein families (represented by a single entry in each eukaryotic genome in a KOG). Human entries from the 178 protein families thus selected were used to quire annotated animal genomes in the RefSeq portion of GenBank and obtain 178 sets of homologs for 5 representatives of 4 animal phyla. Analogously, homologs were retrieved for another 25 animal species from unannotated proteom resources (EST data), thus increasing the taxonomic sample to 11 animal phyla. At the latter step, several algorithms were employed for mass assembling of EST sequences and translation of resulting contigs with corrections for base ambiguities and frame shift errors. The resulting homologous sets were analyzed with several methods to identify and sort out paralogous gene copies, with special precautions taken to identify paralogs in EST data. Protein families with higher than 50% paralog content were omitted from the list. Multiple alignments constructed for each family were processed with the original methods (Lyubetsky et al., 2005; Rusin et al., 2007) to concentrate available phylogenetic signal and assess their suitability for deep phylogeny reconstruction; non-informative protein families were further omitted from the list. The final list contains 51family with representative sampling of 30 animal species from 11 phyla (Ctenophora, Cnidaria, Priapulida, Tardigrada, Platyhelminthes, Nematoda, Arthropoda, Annelida, Mollusca, Echinodermata, Chordata). The Table shows distribution of the families among the NCBI functional annotation groups. A dataset of several dozen genes for more than four metazoan phyla is constructed for the first time. Currently, intensive computational analyses are conducted to infer metazoan relationships on its basis, the result with far-reaching implications in various fields of fundamental and applied science.