iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://www.ncbi.nlm.nih.gov/pubmed/20534164
GIGA: a simple, efficient algorithm for gene tree inference in the genomic age - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 9:11:312.
doi: 10.1186/1471-2105-11-312.

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Affiliations

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Paul D Thomas. BMC Bioinformatics. .

Abstract

Background: Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost.

Results: We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process.

Conclusions: GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Decomposing a tree with a duplication event into orthologous subtrees (OS's). The example shows part of the methylene tetrahydrofolate reductase (MTHFR in human) gene family. This tree can be decomposed into OS's in two different ways: 1) the fungal MET13/met9 group remains in the same OS as its ancestors, while the MET12/met11 group founds a new OS, and 2) the MET12/met11 group remains in the same OS as its ancestors, while the MET13/met9 group founds a new OS. In both cases, the two OS's are sibling groups, because they contain genes descending from the duplication event, and in both cases the FCE of the more recent OS (the one with only genes from fungi) can be dated relative to speciation events, between the opisthokont common ancestor and the fungal common ancestor in this example. Species are abbreviated with the 5-letter UniProt code: CAEEL (C. elegans, nematode worm), CHICK (G. gallus, chicken), DANRE (D. rerio, zebrafish), DICDI (D. discoideum, cellular slime mold), HUMAN (H. sapiens, human), MOUSE (M. musculus, mouse), SCHPO (S. pombe, fission yeast), YEAST (S. cerevisiae, Baker's yeast).
Figure 2
Figure 2
GIGA rules for speciation events. Note that GIGA Rules 1 and 2 result in a different tree topology than standard agglomerative methods such as UPGMA. Because GIGA uses knowledge of the species tree, it postulates that the yeast MET13/met9 group is actually orthologous to the MTHFR genes from other organisms, but was not merged prior to the sequence from D. discoidieum (DICDI) due to accelerated evolutionary rate in the fungal lineage.
Figure 3
Figure 3
GIGA rules for duplication events. GIGA infers that a duplication must have occurred using Rule 2, as the two OS's being joined contain two genes from both Baker's yeast (YEAST) and fission yeast (SCHPO). It then places the duplication just prior to the most recent MRCA speciation event (fungi, in this case), which is the most parsimonious solution with respect to gene deletion events (Rule 3). Note that many other solutions are possible (two examples are shown below the most parsimonious case), but they require an increasing number of independent gene deletion events.
Figure 4
Figure 4
Core of the simple GIGA algorithm. OS = orthologous subtree, a portion of the gene tree containing only speciation events; FCE = founding copying event, the event (located relative to speciation events) that founded a given OS; in this first implementation of GIGA all FCEs are duplication events. The algorithm begins with each sequence in its own, separate OS. Each iteration operates on the currently closest pair of OS's. At each iteration, either 1) the two OS's are merged into a single OS (right side), or 2) one (or both) OS's are assigned FCEs. The tree is complete when all OS's have been assigned FCEs.
Figure 5
Figure 5
Characteristics of the TreeFam families used in this study (14,331 families with at least 4 sequences). (A) Distribution of average and minimum pairwise identity of families, (B) Distributions of number of sequences and protein alignment length.
Figure 6
Figure 6
CPU time required for tree reconstruction, note the log scale. GIGA is over 100 times faster than NJ and 1000 times faster than ML methods. (A) Dependence on number of sequences (alignment length is constant at 200-204). (B) Dependence on alignment length (number of sequences is constant at 20). The same alignments are used for each method.
Figure 7
Figure 7
Accuracy of GIGA trees: comparison with TreeFam clean trees for more than 14,000 TreeFam families. A) normalized RF distance comparing tree topology; B) ortholog pair difference (see text) is substantially smaller than the RF distance, indicating that many of the topological differences between TreeBeST and GIGA trees are due to disagreements in speciation event order.
Figure 8
Figure 8
Comparison of multiple different tree reconstruction methods. The RF distance of each pair of trees is plotted versus (A) the number of sequences in the family and (B) the length of the alignment to show the dependence on these parameters. GIGA and TreeBeST (blue diamonds) generally yield more similar trees than any other pair of methods, except for NJ-ML, which is of comparable similarity. The RF distance mean and standard deviation for each pair of methods is in the figure legend in parentheses. The same subsets of TreeFam families was used as for Figure 6.
Figure 9
Figure 9
Overlap between orthologs computed from GIGA and TreeBeST trees. GIGA infers 96% of orthologs inferred by TreeBeST, but also finds many additional orthologs, due mainly to minimization of implied gene duplication and deletion events.
Figure 10
Figure 10
Example of a tree with substantial disagreement in inferred duplication events, and corresponding orthologs, between TreeBeST (A) and GIGA (B), TreeFam family TF105095. The sequence alignment is of high quality according to PredictedSP, so this disagreement is due to algorithm differences rather than a problematic alignment. The main differences are in the inference of gene duplication events (orange nodes) in the CYP17A1 lineage (other than the recent duplications in the bovine lineage). (A) TreeBeST infers two duplication events (dup 1 and dup 2), both prior to the ray-finned fish-tetrapod divergence, followed by at least five separate deletion events: one prior to the frog-amniote divergence (del 1), one prior to the chicken-mammal divergence (del 2), one prior to the fish radiation (del 3), one following the divergence of the frog lineage (del 4), and one following the divergence of the chicken lineage (del 5). Note that according to this tree, there are no orthologs of human CYP17A1 in chicken, frog, or fish. (B) GIGA infers one duplication event, before the fish radiation (dup 1') and no deletion events. Note that according to this tree, there is one ortholog of human CYP17A1 in frog, one in chicken, and two in each fish species. Note also that tree (B) infers two periods of accelerated (potentially adaptive) molecular evolutionary rates, which may account for why a molecular evolution model would favor a topology with longer divergence times such as in (A).
Figure 11
Figure 11
Robustness of tree inference algorithms: histograms for GIGA and TreeBeST, for "clean" vs. "full" alignments for more than 14,000 TreeFam families. Full alignments include additional sequences, but the alignment is the same as for the clean set. An RF distance of 0 indicates that the tree topology is unchanged by adding more sequences. Overall, GIGA is more robust than TreeBeST to the perturbation of adding sequences.

Similar articles

Cited by

References

    1. Felsenstein J. Inferring Phylogenies. New York: Sinauer, Inc.; 2004.
    1. Barnabas J, Goodman M, Moore GW. Descent of mammalian alpha globin chain sequences investigated by the maximum parsimony method. J Mol Biol. 1972;69(2):249–278. doi: 10.1016/0022-2836(72)90229-X. - DOI - PubMed
    1. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. - PubMed
    1. Prager EM, Wilson AC. Construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods. J Mol Evol. 1978;11(2):129–142. doi: 10.1007/BF01733889. - DOI - PubMed
    1. Whelan S. Inferring trees. Methods Mol Biol. 2008;452:287–309. full_text. - PubMed

Publication types

LinkOut - more resources