GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

doi:10.1186/1471-2105-11-312

. 2010 Jun 9:11:312.

doi: 10.1186/1471-2105-11-312.

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Paul D Thomas¹

Affiliations

PMID: 20534164
PMCID: PMC2905364
DOI: 10.1186/1471-2105-11-312

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Paul D Thomas. BMC Bioinformatics. 2010.

. 2010 Jun 9:11:312.

doi: 10.1186/1471-2105-11-312.

Author

Paul D Thomas¹

Affiliation

¹ Evolutionary Systems Biology Group, SRI International, Menlo Park, CA, USA. pdthomas@usc.edu

PMID: 20534164
PMCID: PMC2905364
DOI: 10.1186/1471-2105-11-312

Abstract

Background: Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost.

Results: We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process.

Conclusions: GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events.

PubMed Disclaimer

Figures

**Figure 1**
**Decomposing a tree with a duplication event into orthologous subtrees (OS's)**. The example shows part of the methylene tetrahydrofolate reductase (MTHFR in human) gene family. This tree can be decomposed into OS's in two different ways: 1) the fungal MET13/met9 group remains in the same OS as its ancestors, while the MET12/met11 group founds a new OS, and 2) the MET12/met11 group remains in the same OS as its ancestors, while the MET13/met9 group founds a new OS. In both cases, the two OS's are sibling groups, because they contain genes descending from the duplication event, and in both cases the FCE of the more recent OS (the one with only genes from fungi) can be dated relative to speciation events, between the opisthokont common ancestor and the fungal common ancestor in this example. Species are abbreviated with the 5-letter UniProt code: CAEEL (*C. elegans*, nematode worm), CHICK (*G. gallus*, chicken), DANRE (*D. rerio*, zebrafish), DICDI (*D. discoideum*, cellular slime mold), HUMAN (*H. sapiens*, human), MOUSE (*M. musculus*, mouse), SCHPO (*S. pombe*, fission yeast), YEAST (*S. cerevisiae*, Baker's yeast).

**Figure 2**
**GIGA rules for speciation events**. Note that GIGA Rules 1 and 2 result in a different tree topology than standard agglomerative methods such as UPGMA. Because GIGA uses knowledge of the species tree, it postulates that the yeast MET13/met9 group is actually orthologous to the MTHFR genes from other organisms, but was not merged prior to the sequence from *D. discoidieum* (DICDI) due to accelerated evolutionary rate in the fungal lineage.

**Figure 3**
**GIGA rules for duplication events**. GIGA infers that a duplication must have occurred using Rule 2, as the two OS's being joined contain two genes from both Baker's yeast (YEAST) and fission yeast (SCHPO). It then places the duplication just prior to the most recent MRCA speciation event (fungi, in this case), which is the most parsimonious solution with respect to gene deletion events (Rule 3). Note that many other solutions are possible (two examples are shown below the most parsimonious case), but they require an increasing number of independent gene deletion events.

**Figure 4**
**Core of the simple GIGA algorithm**. OS = orthologous subtree, a portion of the gene tree containing only speciation events; FCE = founding copying event, the event (located relative to speciation events) that founded a given OS; in this first implementation of GIGA all FCEs are duplication events. The algorithm begins with each sequence in its own, separate OS. Each iteration operates on the currently closest pair of OS's. At each iteration, either 1) the two OS's are merged into a single OS (right side), or 2) one (or both) OS's are assigned FCEs. The tree is complete when all OS's have been assigned FCEs.

**Figure 5**
**Characteristics of the TreeFam families used in this study (14,331 families with at least 4 sequences)**. (A) Distribution of average and minimum pairwise identity of families, (B) Distributions of number of sequences and protein alignment length.

**Figure 6**
**CPU time required for tree reconstruction, note the log scale**. GIGA is over 100 times faster than NJ and 1000 times faster than ML methods. (A) Dependence on number of sequences (alignment length is constant at 200-204). (B) Dependence on alignment length (number of sequences is constant at 20). The same alignments are used for each method.

**Figure 7**
**Accuracy of GIGA trees: comparison with TreeFam clean trees for more than 14,000 TreeFam families**. A) normalized RF distance comparing tree topology; B) ortholog pair difference (see text) is substantially smaller than the RF distance, indicating that many of the topological differences between TreeBeST and GIGA trees are due to disagreements in speciation event order.

**Figure 8**
**Comparison of multiple different tree reconstruction methods**. The RF distance of each pair of trees is plotted versus (A) the number of sequences in the family and (B) the length of the alignment to show the dependence on these parameters. GIGA and TreeBeST (blue diamonds) generally yield more similar trees than any other pair of methods, except for NJ-ML, which is of comparable similarity. The RF distance mean and standard deviation for each pair of methods is in the figure legend in parentheses. The same subsets of TreeFam families was used as for Figure 6.

**Figure 9**
**Overlap between orthologs computed from GIGA and TreeBeST trees**. GIGA infers 96% of orthologs inferred by TreeBeST, but also finds many additional orthologs, due mainly to minimization of implied gene duplication and deletion events.

**Figure 10**
**Example of a tree with substantial disagreement in inferred duplication events, and corresponding orthologs, between TreeBeST (A) and GIGA (B), TreeFam family TF105095**. The sequence alignment is of high quality according to PredictedSP, so this disagreement is due to algorithm differences rather than a problematic alignment. The main differences are in the inference of gene duplication events (orange nodes) in the CYP17A1 lineage (other than the recent duplications in the bovine lineage). (A) TreeBeST infers two duplication events (dup 1 and dup 2), both prior to the ray-finned fish-tetrapod divergence, followed by at least five separate deletion events: one prior to the frog-amniote divergence (del 1), one prior to the chicken-mammal divergence (del 2), one prior to the fish radiation (del 3), one following the divergence of the frog lineage (del 4), and one following the divergence of the chicken lineage (del 5). Note that according to this tree, there are no orthologs of human CYP17A1 in chicken, frog, or fish. (B) GIGA infers one duplication event, before the fish radiation (dup 1') and no deletion events. Note that according to this tree, there is one ortholog of human CYP17A1 in frog, one in chicken, and two in each fish species. Note also that tree (B) infers two periods of accelerated (potentially adaptive) molecular evolutionary rates, which may account for why a molecular evolution model would favor a topology with longer divergence times such as in (A).

**Figure 11**
**Robustness of tree inference algorithms: histograms for GIGA and TreeBeST, for "clean" vs. "full" alignments for more than 14,000 TreeFam families**. Full alignments include additional sequences, but the alignment is the same as for the clean set. An RF distance of 0 indicates that the tree topology is unchanged by adding more sequences. Overall, GIGA is more robust than TreeBeST to the perturbation of adding sequences.

See this image and copyright information in PMC

Cited by

Stage-specific modulation of multinucleation, fusion, and resorption by the long non-coding RNA DLEU1 and miR-16 in human primary osteoclasts.
Moura SR, Sousa AB, Olesen JB, Barbosa MA, Søe K, Almeida MI. Moura SR, et al. Cell Death Dis. 2024 Oct 11;15(10):741. doi: 10.1038/s41419-024-06983-1. Cell Death Dis. 2024. PMID: 39389940 Free PMC article.
PANTHER: Making genome-scale phylogenetics accessible to all.
Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H. Thomas PD, et al. Protein Sci. 2022 Jan;31(1):8-22. doi: 10.1002/pro.4218. Epub 2021 Nov 25. Protein Sci. 2022. PMID: 34717010 Free PMC article. Review.
Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees.
Vega Yon GG, Thomas DC, Morrison J, Mi H, Thomas PD, Marjoram P. Vega Yon GG, et al. PLoS Comput Biol. 2021 Feb 18;17(2):e1007948. doi: 10.1371/journal.pcbi.1007948. eCollection 2021 Feb. PLoS Comput Biol. 2021. PMID: 33600408 Free PMC article.
PhyloGenes: An online phylogenetics and functional genomics resource for plant gene function inference.
Zhang P, Berardini TZ, Ebert D, Li Q, Mi H, Muruganujan A, Prithvi T, Reiser L, Sawant S, Thomas PD, Huala E. Zhang P, et al. Plant Direct. 2020 Dec 30;4(12):e00293. doi: 10.1002/pld3.293. eCollection 2020 Dec. Plant Direct. 2020. PMID: 33392435 Free PMC article.
Unilateral L4-dorsal root ganglion stimulation evokes pain relief in chronic neuropathic postsurgical knee pain and changes of inflammatory markers: part II whole transcriptome profiling.
Kinfe TM, Asif M, Chakravarthy KV, Deer TR, Kramer JM, Yearwood TL, Hurlemann R, Hussain MS, Motameny S, Wagle P, Nürnberg P, Gravius S, Randau T, Gravius N, Chaudhry SR, Muhammad S. Kinfe TM, et al. J Transl Med. 2019 Jun 19;17(1):205. doi: 10.1186/s12967-019-1952-x. J Transl Med. 2019. PMID: 31217010 Free PMC article. Clinical Trial.

See all "Cited by" articles

References

1. Felsenstein J. Inferring Phylogenies. New York: Sinauer, Inc.; 2004.
1. Barnabas J, Goodman M, Moore GW. Descent of mammalian alpha globin chain sequences investigated by the maximum parsimony method. J Mol Biol. 1972;69(2):249–278. doi: 10.1016/0022-2836(72)90229-X. - DOI - PubMed
1. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. - PubMed
1. Prager EM, Wilson AC. Construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods. J Mol Evol. 1978;11(2):129–142. doi: 10.1007/BF01733889. - DOI - PubMed
1. Whelan S. Inferring trees. Methods Mol Biol. 2008;452:287–309. full_text. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01GM081084/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

[1] Felsenstein J. Inferring Phylogenies. New York: Sinauer, Inc.; 2004.

[2] Felsenstein J. Inferring Phylogenies. New York: Sinauer, Inc.; 2004.

[3] Barnabas J, Goodman M, Moore GW. Descent of mammalian alpha globin chain sequences investigated by the maximum parsimony method. J Mol Biol. 1972;69(2):249–278. doi: 10.1016/0022-2836(72)90229-X. - DOI - PubMed

[4] Barnabas J, Goodman M, Moore GW. Descent of mammalian alpha globin chain sequences investigated by the maximum parsimony method. J Mol Biol. 1972;69(2):249–278. doi: 10.1016/0022-2836(72)90229-X. - DOI - PubMed

[5] Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. - PubMed

[6] Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–425. - PubMed

[7] Prager EM, Wilson AC. Construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods. J Mol Evol. 1978;11(2):129–142. doi: 10.1007/BF01733889. - DOI - PubMed

[8] Prager EM, Wilson AC. Construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods. J Mol Evol. 1978;11(2):129–142. doi: 10.1007/BF01733889. - DOI - PubMed

[9] Whelan S. Inferring trees. Methods Mol Biol. 2008;452:287–309. full_text. - PubMed

[10] Whelan S. Inferring trees. Methods Mol Biol. 2008;452:287–309. full_text. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Affiliation

GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

Author

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources