Abstract
Background
Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group.
Methodology
We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations.
Conclusions
We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.
Introduction
Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases as well as the incorporation of other genome-enhanced technologies [1]–[4], a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning EST database and the steadily growing number of whole plant genomes to address one of the more important phylogenetic questions concerning the hierarchical relationships of the major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales).
The elucidation of spermatophyte phylogeny continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets (morphological: [5]–[9]; and molecular: [10]–[16]) as recently and extensively reviewed [17]. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group (Figure 1). Many current studies support the placement of Gnetales and conifers as closely-related groups, either as sister clades (Panel B), or with Gnetales as a nested group within the conifers (Panel D). In both of these hypotheses, cycads are the basal clade, followed by Ginkgo. A fourth hypotheses, which first emerged through the analysis of the plastid genes rbcL and rpoC1 [18], [19] and multiple plastome genes [20] and again with phytochrome genes [13], [21] and some genes involved in development [16], [22], [23] has generally remained marginal and controversial, places the Gnetales as basal gymnosperms, with conifers and Ginkgo plus cycads as later-branching sister groups.
In a previous publication [11], we incorporated Expressed Sequence Tags (ESTs) together with complete protein sequences plus a morphology matrix into a phylogenetic analysis of the seed plants. The concatenation and simultaneous analysis of 43 data partitions yielded a well resolved, single most parsimonious tree with reasonable bootstrap support. In that study we demonstrated the pertinence of using ESTs as a source of phylogenetic characters, provided there is adequate orthology determination. We also stressed the importance of assessing character support in more robust and consistent ways before declaring a phylogenetic question confidently resolved. Given the diverse origins, roles and evolutionary histories of all genes within a particular genome, issues of character support and conflict are relevant when considering the overall history of a taxonomic group, and it appears sensible to consider as many sources of evidence as possible (and available). In this context, the question of where to stop adding characters to a phylogenomic analysis [24] remains open and a high priority for the careful and efficient planning of sequencing projects across all phyla.
Although our earlier approach [11] proved to be very effective in estimating character support and conflict, as well as supporting the case for the use of ESTs in phylogenetic analysis, it was clear more character information was needed to provide stronger support in the resolution of spermatophyte phylogeny. An increase in total characters, but especially an increase in phylogenetically informative characters, would augment both apparent and hidden support in all gymnosperm clades, and provide stronger support for inferences on the hierarchical relationships among the taxa involved. The burgeoning EST and sequencing projects being conducted across genomes make such character information available at an accelerated and sustained pace. One of the main criticisms to phylogenetic projects employing whole- or partial-genome sequences is that with the scarcity of comprehensive genomic or subgenomic data for a large number of taxa, the analyses would retrieve phylogenies for very few taxa that, even if well-resolved and strongly-supported, would represent incorrect evolutionary reconstructions (e.g. [25]). Moreover, Gatesy et al. [26] showed that choice of ingroup taxa at the root of the tree and, more importantly, outgroup choice in deep phylogenomic studies is critical. In the current report, we have expanded taxonomic representation to 17 species, compared to the original six-ingroup, single-outgroup taxa study of de la Torre et al. [11] and expand the number of gene partitions to 1200.
Materials and Methods
Orthology prediction
In order to generate a comprehensive molecular matrix to address the phylogenetic questions of flowering versus non-flowering seed plants, we searched the TIGR Plant Transcript Assemblies database (http://plantta.jcvi.org) for well-sampled representatives of all major seed plant groups. Our database search for available EST/unigenes (from a total 226,210 EST assemblies and singletons) from well-sampled representative members of major seed and seed-free plant groups retrieved a total of 158,358 genes from complete genomes (Arabidopsis, rice, and poplar), and between 16,000 and 22,000 total unigenes (depending on the dataset) from ESTs for all other species included in various versions of the analysis. In all, the following species were surveyed: Arabidopsis thaliana, Oryza sativa (common rice), Amborella trichopoda, Vitis vinifera (common grape vine), Populus trichocarpa (California poplar) (angiosperms); Cycas rumphii (Malayan fern palm), Zamia fischeri, Ginkgo biloba, Gnetum gnemon (melinjo, bago, peesae), Welwitschia mirabilis, Cryptomeria japonica (Japanese cedar), Pinus taeda (Loblolly pine) (gymnosperms) as ingroup taxa; Selaginella moellendorffii (Lycopophyte), Adiantum capillus-veneris (Filicalean fern), Marchantia polymorpha (liverwort), Physcomitrella patens (moss) and Chlamydomonas reinhardtii (unicellular green alga) as outgroups. All available assembled EST databases, independent of their source (tissue, developmental stage, or type of experiment) were surveyed. Using these unigenes, the OrthologID software pipeline ([27]; http://nypg.bio.nyu.edu/orthologid) was employed to predict orthologous groups resulting in fully aligned matrices composed of 926–1,600 gene or ortholog partitions. The variance in the number of orthologs depended on the filtering schemes discussed below. These ortholog groups consisted mostly of translated EST sequence data.
Ortholog filtering
OrthologID identifies all genes that are orthologous amongst the taxon set under examination [27]. Due to the incomplete nature of the EST database, oftentimes the resulting orthologous groups will include only a few taxa. In addition, the available orthologs can be distributed in specific and narrowly defined taxonomic groups. We reasoned that the inclusion of partitions with three or fewer orthologs will add little to the robustness of the present analysis, so we developed a filtering function in our informatics analysis pipeline that removed any ortholog sets that had fewer than four taxa with genes in the ortholog group. In addition, we restricted the distribution of this filtering to include only those ortholog groups with at least three ingroup taxa (specifically at least two gymnosperms and one angiosperm) and one outgroup taxon per partition. We arrived at a comprehensive dataset formed by 12 ingroup species and 4 outgroup species. We found that using all available outgroups resulted in the retrieval of the largest number of bona fide orthologous partitions (1,239) with the filtering scheme specifying the minimal presence of three ingroup taxa (two gymnosperms and one angiosperm) and one outgroup per partition. The resulting ortholog groups comprise genes that are randomly distributed throughout the genome as demonstrated by mapping the loci on the chromosome map of Arabidopsis thaliana (Figure S1). This somewhat balances for the general bias of EST and transcriptome data, which most often show enrichment for genes implicated in metabolism, energy and general housekeeping, and an underrepresentation for functional categories such as gene regulation. Still, our dataset comprises an array of orthologous genes belonging to diverse functional categories (Figure S2) including transcriptional regulators and signaling genes. The fact that statistical tests (z-scores, Sungear [28]; data not shown) show a lack of overrepresentation of these categories further suggests that our ortholog sample is more balanced (i.e. less biased) than any previously reported for similar studies of EST data.
Construction of a comprehensive seed plant phylogenomic matrix
Once the ortholog groups were established as detailed above, we used the Perl script ASAP (Automated Simultaneous Analysis Phylogenies; [29]) to organize and construct a matrix. This program automatically constructs a matrix with named partitions into gene name, GO category, and other informatics categories. The concatenated partitioned matrix can be found in Document S1.
Phylogenomic analyses
The phylogenetic matrix was analyzed using maximum parsimony (MP) and maximum likelihood (ML) optimality criteria. Parsimony analysis was performed in PAUP* 4b10 [30] using equal weights. Node support was evaluated using the nonparametric bootstrap and jackknife methods in PAUP. Pairwise phylogenetic congruence across all partitions was tested using the ILD test (incongruence length difference; [31], [32]) in PAUP. While this measure has been criticized recently [33]–[36], we choose to use this test conservatively in the context of this study. Branch support measures, such as the Bremer index [37], partitioned branch support [38], and hidden branch support [39], were calculated in ASAP in conjunction with PAUP. Maximum likelihood inference was carried out in RA×ML 7.0.4 [40] at the AMNH Computational Sciences facility on an 8-way server with 2.2 GHz AMD Opteron 846 processors and 128 GB RAM using the fine-grained parallel Pthreads (POSIX Threads Library; [41]) and on the CIPRES cluster (http://www.phylo.org) using the MPI (Message Passing Interface; [42], [43]) implementations. The substitution model best fitting the data was selected in ProtTest [44] by contrasting each model inference's log-likelihood score. The JTT model [45] yielded the highest likelihood score and therefore was used in ML inference taking into account empirical amino acid frequencies calculated directly from the data in hand (Document S2). Among-site rate heterogeneity was accounted for using the CAT approximation model [46] with 25 site rate categories. Node support was quantified with 1625 rapid bootstrap pseudo-replicates as implemented in the parallel versions of RA×ML [47]. In order to explore outgroup choice on tree topology, we performed a series of searches, with different combinations of ingroup and outgroup taxa. These manipulations are summarized in Figure S3. We also explored the effect of missing taxa on the overall phylogenetic hypothesis by measuring the amount of branch support (BS) and partitioned hidden branch support (PHBS) for trees generated by serial nested additions of ingroup taxa (3–11). This analysis involved serially adding partitions with up to 3 taxa, then up to 4 taxa, and so on, so that the matrix kept expanding as partitions with more taxa were added.
Results
The impact of outgroup choice on seed plant phylogenetics
In order to address the issue of random rooting [26], [48] we chose to break up the long root to the seed plants by including additional outgroup taxa (Physcomitrella, Marchantia, Selaginella, and Adiantum). Species chosen to implement this approach fulfilled two criteria: known phylogenetic relevance and good representation in the database. The results are shown in Figure 2. The relative placement of gymnosperm groups changes as outgroup taxa are excluded or rooting is forced on certain seed plant taxa. If no outgroups are specified, trees behave differently depending on whether (and which) seedless taxa are included. When the unicellular green alga Chlamydomonas and/or the moss Physcomitrella are included, cycads and Ginkgo nest within the conifers, and Gnetales appear basal. When only the heterosporous lycophyte, Selaginella (or any of the seed plants) is used to root the tree, Gnetales and conifers group together, and form a sister group to cycads and Ginkgo. Forcing the latter to be the outgroup does not change the relative positions of the former. Gap-coding the matrix results in similar arrangements, except for Cryptomeria, which falls outside the gymnosperms – probably due to insufficient amounts of informative characters.
Figure S2 suggests that the effect of long branch attraction or random rooting, can be neutralized by multiple outgroup analysis. In fact, our resulting tree topology remains stable and robust regardless of which outgroup, or outgroup combinations we use (including no outgroup, when rooting with any of the seed plants), suggesting we might have reached a large enough number of informative characters to render a highly robust topology, immune to outgroup choice. In all subsequent analyses we remove Chlamydomonas from the analysis due to the fact that it appears to have extreme random root effects [26], [48] and that we have replaced it with four other more appropriate outgroups.
A robust phylogenomic hypothesis focused on the relationships of major seed plant groups
Phylogenetic analysis of the most inclusive matrix we constructed (72,900 informative characters from 16 species) resulted in a single most parsimonious tree with very high measures of branch support. Figure 2A shows the MP tree of 12 seed plant ingroup taxa rooted with all four outgroup taxa (non-seed plants). Bootstrap and jackknife support values are all at or near 100%. Bremer decay values vary, but all are above double-digits. Higher-level inferences of relationships are consistent with most previous molecular analyses, showing gymnosperms as a monophyletic group sister to the angiosperms. As expected, angiosperm species conform to the well-accepted view that Amborella is basal to all flowering plants, followed by the separation between monocots (Oryza) and the eudicots Arabidopsis, Vitis, and Populus [25], [49]. Not surprisingly, as two of these species are fully sequenced, all measures of support for angiosperm groupings are very high (Bremer indices in the triple-digits).
The grouping of gymnosperms in the expanded analysis shown in Figure 2A is different from the one observed in our previous study [11], which placed cycads as the earliest diverging branch followed by Ginkgo, and then the Gnetales and conifers as sister taxa deeper in the gymnosperm clade (i.e., a pectinate gymnosperm clade). We point out that in the present study, the tree generated differs from the previous one not only in the overall number of taxa, where the ingroup is doubled, and the outgroup is quadrupled, but also in the overall placement of gymnosperm taxa. The MP tree (Figure 2A) shows Gnetum and Welwitschia (which form a solid monophyletic group) branching early and forming a sister clade to all other gymnosperms.
Notably, the topology of the phylogenomic tree shown in Figure 2A does not agree with two prior hypotheses. The first proposes that all conifers are sister to Gnetales, and the second proposes that the Gnetales are nested within the conifers in particular, placed as sister to conifers I (e.g. [10]; see Figure 1B and 1D). In addition, our initial hypothesis [11] that cycads, followed by Ginkgo, could be the earliest diverging extant gymnosperms is not supported in this larger analysis. Instead, the present analysis seems to provide robust support for the hypothesis that Gnetales are the earliest diverging gymnosperm lineage (Figure 1C), previously postulated using phytochrome genes as data sources [13] and in other analyses using the chloroplast gene rpoC1 [19], using the AGL6 [16], [22], and using Floricaula/LEAFY [23] even though they are the most recent group in the seed plant fossil record. Figure 2 shows the maximum likelihood (ML) tree that agrees entirely with the MP tree topology. This tree has robust (100%) likelihood bootstrap values at all nodes with the exception of the node supporting the clade (Selaginella,(Marchantia,Physcomitrella)) at 54%. The final log-likelihood score and branch lengths were optimized with the GAMMA model of rate heterogeneity in RA×ML and yielded a score of −3989109.546056 and an α (alpha) shape parameter of the Γ (Gamma) distribution of 0.720925.
Missing taxa have a significant effect on tree topology and support – relevance to EST phylogenomics
Previous studies using both simulated (e.g. [50]) and real (using ESTs: e.g. [51], [52]) datasets have tested whether large amounts of missing taxa have a significant effect on the topology and support of a phylogenetic analysis. This type of analysis is particularly relevant to EST studies as the probability of obtaining a full complement of taxa for a particular ortholog is reduced as the number of taxa in the analysis increases (see [53] for an example in animals). This approach is generally accomplished by comparing support metrics and topology changes on datasets with and without given combinations of missing taxa. All existing results (with little change in these factors for compared datasets) have hitherto suggested that large numbers of missing taxa per se do not alter either the signal or support values. However, when “missing taxa” also means too few available characters for a correct call regarding taxon placement, the negative effect is indeed dramatic.
Our analysis on the 43-partition matrix [11] revealed that subtracting partitions with high taxon representation did collapse many branches or significantly lower overall support, although the exclusion of these taxon-dense partitions also meant the removal of crucially informative character information. We explored the effect these missing taxa had on the overall phylogenetic hypothesis by comparing the amount of branch support and hidden branch support for each node using partitions where information was available for 7, 6, 5, and 4 taxa.
As shown in Figure 3, tree support values increase dramatically as more partitions with fuller taxon complements are added. This result could argue for the exclusion of partitions with low number of taxa. When analyzing individual partitions, it is clear that trees from those with lower number of taxa have fewer informative characters, number of resolved clades and, ultimately, lower support value across the board. However, we also suggest keeping those partitions with even minimal character information, as these partitions may often prove valuable in the resolution of a single clade or clades within the tree.
We also explored the effect of missing taxa on the overall phylogenetic hypothesis. Figure 4 depicts how the data in our study relates to the compromise of increasing number of character and taxa. Given our choice of taxa, and the current sequence availability for each species (indicated on the X axis of Figure 4), a peak of informative characters and related bootstrap values (a “phylogenetic sweet spot” of sorts) is reached between 5 and 6 taxa. That is genes that are found in five or six of the taxa in this study when combined have more parsimony informative characters and higher overall bootstrap values This result is attained as a result of there being fewer and fewer genes with fuller taxonomic representation in the EST database.
This result does not necessarily mean the incorporation of additional taxa is of no value. Potentially important character information is still obtained when adding more taxa. While this illustrates the effect of missing taxa for genes in the EST database, an analysis will benefit from the compounded information obtained from including all partitions containing 3 to 9 ingroup taxa, below and above which phylogenetic information will be null. In theory, the upper limit will shift to the right as more genomes are sequenced, until reaching an absolute limit, given by evolutionary – not technological – constraints, i.e. a real lack of overlap for several genes among species. As seen before, even when adding incomplete partitions (i.e. with varying amounts of taxon representation within the partition), support increases radically as more parsimony-informative sequence data are added. This result indeed argues for the inclusion of all information available, as long as a minimum of 3 ingroup and one outgroup species is maintained in each partition.
Analysis of individual partitions
As shown previously for seed plants [11] and yeast species [24], analysis of trees generated with individual data partitions, reveals large disagreement with the simultaneous analysis tree hypothesis. Yet, as shown in earlier studies (e.g. [11], [54], [55]), most, if not all, of such apparent incongruence is statistically significant using the ILD test. We employed this test in order to explore the interaction among data partitions within our dataset and the degree of incongruence at the character level. Due to computational constraints within PAUP, we limited the number of individual pairwise comparisons, and generated random samples of paired ILD comparisons corresponding to 10% of the total dataset, and performed pairwise ILD tests on this random sample of combinations of these subsets (data not shown).
We evaluated the effect of increasing character information (PI, parsimony-informative) on both bootstrap and Bremer support values. Figure 3 reflects a definite overall increase in bootstrap metrics as the number of PI characters go up, but shows different behaviors for each. This trend continues without a clear limit or plateau. The variance makes sense, as the very nature of these metrics changes as a function of the addition of new data partitions with varying degrees of supporting and conflicting character information. By contrast, traditional Bremer support values, show an overall upward trend but reach a clear plateau after the 900 partition-mark (∼30,000 PI characters), and remain unchanged even after more data partitions are added. This trend holds well above the 40,000 PI character-mark (Figure 3).
Bootstrap support values show a slightly different trend (Figure 3, and Figure S4). Bootstrap averages rise steadily at first and then plateau within a limited range between 91 and 96% past the 780-partition mark (∼20,000 PI characters) even as many more PI characters continue to be added. This result again suggests enough character information is present in the matrix to support the concatenated tree topology in >90% bootstrap replicates, but enough conflicting information is present to account for mild oscillations. Near-100% bootstrap and jackknife values are reached in most tree nodes (e.g. trees in Figure 2). Inclusion of differing character information in a concatenated approach is still preferred as a more accurate approximation to the true species phylogeny, as evidenced by the retrieval of a single, total evidence tree with high support values even though large amounts of significantly conflicting data (data not shown) are present in the combined dataset.
Partitioned analysis reveals the behavior of character support and conflict
By using partitioned support metrics, both hidden and apparent, we were able to identify those individual partitions contributing various degrees of positive, negative or null support to the all-evidence topology (Figure 2B). Most partitions (>50%) contribute no hidden support to the concatenated analysis tree, while roughly 22% contribute positive hidden support, and about 15% contribute negative support to the simultaneous analysis tree. This means only 1/6 of all data partitions contain characters that actually conflict with the concatenated analysis hypothesis and result in worse tree length scores, although less than half (i.e. ∼8%) of total partitions actually contribute more than three steps of negative hidden support. In contrast, more than half (i.e. >12%) of the partitions contribute more than three steps of positive hidden support to the simultaneous analysis hypothesis.
Discussion
Implications for seed plant phylogeny
While our initial approach used for 43 partitions and 7 seed plant species in a previous study [11] may have been appropriate to explore the utility of EST data in phylogenetic analysis, limited taxon sampling and choice of outgroups most definitely influenced the retrieval of a conflicting topology to that presented here which is based on >1,200 partitions from 16 plant species. The fact that we have used a relatively unbiased EST sampling method, the sheer number of informative characters and additional taxa, and the various tests for robustness described earlier, all make us prefer either of the current trees to any previous phylogenetic hypotheses for the seed plants. This result also supports a long-standing observation that high bootstrap values reflect the local concordance of the topology with the data, but provide little indication of the approximation of a particular topology/dataset to the true species phylogeny. Two completely different topologies reflecting relationships among comparable groups of data may both have equally high bootstrap values, and still fall far from the true species tree [24].
While our current hypothesis still reveals a few branches lacking in robustness – a problem that will most likely be solved by adding more sequence data from currently under-represented species – our analysis nonetheless puts forward several well corroborated hypotheses concerning seed plant phylogeny, namely:
Gymnosperms are a monophyletic group, sister to the angiosperms.
Amborella is confirmed as a basal angiosperm, sister to monocots and eudicots.
Gnetales and conifers are separate, monophyletic groups, (i.e. not nested within one another).
The clade formed by cycads and Ginkgo share a common ancestor.
Additionally, our results suggest Gnetales may indeed be the sister group to the rest of the extant gymnosperms. While it is conceivable that further taxon addition may falsify this hypothesis in the future, the high support values for the tree in Figure 2, together with our observations in serial addition experiments, supports the basal placement of Gnetales within the gymnosperms. Furthermore, alternative hypotheses using conflicting partitions and partition sets are poorly resolved, do not agree on a particular alternative, and generally receive poor support values.
While the topology of the Gnetales as sister to the rest of the gymnosperms may be considered unconventional, it is quite interesting to note that this topology has been retrieved from individual gene trees such as rpoC1 and rbcL as well as the noncoding regions of the inverted repeat representing the plastome [18]–[20] and the nuclear genome using phytochrome genes [13], [21], and agamous genes AGL6 and AGL-like genes [16], [22] and Floricaula/LEAFY [23]. Besides representing multiple genes from two different genomes, these data also represent a diversity of functions within the plants. Therefore, it should come as no surprise that this topology could be supported by our data set. Moreover, the analysis represents a substantial data set that is not only consistent with a basal position for Gnetales either as sister to all other seed plants or as sister to the rest of the gymnosperms but also when analyses include bryophyte, lycophyte and pteridoyphyte outgroups as for example in the analysis of rbcL data [56]. It should also be noted that the ages of known fossils are minimum ages so the young age for Gnetales is simple that a minimum age.
Impact of outgroup choice
The observation that a denser ingroup taxon sampling did not have a major effect on tree topology beyond a certain point, but a change in outgroup taxa identity and number did make gymnosperm relationships vary significantly, stimulated us to look at outgroup choice in more detail. This problem is an issue observed in previous studies yet largely overlooked in the literature regarding this group [10]. In general, this issue has not been addressed in a systematic way, either because not all gymnosperm groups have been included, or because not enough taxa have been sampled (e.g. [10], [57], [58]). Alternatively, the problem is avoided altogether by rooting with a seed plant – either an angiosperm or with what are usually considered the more primitive gymnosperms (i.e. Gnetales; see Figure 2). Overall, our data indicate that outgroup choice can severely influence tree topology in datasets with lower numbers of informative characters, but that the addition of more informative characters can lead to a point where outgroup choice plays a minimal role.
Support and the seed plant phylogenetic tree
Due to the incomplete nature of EST information, both as it refers to the absence of full-length sequences as well as to the randomness of sequencing for each species sampled, many of our partitioned matrices had different degrees of missing data. Throughout the present analysis, we evaluated the effect of missing characters and missing taxa on tree topology and branch support. We conclude that increasing taxon sampling is crucial in retrieving precise and unambiguous phylogenies, and outgroup choice can be a determining factor in resolving controversial phylogenies by minimizing the effect of long branch attraction. In contrast, missing characters do not seem to play a significant role in altering support metrics, as long as informative characters are present to resolve species relationships. Similarly, gene order does not appear to be a determining factor, while the effect of gene identity becomes less and less significant as the number of (randomly-selected) partitions increase. Ultimately, and as a representative sample of the species' genomes is approached, this variable will end up playing a minimal, marginal role in influencing support values.
A single EST-based tree may have well supported clades that have reached a limit – or plateau – of support (such as nodes 4–8) coexisting with poorly supported nodes (e.g. 1, 2, 11), which do not have enough character information to support them. These comparisons also suggest how relative the character-to-support relationship may be. For instance, Zamia and Amborella, both with ∼14% matrix representation, do not change their position relative to Cycas or other angiosperms, respectively, while Pinus and Cryptomeria (with >80% and ∼15%, respectively) still struggle to “find their place”, vis-à-vis each other within the gymnosperms.
The impact of missing data on approaching a support plateau
Rokas et al. [24] clearly show a plateau of support values for trees as sequence information increases. Among the many imperfections of dealing with an EST-based alignment matrix such as the one in this study, is the randomness of sequencing, which results in suboptimal taxa representation for many of the individual gene partitions. However, this shortcoming allows us to visualize the behavior of various stages of character density that result (as seen earlier) in varying degrees of branch support. By evaluating the effect of taxon and character density produced by the randomness of the EST approach, we can evaluate the degree of support on branches with different character-to-taxon ratios (Figures 3 and 4).
The yeast study [24], and the seed plant hypothesis presented here, both suggest that studies with similar number of taxa may require different numbers of characters and genes in order to reach similar robust topological inferences and high levels of support. This discrepancy is probably a factor of the different phylogenetic scales and divergence times of the groups involved: ingroup taxa in the yeast phylogeny diverged between 50 and 100 million years ago (Mya) [59] and were confined within a single genus. In contrast, the ingroup taxa in our plant study are at the level of families – if not orders – that diverged no earlier than 400 Mya [60]–[62]. Alternatively, tree balance dynamics may have an impact on resolution [63] or as several studies with much larger numbers of ingroup taxa [39], [64], [65] suggest, larger numbers of characters than those of the yeast study are required for robust resolution of some simple phylogenetic hypothesis.
Supporting Information
Acknowledgments
The authors thank Damon P. Little (NYBG), Rob Martienssen and Richard McCombie (CSHL) and other members of the New York Plant Genomics Consortium for valuable discussions during the course of this project, as well as Alexandros Stamatakis and Michael Ott (Technical University of Munich) for assistance with RA×ML.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was supported by the NSF Plant Genome Program, DBI-0421604, and by the Lewis and Dorothy Cullman Program in Molecular Systematics at the American Museum of Natural History and the New York Botanical Garden. As well the Sackler Institute for Comparative Genomics at the American Museum of Natural History and the Korein Family Foundation provided support for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Albert VA, Soltis DE, Carlson JE, Farmerie WG, Wall PK, et al. Floral gene resources from basal angiosperms for comparative genomics research. BMC Plant Biol. 2005;5:5. doi: 10.1186/1471-2229-5-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science. 2002;296:92–100. doi: 10.1126/science.1068275. [DOI] [PubMed] [Google Scholar]
- 3.Mayer K, Mewes HW. How can we deliver the large plant genomes? Strategies and perspectives. Curr Opin Plant Biol. 2002;5:173–177. doi: 10.1016/s1369-5266(02)00235-2. [DOI] [PubMed] [Google Scholar]
- 4.Yu J, Hu S, Wang J, Wong GK, Li S, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002;296:79–92. doi: 10.1126/science.1068037. [DOI] [PubMed] [Google Scholar]
- 5.Crane PR. Phylogenetic analysis of seed plants and the origin of angiosperms. Ann MO Bot Gard. 1985;72:716–793. [Google Scholar]
- 6.Doyle J, Donoghue M. Seed plant phylogeny and the origin of angiosperms: an experimental cladistic approach. Bot Rev. 1986;52:321–431. [Google Scholar]
- 7.Doyle JA. Molecules, morphology, fossils, and the relationship of angiosperms and Gnetales. Mol Phylogenet Evol. 1998;9:448–462. doi: 10.1006/mpev.1998.0506. [DOI] [PubMed] [Google Scholar]
- 8.Loconte H, Stevenson DW. Cladistics of the Spermatophyta. Brittonia. 1990;42:197–211. [Google Scholar]
- 9.Rothwell GW, Serbet R. Lignophyte phylogeny and the evolution of spermatophytes: a numerical cladistic analysis. Syst Bot. 1994;19:443–482. [Google Scholar]
- 10.Bowe LM, Coat G, dePamphilis CW. Phylogeny of seed plants based on all three genomic compartments: extant gymnosperms are monophyletic and Gnetales' closest relatives are conifers. Proc Natl Acad Sci U S A. 2000;97:4092–4097. doi: 10.1073/pnas.97.8.4092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.de la Torre JE, Egan MG, Katari MS, Brenner ED, Stevenson DW, et al. ESTimating plant phylogeny: lessons from partitioning. BMC Evol Biol. 2006;6:48. doi: 10.1186/1471-2148-6-48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Donoghue MJ, Doyle JA. Seed plant phylogeny: demise of the anthophyte hypothesis? Curr Biol. 2000;10:R106–109. doi: 10.1016/s0960-9822(00)00304-3. [DOI] [PubMed] [Google Scholar]
- 13.Schmidt M, Schneider-Poetsch HA. The evolution of gymnosperms redrawn by phytochrome genes: the Gnetatae appear at the base of the gymnosperms. J Mol Evol. 2002;54:715–724. doi: 10.1007/s00239-001-0042-9. [DOI] [PubMed] [Google Scholar]
- 14.Soltis PS, Soltis DE, Chase MW. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature. 1999;402:402–404. doi: 10.1038/46528. [DOI] [PubMed] [Google Scholar]
- 15.Soltis PS, Soltis DE, Wolf PG, Nickrent DL, Chaw SM, et al. The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal? Mol Biol Evol. 1999;16:1774–1784. doi: 10.1093/oxfordjournals.molbev.a026089. [DOI] [PubMed] [Google Scholar]
- 16.Winter KU, Becker A, Munster T, Kim JT, Saedler H, et al. MADS-box genes reveal that gnetophytes are more closely related to conifers than to flowering plants. Proc Natl Acad Sci U S A. 1999;96:7342–7347. doi: 10.1073/pnas.96.13.7342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mathews S. Phylogenetic relationships among seed plants: persistent questions and the limits of molecular data. Am J Bot. 2009;96:228–236. doi: 10.3732/ajb.0800178. [DOI] [PubMed] [Google Scholar]
- 18.Hasebe M, Kofuji R, Ito M, Kato M, Iwatsuki K, et al. Phylogeny of gymnosperms inferred from rbc L gene sequences. J Plant Res. 1992;105:673–679. [Google Scholar]
- 19.Samigullin TK, Martin WF, Troitsky AV, Antonov AS. Molecular data from the chloroplast rpo C1 gene suggest a deep and distinct dichotomy of contemporary spermatophytes into two monophyla: gymnosperms (including Gnetales) and angiosperms. J Mol Evol. 1999;49:310–315. doi: 10.1007/pl00006553. [DOI] [PubMed] [Google Scholar]
- 20.Goremykin V, Bobrova V, Pahnke J, Troitsky A, Antonov A, et al. Noncoding sequences from the slowly evolving chloroplast inverted repeat in addition to rbcL data do not support gnetalean affinities of angiosperms. Mol Biol Evol. 1996;13:383–396. doi: 10.1093/oxfordjournals.molbev.a025597. [DOI] [PubMed] [Google Scholar]
- 21.Mathews S, Donoghue MJ. 2002 Analyses of phytochrome data from seed plants: exploration of conflicting results from parsimony and Bayesian approaches [ http://www.2002.botanyconference.org/section12/abstracts/238.shtml]. Botany 2002. August 4–7, 2002; University of Wisconsin, Madison, WI. [Google Scholar]
- 22.Becker A, Theissen G. The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Mol Phylogenet Evol. 2003;29:464–489. doi: 10.1016/s1055-7903(03)00207-0. [DOI] [PubMed] [Google Scholar]
- 23.Frohlich MW, Parker DS. The Mostly Male theory of flower evolutionary origins: from genes to fossils. Syst Bot. 2000;25:155–170. [Google Scholar]
- 24.Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. doi: 10.1038/nature02053. [DOI] [PubMed] [Google Scholar]
- 25.Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu YL, et al. Genome-scale data, angiosperm relationships, and “ending incongruence”: a cautionary tale in phylogenetics. Trends Plant Sci. 2004;9:477–483. doi: 10.1016/j.tplants.2004.08.008. [DOI] [PubMed] [Google Scholar]
- 26.Gatesy J, DeSalle R, Wahlberg N. How many genes should a systematist sample? Conflicting insights from a phylogenomic matrix characterized by replicated incongruence. Syst Biol. 2007;56:355–363. doi: 10.1080/10635150701294733. [DOI] [PubMed] [Google Scholar]
- 27.Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, et al. OrthologID: automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics. 2006;22:699–707. doi: 10.1093/bioinformatics/btk040. [DOI] [PubMed] [Google Scholar]
- 28.Poultney CS, Gutierrez RA, Katari MS, Gifford ML, Paley WB, et al. Sungear: interactive visualization and functional analysis of genomic datasets. Bioinformatics. 2007;23:259–261. doi: 10.1093/bioinformatics/btl496. [DOI] [PubMed] [Google Scholar]
- 29.Sarkar IN, Egan MG, Coruzzi G, Lee EK, DeSalle R. Automated simultaneous analysis phylogenetics (ASAP): an enabling tool for phlyogenomics. BMC Bioinformatics. 2008;9:103. doi: 10.1186/1471-2105-9-103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Swofford DL. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sunderland, MA, USA: Sinauer Associates; 2003. [Google Scholar]
- 31.Farris JS, Källersjö M, Kluge AG, Bult C. Testing significance of incongruence. Cladistics. 1994;10:315–319. [Google Scholar]
- 32.Farris JS, Källersjö M, Kluge AG, Bult C. Constructing a significance test for incogruence. Syst Biol. 1995;44:570–572. [Google Scholar]
- 33.Barker FK, Lutzoni FM. The utility of the incongruence length difference test. Syst Biol. 2002;51:625–637. doi: 10.1080/10635150290102302. [DOI] [PubMed] [Google Scholar]
- 34.Darlu P, Lecointre G. When does the incongruence length difference test fail? Mol Biol Evol. 2002;19:432–437. doi: 10.1093/oxfordjournals.molbev.a004098. [DOI] [PubMed] [Google Scholar]
- 35.Dolphin K, Belshaw R, Orme CD, Quicke DL. Noise and incongruence: interpreting results of the incongruence length difference test. Mol Phylogenet Evol. 2000;17:401–406. doi: 10.1006/mpev.2000.0845. [DOI] [PubMed] [Google Scholar]
- 36.Hipp AL, Hall JC, Sytsma KJ. Congruence versus phylogenetic accuracy: revisiting the incongruence length difference test. Syst Biol. 2004;53:81–89. doi: 10.1080/10635150490264752. [DOI] [PubMed] [Google Scholar]
- 37.Bremer K. Branch support and tree stability. Cladistics. 1994;10:295–304. [Google Scholar]
- 38.Baker RH, DeSalle R. Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Syst Biol. 1997;46:654–673. doi: 10.1093/sysbio/46.4.654. [DOI] [PubMed] [Google Scholar]
- 39.Gatesy J, O'Grady P, Baker RH. Corroboration among data sets in simultaneous analysis: hidden support for phylogenetic relationships among higher level artiodactyl taxa. Cladistics. 1999;15:271–313. doi: 10.1111/j.1096-0031.1999.tb00268.x. [DOI] [PubMed] [Google Scholar]
- 40.Stamatakis A. RA×ML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–2690. doi: 10.1093/bioinformatics/btl446. [DOI] [PubMed] [Google Scholar]
- 41.Stamatakis A, Ott M. Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures. Phil Trans R Soc Lond B Biol Sci. 2008;363:3977–3984. doi: 10.1098/rstb.2008.0163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ott M, Zola S, Aluru S, Stamatakis A. Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L. Proceedings of IEEE/ACM Supercomputing Conference (SC2007). November 2007, Reno, NV. 2007.
- 43.Stamatakis A, Ott M. Springer Lectures in Bioinformatics. Vol. 5265. Berlin: Springer; 2008. Exploiting fine-grained parallelism in the phylogenetic likelihood function with MPI, Pthreads, and OpenMP: a performance study. 3rd IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB 2008). pp. 424–436. [Google Scholar]
- 44.Abascal F, Zardoya R, Posada D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 2005;21:2104–2105. doi: 10.1093/bioinformatics/bti263. [DOI] [PubMed] [Google Scholar]
- 45.Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–282. doi: 10.1093/bioinformatics/8.3.275. [DOI] [PubMed] [Google Scholar]
- 46.Stamatakis A. Phylogenetic models of rate heterogeneity: a high performance computing perspective. 20th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS2006). 25–29 April 2006, Rhodes, Greece. 2006. doi:10.1109/IPDPS.2006.1639535.
- 47.Stamatakis A, Hoover P, Rougemont J. A rapid bootstrap algorithm for the RA×ML web servers. Syst Biol. 2008;57:758–771. doi: 10.1080/10635150802429642. [DOI] [PubMed] [Google Scholar]
- 48.Wheeler WC. Nucleic acid sequence phylogeny and random outgroups. Cladistics. 1990;6:363–367. doi: 10.1111/j.1096-0031.1990.tb00550.x. [DOI] [PubMed] [Google Scholar]
- 49.Lockhart PJ, Penny D. The place of Amborella within the radiation of angiosperms. Trends Plant Sci. 2005;10:201–202. doi: 10.1016/j.tplants.2005.03.006. [DOI] [PubMed] [Google Scholar]
- 50.Wiens JJ. Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol. 2003;52:528–538. doi: 10.1080/10635150390218330. [DOI] [PubMed] [Google Scholar]
- 51.Philippe H, Lartillot N, Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol. 2005;22:1246–1253. doi: 10.1093/molbev/msi111. [DOI] [PubMed] [Google Scholar]
- 52.Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, et al. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004;21:1740–1752. doi: 10.1093/molbev/msh182. [DOI] [PubMed] [Google Scholar]
- 53.Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, et al. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature. 2008;452:745–749. doi: 10.1038/nature06614. [DOI] [PubMed] [Google Scholar]
- 54.DeSalle R. Animal phylogenomics: multiple interspecific genome comparisons. Methods Enzymol. 2005;395:104–133. doi: 10.1016/S0076-6879(05)95008-8. [DOI] [PubMed] [Google Scholar]
- 55.Jeffroy O, Brinkmann H, Delsuc F, Philippe H. Phylogenomics: the beginning of incongruence? Trends Genet. 2006;22:225–231. doi: 10.1016/j.tig.2006.02.003. [DOI] [PubMed] [Google Scholar]
- 56.Albert VA, Backlund A, Bremer K, Chase MW, Manhart JR, et al. Functional constraints and rbcL evidence for land plant phylogeny. Ann MO Bot Gard. 1994;81:534–567. [Google Scholar]
- 57.Chase MW, Soltis DE, Olmstead RG, Morgan D, Les DH, et al. Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Ann MO Bot Gard. 1993;80:528–580. [Google Scholar]
- 58.Soltis DE, Soltis PS, Zanis MJ. Phylogeny of seed plants based on evidence from eight genes. Am J Bot. 2002;89:1670–1681. doi: 10.3732/ajb.89.10.1670. [DOI] [PubMed] [Google Scholar]
- 59.Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. doi: 10.1038/nature01644. [DOI] [PubMed] [Google Scholar]
- 60.Crane PR. Time for the angiosperms. Nature. 1993;366:631–632. [Google Scholar]
- 61.Magallón SA, Sanderson MJ. Angiosperm divergence times: the effect of genes, codon positions, and time constraints. Evolution. 2005;59:1653–1670. doi: 10.1554/04-565.1. [DOI] [PubMed] [Google Scholar]
- 62.Sanderson MJ, Thorne JL, Wikstrom N, Bremer K. Molecular evidence on plant divergence times. Am J Bot. 2004;91:1656–1665. doi: 10.3732/ajb.91.10.1656. [DOI] [PubMed] [Google Scholar]
- 63.Rohlf FJ, Chang WS, Sokal RR, Kim J. Accuracy of estimated phylogenies: effects of tree topology and evolutionary model. Evolution. 1990;44:1671–1684. doi: 10.1111/j.1558-5646.1990.tb03855.x. [DOI] [PubMed] [Google Scholar]
- 64.Gatesy J, Baker RH. Hidden likelihood support in genomic data: can forty-five wrongs make a right? Syst Biol. 2005;54:483–492. doi: 10.1080/10635150590945368. [DOI] [PubMed] [Google Scholar]
- 65.Planet PJ, Kachlany SC, Fine DH, DeSalle R, Figurski DH. The widespread colonization island of Actinobacillus actinomycetemcomitans. Nat Genet. 2003;34:193–198. doi: 10.1038/ng1154. [DOI] [PubMed] [Google Scholar]
- 66.Nixon K, Crepet WL, Stevenson D, Friis EM. A reevaluation of seed plant phylogeny. Ann MO Bot Gard. 1994;81:484–533. [Google Scholar]
- 67.Chaw SM, Parkinson CL, Cheng Y, Vincent TM, Palmer JD. Seed plant phylogeny inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. Proc Natl Acad Sci U S A. 2000;97:4086–4091. doi: 10.1073/pnas.97.8.4086. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.