- Split View
-
Views
-
Cite
Cite
Frank T Burbrink, Felipe G Grazziotin, R Alexander Pyron, David Cundall, Steve Donnellan, Frances Irish, J Scott Keogh, Fred Kraus, Robert W Murphy, Brice Noonan, Christopher J Raxworthy, Sara Ruane, Alan R Lemmon, Emily Moriarty Lemmon, Hussam Zaher, Interrogating Genomic-Scale Data for Squamata (Lizards, Snakes, and Amphisbaenians) Shows no Support for Key Traditional Morphological Relationships, Systematic Biology, Volume 69, Issue 3, May 2020, Pages 502–520, https://doi.org/10.1093/sysbio/syz062
- Share Icon Share
Abstract
Genomics is narrowing uncertainty in the phylogenetic structure for many amniote groups. For one of the most diverse and species-rich groups, the squamate reptiles (lizards, snakes, and amphisbaenians), an inverse correlation between the number of taxa and loci sampled still persists across all publications using DNA sequence data and reaching a consensus on the relationships among them has been highly problematic. In this study, we use high-throughput sequence data from 289 samples covering 75 families of squamates to address phylogenetic affinities, estimate divergence times, and characterize residual topological uncertainty in the presence of genome-scale data. Importantly, we address genomic support for the traditional taxonomic groupings Scleroglossa and Macrostomata using novel machine-learning techniques. We interrogate genes using various metrics inherent to these loci, including parsimony-informative sites (PIS), phylogenetic informativeness, length, gaps, number of substitutions, and site concordance to understand why certain loci fail to find previously well-supported molecular clades and how they fail to support species-tree estimates. We show that both incomplete lineage sorting and poor gene-tree estimation (due to a few undesirable gene properties, such as an insufficient number of PIS), may account for most gene and species-tree discordance. We find overwhelming signal for Toxicofera, and also show that none of the loci included in this study supports Scleroglossa or Macrostomata. We comment on the origins and diversification of Squamata throughout the Mesozoic and underscore remaining uncertainties that persist in both deeper parts of the tree (e.g., relationships between Dibamia, Gekkota, and remaining squamates; among the three toxicoferan clades Iguania, Serpentes, and Anguiformes) and within specific clades (e.g., affinities among gekkotan, pleurodont iguanians, and colubroid families).
Well-supported phylogenies inferred using both thorough taxon-sampling and genome-scale sequence data are paramount for understanding phylogenetic structure and settling debates about higher-level taxonomy. Phylogenomic analyses can provide reliable trees for downstream use in comparative biology (Garland et al. 2005; Wortley et al., 2005; Heath et al., 2008; Ruane et al., 2015; Burbrink et al., 2019) and help unravel evolutionary complexity (Philippe et al., 2011), such as deep-time phylogenetic reticulation (Burbrink and Gehara, 2018). In recent years, well-resolved phylogenies of birds and mammals used both large numbers of genes and taxa (Prum et al., 2015; Liu et al., 2017). Unfortunately, among amniotes, squamates have fallen behind and all comparative and taxonomic studies still rely on phylogenetic structure estimated from either a handful of genes and a large number of species (Pyron et al., 2013), small number of lineages with phylogenomic data (Streicher and Wiens 2017), and a few of intermediate range with |$\sim $|50 loci and 161 taxa (Wiens et al., 2012; Reeder et al., 2015). Furthermore, phylogenetic thinking about such groups often reflects historical morphological hypotheses that are weakly congruent or incongruent with recent phylogenomic estimates (Conrad 2008; Gauthier et al., 2012; Losos et al., 2012).
The order Squamata comprises almost 10,800 extant lizards, snakes, and amphisbaenians (Uetz et al. 2018) showing continuous diversification since the Jurassic, with many groups surviving the Cretaceous/Tertiary mass extinction (Evans 2003; Evans and Jones 2010; Jones et al., 2013). Extant squamates occur in nearly all habitats globally and show huge variation in body size, body shape, limb types (including repeated complete limb loss), oviparous and viviparous reproductive modes, complex venoms, and extremely varied diets that include plants, invertebrates and vertebrates (Vitt and Pianka 2005; Vitt and Caldwell 2009; Colston et al., 2010; Pyron and Burbrink, 2014; Zaher et al., 2014; Fry 2015). Squamates are important research organisms in the fields of behavior, ecology, and macroevolution, for which major studies on their speciation, biogeography, and latitudinal richness gradients have contributed to the basic understanding of how diversity accumulates across the earth (O’Connor and Shine 2004; Ricklefs et al., 2007; Pyron and Burbrink 2012; Pyron, 2014; Burbrink et al., 2015; Esquerré and Scott Keogh 2016; Esquerré et al., 2017).
Attempts to understand relationships among squamates over the last 250 years reflect the input from hundreds of researchers spanning morphological, small-scale molecular, and now phylogenomic data sets, marked by several key milestones (Oppel 1811; Camp 1923; Underwood 1967; Estes et al., 1988; Townsend et al., 2004; Vidal and Hedges 2005, 2009; Conrad 2008; Wiens et al., 2010, 2012; Gauthier et al., 2012; Pyron et al., 2013, 2014; Reeder et al., 2015; Streicher and Wiens 2016). Although research from both morphological and molecular studies have converged toward similar content of most family-level groups, relationships among these groups and, more intriguingly, the deepest divisions within squamates remain highly contentious (Losos et al., 2012). Squamate relationships inferred from molecular sequence-based phylogenetics that are fundamentally at odds with historical interpretations of morphological data include: 1) monophyly of Toxicofera (snakes, iguanians, and anguiforms; Vidal and Hedges 2005), 2) polyphyly of Anilioidea (pipe-snakes) and Macrostomata (large-gaped snakes including Acrochordidae, Boidea, Bolyeriidae, Colubriformes, Pythonidae, Tropidophiidae, and Ungaliophiidae), 3) paraphyly of Scolecophidia (worm-snakes), 4) phylogenetic affinities of Dibamia and Amphisbaenia within squamates (Hallermann 1998; Rieppel and Zaher 2000; Conrad 2008; Gauthier et al., 2012), and 5) phylogenetic affinities of Heloderma and Shinisaurus within anguiformes (Gao and Norell 1998).
In brief, all cladistic morphological studies of extant and extinct taxa since the landmark analysis of Estes et al. (1988), which itself tested the original divisions of Camp (1923), have supported a basal split between Iguania, represented by Pleurodonta and Acrodonta, and Scleroglossa, which includes gekkotans and all remaining (“autarchoglossan”) squamate groups (see Conrad [2008] and Gauthier et al. [2012] for reviews). This arrangement at the root of crown-Squamata is supported by a suite of morphological characters that range from 2 to 10 unambiguous synapomorphies uniting Scleroglossa as the sister group of Iguania (Conrad 2008; Gauthier et al., 2012). Alternatively, all molecular phylogenetic studies since Saint et al. (1998), Townsend et al. (2004), and Vidal and Hedges (2005) have demonstrated that snakes, anguimorphs, and iguanians share a more-recent common ancestor to the exclusion of other squamates; this clade is collectively referred to as Toxicofera (Vidal and Hedges 2005). Importantly, re-analysis of combined molecular and morphological data have found quantitatively similar hidden morphological support for Toxicofera as well as for Scleroglossa (Reeder et al., 2015), which suggests that most traits supporting a basal Iguania/Scleroglossa split are the result of convergent ecological adaptations. Moreover, Simões et al. (2018) recently rejected this basal split into Scleroglossa using a new morphological data matrix containing characters that support the molecular phylogeny (at least in part). This suggests that the traditional coding of some of these morphological traits may have been in error or reflected homoplasy.
Many morphological studies recovered a single origin for large-gaped snakes, referred to as Macrostomata (Cundall et al., 1993; Rieppel et al., 2003; Conrad 2008; Wilson et al., 2010; Gauthier et al., 2012; Zaher and Scanferla 2012; Simões et al., 2018). This group has been defined by a large number of skull features, all involving the dentigerous upper and lower jaws, palatal, and suspensorium bones, which contribute to an increase of gape size (Rieppel 1988; Cundall and Irish 2008). However, Macrostomata was found to be paraphyletic based on fossil taxa and morphological data alone (Lee and Scanlon 2002; Rieppel et al., 2003; Scanlon 2006; Rieppel, 2012). Similarly, molecular studies using mtDNA and/or single-copy nuclear genes also demonstrate paraphyly in Macrostomata, where Aniliidae (non-Macrostomata) and Tropidophiidae (Macrostomata) represent sister taxa to the exclusion of other macrostomatan families (Vidal and Hedges, 2004; Pyron and Burbrink, 2012; Wiens et al., 2012; Streicher and Wiens, 2016); this has also been reinforced by some unconventional morphological traits (Siegel et al., 2011). Finally, a recent study combining morphological and molecular data from extant and fossil species using tip dating methods showed that macrostomatan morphological features evolved early in snakes and subsequently reversed multiple times (Harrington and Reeder, 2017).
Some authors have argued that because morphological data are subjectively coded and demonstrably susceptible to convergence among key traits, they should be considered biased against estimating correct relationships when compared with more voluminous and objectively coded molecular data (Wiens et al., 2010; Reeder et al., 2015). Molecular data, therefore, should not be as influenced by ontogeny, environment, convergence, or user-coded bias when compared with morphological traits. However, a few examples of molecular convergence within mtDNA in squamates exist (Castoe et al., 2009) and among particular genes in other organisms (Parker et al., 2013; Projecto-Garcia et al., 2013; Zou and Zhang, 2016). Although instances of genome-wide convergence occur (Foote et al., 2015), it nevertheless seems unlikely to expect convergence across nuclear and mtDNA genomes while yielding identical topologies with respect to groups like Toxicofera and Alethinophidia.
Although genome-scale data should thus be optimal for resolving squamate relationships, most molecular studies have been limited by either having many taxa for few genes (Pyron et al., 2013) or having few taxa and many genes (Streicher and Wiens, 2017). For many studies, understanding how many genes support a species tree and what signal exists for alternative arrangements is currently unknown for most genomic-scale data sets. Thus, a study using genome-scale molecular data and dense sampling of extant lineages is needed to confirm the robustness of the molecular signal, particularly with respect to methods that can interrogate phylogenetic signal across loci with respect to the species tree (e.g., Arcila et al., 2017).
Here, we sample 289 species from nearly all squamate families for 394 anchored phylogenomic loci (AHE; Lemmon et al., 2012) to investigate species-tree relationships within Squamata, assess support for all nodes, and estimate divergence dates. We then examine the influence of each locus on the overall topology. Specifically, we use machine-learning techniques (Lek et al., 1996; Zhang, 2010) to quantify and understand why particular genes do not support the standard molecular relationships and determine if there is hidden support for Scleroglossa and Macrostomata. Given past morphological evidence for these relationships, we expect at least some loci in any phylogenetic context (concatenated or species trees) should support these relationships. Thus, the genomic distribution of congruence and incongruence should highlight the underlying biological mechanisms for these topological disputes and potentially provide resolution. Results from our research solidify relationships among most extant squamates, further clarify their taxonomy, and highlight remaining problems.
Methods and Materials
Data set
Using anchored phylogenomics (Lemmon et al., 2012), we generated a genomic data set for 289 species representing 75 families of squamates and one Rhynchocephalian (see Supplemental material available on Dryad at http://dx.doi.org/10.5061/dryad.6392n3s). Our taxonomic arrangement followed Vidal and Hedges (2005, 2009), Conrad (2008), and Vidal et al. (2010) for higher-level squamatan clades; Pyron et al. (2013) and Barker et al. (2015) for Booidea and Pythonoidea familial and generic levels; Zaher et al. (2009) and Kelly et al. (2009), including nomenclatural suggestions of Savage (2015) and Rhodin et al. (2015) for higher and familial caenophidian clades. Whole genomic DNA was extracted from tissue using the Qiagen DNeasy kit following manufacturer’s protocols at the Center for Anchored Phylogenomics at Florida State University and data were assembled using Anchored Phylogenomics (www.anchoredphylogeny.com). Following Qubit fluorometer quantification using a dsDNA HS Assay kit (Invitrogen™), we sonicated up to 1 |$\mu $|g of the extracted DNA to a size range of 150–400 bp using a Covaris E220 Focused-ultasonicator. We then prepared libraries following Lemmon et al. (2012) using a Beckman-Coulter FXp liquid-handling robot. After ligating 8 bp indexes, we pooled libraries in groups of 16 and enriched the library pools using the AHE probes developed for Squamates by Ruane et al. (2015) and Tucker et al. (2016). After verifying the quality and quantity of the enriched libraries by bioanalysis and qPCR, we sequenced the libraries at the Florida State University translational laboratory on eight lanes of Illumina HiSeq 2500 with paired-end 150-bp protocol (|$\sim $|395 Gb of total data).
Following sequencing, we demultiplexed reads passing the Casava high-chastity filter, allowing for no index mismatches. Next, we merged overlapping read pairs to remove library adapters and correct for sequencing errors (Rokyta et al., 2012). We then assembled the reads using the quasi-de novo approach described by Ruane et al. (2015) and Prum et al. (2015), with Calamaria pavimentata and Anolis carolinensis as references. We avoided sequences derived from low-level contamination or misindexing by removing assembly clusters containing fewer than 500 reads. We also verified the identity of some tissue samples by comparing mitochondrial genes to previously published mitochondrial sequences (mainly cytb and the rRNAs genes). Mitochondrial sequences were assembled through the map to reference approach using the Geneious mapper in Geneious R9 (Biomatters Ltd.; Kearse et al., 2012). We established orthology by clustering consensus sequences using pairwise-distances, as described by Hamilton et al. (2016). For some loci, we removed aberrant sequences resulting from low-level contamination or a low-divergent paralog that passed by our primary filters by using a modified version of the orthology assessment method described by Hamilton et al. (2016). In this modified approach, we iteratively clustered the sequences using pairwise-distances for each nested taxonomic level. We then aligned orthologous sequences using MAFFT v7.023b (Katoh and Standley, 2013). We trimmed and masked the alignments following Hamilton et al. (2016); but with MINGOODSITES = 13, MINPROPSAME = 0.3, and MISSINGALLOWED = 82, then inspected the alignments manually in Geneious R9 to identify and remove remaining aberrant sequences.
Phylogeny
We estimated models of substitution for each locus using ModelFinder (Chernomor et al., 2016; Kalyaanamoorthy et al., 2017), which uses maximum likelihood to fit 22 substitutional models including up to 6 free-rate gamma categories. We first estimated phylogeny and tree support using the ultrafast nonparametric bootstrap approximation (|$n$| = 1000; Minh et al., 2013) over the partitioned concatenated data sets. Support was also estimated using the Shimodaira–Hasegawa-like approximate likelihood ratio test (SH; Shimodaira and Hasegawa, 1999; Anisimova et al., 2011). With the locus-partitions and substitution models, we generated gene trees with support for each locus in IQ-TREE v1.6.6 (Nguyen et al., 2015). We then generated a species tree using ASTRAL III with IQ gene trees as inputs and support estimated using local posterior probabilities on quadripartitions with default hyper-parameter inputs (Yule prior for branch lengths and the species tree set to 0.5; Mirarab and Warnow, 2015). We also ran ASTRAL using 1000 bootstrapped IQ trees under the multilocus bootstrapping feature. We compared these concatenated and species trees using Robinson–Foulds (RF) distances (Robinson and Foulds, 1981), which provide a measure of topological dissimilarity, and then determined if all measures of support were correlated among shared branches using Spearman rank correlation (Supplementary material available on Dryad). The Squamata tree was rooted with their closest living relative, the rhynchocephalian Sphenodon punctatus; outgroup status of this taxon has been discussed in other recent papers that used morphological and molecular data (Gauthier et al., 2012; Jones et al., 2013; Chen et al., 2015; Harrington et al., 2016).
Because bootstraps, SH, or posterior probabilities do not provide comprehensive measures of underlying agreements or disagreement among sites and genes for supporting any topological arrangement, we examined site and gene concordance factors (sCF and gCF, respectively). Here, gCF indicated the percentage of gene trees showing a particular branch from a species tree (Ané et al., 2007; Baum, 2007), whereas sCF shows the number of sites supporting that branch. Values of sCF have a lower bound of 33% given the three possible quartets for each node, whereas gCF values are calculated from a full gene tree and, therefore, may not resolve a particular node yielding a lower bound of 0%. We estimated gCF and sCF values across all nodes of the ASTRAL species tree and concatenated IQ trees using IQ-TREE v1.6.6 (Nguyen et al., 2015; Minh et al., 2018) and associated R code (http://www.robertlanfear.com/blog/files/concordance_factors.html).
Because discordance as estimated here between gene trees and the species tree may be caused by either incomplete lineage sorting (ILS) and/or poorly estimated gene trees, we attempted to isolate these issues across all nodes. If ILS is driving gene and species-tree differences, then the two discordant topologies at a particular node (i.e., not the primary node from the species-tree topology) should be equivalent for gCF and for sCF, though sites may be linked within single loci and provide unreliable estimates. Using a |$X^2$| test for genes and sites separately, we sum the number of genes and the number of sites supporting the two discordant topologies at each node to determine if they deviate significantly from being evenly represented. If they were not significant, this is a reasonable expectation that discordance among gene trees and among sites was due to ILS (Huson et al., 2005; Green et al., 2010; Martin et al., 2015). We estimated |$X^2$| between genes and sites using script provided here: http://www.robertlanfear.com/blog/files/concordance_factors.html.
Divergence Dates
We estimated divergence dates (with error) using the penalized-likelihood approach in TreePL (Smith and O’Meara, 2012) with three phylogenetic datasets. A full MCMC-based relaxed phylogenetics method was not computationally feasible for a dataset of this size. The first dataset used for dating was composed of 1000 concatenated bootstrapped (pseudoreplicated) trees generated from the IQTREE rapid bootstrap function (UFBoots). Second, because tree space may not have been widely explored using this method of generating bootstrapped trees (Smith et al., 2018), we also generated 100 pseudoreplicates in RAxML 1.6.7 (Stamatakis, 2014), and estimated phylogeny for each in IQTREE. Third, we also generated a test tree by fitting the concatenated data to the ASTRAL topology producing a species-tree topology with branch lengths in substitution rates. We then re-estimated the phylogeny and generated dated trees using TreePL; this method produced dates given tree uncertainty from the two bootstrap replicates and the ASTRAL tree. For TreePL, we chose the best smoothing parameter to balance a tradeoff between rates across the tree being clock-like or completely saturated by using a cross-validation approach. This approach sequentially removed terminal taxa, produced an estimate of these branches from the remaining data given an optimal smoothing parameter, and then produced an appropriate smoothing parameter from the fit of the real branch and the pruned branch (Sanderson, 2002). We iterated this 14 times over a range of smoothing parameters from 1 |$\times $| 10|$^{-7}$|–1 |$\times$| 10|$^{4}$| and implemented the thorough option to ensure that the run iterated until convergence. To calibrate these trees, we followed Jones et al. (2013), Alencar et al. (2016), and Zaher et al. (2018), and added other taxa, to utilize 26 fossils and locations on the phylogeny for estimating divergence dates (see Supplementary material available on Dryad).
Locus Support for Morphological Topology
We examined how often individual loci recovered the traditional Scleroglossa/Iguania division and a monophyletic Macrostomata (including Acrochordidae, Boidea, Bolyeriidae, Calabariidae, Candoidae, Charinidae, Erycidae, Loxocemidae, Pythonidae, Sanziniidae, Tropidophiidae, Ungaliophiidae, Xenopeltidae, and all 17 families of Colubroides recognized herein). For Scleroglossa, we tested for monophyly of all squamates excluding Iguania. For Macrostomata, we tested for monophyly of Amerophidia (Aniliidae |$+$| Tropidophiidae), a group that rejects Macrostomata, as aniliids are canonical non-macrostomatans, and tropidophiids are canonical macrostomatans (Vidal and Hedges, 2004). This grouping in particular is an example of conflict between molecular, osteological, and soft-tissue characters (see Siegel et al., 2011; Hsiang et al., 2015). To test a more difficult phylogenetic relationship, we also examined the placement of Dibamia, here inferred in the species tree as sister to Gekkota.
We first estimated the probability of Toxicofera and Amerophidia monophyly, respectively, using the measures of support in the species tree (quadripartition and bootstrapped IQ gene trees) and asked how many loci also recover these relationships. For the remaining loci that did not show these relationships, we asked how often they find the basic Scleroglossa/Iguania split and monophyly of Macrostomata and whether these outlier loci together recovered these traditional relationships using species-tree methods. Similarly, we also examined key placements for Dibamia as the sister group to: 1) all other squamates, 2) Gekkota, and 3) all other squamates excluding Gekkota.
Genomic Interrogation Using Neural Networks
Using a custom R script (R Core Team, 2015; script provided as Supplementary material available on Dryad), we examined the frequency of genes that support each node of the dated species tree, similar to Smith et al. (2018). We assessed this support over all dated nodes to understand where and when particular phylogenetic relationships were poorly supported by the density of gene trees that disagree with the species tree. We then attempted to investigate why certain loci did not show the same relationships as the species tree.
Using RF distances (Robinson and Foulds, 1981) between each gene tree and the species tree (RF|$_{gtst})$| as the response variable, we chose characteristics of genes known to affect resolution, support, or topological accuracy such as gene length, base-pair composition, alignment gappiness, and variable sites (Rosenberg and Kumar, 2003; Felsenstein, 2004; Wortley et al., 2005; Nagy et al., 2012; López-Giráldez et al., 2013; Som, 2015; Duchêne et al., 2017), which here included the following parameters as a standard set of predictor variables: 1) number of sites, 2) base-pair content, 3) number of parsimony-informative sites (PIS), 4) number of gaps, 4) maximum gap length, 5) mean gap length, 6) number of segregating sites with gaps, 7) standard deviation of gap length, 8) sites with more than one substitution, 9) number of bases observed at each site (1–4), 10) maximum or minimum phylogenetic informativeness, and 11) gene and site concordance (gCF and sCF). Metrics 1–9 were generated with a custom script using the Ape package in R (Paradis et al., 2004). RF distances were estimated using a custom script based on the package phangorn. Metric 10 was estimated using a custom script based on the package phyloinformR (Dornburg et al., 2016), which estimates rates per site from the web server Phydesign (Lopez-Giraldez and Townsend, 2011) and metric 11 was generated in IQ-TREE v1.6.6 (Nguyen et al., 2015).
To understand if these variables can predict RF|$_{gtst}$|, we used artificial neural network (NN) regressions in caret (Kuhn, 2008). Artificial NNs are a type of deep-learning method used as an alternative to likelihood techniques to address complex questions with many predictor variables in population genomics (Libbrecht and Noble, 2015; Sheehan et al., 2016). As such, they are a powerful tool for understanding non-linear interactions among numerous parameters to predict responses resulting in a range of simple to complex models regardless of the statistical distribution of those variables or the relationships among them (Lek et al., 1996; Zhang, 2010). These methods have been successfully used to address phylogenetic and comparative tree-based tests previously (Burbrink et al., 2017; Burbrink and Gehara, 2018).
In brief, in the typical NN with multilayer feed-forward back-propagation used here, the basic structure began with genetic input variables (input neurons), joined by weighted synapses to hidden neurons, and then finally ending in an output neuron. Every node was connected to every other node in the previous layer, and each node was provided with an activation value, defined as the difference between the weighted sum of all inputs and a bias (threshold) parameter; connected hidden neurons activate an output node that is compared with a known (dependent) value.
Specific to our NN, we scaled all predictor variables by the minimum and maximum range for each data category. We separated the data into 70% standard training and 30% test. We also tested accuracy at other training and test percentages, respectively: 50/50, 60/40, 70/30, 80/20, and 90/10. Each of these was run using 1000 maximum iterations, which ensured convergence. We resampled the data using the default 25 bootstrap replicates to reach convergence with the following tuning parameters: weight decay, root mean squared error (RMSE), |$r^{2}$|, and mean absolute error (MAE). We examined the power of these variables and models to predict the response (RF|$_{gtst})$| by recomposing the test and training sets over 100 iterations and comparing the resulting test statistics (RMSE, |$r^{2}$|, and MAE) to those from randomized response variables for each of these 100 estimated models. Over these replicates, we also identified the top five most important model variables (Supplementary Fig. S1 available on Dryad).
Multicollinearity among variables may be problematic for constructing model inferences if there is a linear dependence among these independent variables (De Veaux and Ungar, 1994; Hastie et al., 2009; Dormann et al., 2013). Although overparameterization is often not a problem for NN given that they are used for prediction of the system and not necessarily for interpretation, previous studies have demonstrated that machine-learning techniques in general may be sensitive to changes in variables over collinear data (Shan et al., 2006; Dormann et al., 2013). Our NN models contain 26 variables that mostly describe properties of genes; therefore, we used variance inflation factors (VIF; Montgomer et al., 2012) to filter collinearity data in the R package “Faraway” (Faraway, 2002). Here, we chose standard VIF |$>$|10 and removed all but one variable greater than this value. We then repeated the NN analyses as with all of the variables described above.
Similarly, we used NN to understand if the same variables used to describe genes, but now also including RF|$_{gtst}$|, could be used to classify why loci do or do not support the monophyly of Toxicofera, Amerophidia (which implies a lack of support for Macrostomata), and the Gekkota and Dibamia sister relationship (GD). Our binary predictions were classified as “1” if the locus supported the relationship and “0” if it did not. We again used the “train” function with the identical set up as above but replaced the tuning parameters with weighted decay and model accuracy. This was replicated 500 times for recomposing test and training data sets, estimating accuracy, and then testing this against the accuracy of randomized responses. For each of these replicates, we also tested performance using a confusion matrix (Townsend, 1971; Kuhn, 2008) comparing predicted with actual classification from the training data.
Results
Assemblies
We were able to recover 99.0% (SD = 2.3%) of the 394 target AHE loci (Supplementary material available on Dryad). An average of 23.5% (SD = 10.0%) of the reads mapped to the target region. The resulting consensus sequences averaged 1775 bp (SD = 180 bp). We obtained for each locus an average of 1.7 consensus sequences (assembly clusters, SD = 0.67), indicating that the loci were low copy but not all loci had single copies. The average coverage of these consensus sequences was 218 (SD = 83).
Data set
Loci had on average retained 92.4% (SD = 10.8) of all taxa represented in the species tree (|$n$| = 289). Mean length and number of PIS were 1302 bp (range: 170–2075, SD = 282.72) and 749 bp (range: 61–1455; SD = 282.7167), respectively. We found that 18 models provided a best fit for substitutions across all loci, with the TVM (transversion model, AG = CT, unequal base frequencies) fitted to 27% of loci and GTR fitted to 18.1 % of models, with either a five or six free-rate parameter (R) model describing rate heterogeneity.
Phylogeny
Tree estimates using either concatenated or species-tree methods produced similar topologies (Figs. 1 and 2), with RF distances of 44 between these trees being only 7.7% of the maximum RF distance. All methods showed strong nodal support. All measures of support—which included ASTRAL with local posterior probabilities, ASTRAL with 1000 IQ tree UF bootstraps, concatenated partitioned IQ trees with 1000 bootstraps, and concatenated partitioned IQ trees with SH likelihoods showed 0.90, 0.92, 0.93, and 0.95 percent of nodes were supported above 0.95 for each method, respectively (Supplementary material available on Dryad). For shared nodes, support was significantly correlated among methods (|$\rho $| = 0.55–0.66; |$P $|= 2.2 |$\times$| 10|$^{-16})$|.
All phylogenies generally showed relationships among groups that have been commonly recovered in previous genomic or combined molecular-morphological studies (Figs. 1 and 2; Vidal and Hedges, 2005; Wiens et al., 2012; Pyron et al., 2013; Reeder et al., 2015; Streicher and Wiens, 2017). We found unambiguous support (i.e., support values of 1.0) for all main groups, including Unidentata, Episquamata, Toxicofera, Gekkota, Scincomorpha, Laterata, Anguiformes, Iguania, and Serpentes (Fig. 1). The tree supported an early division between Gekkota/Dibamia and the remainder of Squamata, followed by a Scincomorpha and Episquamata sister relationship, and within Episquamata a sister relationship between Laterata and Toxicofera. Resolution and support within each of these groups was generally unambiguous (Supplementary material available on Dryad). For instance, most of the nodes within Colubroides received probability support values of 1.0, except for sister relationships between Colubridae + Grayiidae and Lamprophiidae + Pseudoxyrhophiidae, and the placement of Natricidae and Elapidae within Colubroidea and Elapoidea, respectively. Likewise, the remainder of Serpentes was well supported except for the placement of Candoiidae and Bolyeriidae, and the sister relationship between the two clades containing the most recent common ancestors (MRCA) of 1) Pythonidae and Bolyeriidae and 2) Calabariidae and Boidae. We also did not find strong support for relationships among some families in Pleurodonta (Iguania) and Gekkota, the placement of Dibamia as sister to Gekkota, and placement of Anguiformes or Iguania (or their MRCA) as sister to Serpentes within Toxicofera (Fig. 2).
Tree Support
Our estimates of sCF and gCF were correlated across the species tree of squamates (Fig. 3) but we note that both of these measures fell well below standard measures of Pp support. For many of the standard squamate relationships, including Toxicofera and Amerophidia, gCF were above 50%, whereas sCF remained low. This indicated that estimating support from sites alone, such as in bootstraps, may not provide credible estimates of support with genomic-scale data. We also demonstrated a significant relationship between gCF and sCF and branch length (Fig. 3).
Most of the gene-to-species-tree discordance can be described by ILS, where 82.6% of nodes did not show significant differences between the two discordant topologies. Although likely not as reliable given non-independence among sites, sCF shows 39.7% of nodes not showing significance among discordance sites. Finally, logistic regression testing for the presence or absence of Toxicofera and Amerophidia given sCF was significant (|$P $|= 0.012 and 3.99 |$\times $| 10|$^{-5})$|, whereas the presence of the Dibamia/Gekkota node showed no relationship with sCF (|$P $|= 0.45).
For each node of the species tree, more genes recovered a particular node than did not. This does not indicate, however, that the majority of gene topologies were the same as the species tree (Fig. 4); 72% of nodes were supported by the majority of gene trees. Most of the main groups showed high species-tree and individual-locus agreement, though we note that at 90–95 Ma, relationships among Pleurodonta families (Iguanians) and the placement of Candoiidae and Bolyeriidae among Pythonoidea and Booidea, respectively, showed strong discordance between the number of gene trees showing the species-trees relationship. Most loci yielded strong phylogenetic informativeness over substantially large substitution rates to infer the origin and diversification throughout the history of Squamata sampled in our date trees (Supplementary material available on Dryad).
Neither concatenated analyses nor species trees supported the traditional Scleroglossa or Macrostomata. In addition, no loci recovered those groups either. In contrast, Toxicofera and Amerophidia were well supported (100%) in concatenated and species-tree estimates (Supplementary material available on Dryad). Among individual genes, 75% and 69% of loci recovered monophyletic Toxicofera and Amerophidia, respectively. However, the sister relationship between Gekkota/Dibamia (GD) was inferred in all concatenated and species-tree estimates but with low support (only 42.5% of individual loci).
We also generated species trees using ASTRAL III from the loci not supporting Toxicofera, Amerophidia, and GD. We found that among these smaller sets of loci, species trees did not support Scleroglossa, but rather placed Laterata sister to Serpentes, followed by Iguania and then Anguimorpha (Supplementary material available on Dryad). When forcing a sister relationship between Laterata and Serpentes, estimating a best supporting IQTree for this constraint, and calculating gCF and sCF for that nodal constraint, we found very little support, essentially random, for this arrangement: gCF = 4.32 and sCF = 33.1. For those loci not supporting Amerophidia, we found Aniliidae as sister to Alethinophidia, but still not supporting Macrostomata, given the inclusion of Uropeltoidea in Alethinophidia (Supplementary material available on Dryad). Finally, we found that 73% of the loci did not recover the GD relationship but instead resolving Dibamia as sister to all squamates (43% of loci) or Dibamia as sister to the remaining Squamata after Gekkota (30% of loci; Supplementary material available on Dryad).
Divergence Dating
Our divergence dates were largely congruent with recent studies focused on estimating divergence times in Squamata (e.g., Mulcahy et al., 2012; Jones et al., 2013; Pyron, 2017). Our analysis, though, provided an unprecedented taxonomic and genomic coverage at lower hierarchical levels of the squamate tree that enabled new estimates of divergence dates among extant families. Between the different sets of the bootstrapped phylogeny, we found that the absolute mean difference for dated nodes was only 0.668 Ma, and all dates at all shared nodes were correlated (|$\rho$| = 0.925, |$P$| = 2.2 |$\times $| 10|$^{-16})$|. Our estimates of divergence dates (Figs. 1 and 2; Supplementary material available on Dryad) suggested the origin of crown-Squamata to be in the Early Jurassic (190 Ma), though recent fossil evidence suggested this may have occurred earlier in the Late Triassic (206 Ma; Simões et al., 2018). Deep divergences within the tree of Squamata occurred between the Early and Middle Jurassic, with the divergence among Gekkota–Dibamia, Scincomorpha–Episquamata, Laterata–Toxicofera, Lacertibaenia–Teiioidea, and Iguania–Anguiformes occurring between 190 and 155 Ma.
A large number of modern groups diverged within the Cretaceous: gekkotan, cordyloid, teiioid, anguiform, acrodontan, pleurodontan, and typhlopoid extant families. Divergence of crown-pleurodont iguanian families occurred within a short interval during the Early Late Cretaceous, between |$\sim $|98 and 79 Ma, following the end of the opening of the South Atlantic and during a period of isolation of the western Gondwanan landmasses (McLoughlin, 2001). Similarly, Chamaeleonidae and Agamidae, diverged at |$\sim $|100 Ma. Afrophidian stem-Uropeltoidea, Pythonoidea, Booidea, and Caenophidia also diverged within the Late Cretaceous, although most of their extant families diverged after the K/Pg boundary. Notable exceptions are the Cylindrophiidae, Bolyeriidae, Xenopeltidae, Calabariidae, Acrochordidae, and Xenodermidae, which diverged in the Late Cretaceous. All extant families of the most diverse group of squamates, the Colubriformes (|$>$|3600 extant species), diverged within the Paleogene (Pareidae, Viperidae, Homalopsidae) or throughout the Eocene (all remaining families).
Gene Interrogation via NN: Scleroglossa and Macrostomata
To understand discordance between gene trees and species trees beyond ILS, we used two NN analyses. For the first, we used a regression approach and modeled predictions for the RF|$_{gtst}$| discordance. All NN analyses converged and accuracy did not vary among training data sets ranging in size from 0.5 to 0.9 (mean and SD |$r^{2 }$| =0.87, 0.015). We estimated an average RMSE, |$r^{2}$|, and MAE (SD in parentheses) of 0.06 (0.013), 0.86 (0.07), 0.05 (0.009), respectively. These strongly suggested that the NN was accurate, particularly relative to random responses (test between real and random accuracy metrics; |$P<$| 2.2 |$\times $| 10|$^{-16})$|, which were on average 0.24 (RMSE), 0.01 (|$r^{2})$|, 0.17 (MAE; Fig. 5). The top five most important variables for predicting RF|$_{gtst}$| by each locus were: mean and SD of sCF, PIS, phylogenetic informativeness, frequency of sites with three observable alternate bases, and frequency of sites with four observable alternate bases (Fig. 6). Individually, Bayesian correlation (BayesFirstAid (https://github.com/rasmusab/bayesian_first_aid) between each of these variables and RF|$_{gtst}$| showed strong negative correlation (Fig. 6). We also tested the efficacy of the NN approach by filtering for multicollinearity using VIF, which removed all but five variables (mean and SD sCF, PIS, number of tips, and number of gaps). This essentially produced the same prediction as using all 26 variables (accuracy = RMSE = 0.06, |$r^{2}$| = 0.87, MAE = 0.047) with variable importance ranked at 1.0 for all uncorrelated variables: PIS, number of tips, number of gaps, and standard deviation and mean sCF.
For the second machine-learning approach, we used NN classification to understand if any of the properties of these loci along with RF|$_{gtst}$| could predict why particular genes failed to recover Toxicofera, Amerophidia, and Dibamia/Gekkota. We determined that accuracy (scaled between 0 and 1) when running these models over the training data set only had a modest increase over randomizing the responses (Fig. 5; |$\Delta $| accuracy for Toxicofera = 0.08, Amerophidia = 0.08, Dibamia/Gekkota = 0.13), suggesting difficulty determining why certain loci fail to find these three relationships. Similarly, area under the ROC curve was low (0.60–0.70), and confusion matrices were not significant (|$P$| = 0.16–0.44). The top-ranked importance variables were RF|$_{gtst}$| (0.96–1.0) and phylogenetic informativeness (0.92–0.96) for Toxicofera and Dibamia/Gekkota. For Amerophidia, we found that the variables RF|$_{gtst}$| (0.94) and frequency of adenosine bases (0.75) ranked highest.
Discussion
Using a genome-scale data set with a thorough and diverse sampling of taxa, we corroborate nearly all recent molecular studies that estimate strong support for several fundamental groupings in Squamata: Unidentata, Scincomorpha, Episquamata, Laterata, Toxicofera, and Amerophidia. Using gene-interrogation techniques, we do not find any support in the genome for the traditional morphology-based Scleroglossa/Iguania division nor for monophyly of large-gaped snakes, Macrostomata. Although the former has mainly fallen out of use in the literature after the rise of DNA sequence-based phylogenies, the latter has remained in use given analyses of morphological data that strongly support it (Conrad, 2008; Gauthier et al., 2012; Zaher and Scanferla, 2012; Hsiang et al., 2015).
Artificial NNs
Artificial NNs show that loci yielding discontinuity with the species tree, as measured by scaled RF distance, can be characterized by a few general properties. Although traditional support remains high across most parts of this phylogeny, both gCF and sCF provide another view where in several cases loci fail to infer key nodes (Figs. 3 and 4). We developed this NN approach to better understand if particular properties of genes or simple ILS can account for discordance (Figs. 5 and 6). Overall, most nodes reveal a pattern consistent with ILS. When using NN and filtering for multicollinearity to predict the degree of gene and species-tree discordance, we found the following properties of genes and samples were important for resolving species-tree phylogenies: the number of PIS, mean site concordance across all nodes in the tree, the number of terminals sampled, and, importantly, variance in concordance sites across all nodes of each gene tree.
As expected, genes with higher mean sCF across all nodes are better at producing trees concordant with the species trees. This gene property is correlated with the number of PIS (|$\rho $| = 0.400, |$P <$| 1.1 |$\times $| 10|$^{-15})$|. In turn, with our tests of multicollinearity, PIS is correlated with a large number of other properties including phylogenetic informativeness over time (related to substitution rate over time), length of the gene, base-pair frequencies, and number of segregating sites. Interestingly, high variance in sCF predicts higher discordance between the gene and species-tree topologies as measured by RF. This suggests for the main topology estimated using species-tree techniques, genes with high sCF variance are concordant with some nodes and not others, again likely associated with the number of PIS per locus. In addition, both gCF and sCF are correlated with branch lengths which indicate that difficult areas of a phylogeny to infer with credible support will more likely be those where the timing between divergences were small. This has been known from the phylogenetic literature using both concatenated and multispecies coalescent-based methods (Philippe et al., 1994; Xu and Yang, 2016).
In summary, gene-tree and species-tree discordance is a mix of ILS, which is easy to ameliorate given coalescent-based phylogenetic inference, various properties of the genes, like site concordance and PIS, and short times between divergences, which may be difficult to resolve with any amount of data. It is likely that this NN method could serve to filter loci prior to final species-tree estimation. Care should be taken in cases where a majority of loci feature poor properties for tree inference, such as low PIS; preliminary species trees would likely be poorly estimated and supported, thus providing unusable estimates of sCF and gCF.
Squamate Phylogenomics
Some authors have suggested that molecular convergence, such as that found in previous phylogenetic studies (e.g., Castoe et al., 2009), may account for an erroneous estimation of Toxicofera (see Losos et al., 2012). However, this suggestion, even when using a handful of loci, seems improbable given the overwhelming support for the group estimated across many individual markers—including nuclear, mitochondrial, and structural (SINE) loci—and from concatenated and species trees. In addition, there appears to be very little unambiguous evidence for Scleroglossa in most morphological data sets (Reeder et al., 2015). Molecular convergence has been found within loci—for instance, within burrowing squamates in mtDNA (Castoe et al., 2009), and even across the genome in some limited examples (Foote et al., 2015). However, expecting convergence across most or all heritable markers including ultraconserved elements (UCE) (Streicher and Wiens, 2017), AHE loci, and mtDNA is unlikely given independence and function of these loci and the extremely low probability that taxa would randomly show the same relationships across these genes.
We do not find a single locus supporting Scleroglossa, whereas a majority of all possible loci containing all taxa recover Toxicofera (Figs. 3 and 4). Moreover, both concatenated and species-tree estimates support Toxicofera at 100%. The loci that do not recover Toxicofera are also problematic for most relationships, yielding higher gene-to-species-tree RF values. Interestingly, species-tree estimates from these loci also do not support Scleroglossa and still show some support for Toxicofera (Supplementary material available on Dryad). Previous research with a smaller data set also failed to find gene-support for Scleroglossa and supported Toxicofera (Reeder et al., 2015). Given that characters supporting Scleroglossa (Conrad, 2008; Gauthier et al., 2012) may be problematic due potential convergence arising from independent adaptation to burrowing (Reeder et al., 2015), we discourage any further use of Scleroglossa as a nomen outside of a historical context. On the other hand, sister-group relationships within Toxicofera remain unresolved given that the clade formed by Iguania and Anguiformes received low support values in one method (quadripartition support), despite the large number of loci used in our study. This relationship has been inferred previously using genomic data with greater support but using less taxa (Streicher and Wiens, 2017). Therefore, given independent support between UCEs and AHE data, it may be likely that the node subtending Iguana/Anguiformes is correct.
It is important to note in our study Serpentes shows a long unbranched interval of almost 50 Ma separating it from the remaining two toxicoferan clades (Fig. 1). The lack of representation of a number of key extinct lineages capable of resolving the placement of snakes by filling this geochronological gap may also confound phylogeny. These remaining two toxicoferan clades are also subtended by an unusually small branch, |$<$|1.4 my, where 98% of branching times in this tree exceed this length. Unsolved relationships within Toxicofera and other difficult regions in this tree (e.g., relationships within Pleurodonta), therefore, may be due to excessively short internodes, which may require data from the entire genome to estimate with any credible support.
Similarly, we find strong support for a deep sister relationship between Aniliidae (“Microstomata”) and Tropidophiidae (Macrostomata) using both concatenated and species-tree methods (bootstrap and Pp support = 100%) and among loci (present in 69% of gene trees), indicating that the taxon Macrostomata defined by snakes with expanded gapes (Zaher, 1998; Conrad, 2008; Wilson et al., 2010; Gauthier et al., 2012; Hsiang et al., 2015) is invalid. This suggests that the origin of large gapes has either evolved or been lost multiple times and is consistent with Harrington and Reeder (2017), and it may indicate that previous hypotheses on the origin of a “macrostomatan” diet in Serpentes requires re-analyses (e.g., Rodriguez-Robles et al., 1999). Unfortunately, the phylogenetic affinities of a number of extinct alethinophidian lineages with key macrostomatan features—such as Pachyrhachis, Haasiophis, Eupodophis, Yurlunggur, Wonambi, and Sanajeh—remain in dispute (Conrad, 2008; Gauthier et al., 2012; Zaher and Scanferla, 2012; Reeder et al., 2015). A more accurate placement of these fossils at the base of the alethinophidian tree might help clarify higher-level affinities between amerophidian and afrophidian lineages.
Within Anguiformes, both concatenated and species-tree estimates support Shinisaurus as the sister taxon to varanids and Helodermatidae as the sister taxon to Anguioidea (Diploglossidae, Anniellidae, Anguidae, Xenosauridae). This topology conflicts with the one suggested by morphological data, including those with expanded fossil sampling, where helodermatids and Shinisaurus are traditionally recognized as the sister groups of varanids and Xenosaurus, respectively (McDowell and Bogert, 1954; Gao and Norell, 1998; Gauthier, 1998; Gauthier et al., 2012). However, Conrad (2008) has shown that Shinisaurus and Xenosaurus were not closely related, the former being the sister group of varanoids (including Helodermatidae) and the latter as the sister group of Anguidae (Conrad, 2008). Conrad’s (2008) morphological tree closely matches molecular estimates of anguiform affinities, including ours, with the exception of the phylogenetic position of Helodermatidae. The distinct and strongly supported evidence provided by morphological and molecular data regarding the phylogenetic affinities of Helodermatidae within Anguiforms is another point of conflict that still awaits a solution.
Finally, the placement of Dibamia within Squamata has been difficult to resolve; molecular studies place Dibamia as the sister to Gekkota, sister to the remaining Squamata, or sister to Unidentata (Townsend et al., 2004; Vidal and Hedges, 2005; Pyron et al., 2013; Reeder et al., 2015; Streicher and Wiens, 2017). Most morphological studies place Dibamia with other limbless, burrowing taxa such as Amphisbaenia or Serpentes (Evans and Barbadillo, 1998, 1999; Hallermann, 1998; Lee, 1998; Rieppel and Zaher, 2000; Evans et al., 2005; Conrad, 2008; Gauthier et al., 2012), though authors typically acknowledge that many characters showing this relationship may be due to convergence reflecting shared ecology. Our analyses inferred Dibamia as the sister to Gekkota, but with poor support among methods and loci, where both gCF and sCF were split evenly among primary and discordant nodes for both metrics. The next-most-probable placement is sister to all other Squamata, which was also found using UCEs in Streicher and Wiens (2017), and then sister to Episquamata. Importantly, our results never find Dibamia closely related to other limbless or burrowing taxa. In general, other methods characterizing genomic data sets have also had difficulty confidently estimating deep relationships, for which particular genes may have biased results (Brown and Thomson, 2016).
Phylogenetic Structure and the Origin of Squamates
The structure of our phylogeny is extremely similar between concatenated and species-tree methods, with RF distances being only 7.7% of the maximum RF. Support among all nodes for all measures of support is high, with |$>$|90% of nodes having 95% support or higher. Although we infer a tree similar to those published previously (Pyron et al., 2013; Reeder et al., 2015; Streicher and Wiens, 2017), where the tree is largely structured as (Unidentata(Episquamata(Toxicofera))), we note that several key nodes remain poorly supported (Figs. 1 and 2). For example, several deep relationships such as the sister to Serpentes (Anguiformes or Iguania, or both) within Toxicofera remain uncertain.
Similarly, support is low for resolving relationships among pleurodont families (primarily the New World iguanians), which is similar to results from previous molecular and morphological attempts to understanding these relationships (Etheridge and Frost, 1989; Townsend et al., 2004; Reeder et al., 2015; Streicher et al., 2015). Comparable with Streicher and Wiens (2016), we found excessively short branch lengths subtending relationships among iguanian families, here ranging from 0.3 to 1.6 my, which are in the lower 0.4–1.7% of shortest internodes on our dated tree. It is likely, given their rapid divergence in the Upper Cretaceous, that signal across the genome to confidently estimate this area of the tree may remain difficult to extract (Rokas and Carroll, 2008). We also find poor support for the placement of Candoiidae within Booidea and Bolyeriidae with respect to Booidea and Pythonoidea, relationships also dating to the Upper Cretaceous. However, dense taxon sampling within these relictual families is not possible. Finally, support for relationships among some of the families of the rapidly diverging and diverse Colubroidea and Elapoidea is lacking, mirroring numerous previous studies (Lawson et al., 2005; Zaher et al., 2009; Pyron et al., 2011). We are presently sampling Colubroidea and Elapoidea more densely to better estimate those interfamilial relationships. As with all regions in the Squamate Tree of Life with poorly supported, short nodes, it is hopeful that imminent whole genome studies may be capable of increasing support in that particular region of the tree, though strong signal for a specific arrangement may always remain elusive.
Our results provide a solid foundation for the origins of all groupings of Squamata, with major pulses of diversification occurring in the Jurassic, Cretaceous, and Paleogene (Figs. 1 and 2). These results generally agree with previous studies using genetic and morphological data (Mulcahy et al., 2012; Pyron and Burbrink, 2014; Harrington and Reeder, 2017), though disparity in sampling loci and taxa make direct comparisons among studies difficult. For example, the root times for Squamata in a recent paper were found to be older, with crown-Squamata originating in the Late Triassic (|$\sim $|206 Ma; Simões et al., 2018). However, Simões et al. (2018) expanded the concept of Squamata by including Megachirella and Marmoretta from the Middle Triassic and Early Jurassic, respectively; these two poorly preserved taxa have been identified previously as stem-lepidosaurs (Evans and Jones, 2010). Our results are concordant with divergence dates for the origin of Squamata and for the major divisions giving rise to higher-level groups within squamates given by Pyron (2017) and Jones et al. (2013), but we note many of our calibrations were taken from those studies. Unidentata, Episquamata, and Toxicofera all arose in the Early to Middle Jurassic. In addition, within these groups, diversification producing the primary taxonomic divisions for the following taxa also occurred in the Jurassic: Scincomorpha, Laterata, Iguania, Cordylioidea, Pleurodonta, Acrodonta, and the root of Dibamia and Gekkota.
Similar to recent combined fossil, morphological, and molecular studies (Jones et al., 2013; Pyron, 2017; Simões et al., 2018), a large number of diverse groups subsequently originated in the Cretaceous, including the crown Serpentes and their major divisions into Typhlopoidea, Amerophidia, Alethinophidia, Afrophidia, and Caenophidia. Within these divisions, well-known and widely distributed groups such as Pythonoidea, Booidea, and Uropeltoidea diversified as well. Within lizards, we see origins and diversification within Gekkota, Teiioidea, Pleurodonta, and Acrodonta. Although understanding how the K/Pg boundary affected rates of diversification within Squamata requires additional information from the fossil record, it is clear that the groups originating in the Mesozoic rapidly diversified into all major families during the Cenozoic, ultimately producing the 10,800 currently known extant species. Most of these extant families of squamates diversified massively throughout the Paleogene and Neogene, underscoring the origins and diversification of the many hyperdiverse families of Colubriformes.
Conclusion
We provide a robust, dated phylogenomic estimate of phylogenetic relationships among Squamata sampling widely across almost all major lizard and snake groups. This research provides a solid framework for understanding the relationships and dates of origins of all extant squamates showing that most major families diversified prior to the K/Pg boundary. We also provide a novel framework for interrogating genes using artificial-intelligence techniques to understand how particular loci differ from species-tree estimates. Importantly, we show that both ILS and poor tree estimation given properties of genes such as the number of informative sites, may produce significant discordance among gene and species trees. All analyses fail to support the two traditional groupings of squamates into Scleroglossa and Macrostomata, but rather a consistent pattern grouping the squamates into Unidentata, Scincomorpha, Episquamata, Laterata, and Toxicofera. We also highlight areas of topological uncertainty within particular groups, such as Pleurodonta family and deep toxicoferan relationships, that represent potential avenues of novel research using whole genomes and denser taxon sampling to properly infer these relationships.
Supplementary Material
Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.sm6jb0p.
Acknowledgments
We thank the following curators who kindly provided tissue samples for our study: G. Schneider (UMMZ), L. Densmore, L. Grismer, A. Bauer, T. Jackman, R. Brown and C. K. Onn (KU), C. Austin, R. Brumfield and D. Dittman (LSUMNS), J. Rosado (MCZ), M. Hagemann (BPBM), A. Wynn (USNM), J. Vindum (CAS), J. McGuire and C. Spencer (MVZ), D. Kizirian (AMNH), A. Resetar (FMNH), K. Krysko and T. Lott (UF). We thank M.Kortyna, Al. Bigelow, S. Holland, J. Cherry at Florida State University’s Center for Anchored Phylogenomics for assistance with data collection and analysis.
Funding
This research was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo [grant number BIOTA-FAPESP 2011/50206-9 to H.Z.], National Science Foundation [grant numbers DEB-1257926 to F.T.B., DEB-1441719 to R.A.P., DEB-1257610 to C.J.R.], and Australian Research Council Discovery [grant number DP120104146 to J.S.K. and S.C.D.]. F.G.G. benefited from a Postdoctoral grant from Fundação de Amparo à Pesquisa do Estado de São Paulo [FAPESP grant number 2012/08661-3].
References
R Core Team.