- Split View
-
Views
-
Cite
Cite
Elvis Han Cui, Dongyuan Song, Weng Kee Wong, Jingyi Jessica Li, Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime, Bioinformatics, Volume 38, Issue 16, August 2022, Pages 3927–3934, https://doi.org/10.1093/bioinformatics/btac423
- Share Icon Share
Abstract
Modeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models.
Here, we propose the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped or valley-shaped, along cell pseudotime. The scGTM has three advantages: (i) it can capture non-monotonic trends that are easy to interpret, (ii) its parameters are biologically interpretable and trend informative, and (iii) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression datasets using the scGTM and show that scGTM can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying biological processes.
The Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM.
Supplementary data are available at Bioinformatics online.
1 Introduction
Pseudotime analysis is one of the most important topics in single-cell transcriptomics. There has been fruitful work on inferring cell pseudotime (Bendall et al., 2014; Cao et al., 2019; Ji and Ji, 2016; Magwene et al., 2003; Mondal et al., 2021; Qiu et al., 2017; Shin et al., 2015; Street et al., 2018; Trapnell et al., 2014) and constructing statistical models for gene expression along the inferred cell pseudotime (Bacher et al., 2018; Campbell and Yau, 2017; Ren and Kuan, 2020; Song and Li, 2021; Van den Berge et al., 2020). Informative trends of gene expression along cell pseudotime may reflect molecular signatures in the biological processes. For instance, a gene may over time exhibit a hill-shaped trend (i.e. first-upward-then-downward) (Fig. 1b) or a valley-shaped trend (i.e. first-downward-then-upward) (Fig. 1c) trend, and both trends may indicate the occurrence of some biological event. Hence, it is of great interest to have a statistical model that can capture hill- and valley-shaped gene expression trends along cell pseudotime.
Two types of statistical methods have been developed to model the relationship between a gene’s expression in a cell (or a sample) and the cell pseudotime (or the sample’s physical time). Methods of the first type are based on statistical regression models, such as generalized linear models (GLM) and generalized additive models (GAM). Specifically, the GLM used in the Monocle3 method (Cao et al., 2019) assumes that a gene’s log-transformed expected expression in a cell is a linear function of the cell pseudotime, making it unable to capture hill- and valley-shaped trends. Consequently, most methods use nonparametric regression models, such as the GAM and piecewise linear models, to capture complex trends. For example, Storey et al. (2005) applied basis regression; Trapnell et al. (2014) considered the GAM with the Tobit likelihood; Ren and Kuan (2020) applied the GAM with Bayesian shrinkage dispersion estimates; Van den Berge et al. (2020) proposed tradeSeq using the spline-based GAM. More recently, Song and Li (2021) proposed the PseudotimeDE method, which fixes the P-value calibration issue in tradeSeq and also uses the spline-based GAM. Additionally, Bacher et al. (2018) used a piecewise linear model, which is more restrictive than the GAM. Locally estimated scatterplot smoothing (LOESS) is another nonparametric smoothing method that is often used for capturing gene expression trends. Although these nonparametric methods can fit complex gene expression trends, they are prone to over-fitting without proper hyper-parameter tuning (as we will show in Section 3), and their parameters either do not directly inform the shape of a trend (e.g. hill-shaped) or carry biological meanings.
Unlike the first type, methods of the second type use models with direct relevance to gene expression dynamics, and notable methods include ImpulseDE/ImpulseDE2 (Chechik and Koller, 2009; Sander et al., 2017; Fischer et al., 2018) and switchDE (Campbell and Yau, 2017). Specifically, ImpulseDE2 estimates a gene expression trend using a double-logistic curve to capture the non-monotonicity; however, even though the parameters have biological interpretations, they do not intuitively inform the shape of a trend. In contrast, switchDE uses a restrictive model with parameters that directly inform the shape of a trend (e.g. a gene’s activation time) but is unable to detect non-monotonic trends.
The above review suggests that there is no current model that can capture monotone, hill-shaped and valley-shaped trends with biologically interpretable and trend-informative parameters. To this end, we propose the scGTM that (i) can capture both hill- and valley-shaped trends and monotone trends, (ii) has interpretable and trend-informative parameters and (iii) has flexible modeling for count data.
To estimate the scGTM parameters, we apply particle swarm optimization (PSO) to find the constrained maximum likelihood estimates (MLE) of the model parameters (Supplementary Fig. S21). PSO has several advantages that make it suitable for our optimization problem: (i) it does not require the objective function to be convex or differentiable; (ii) it can handle boundary constraints and discrete parameters without having to re-formulate the objective function, and (iii) unlike the Newton-type algorithms used in the studies by Trapnell et al. (2014), Wood (2017) and Campbell and Yau (2017), PSO is gradient-free. In addition, PSO codes are freely available and easy to implement; PSO’s successes in tackling complex optimization problems are already well documented in computer science and engineering.
The rest of the article is organized as follows. In Section 2, we introduce the scGTM and briefly review the PSO algorithm. In Section 3, we compare the scGTM with the GLM, GAM, ImpulseDE2, switchDE and LOESS, and we show scGTM’s advantages in capturing informative, interpretable gene expression trends in two real datasets. Section 4 contains a discussion and future work.
2 Materials and methods
2.1 The scGTM formulation
Let be an observed G × C gene expression count matrix, where G is the number of genes, C is the number of cells (i.e. the number of pseudotime values), and ygc is the (g, c)-th element indicating the observed expression count of gene in cell . We consider gene expression counts as random variables whose randomness comes from experimental measurement uncertainty, so ygc is a realization of the random count variable Ygc. Given a particular gene g, for notation simplicity, we drop the subscript g and denote Ygc as Yc and ygc as yc. We denote by the inferred (normalized) pseudotime of cell c. In the scGTM, are treated as fixed values of pseudotime and serve as the covariate vector of interest.
Given tc, the scGTM can model the count variable Yc using four count distributions commonly used for gene expression data: the Poisson, negative binomial (NB), zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) distributions.
We design the parametric form (2) for the following reasons. First, on the left-hand side, is motivated by the logarithmic link function used in the GLM and GAM. The addition of 1 is to ensure that so we can use as the baseline of the hill-shaped trend (empirically, b is set to 0 and works well). Second, on the right-hand side, two partial Gaussian functions are adopted to model the trend’s increasing and decreasing parts separately, so that the trend is allowed to be asymmetric (e.g. the increasing trend may be steeper or flatter than the decreasing trend). We choose to piece two partial Gaussian functions at the maximum into one function for two reasons: (i) the function is smooth (differentiable everywhere) and has a zero derivative at the maximum; and (ii) the function has tails that converge to the baseline b as the pseudotime tc moves away from the mode t0, a pattern that agrees with many biological processes.
Figure 1a shows the roles of the four parameters , k1, k2 and t0 in (2) for modeling a hill-shaped trend. For a valley-shaped trend, there are four similar parameters, and we note that a monotone increasing trend is a special case of a hill-shaped trend with the increasing part only. The four parameters in the Figure 1a are the maximum log expected expression , the activation strength k1, the repression strength k2, and the change time t0 where the expected expression stops increasing. Figure 1b and c show the scGTM fitted to the gene Tmsb10 in the GYRUS dataset and the gene NFKBIA in the LPS dataset (Supplementary Table S1). The fitted trends are hill- and valley-shaped, respectively.
In the hill-shaped scGTM, we assume that a gene’s expression count Yc in cell c has mean parameter τc and zero-inflation parameter pc, and both depend on the pseudotime tc of cell c. In (2), we link τc to tc. In (3), we link pc to tc using a logistic regression with predictor , i.e. the logistic transformation of pc is a linear function of (with slope α and intercept β) and thus a function of tc.
Besides and , the following parameters of the hill-shaped scGTM shown in Figure 1a need to be estimated for biological interpretations:
: magnitude of the hill;
: activation strength (how fast the gene is up-regulated);
: repression strength (how fast the gene is down-regulated);
: change time (where the gene reaches the maximum expected expression). It is within because the pseudotime is normalized to .
: magnitude of the valley;
: activation strength (how fast the gene is up-regulated);
: repression strength (how fast the gene is down-regulated);
: change time (where the gene reaches the minimum expected expression).
Compared to the hill-shaped scGTM, the valley-shaped scGTM has an additional baseline parameter b that needs to be estimated. For simplicity, we estimate b by a plug-in estimator , where are the observed counts of a valley-shaped gene. This estimate is justified by the fact that the maximum likelihood estimate (MLE) of the upper bound parameter of a domain is the maximum order statistic; i.e. if are randomly sampled from a distribution with domain , then is the MLE of b. For the common parameters of the hill- and valley-shaped scGTMs, we next discuss how PSO can provide constrained likelihood estimates for these parameters.
2.2 Constrained MLE and the PSO algorithm
For a valley-shaped gene, the constrained MLE problem is similar, and we omit the discussion for space consideration.
There are two difficulties in the optimization problem (5). First, the likelihood function (6) is neither convex nor concave. Second, the constraint is linear in , k1, k2 and t0, but is a positive integer-valued variable. Hence, conventional optimization algorithms such as P-IRLS in GAM (Wood, 2011, 2017) and L-BFGS in switchDE (Campbell and Yau, 2017) are difficult to apply in this case. Metaheuristics is a class of assumptions-free general purpose optimization algorithms used to tackle challenging and high-dimensional optimization problems in quantitative sciences (Whitacre, 2011a,b; Yang, 2017). PSO is an exemplary metaheuristic algorithm, and it has effectively solved various types of optimization problems. Korani and Mouhoub (2021) is a recent review of metaheuristic algorithms and their applications across various disciplines.
PSO first generates a swarm of candidate solutions (known as particles) to the optimization problem (5). At each iteration, particles change their positions within the constraints, and the algorithm finds the best solution among all particle trajectories. We summarize the vanilla PSO algorithm (Bratton and Kennedy, 2007) for the constrained MLE of the scGTM in Algorithm 1, and we provide further details of PSO in the Supplementary Information.
Input data: a gene’s expression counts and cell pseudotime values y: a gene expression count vector; t: a cell pseudotime vector;
Input parameters:
F: count distribution: Poisson, NB, ZIP, or ZINB;
H: number of iterations in PSO; set to H = 100 by default; w, c1, and c2: hyperparameters of PSO; set to w = 0.9, , and by default;
Algorithm:
Randomly initialize Θ with B particles: ;
Randomly initialize velocity vectors for the B particles: ;
For h = 0 to H:
- Update the best solution of each particle iwhere ;
- Update the global best solution
- Update velocity of each particle iwhere and are independently generated from ;
- Update each particle
Set ;
Calculate 95% approximate confidence intervals of key parameters based on (Section 2.3).
Output:
: fitted negative log likelihood value;
: estimated parameters;
, and : 95% approximate confidence intervals.
2.3 Approximate confidence intervals of the four key parameters in the scGTM
3 Results
3.1 scGTM outperforms GAM, GLM, LOESS, switchDE and ImpulseDE2 in capturing informative and interpretable trends
As an example, we use the MAOA gene in the WANG dataset (Wang et al., 2020) (Supplementary Table S1) to compare the fitted trends of the scGTM, GAM, GLM, LOESS, switchDE and ImpulseDE2. In the original study, the gene was reported to have a hill-shaped trend. Our comparison results have several interesting observations. First, we show that the scGTM provides more informative and interpretable gene expression trends than the GAM and GLM do. Figure 2a shows that the scGTM robustly captures the hill-shaped trends by assuming the Poisson, ZIP, NB and ZINB distributions and consistently estimates the change time around 0.75, which is where the MAOA gene reaches its expected maximum expression. While the GAM also estimates the maximum expression around 0.75, its estimated trends are much more complex. This is likely due to possible overfitting (despite the use of penalization), and consequently, the GAM trends are more difficult to interpret than the scGTM trends (Fig. 2b). Unlike the scGTM and GAM, the GLM can only capture monotone trends, making it unable to detect the possible existence of expression change time (Fig. 2c). Second, we compare the scGTM with the two existing methods, switchDE and ImpulseDE2, that use models with direct relevance to gene expression dynamics. Although switchDE estimates the activation time around 0.75, similar to the scGTM’s estimated change time, switchDE cannot capture the downward expression trend as the cell pseudotime approaches 1.00 due to its monotone nature (Fig. 2d). ImpulseDE2 can theoretically capture a hill-shaped trend, but it only fits a monotone increasing trend for the MAOA gene (Fig. 2e). A likely reason is that the method was designed for time-course bulk RNA-seq data. Third, we compare the scGTM with the LOESS method commonly used for exploratory data analysis. While LOESS outputs a reasonable, though less smooth trend (Fig. 2f), it is not probability-based and thus does not have a likelihood. Hence, LOESS does not allow likelihood-based model selection, a functionality of the scGTM. To summarize, the scGTM outperforms the GAM, GLM, LOESS, switchDE and ImpulseDE2 by providing more informative and interpretable trends with less concern on model overfitting.
In addition to the MAOA gene, Wang et al. (2020) reported 19 other exemplary genes that define menstrual cycle phases and exhibit hill-shaped expression trends along the cell pseudotime. Supplementary Figures S1–S19 compare the various model fits for the 19 genes, and we observe that the scGTM consistently provides more informative, interpretable trends than the other methods do.
Besides visually inspecting the fitted expression trends, we compare the AIC values of the scGTM, GAM and GLM used with the four count distributions fitted to the aforementioned 20 genes. Note that a lower AIC value indicates a model’s better goodness-of-fit with the model complexity penalized. Supplementary Figure S20 shows that the scGTM has comparable or even lower AIC values than the GAM’s AIC values, confirming that the scGTM fits well to data despite its much simpler model than GAM’s. Based on Figure 2 and Supplementary Figures S1–S20, we use the scGTM with the Poisson distribution in the following applications for its goodness-of-fit and model simplicity. This choice is consistent with previous research on modeling sequencing data (Silverman et al., 2020; ,Jiang, 2022) and other count data (Campbell, 2021; Warton, 2005).
3.2 scGTM recapitulates gene expression trends of endometrial transformation in the human menstrual cycle
The WANG dataset contains 20 exemplar genes that exhibit temporal expression trends in unciliated epithelia cells in the human menstrual cycle (Wang et al., 2020). The original study also ordered the 20 genes by the estimated pseudotime at which they achieved the maximum expression (Fig. 3a; genes ordered from top to bottom), and it was found that the ordering agreed well with the menstrual cycle phases (Fig. 3a; the top bar indicates the phases). Comparing the fitted expression trends of the 20 genes by the scGTM, switchDE and ImpulseDE2, we observe that only the scGTM trends agree well with the data (Fig. 3). Additionally, we evaluate the 20 genes’ estimated change times (i.e. t0) by the scGTM and their estimated activation times by the switchDE. Although the change times and activation times are both expected to correlate well with the gene ordering in the original study, only the change times estimated by the scGTM fulfills this expectation (Fig. 3b and c). Compared with the scGTM, switchDE miscalculates the activation times for many hill-shaped genes whose maximum expression occurs in the middle of the cycle; this is likely due to the fact that switchDE can only capture monotone trends (Fig. 3c). Similarly, ImpulseDE2 cannot well capture the trends of those hill-shaped genes (Fig. 3d). Unlike switchDE and ImpulseDE2, the scGTM estimates the change times reasonably for almost all genes. For instance, the GPX3 gene has an estimated change time at 0.88, consistent with its role as a secretory middle/late phase marker gene (Wang et al., 2020).
Besides the 20 exemplar genes, we apply the scGTM, switchDE and ImpulseDE2 to fit the expression trends of all 1382 menstrual cycle genes reported in Wang et al. (2020). Supplementary Figure S28 shows that the scGTM still outperforms switchDE and ImpulseDE2 for capturing these genes’ expression trends. In summary, the scGTM provides useful summaries for gene expression trends in the human menstrual cycle.
3.3 scGTM identifies informative gene expression trends after immune cell stimulation
As the second real data application, we use the scGTM to categorize gene expression trends in mouse dendritic cells (DCs) after stimulation with lipopolysaccharide (LPS, a component of Gram-negative bacteria) (Shalek et al., 2014). First, we apply the likelihood ratio tests to screen the genes where the scGTM fits significantly better than the null Poisson model [in which τc and pc in (1) do not depend on cell pseudotime tc] does. Assuming that the likelihood ratio statistic of every gene follows as the null distribution, we retain 2405 genes whose Benjamini–Hochberg (BH) adjusted P-values .
Second, we use the scGTM’s confidence intervals of the three parameters t0, k1 and k2 to categorize the 2405 genes into three types: (i) hill-shaped and mostly increasing genes: (change time occurs at late pseudotime) and (strong activation strength), (ii) hill-shaped and mostly decreasing genes: (change time occurs at early pseudotime) and (strong repression strength) and (iii) valley-shaped genes. To demonstrate that this categorization is biologically meaningful, we perform gene ontology (GO) analysis on the three gene types and compare the enriched GO terms. Figure 4a shows that the three gene types are enriched with largely unique GO terms, verifying their functional differences. Notably, the hill-shaped and mostly increasing genes are related to immune response processes, showing consistency between their expression trends (activation after the LPS stimulation) and functions (immune response). Further, we visualize five illustrative genes from each gene type (Fig. 4b) and observe that the scGTM’s fitted trends agree well with the data. In conclusion, the scGTM can help users discern genes with specific trends by its trend-informative parameters.
Besides the above three real data applications, we conduct a simulation study to verify the robustness of the scGTM to gene expression trends not generated from the scGTM assumptions. The simulation results also show that, beyond good interpretability, the scGTM is flexible enough to fit various trends to a similar extent as the GAM does (Supplementary Information S3). Moreover, we use a bootstrap analysis to show that the fitted scGTM trends have a smaller variance than the fitted GAM trends do (Supplementary Information S3), at the cost of a larger bias.
4 Discussion
We propose the scGTM as a flexible and interpretable statistical model for studying single-cell gene expression trends along cell pseudotime. Using four count distributions and two real datasets, we demonstrate that the scGTM has interpretable parameters that can directly inform a trend for gene expression counts. The scGTM parameters are estimated by the constrained maximum likelihood estimation via PSO, one of the most popular metaheuristic algorithms for function optimization. We show that scGTM has distinct advantages over the classic models GLM and GAM and the two recent methods switchDE and ImpulseDE2 in that it can uniquely capture robust, informative and interpretable trends. In contrast, the GLM and switchDE can only estimate monotonic trends; the GAM often provides trends that are too complex to interpret, and ImpulseDE2 (a method designed for bulk RNA-seq data) does not have stable performance on single-cell data. We then use the estimated parameters and confidence intervals from the scGTM to characterize gene expression trends.
Note that we can extend the scGTM by assuming a more complicated mean function, whose estimation can still be achieved by the flexible PSO algorithm. To demonstrate this functionality of the scGTM, we conduct a simulation in Supplementary Information S10, where we use the sine function to generate a gene’s true expression trend along the pseudotime. With its mean function set as as the sine function, the scGTM accurately estimates the gene trend (Supplementary Fig. S25). In a future version of the scGTM package, we can allow users to input specified mean functions that reflect the gene expression trends of interest. On the other hand, if users do not have any prior preference for the gene expression trends, we would recommend the GAM that can capture flexible trends.
Strictly speaking, the inference of the scGTM has two caveats. First, the parameter estimation includes a double-dipping procedure: the same data are first used to decide whether a trend is hill- or valley-shaped and second used to estimate the parameters. Second, since only the key parameters , k1, k2 and t0 are inferential targets, the other parameters , α and β should be regarded ‘nuisance’ parameters. However, the construction of confidence intervals of the key parameters does not account for these two caveats and would thus result in overly optimistic confidence intervals. We will investigate how to obtain better-calibrated confidence intervals in future research.
In our previous work (Song and Li, 2021), we developed a method PseudotimeDE to account for the uncertainty of inferred pseudotime on the inference of differentially expressed genes along the pseudotime. Note that PseudotimeDE is directly extendable to the scGTM, by just replacing the GAM in PseudotimeDE by the scGTM. However, here our focus is on proposing the scGTM for interpreting a trend, instead of testing whether a trend is different from a horizontal line, i.e. the focus of PseudotimeDE. Hence, we leave the incorporation of the scGTM into PseudotimeDE to future work. Moreover, we have a simulation study to show that the fitted scGTM trends have shapes largely robust to noise added to pseudotime (Supplementary Information S4).
The current implementation of the scGTM is only applicable to a single pseudotime trajectory (i.e. cell lineage). A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately.
In addition, the vanilla PSO algorithm in this article handles each parameter’s constraint separately. Hence, if we need a constraint on more than one parameter, e.g. should be within an user-specified range, then we have to develop a variant algorithm of PSO or use other metaheuristics algorithms.
Acknowledgements
The authors thank Dr Wanxin Wang for providing the data in Wang et al. (2020). They also appreciate the comments and feedback from the members of the Junction of Statistics and Biology at UCLA (http://jsb.ucla.edu).
Funding
This work was supported by the National Science Foundation [DBI-1846216 and DMS-2113754]; National Institutes of Health/NIGMS [R01GM120507 and R35GM140888]; Johnson and Johnson WiSTEM2D Award; Sloan Research Fellowship; UCLA David Geffen School of Medicine W.M. Keck Foundation Junior Faculty Award; and Chan-Zuckerberg Initiative Single-Cell Biology Data Insights Grant (to J.J.L.).
Conflict of Interest: none declared.
Data availability
The code and data for generating the results are available at Zenodo (doi: 10.5281/zenodo.5728341). The scGTM Python package is available at https://github.com/ElvisCuiHan/scGTM.
References
Author notes
The authors wish it to be known that, in their opinion, Elvis Han Cui and Dongyuan Song should be regarded as Joint First Authors.