- Split View
-
Views
-
Cite
Cite
Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, Jean-Pierre Kocher, Wei Li, CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model, Nucleic Acids Research, Volume 41, Issue 6, 1 April 2013, Page e74, https://doi.org/10.1093/nar/gkt006
- Share Icon Share
Abstract
Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and ‘hidden’ transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.
INTRODUCTION
Although the human genome sequence was released a decade ago, the role of functional noncoding RNAs (ncRNAs) is much less understood compared with their coding counterparts. Several previous studies have demonstrated that the human genome is pervasively transcribed (1–4), but thoroughly cataloging all the RNA species (especially ncRNA) is challenging. Undiscovered ncRNAs might be rare, transient or beyond the detection limits of conventional approaches. Furthermore, ncRNAs also tend to be idiosyncratic to species and tissues (5,6). Nevertheless, advances in RNA-Seq have provided a new method of surveying the whole transcriptome to an unprecedented degree. Recent genome-wide studies revealed tens of thousands of novel transcripts, the majority of which were long noncoding RNAs (lncRNAs, >200 nt) (4–9). Although a few dozen lncRNAs have been characterized to some extent and are reported to have critical roles in diverse cellular and disease development processes (6,10–14), the biogenesis and function of most lncRNAs remain unclear.
Accurate and quantitative assessment of coding potential is the first step toward comprehensive annotation of newly discovered transcripts. Until now, prediction of coding potential heavily relied on sequence alignment, either pairwise homology search for protein evidence such as that used in the Coding-Potential Calculator (CPC) and PORTRAIT methods (15,16) or multiple alignments to calculate the phylogenetic conservation score such as that used in the Phylogenetic Codon Substitution Frequencies (PhyloCSF) and RNAcode methods (17,18). Alignment-based approaches are particularly useful for highly conserved protein-coding genes and, to a lesser extent, short genes encoding housekeeping or regulatory RNAs (e.g. snRNAs, snoRNA, transfer RNA). However, these approaches cannot immediately apply to all the novel transcripts because of several intrinsic limitations. First, most newly discovered transcripts are lncRNAs, which tend to be lineage specific and less conserved (5,6). This greatly limits the discriminatory power of alignment-based methods. For example, only 29 of 550 lncRNAs identified from zebrafish had detectable sequence similarity with putative mammalian orthologs (6), and only 993 of 8195 human lncRNAs have orthologous transcripts in other species (5). Second, considerable fractions of lncRNAs are overlapped with either the sense or antisense strand of protein-coding genes. These lncRNAs cannot be correctly classified by homology searching because they would have significant matches to protein-coding genes (3,8,19). Third, the reliability of alignment-based approaches largely depends on the quality of alignments (20). This is problematic because most widely used multiple-sequence alignment tools use heuristics and do not guarantee optimal alignments. Finally, alignment-based methods are extremely time-consuming. For instance, CPC and PhyloCSF took 2 days to evaluate the coding potential of 14 353 lncRNAs identified by Cabili et al. (5). This problem is getting more attention as massive-scale RNA sequencing is increasingly being performed. Consequently, a more accurate, robust and faster method that does not rely on sequence alignment is needed to distinguish ncRNAs, especially lncRNAs, from protein-coding genes.
Here, we present Coding-Potential Assessment Tool (CPAT), an alignment-free program, which uses logistic regression to distinguish between coding and noncoding transcripts on the basis of four sequence features. CPAT is highly accurate (0.967) and extremely efficient (10 000 times faster than CPC and PhyloCSF, and 50 times faster than PORTRAIT). CPAT needs only the sequence or coordinate file as input, and it is straightforward to use. We expanded the availability of CPAT to a larger scientific audience via a web interface, which allows users to submit sequences and receive the prediction results back almost instantaneously (http://lilab.research.bcm.edu/cpat/index.php).
MATERIALS AND METHODS
Coding-potential prediction is essentially a binary decision problem, which makes logistic regression a suitable approach. As an alignment-free method, all selected features (predictor variables) were calculated directly from the sequence. The first feature was the maximum length of the open reading frame (ORF). ORF length is one of the most fundamental features used to distinguish ncRNA from messenger RNA because a long putative ORF is unlikely to be observed by random chance in noncoding sequences. Despite the simplicity, ORF length has high concordance with more sophisticated discrimination methods and remains the primary criterion in almost all coding-potential prediction methods (21). The second feature was ORF coverage defined as the ratio of ORF to transcript lengths. This feature also has good classification power, and it is highly complementary to, and independent of, the ORF length (Supplementary Figures S1 and Supplementary Data). Some large bona fide ncRNAs may contain putative long ORFs by random chance (5), and thus cannot be classified correctly by ORF length alone. Fortunately, those large ncRNAs usually have much lower ORF coverage than protein-coding RNAs (Figure 1B).
The Fickett score is independent of the ORF, and when the test region is ≥200 nt in length (which includes most lncRNA), this feature alone can achieve 94% sensitivity and 97% specificity, with ‘no opinion’ on 18% of the sequences (22).
Hexamer score determines the relative degree of hexamer usage bias in a particular sequence. Positive values indicate a coding sequence, whereas negative values indicate a noncoding sequence.
We build a logistic regression model using these four linguistic features as predictor variables. A χ2 test was used to evaluate whether our logit model with predictors fits the training data significantly better than the null model, which had only an intercept. We built a high-confidence training data set to measure the prediction performance of our logit model. This data set contained 10 000 protein-coding transcripts selected from the RefSeq database; all transcripts had high-quality protein sequences annotated by the Consensus Coding Sequence project. We also added 10 000 randomly selected noncoding transcripts from the GenCODE database. We evaluate the model with a 10-fold cross-validation and measured its sensitivity, specificity, accuracy, precision and area under the curve (AUC) characteristics. The receiver operating characteristic (ROC) curve and precision–recall (PR) curve were generated using ROCR package (24). We also built a nonparametric two-graph ROC curve for selecting the optimal CPAT score threshold that maximizes the sensitivity and specificity of CPAT while minimizing misclassifications.
RESULTS
All four selected features were concordantly higher in coding transcripts and lower in noncoding transcripts (Figure 1). We plotted three major features (ORF size, Fickett score and hexamer score) in a three-dimensional space to evaluate their combinatorial effect (Figure 2). Coding and noncoding transcripts in our training data set were grouped into two distinct clusters, indicating good concordance between features. The χ2 test P value was <.001 (χ2 = 23 548.44; degrees of freedom = 4), indicating that the logit model as a whole fits significantly better than the null model. Ten-fold cross-validation showed that CPAT could achieve very high accuracy, with an AUC of 0.9927 (Figure 3A). We also provide the PR curve because the ROC curve can be misleading when the test data are largely skewed (Figure 3B). We use nonparametric two-graph ROC curves to determine an optimal CPAT score threshold that maximizes the discriminatory power (Figure 3C and D). According to Figure 3D, a score threshold of 0.364 gave the highest sensitivity and specificity (0.966 for both) for human data.
We compared the performance of CPAT with that of CPC, PhyloCSF and PORTRAIT (protein-independent support vector machine model) using an independent test data set composed of 4000 coding genes and 4000 noncoding genes. A multiple alignment of 45 vertebrate genomes, including that of human, was downloaded from the UCSC (University of California, Santa Cruz) Genome Browser and was used as the input alignment for PhyloCSF. In general, CPAT (sensitivity: 0.96, specificity: 0.97) had greater classification power compared with all other programs (Figure 4; Supplementary Tables S1 and Supplementary Data). Although CPC had the highest sensitivity (0.99), it suffered from poor specificity (0.74). One possible explanation is that a significant proportion of ncRNAs has a certain degree of sequence similarity to protein-coding genes. PhyloCSF had the least sensitivity (0.90) and the lowest specificity (0.63). Part of the reason for these outcomes is that nonconserved transcripts cannot be processed by PhyloCSF. If we consider those 528 nonconserved transcripts as noncoding, the specificity increased from 0.63 to 0.69, and the sensitivity remained unchanged. PORTRAIT had relatively balanced sensitivity (0.96) and specificity (0.87). CPAT achieved highest overall accuracy (0.97) when compared with CPC (0.87), PhyloCSF (0.76) and PORTRAIT (0.92). CPAT’s excellent discriminatory power was further demonstrated by the greatest separation between the score distributions of coding and noncoding sequences (Figure 5). Unlike CPC, PhyloCSF and PORTRAIT, choosing a smaller CPAT score threshold to increase the sensitivity will not sacrifice too much specificity.
One could argue that PhyloCSF underperformed in this study because we used whole transcripts for testing rather than consecutive protein-coding exons and intergenic regions as used in its original article (17). To address this issue, we compiled another single-exon test data set consisting of 184 protein-coding and 278 noncoding transcripts. The test results with this data set indicated that CPAT (sensitivity: 0.962, specificity: 0.842) still outperformed PhyloCSF (sensitivity: 0.832, specificity: 0.588, Supplementary Figure S2). However, when tested on PhyloCSF’s original data set in Lin et al. (25), PhyloCSF (sensitivity 0.91, specificity 0.99) has better performance than CPAT (sensitivity 0.50, specificity 0.98). This is reasonable because lncRNAs in our test data set are poorly conserved, whereas lncRNAs in Lin et al. test data set are highly conserved because they are taken from multiple-sequence alignments of three closely related Drosophila species. Hence, we argue that PhyloCSF works better if the transcripts are highly conserved, which are rare to find in lncRNAs (5,6). This also highlights the Achilles’ heel of the alignment-based methods for detecting lncRNAs. In contrast, the dramatic decrease in CPAT’s sensitivity is due to the lack of ORF information in Lin et al. test data set, which is largely composed of individual exons, and not exon-length complete transcripts. This, however, will not limit the application scope of CPAT because most full-length transcripts can be constructed at the current sequencing depth (8).
We measured the computational speed of CPAT, CPC and PhyloCSF on a sample of 200 sequences randomly selected from the test data set. CPAT took 0.67 s to process the data, and it was four orders of magnitudes faster than both CPC [11 945 s (3.3 h)] and PhyloCSF [11 737 s (3.3 h)]. Furthermore, computational time for the PhyloCSF did not include the time spent preparing multiple-alignment files for analysis. PORTRAIT was significantly faster than CPC and PhyloCSF, and therefore all 8000 test genes were used to evaluate its speed: CPAT took 23.83 s to process the test set, and it was 48 times faster than PORTRAIT [1146.30 s (19 min)].
DISCUSSION
A number of linguistic features characterizing coding RNA sequences have been developed over the past 30 years. These include maximum ORF size, dinucleotide usage, codon usage bias, hexamer usage bias, nucleotide composition bias between codon positions and imperfect periodicity in base occurrences (23,26). Among these features, we selected ORF features (size and coverage) because of their discriminatory power and ease of calculation (21). In-frame hexamer score was selected because it has the highest prediction accuracy (average of sensitivity and specificity) as evaluated by Fickett and Tung in 1992 (23). Fickett score was selected because it simultaneously captures the compositional bias and position asymmetry, which are orthogonal to the ORF features. Supplementary Figure S3 shows the performance of these individual features as well as the combined feature set. The combined feature set has very high sensitivity and specificity (>0.966), leaving very little room for further improvement.
Annotation of genomes has always been a challenging task for biologists, and these efforts have been accelerated by deep transcriptome sequencing. Distinguishing between protein-coding and noncoding sequences is the first and arguably the most crucial step in genome annotation. Most novel transcripts are less conserved and species-specific ncRNAs. Detecting the coding-potential of these transcripts via alignment-based software is intractable. We developed CPAT, a highly accurate alignment-free method, which uses a logistic regression model to discriminate between coding and noncoding transcripts using pure linguistic features. Compared with other tools, CPAT is more robust, markedly faster and more convenient to use. Taken together, CPAT is able to accurately assess the coding potential of tens of thousands of transcripts in real-time, and will be a valuable tool for the rapidly growing RNA-seq community.
AVAILABILITY AND IMPLEMENTATION
Source code was implemented in C and Python and is freely available at: http://code.google.com/p/cpat/. The web server was implemented in PHP, MYSQL and Apache, with support for all major browsers: http://lilab.research.bcm.edu/cpat/index.php.
FUNDING
Department of Defense Prostate Cancer Program [PC094421 to W.L.]; the Cancer Prevention and Research Institute of Texas [RP110471-C3 to W.L.]; the Center for Individualized Medicine (CIM) at Mayo Clinic (to J.P.K.). Funding for open access charge: Cancer Prevention and Research Institute of Texas [RP110471-C3 to W.L.].
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors thank Chen Wang (Mayo Clinic) and two anonymous reviewers for their valuable suggestions. We also thank Mayo’s section of scientific publication for their copy-editing services.
REFERENCES
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Comments