- Split View
-
Views
-
Cite
Cite
Ran Su, Jie Hu, Quan Zou, Balachandran Manavalan, Leyi Wei, Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools, Briefings in Bioinformatics, Volume 21, Issue 2, March 2020, Pages 408–420, https://doi.org/10.1093/bib/bby124
- Share Icon Share
Abstract
Cell-penetrating peptides (CPPs) facilitate the delivery of therapeutically relevant molecules, including DNA, proteins and oligonucleotides, into cells both in vitro and in vivo. This unique ability explores the possibility of CPPs as therapeutic delivery and its potential applications in clinical therapy. Over the last few decades, a number of machine learning (ML)-based prediction tools have been developed, and some of them are freely available as web portals. However, the predictions produced by various tools are difficult to quantify and compare. In particular, there is no systematic comparison of the web-based prediction tools in performance, especially in practical applications. In this work, we provide a comprehensive review on the biological importance of CPPs, CPP database and existing ML-based methods for CPP prediction. To evaluate current prediction tools, we conducted a comparative study and analyzed a total of 12 models from 6 publicly available CPP prediction tools on 2 benchmark validation sets of CPPs and non-CPPs. Our benchmarking results demonstrated that a model from the KELM-CPPpred, namely KELM-hybrid-AAC, showed a significant improvement in overall performance, when compared to the other 11 prediction models. Moreover, through a length-dependency analysis, we find that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than peptides in other length ranges.
Introduction
Cell-penetrating peptides (CPPs) are short peptides with approximately 5–30 amino acids residues in length [1]. One of the most distinctive characteristics of CPPs is the ability to carry a variety of bioactive molecules into cells without specific receptor interaction [2–5]. The cargoes that CPPs attach vary in different sizes, such as small molecule compounds, dyes, peptides, polypeptide nucleic acids, proteins, plasmid DNA, liposomes, phage particles, superparamagnetic particles and so on [2, 4]. Relevant fluorescence validation experiments have been performed to verify the cell penetrating capability of CPPs [6–8]. Considering this unique property, CPPs that improve the cellular uptake of various bioactive molecules are expected to be promising therapeutic candidates. In consideration of the potential of CPPs in therapeutics, the further development and application of CPP-based delivery strategies have steadily emerged over the past few years, demonstrating the great potential of gene delivery and cancer therapy, as well as effective clinical efficacy [2, 9].
The 1st CPP was discovered by Frankel et al. in the 1980s, which demonstrates that the human immunodeficiency virus 1 (HIV 1) transactivating (Tat) protein was able to enter tissue-cultured cells, to translocate into the nucleus and to transactivate the viral gene expression [10]. The α-helical domain of Tat protein spanning the residues 48–60, mainly composed of basic amino acids, was found as the main determinant for cell internalization and nucleus translocation. After that, Penetratin peptide, the 3rd helix of the antennapedia homeodomain, was found to efficiently cross the cell membranes with an energy-independent mechanism. These observations explore the basic research of CPPs. Since then, research on CPPs gained the growing interest in the last 30 years [11, 12], leading to an exponential increase in the number of CPPs. Currently, there are a total of 1850 experimentally validated entries in CPPsite 2.0—the largest CPP database [13]. The number of CPPs in current database has increased nearly twice relative to its previous version (CPPsite) [14]. It should be pointed out that of the entries in CPPsite 2.0, roughly 90% are derived from natural proteins [15], while the remaining are synthetic proteins or chimeric peptides [13, 16]. Along with the rapid development and wide applications of the next-generation sequencing techniques [17], a plenty of novel protein sequences have been generated rapidly at low cost [15, 16]. Subsequently, amongst these novel and uncharacterized protein sequences, it can be expected to explore more functional peptides with cell-penetrating activities [16]. Unfortunately, it is extremely difficult to apply traditional experimental methods to practical applications, especially with the avalanche of protein sequences, as they have some intrinsic limitations, such as being expensive, labor intensive and time-consuming [16]. To address the limitations from the perspective of experimental methods, computational methods have recently emerged as a promising alternative for accurate and efficient predicting CPPs.
Over the past few decades, a variety of computational methods, especially machine learning (ML)-based methods, have been developed for the prediction of CPPs [1, 6, 18–21]. ML techniques are able to extract useful patterns hidden in experimentally validated CPPs and make effective use of these patterns to accurately predict whether new uncharacterized peptides have the activity of cell-penetrating or not. Importantly, they allow for the use of only primary protein sequences as inputs, without any prior knowledge (i.e. secondary structures), showing the great potential for the high-throughput prediction in large-scale proteomic data. So far, various ML techniques have been applied to the development of CPP prediction methods, such as Support Vector Machine (SVM) [7, 22–24], Random Forest (RF) [8, 11, 12, 16, 25, 26], Neural Network (NN) [20, 27–29], Extremely Randomized Tree (ERT) [15, 30–32] and Kernel Extreme Learning Machine (KELM) [33, 34], thereby generating a number of prediction methods. We found that there is a common phenomenon lying in existing prediction methods; that is, different methods claimed to outperform other previously published methods in their own studies. However, comparisons carried out in different studies are somewhat biased. The possible biases are generated in the following three reasons. Firstly, the comparison was performed by developers themselves. As for the implementation of the compared prediction tools, setting algorithm parameters would greatly impact the performance. The problem is that, in some studies, no algorithmic details are given indeed [1, 6–8, 20, 26, 28]. Therefore, it is difficult to conduct a fair comparison. Secondly, the performances from one study to another are indeed not comparable due to the difference of benchmark datasets. We found that different methods use different training datasets and validation datasets [12, 15, 23, 34]. Thirdly, performance comparison was usually evaluated by cross-validation; the independent test was seldom performed, which is somewhat more important. Moreover, most of existing methods ignore highlighting the specificity (SP) comparison. The SP of a predictor corresponds to the ability of predicting non-CPPs (negatives). It is of great importance for wet-lab researchers, because the low SP of a predictor will produce a large number of false positives when applied to identify functional peptides in large-scale proteomic data, therefore increasing the expense of experimental validation. Consequently, it is also a need to conduct a comparative analysis for existing prediction tools in terms of SP.
In this review, we firstly summarized existing CPP prediction methods using different ML algorithms. Then, we carried out an unbiased evaluation of existing web-based prediction tools using two benchmark validation datasets. Note that there are six prediction tools that have the available web portals, including CellPPD [23], SkipCPP-Pred [12], CPPred-RF [16], KELM-CPPpred [34], MLCPP [15] and CPPred-FL [11], respectively. It should be mentioned that some of them provides more than one prediction models. Therefore, in this study we tested and compared a total of 12 CPP prediction models from the six web servers. Our comparative results demonstrate that the KELM-hybrid-AAC model from the KELM-CPPpred server significantly outperforms other competing prediction models in terms of SP, accuracy (ACC) and Matthew’s correlation coefficient (MCC). More importantly, it achieves more balanced sensitivity (SE) and SP as compared with other prediction tools. In particular, the remarkable higher SP indicates that it can be applied to large-scale proteomics, drastically reducing the generation of false positives. This will greatly facilitate the reduce of cost and time on experimental validation of predictions generated from ML models. Finally, we conducted a length-dependency comparison analysis, and found that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than the peptides in other length ranges.
Materials and methods
Framework of CPP prediction using machine learning
The framework of CPP prediction using ML is illustrated in Figure 1, which involves three main stages. The 1st stage is dataset preparation. Candidate peptide sequences are generally collected from validated databases and relevant literatures [35]. To construct a high-quality prediction model, training sets and independent testing sets are usually needed. Training sets are used for model training and the testing sets are for validating the transferability and reliability of the trained model. The 2nd stage is feature encoding. This stage is composed of feature representation and feature optimization [36]. For feature representation, various feature descriptors are usually used to capture the characteristics of CPPs, including compositional features [i.e. amino acid composition (AAC) and dipeptide composition (DAC)], binary profile, motif-based features and physicochemical features, etc. To improve the feature representation ability, the features are often optimized by removing some irrelevant features [37]. The last stage is model construction and prediction. The optimal features from the previous stage are trained with ML algorithms (i.e. SVM and RF). For query peptide sequences, they are encoded with the feature vectors and then fed into the trained model. Ultimately, the developed prediction model will provide a reliable prediction result whether it is a CPP or not.
Cell-penetration peptides database
To date, there are two public CPP databases, namely CPPsite [14] and CPPsite 2.0 [13]. Note that CPPsite 2.0 is the successor of CPPsite. The details of the two databases are presented in Table 1. To be specific, CPPsite is the 1st database of CPPs created by Gautam et al. [14] in 2012, containing 843 entries with the information of sequence, subcellular localization, physicochemical properties (PPs) and uptake efficiency, etc. In 2015, Agrawal et al. [13] released an updated version of CPPsite, called CPPsite 2.0. It contains 1850 entries, including the information of model system, cargo information, the chemical modifications, predicted tertiary structure and so on [38].
Databases | Year | Number of true CPPs | Website | Ref. |
CPPsite | 2012 | 843 | http://webs.iiitd.edu.in/raghava/cppsite/ | [14] |
CPPsite2.0 | 2015 | 1850 | http://webs.iiitd.edu.in/raghava/cppsite2.0/ | [13] |
Databases | Year | Number of true CPPs | Website | Ref. |
CPPsite | 2012 | 843 | http://webs.iiitd.edu.in/raghava/cppsite/ | [14] |
CPPsite2.0 | 2015 | 1850 | http://webs.iiitd.edu.in/raghava/cppsite2.0/ | [13] |
Databases | Year | Number of true CPPs | Website | Ref. |
CPPsite | 2012 | 843 | http://webs.iiitd.edu.in/raghava/cppsite/ | [14] |
CPPsite2.0 | 2015 | 1850 | http://webs.iiitd.edu.in/raghava/cppsite2.0/ | [13] |
Databases | Year | Number of true CPPs | Website | Ref. |
CPPsite | 2012 | 843 | http://webs.iiitd.edu.in/raghava/cppsite/ | [14] |
CPPsite2.0 | 2015 | 1850 | http://webs.iiitd.edu.in/raghava/cppsite2.0/ | [13] |
Year | Methods | Feature representation | Feature selection | Predictor Name | URL | Ref |
2005 | Z-descriptors | Bulk properties of the constituent amino acids | N.A. | N.A. | N.A. | [6] |
2008 | Partial least squares | Chemical properties | Principal component Analysis | N.A. | N.A. | [1] |
2010 | ANNs | Biochemical features | N.A. | N.A. | N.A. | [28] |
2011 | SMO-based SVMs and the Pearson VII universal kernel | Basic biochemical properties | Scatter search approach | N.A. | N.A. | [7] |
2011 | ANNs | N.A. | Principal component Analysis | N.A. | N.A. | [20] |
2013 | SVMs | Sequence composition, binary profile of patterns and physicochemical properties | N.A. | CellPPD | http://crdd.osdd.net/raghava/cellppd/ | [23] |
2013 | N-to-1 NN | Motif information | N.A. | CPPpred | http://bioware.ucd.ie/cpppred | [29] |
2015 | RF | PseAAC and five properties of amino acids | mRMR and IFS | N.A. | N.A. | [26] |
2016 | SVMs | Dipeptide composition | Analysis of variance | C2Pred | http://lin.uestc.edu.cn/server/C2Pred | [24] |
2017 | Random Forest | K-skip-2-gram | N.A. | SkipCPP-Pred | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
2017 | Random Forest | PC-PseAAC, SC-PseAAC, ASDC and PPs | MRMD and SFS | CPPred-RF | http://server.malab.cn/CPPred-RF/ | [16] |
2018 | Extremely randomized tree and RF | AAC, AAI, DPC, PCP and CTD | N.A. | MLCPP | http://www.thegleelab.org/MLCPP/ | [15] |
2018 | Kernel extreme learning machine | AAC, DAC, PseAAC and the motif-based hybrid features | N.A. | KELM-CPPpred | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
2018 | RF | Compositional information, position-specific information and physicochemical properties | mRMR and SFS | CPPred-FL | http://server.malab.cn/CPPred-FL/ | [11] |
2018 | RF | Sequence length, physicochemical properties and molecular properties | N.A. | N.A. | N.A. | [8] |
Year | Methods | Feature representation | Feature selection | Predictor Name | URL | Ref |
2005 | Z-descriptors | Bulk properties of the constituent amino acids | N.A. | N.A. | N.A. | [6] |
2008 | Partial least squares | Chemical properties | Principal component Analysis | N.A. | N.A. | [1] |
2010 | ANNs | Biochemical features | N.A. | N.A. | N.A. | [28] |
2011 | SMO-based SVMs and the Pearson VII universal kernel | Basic biochemical properties | Scatter search approach | N.A. | N.A. | [7] |
2011 | ANNs | N.A. | Principal component Analysis | N.A. | N.A. | [20] |
2013 | SVMs | Sequence composition, binary profile of patterns and physicochemical properties | N.A. | CellPPD | http://crdd.osdd.net/raghava/cellppd/ | [23] |
2013 | N-to-1 NN | Motif information | N.A. | CPPpred | http://bioware.ucd.ie/cpppred | [29] |
2015 | RF | PseAAC and five properties of amino acids | mRMR and IFS | N.A. | N.A. | [26] |
2016 | SVMs | Dipeptide composition | Analysis of variance | C2Pred | http://lin.uestc.edu.cn/server/C2Pred | [24] |
2017 | Random Forest | K-skip-2-gram | N.A. | SkipCPP-Pred | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
2017 | Random Forest | PC-PseAAC, SC-PseAAC, ASDC and PPs | MRMD and SFS | CPPred-RF | http://server.malab.cn/CPPred-RF/ | [16] |
2018 | Extremely randomized tree and RF | AAC, AAI, DPC, PCP and CTD | N.A. | MLCPP | http://www.thegleelab.org/MLCPP/ | [15] |
2018 | Kernel extreme learning machine | AAC, DAC, PseAAC and the motif-based hybrid features | N.A. | KELM-CPPpred | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
2018 | RF | Compositional information, position-specific information and physicochemical properties | mRMR and SFS | CPPred-FL | http://server.malab.cn/CPPred-FL/ | [11] |
2018 | RF | Sequence length, physicochemical properties and molecular properties | N.A. | N.A. | N.A. | [8] |
Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).
Year | Methods | Feature representation | Feature selection | Predictor Name | URL | Ref |
2005 | Z-descriptors | Bulk properties of the constituent amino acids | N.A. | N.A. | N.A. | [6] |
2008 | Partial least squares | Chemical properties | Principal component Analysis | N.A. | N.A. | [1] |
2010 | ANNs | Biochemical features | N.A. | N.A. | N.A. | [28] |
2011 | SMO-based SVMs and the Pearson VII universal kernel | Basic biochemical properties | Scatter search approach | N.A. | N.A. | [7] |
2011 | ANNs | N.A. | Principal component Analysis | N.A. | N.A. | [20] |
2013 | SVMs | Sequence composition, binary profile of patterns and physicochemical properties | N.A. | CellPPD | http://crdd.osdd.net/raghava/cellppd/ | [23] |
2013 | N-to-1 NN | Motif information | N.A. | CPPpred | http://bioware.ucd.ie/cpppred | [29] |
2015 | RF | PseAAC and five properties of amino acids | mRMR and IFS | N.A. | N.A. | [26] |
2016 | SVMs | Dipeptide composition | Analysis of variance | C2Pred | http://lin.uestc.edu.cn/server/C2Pred | [24] |
2017 | Random Forest | K-skip-2-gram | N.A. | SkipCPP-Pred | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
2017 | Random Forest | PC-PseAAC, SC-PseAAC, ASDC and PPs | MRMD and SFS | CPPred-RF | http://server.malab.cn/CPPred-RF/ | [16] |
2018 | Extremely randomized tree and RF | AAC, AAI, DPC, PCP and CTD | N.A. | MLCPP | http://www.thegleelab.org/MLCPP/ | [15] |
2018 | Kernel extreme learning machine | AAC, DAC, PseAAC and the motif-based hybrid features | N.A. | KELM-CPPpred | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
2018 | RF | Compositional information, position-specific information and physicochemical properties | mRMR and SFS | CPPred-FL | http://server.malab.cn/CPPred-FL/ | [11] |
2018 | RF | Sequence length, physicochemical properties and molecular properties | N.A. | N.A. | N.A. | [8] |
Year | Methods | Feature representation | Feature selection | Predictor Name | URL | Ref |
2005 | Z-descriptors | Bulk properties of the constituent amino acids | N.A. | N.A. | N.A. | [6] |
2008 | Partial least squares | Chemical properties | Principal component Analysis | N.A. | N.A. | [1] |
2010 | ANNs | Biochemical features | N.A. | N.A. | N.A. | [28] |
2011 | SMO-based SVMs and the Pearson VII universal kernel | Basic biochemical properties | Scatter search approach | N.A. | N.A. | [7] |
2011 | ANNs | N.A. | Principal component Analysis | N.A. | N.A. | [20] |
2013 | SVMs | Sequence composition, binary profile of patterns and physicochemical properties | N.A. | CellPPD | http://crdd.osdd.net/raghava/cellppd/ | [23] |
2013 | N-to-1 NN | Motif information | N.A. | CPPpred | http://bioware.ucd.ie/cpppred | [29] |
2015 | RF | PseAAC and five properties of amino acids | mRMR and IFS | N.A. | N.A. | [26] |
2016 | SVMs | Dipeptide composition | Analysis of variance | C2Pred | http://lin.uestc.edu.cn/server/C2Pred | [24] |
2017 | Random Forest | K-skip-2-gram | N.A. | SkipCPP-Pred | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
2017 | Random Forest | PC-PseAAC, SC-PseAAC, ASDC and PPs | MRMD and SFS | CPPred-RF | http://server.malab.cn/CPPred-RF/ | [16] |
2018 | Extremely randomized tree and RF | AAC, AAI, DPC, PCP and CTD | N.A. | MLCPP | http://www.thegleelab.org/MLCPP/ | [15] |
2018 | Kernel extreme learning machine | AAC, DAC, PseAAC and the motif-based hybrid features | N.A. | KELM-CPPpred | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
2018 | RF | Compositional information, position-specific information and physicochemical properties | mRMR and SFS | CPPred-FL | http://server.malab.cn/CPPred-FL/ | [11] |
2018 | RF | Sequence length, physicochemical properties and molecular properties | N.A. | N.A. | N.A. | [8] |
Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).
Existing CPP prediction methods
ML algorithms have been widely used to identify CPPs [38]. We summarized existing ML-based CPP prediction methods in Table 2. According to ML algorithms, they are categorized into the following four classes, which are described in detail below.
Prediction methods based on Neural Network
Artificial NNs (ANNs) (Figure 1B) is an algorithmic model that simulates the structure of the brain’s synaptic connections to process information and react to the real world [39]. The ANNs have two unique properties: (1) they are able to learn from examples and adapt to the change in environmental parameters; and (2) they are able to generate highly nonlinear decision boundaries in the multidimensional input space [39, 40].
Until now, there are three ANN-based CPP prediction methods. For example, Dobchev et al. [28] specified biochemical features for true CPPs and non-CPPs, and trained a prediction model using ANN algorithms and Principle Component Analysis (PCA). The reason for using PCA was to select the most informative variables (used as inputs in the net) from the training set. Their model is reported to achieve an accuracy of 80–100% on a validation dataset containing 101 peptides (penetrating and non-penetrating). The 2nd ANN method, proposed by Karelson et al. [20], is to predict the cell-penetrating capability of compounds or drugs. It combines quantitative structure–activity relationship principles with ANN algorithms to develop a prediction model with an overall accuracy of 83%. However, the limitation is that this method needs structural information as inputs, which are not always available, especially for characterizing the cell-penetrating properties of random peptides. The 3rd ANN-based prediction method is called CPPpred [29]. The prediction model of CPPpred was trained on redundancy reduced datasets and achieved an accuracy of 82.98% with the independent test. In particular, this is the 1st study to emphasize the importance of stringent training datasets for the construction of a robust prediction model.
Prediction methods based on Support Vector Machine
The objective of SVM (Figure 1C) is to create a maximum margin separation hyperplane that can separate the positives from negatives with the minimal misclassification rate [41–43]. Basically, it maps the given input features into a high-dimensional space using kernel functions and finds a hyperplane that maximizes the distance between the hyperplane and two classes [44, 45]. For a given test sample that was mapped into the high-dimensional space (as described above), SVM can predict the test sample based on which side of the hyperplane they fall in. Notably, there are different kernel functions, including linear, polynomial functions and Gaussian radial-basis function. In SVM, there are two critical parameters: C (controls the trade-off between the training error and margin) and g (controls how peaked Gaussians are centered on the support vectors). To achieve the best performance, the parameters usually need to be optimized by grid search approach.
Some prediction tools based on SVM have been proposed for predicting CPPs. For example, Sanders et al. [7] developed an SVM-based approach for identifying potential CPPs. The prediction model was trained using the basic biochemical properties of peptides as features. The authors used three different benchmark datasets to highlight the importance of balanced datasets for accurate prediction. The accuracy of the balanced dataset reached to 91.72%. Gautam et al. [23] proposed an SVM-based predictor called CellPPD and established a public web server for the prediction of CPPs. In CellPPD, different feature representation algorithms, such as AAC, DAC, binary profile, motif features and PPs, were used for training different predictive models. The prediction model based on hybrid features is reported to achieve a maximum accuracy of 97.40%, better than the models based on individual features [34]. Tang et al. [24] developed C2Pred, a predictor based on optimized DAC as feature. The overall accuracy of C2Pred is about 83.6%. They also developed a web server with the implementation of C2Pred, but as the writing of this paper, the server is out of service already.
Prediction methods based on Random Forest
RF (Figure 1D) is a powerful ML algorithm [25], with successful applications in bioinformatics [8, 11, 12, 16, 26, 46]. RF is an ensemble of decision trees. The training procedure is briefly introduced as follows. Assuming there exist N samples with M features in the training set, RF selects N samples by bootstrapping to form a new training dataset and then, randomly selects m (m ≪M) features to train a decision tree on the new training dataset. Repeat this procedure until all the decision trees in RF are trained. The final prediction result is determined by an ensemble of the scores of all the decision trees. In RF, the numbers of decision trees and randomly selected features (mtry) are two main parameters for training accurate RF models.
RF algorithm has been widely applied to the field regarding the prediction of CPPs. Chen et al. [26] developed an RF-based CPP prediction model. The model was trained on a series of PPs, including pseudo-AAC (PseAAC) [18], molecular volume, polarity, codon diversity, electrostatic charge and secondary structure [26]. Optimized features were selected by minimum redundancy maximum relevance [47] and incremental feature selection [48]. The overall accuracy of the prediction model is 83.45%. Considering the long-range effect between residues, in previous study we proposed an adaptive k-skip-2-gram algorithm to extract features and trained a predictor named SkipCPP-Pred with an improved accuracy of 90.6% [12]. In our another work, we proposed a two-layer predictor called CPPred-RF, for which the 1st layer is to discriminate true CPPs from non-CPPs while the 2nd layer is to predict the uptake efficiency of CPPs: high or low [16]. The prediction model was trained on integrative features, which combine four sequence-based descriptors, including PC-PseAAC [49], SC-PseAAC [49], adaptive skip DAC (ASDC), and PPs [50–52]. As compared to SkipCPP-Pred, CPPred-RF increased the prediction accuracy (evaluated with 10-fold cross validation) to 91.6% on the same benchmark dataset. It is worth noting that the CPPred-RF is the 1st tool that can predict the uptake efficiency of CPPs. Another work, from Wolfe et al. [8], focuses on the transport of phosphorodiamidite morpholino oligonucleotides by CPPs [8]. Peptide molecular weight, sequence length, theoretical net charge and amino acid physicochemical descriptors were used as input features to train a RF model. Recently, Qiang et al. [11] proposed a computational predictor called CPPred-FL. More specifically, CPPred-FL introduces the feature representation learning strategy to learn the class and probabilistic information from ML models built with multiple feature descriptors, such as PPs, compositional information and position-specific information, etc. The best overall accuracy of CPPred-FL is up to 92.1% [11]. Although the accuracy is not significantly improved as compared to their previous study [16], the feature number they used for training the predictive models is far fewer. This feature representation strategy explores a new effective way to extract high-expressive features.
Prediction methods based on other machine learning algorithms
Besides the methods above, there are some other prediction methods based on other ML algorithms, such as ERT [15, 30] and KELM [33, 34]. In a recent study, Manavalan et al. [15] proposed a two-layer model for predicting CPPs and their uptake efficiency. The 1st-layer model for the prediction of CPPs was trained by ERT algorithm with an accuracy of 89.6%, while the uptake efficiency prediction model (2nd layer) was trained by RF with an accuracy of 72.5%. Pandey et al. [34] developed a KELM-based model. Their prediction models utilize six different feature descriptors, including AAC, dipeptide AAC (DAC), PseAAC and three hybrid features (Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC) [34]. KELM-CPPpred achieved an accuracy of 83.10% on an independent dataset. Moreover, there are some studies with no clear description for the use of ML algorithms. For example, Hällbrink et al. [6] proposed a prediction method concentrated on five z-descriptors [53], which are extracted from physical characteristics of peptide sequences. Likewise, Hansen et al. [1] developed a method based on the chemical properties to predict CPPs and non-CPPs.
Predictors | Year | Classifier | Predicting uptake efficiency | Sequence length Limitation | Upload sequence | Multiple input | URL | Ref. |
CellPPD | 2013 | SVM | N.A. | 1–50 | No | Yes | http://crdd.osdd.net/raghava/cellppd/ | [23] |
SkipCPP-Pred | 2017 | RF | N.A. | No less than 10 | No | Yes | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
CPPred-RF | 2017 | RF | Yes | No limitation | No | Yes | http://server.malab.cn/CPPred-RF | [16] |
MLCPP | 2018 | ERT and RF | Yes | No limitation | Yes | Yes | www.thegleelab.org/MLCPP | [15] |
KELM-CPPpred | 2018 | KELM | N.A. | 5–30 | No | Yes | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
CPPred-FL | 2018 | RF | N.A. | No limitation | Yes | Yes | http://server.malab.cn/CPPred-FL | [11] |
Predictors | Year | Classifier | Predicting uptake efficiency | Sequence length Limitation | Upload sequence | Multiple input | URL | Ref. |
CellPPD | 2013 | SVM | N.A. | 1–50 | No | Yes | http://crdd.osdd.net/raghava/cellppd/ | [23] |
SkipCPP-Pred | 2017 | RF | N.A. | No less than 10 | No | Yes | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
CPPred-RF | 2017 | RF | Yes | No limitation | No | Yes | http://server.malab.cn/CPPred-RF | [16] |
MLCPP | 2018 | ERT and RF | Yes | No limitation | Yes | Yes | www.thegleelab.org/MLCPP | [15] |
KELM-CPPpred | 2018 | KELM | N.A. | 5–30 | No | Yes | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
CPPred-FL | 2018 | RF | N.A. | No limitation | Yes | Yes | http://server.malab.cn/CPPred-FL | [11] |
Predictors | Year | Classifier | Predicting uptake efficiency | Sequence length Limitation | Upload sequence | Multiple input | URL | Ref. |
CellPPD | 2013 | SVM | N.A. | 1–50 | No | Yes | http://crdd.osdd.net/raghava/cellppd/ | [23] |
SkipCPP-Pred | 2017 | RF | N.A. | No less than 10 | No | Yes | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
CPPred-RF | 2017 | RF | Yes | No limitation | No | Yes | http://server.malab.cn/CPPred-RF | [16] |
MLCPP | 2018 | ERT and RF | Yes | No limitation | Yes | Yes | www.thegleelab.org/MLCPP | [15] |
KELM-CPPpred | 2018 | KELM | N.A. | 5–30 | No | Yes | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
CPPred-FL | 2018 | RF | N.A. | No limitation | Yes | Yes | http://server.malab.cn/CPPred-FL | [11] |
Predictors | Year | Classifier | Predicting uptake efficiency | Sequence length Limitation | Upload sequence | Multiple input | URL | Ref. |
CellPPD | 2013 | SVM | N.A. | 1–50 | No | Yes | http://crdd.osdd.net/raghava/cellppd/ | [23] |
SkipCPP-Pred | 2017 | RF | N.A. | No less than 10 | No | Yes | http://server.malab.cn/SkipCPP-Pred/Index.html | [12] |
CPPred-RF | 2017 | RF | Yes | No limitation | No | Yes | http://server.malab.cn/CPPred-RF | [16] |
MLCPP | 2018 | ERT and RF | Yes | No limitation | Yes | Yes | www.thegleelab.org/MLCPP | [15] |
KELM-CPPpred | 2018 | KELM | N.A. | 5–30 | No | Yes | http://sairam.people.iitgn.ac.in/KELM-CPPpred.html | [34] |
CPPred-FL | 2018 | RF | N.A. | No limitation | Yes | Yes | http://server.malab.cn/CPPred-FL | [11] |
Web-accessible prediction tools
As described in the Existing CPP prediction methods, there are a total of 15 prediction methods, but only 6 of them provide available web servers for the prediction of CPPs. They are CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred and CPPred-FL, respectively [11, 12, 15, 16, 23, 34]. The basic information of the web servers is summarized in Table 3. They are described in detail below.
CellPPD is an in silico method for predicting and designing CPPs. The web server provides users two prediction models: (1) SVM-based model and (2) SVM + Motif-based model [23]. The 1st model was trained by SVM classifier [14] using binary N10-C10 descriptor, while the 2nd was trained with a hybrid descriptor of binary profile patterns and motif features. It should be pointed out that CellPPD is the 1st server to predict CPPs. Additionally, CellPPD is able to identify potential CPPs from protein sequences, but the length of input protein sequences is limited to 500 residues long. Furthermore, the web server also allows users to design novel cell penetrating peptides with certain PPs according to specific needs. The web server is freely available at http://crdd.osdd.net/raghava/cellppd/.
SkipCPP-Pred is a RF-based prediction method. The prediction model was trained with the features extracted by an adaptive k-skip-2-gram algorithm [12]. Due to the use of sequential features only, SkipCPP-Pred is capable of fast predicting whether input peptides are CPPs or not. Notably, this server does not have any limit on the size of the input sequences. The web server can be accessed via http://server.malab.cn/SkipCPP-Pred/Index.html.
CPPred-RF is a two-layer RF-based predictor for predicting CPPs and their uptake efficiency simultaneously [16]. This is the 1st server that makes a breakthrough in the prediction of the uptake efficiency of CPPs. Similar to other servers, it supports for the prediction of multiple sequences. CPPred-RF is publicly available at http://server.malab.cn/CPPred-RF.
MLCPP, similar to CPPred-RF, is also a two-layer predictor for CPPs and their uptake efficiency. For given peptide sequences, the 1st-layer model predicts the query sequence as CPPs or not; if the input sequences are predicted as CPPs, the 2nd-layer model predicts their uptake efficiency [15]. The final results include the prediction information and corresponding probability scores. MLCPP is freely available at www.thegleelab.org/MLCPP.
KELM-CPPpred is a KELM-based CPP prediction tool. The web server provides six prediction models based on different features, including AAC, DAC, PseAAC, Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC [34]. Users can select one of the models to make predictions. KELM-CPPpred allows users to type one or multiple query sequences of 5–30 residues in length as inputs. The web server can be accessed via http://sairam.people.iitgn.ac.in/KELM-CPPpred.html.
CPPred-FL is a recent predictor for CPP prediction [11]. The server provides two prediction modes based on class information and probabilistic information for CPP identification. Different from the servers above, this server is designed to identify CPPs within proteins. When using this prediction tool, users should choose one prediction mode and set a confidence threshold and cutting length. It allows users to submit multiple protein sequences. The output of CPPred-FL contains all the peptide sequences predicted with cell-penetrating activity, corresponding residue position and prediction confidence. CPPred-FL is publicly available at http://server.malab.cn/CPPred-FL.
Validation datasets
Two benchmark validation datasets were used in this study for a comparative study of existing methods. They were downloaded from the independent datasets in two most recent studies: Pandey’s work [34] and Manavalan’s study [15]. For convenience of discussions, they are respectively denoted as mlcpp and kelm. The kelm dataset includes 96 experimentally validated CPPs as positives and 96 non-CPPs as negatives, whereas the mlcpp dataset consists of 311 true CPPs (positives) and 311 non-CPPs (negatives). However, some servers have strict length limitations for input testing sequences (see Table 3 for details). To test all the web-based prediction tools, we removed those sequences not meeting the need of the predictors in terms of sequence length. Moreover, considering the bias of high sequence similarity between training datasets and validation datasets, we firstly removed the sequences in the validation datasets having significant sequence similarity with the sequences in training datasets using BLASTP (version 2.8.1+) under default setting. Afterwards, we used CD-HIT, a frequently-used sequence homology reduction software in bioinformatics, to further remove those sequences in the validation datasets sharing the sequence identity of >30% against the sequences in the training datasets. By doing so, only 71 CPPs and 48 non-CPPs from kelm, and 149 CPPs and 193 non-CPPs from mlcpp were retained. It is worth noting that the positives from both of validation datasets were derived from the CPPsite2.0 database.
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
MLCPP | 53 | 5 | 43 | 18 | 74.65 | 89.58 | 80.67 | 0.63 |
CPPred-RF | 59 | 12 | 36 | 12 | 83.10 | 75.00 | 79.83 | 0.58 |
KELM-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
KELM-hybrid-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
CPPred-FL | 56 | 10 | 38 | 15 | 78.87 | 79.17 | 78.99 | 0.57 |
CellPPD | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
CellPPD-motif | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
KELM-PseAAC | 59 | 13 | 35 | 12 | 83.10 | 72.92 | 78.99 | 0.56 |
KELM-DAC | 40 | 1 | 47 | 31 | 56.34 | 97.92 | 73.11 | 0.56 |
SkipCPP-Pred | 58 | 13 | 35 | 13 | 81.69 | 72.92 | 78.15 | 0.55 |
KELM-hybrid-PseAAC | 59 | 14 | 34 | 12 | 83.10 | 70.83 | 78.15 | 0.54 |
KELM-hybrid-DAC | 49 | 8 | 40 | 22 | 69.01 | 83.33 | 74.79 | 0.51 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
MLCPP | 53 | 5 | 43 | 18 | 74.65 | 89.58 | 80.67 | 0.63 |
CPPred-RF | 59 | 12 | 36 | 12 | 83.10 | 75.00 | 79.83 | 0.58 |
KELM-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
KELM-hybrid-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
CPPred-FL | 56 | 10 | 38 | 15 | 78.87 | 79.17 | 78.99 | 0.57 |
CellPPD | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
CellPPD-motif | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
KELM-PseAAC | 59 | 13 | 35 | 12 | 83.10 | 72.92 | 78.99 | 0.56 |
KELM-DAC | 40 | 1 | 47 | 31 | 56.34 | 97.92 | 73.11 | 0.56 |
SkipCPP-Pred | 58 | 13 | 35 | 13 | 81.69 | 72.92 | 78.15 | 0.55 |
KELM-hybrid-PseAAC | 59 | 14 | 34 | 12 | 83.10 | 70.83 | 78.15 | 0.54 |
KELM-hybrid-DAC | 49 | 8 | 40 | 22 | 69.01 | 83.33 | 74.79 | 0.51 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
MLCPP | 53 | 5 | 43 | 18 | 74.65 | 89.58 | 80.67 | 0.63 |
CPPred-RF | 59 | 12 | 36 | 12 | 83.10 | 75.00 | 79.83 | 0.58 |
KELM-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
KELM-hybrid-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
CPPred-FL | 56 | 10 | 38 | 15 | 78.87 | 79.17 | 78.99 | 0.57 |
CellPPD | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
CellPPD-motif | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
KELM-PseAAC | 59 | 13 | 35 | 12 | 83.10 | 72.92 | 78.99 | 0.56 |
KELM-DAC | 40 | 1 | 47 | 31 | 56.34 | 97.92 | 73.11 | 0.56 |
SkipCPP-Pred | 58 | 13 | 35 | 13 | 81.69 | 72.92 | 78.15 | 0.55 |
KELM-hybrid-PseAAC | 59 | 14 | 34 | 12 | 83.10 | 70.83 | 78.15 | 0.54 |
KELM-hybrid-DAC | 49 | 8 | 40 | 22 | 69.01 | 83.33 | 74.79 | 0.51 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
MLCPP | 53 | 5 | 43 | 18 | 74.65 | 89.58 | 80.67 | 0.63 |
CPPred-RF | 59 | 12 | 36 | 12 | 83.10 | 75.00 | 79.83 | 0.58 |
KELM-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
KELM-hybrid-AAC | 49 | 5 | 43 | 22 | 69.01 | 89.58 | 77.31 | 0.58 |
CPPred-FL | 56 | 10 | 38 | 15 | 78.87 | 79.17 | 78.99 | 0.57 |
CellPPD | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
CellPPD-motif | 45 | 3 | 45 | 26 | 63.38 | 93.75 | 75.63 | 0.57 |
KELM-PseAAC | 59 | 13 | 35 | 12 | 83.10 | 72.92 | 78.99 | 0.56 |
KELM-DAC | 40 | 1 | 47 | 31 | 56.34 | 97.92 | 73.11 | 0.56 |
SkipCPP-Pred | 58 | 13 | 35 | 13 | 81.69 | 72.92 | 78.15 | 0.55 |
KELM-hybrid-PseAAC | 59 | 14 | 34 | 12 | 83.10 | 70.83 | 78.15 | 0.54 |
KELM-hybrid-DAC | 49 | 8 | 40 | 22 | 69.01 | 83.33 | 74.79 | 0.51 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
KELM-hybrid-AAC | 141 | 7 | 186 | 8 | 94.63 | 96.37 | 95.61 | 0.91 |
KELM-hybrid-DAC | 141 | 52 | 141 | 8 | 94.63 | 73.06 | 82.46 | 0.68 |
KELM-AAC | 140 | 52 | 141 | 9 | 93.96 | 73.06 | 82.16 | 0.67 |
KELM-PseAAC | 142 | 57 | 136 | 7 | 95.30 | 70.47 | 81.29 | 0.66 |
CPPred-FL | 144 | 62 | 131 | 5 | 96.64 | 67.88 | 80.41 | 0.65 |
MLCPP | 144 | 65 | 128 | 5 | 96.64 | 66.32 | 79.53 | 0.64 |
CPPred-RF | 146 | 74 | 119 | 3 | 97.99 | 61.66 | 77.49 | 0.62 |
SkipCPP-Pred | 148 | 81 | 112 | 1 | 99.33 | 58.03 | 76.02 | 0.60 |
CellPPD | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
CellPPD-motif | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
KELM-hybrid-PseAAC | 138 | 84 | 109 | 11 | 92.62 | 56.48 | 72.22 | 0.51 |
KELM-DAC | 138 | 87 | 106 | 11 | 92.62 | 54.92 | 71.35 | 0.50 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
KELM-hybrid-AAC | 141 | 7 | 186 | 8 | 94.63 | 96.37 | 95.61 | 0.91 |
KELM-hybrid-DAC | 141 | 52 | 141 | 8 | 94.63 | 73.06 | 82.46 | 0.68 |
KELM-AAC | 140 | 52 | 141 | 9 | 93.96 | 73.06 | 82.16 | 0.67 |
KELM-PseAAC | 142 | 57 | 136 | 7 | 95.30 | 70.47 | 81.29 | 0.66 |
CPPred-FL | 144 | 62 | 131 | 5 | 96.64 | 67.88 | 80.41 | 0.65 |
MLCPP | 144 | 65 | 128 | 5 | 96.64 | 66.32 | 79.53 | 0.64 |
CPPred-RF | 146 | 74 | 119 | 3 | 97.99 | 61.66 | 77.49 | 0.62 |
SkipCPP-Pred | 148 | 81 | 112 | 1 | 99.33 | 58.03 | 76.02 | 0.60 |
CellPPD | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
CellPPD-motif | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
KELM-hybrid-PseAAC | 138 | 84 | 109 | 11 | 92.62 | 56.48 | 72.22 | 0.51 |
KELM-DAC | 138 | 87 | 106 | 11 | 92.62 | 54.92 | 71.35 | 0.50 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
KELM-hybrid-AAC | 141 | 7 | 186 | 8 | 94.63 | 96.37 | 95.61 | 0.91 |
KELM-hybrid-DAC | 141 | 52 | 141 | 8 | 94.63 | 73.06 | 82.46 | 0.68 |
KELM-AAC | 140 | 52 | 141 | 9 | 93.96 | 73.06 | 82.16 | 0.67 |
KELM-PseAAC | 142 | 57 | 136 | 7 | 95.30 | 70.47 | 81.29 | 0.66 |
CPPred-FL | 144 | 62 | 131 | 5 | 96.64 | 67.88 | 80.41 | 0.65 |
MLCPP | 144 | 65 | 128 | 5 | 96.64 | 66.32 | 79.53 | 0.64 |
CPPred-RF | 146 | 74 | 119 | 3 | 97.99 | 61.66 | 77.49 | 0.62 |
SkipCPP-Pred | 148 | 81 | 112 | 1 | 99.33 | 58.03 | 76.02 | 0.60 |
CellPPD | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
CellPPD-motif | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
KELM-hybrid-PseAAC | 138 | 84 | 109 | 11 | 92.62 | 56.48 | 72.22 | 0.51 |
KELM-DAC | 138 | 87 | 106 | 11 | 92.62 | 54.92 | 71.35 | 0.50 |
Prediction tools | TP | FP | TN | FN | SE (%) | SP (%) | ACC (%) | MCC |
KELM-hybrid-AAC | 141 | 7 | 186 | 8 | 94.63 | 96.37 | 95.61 | 0.91 |
KELM-hybrid-DAC | 141 | 52 | 141 | 8 | 94.63 | 73.06 | 82.46 | 0.68 |
KELM-AAC | 140 | 52 | 141 | 9 | 93.96 | 73.06 | 82.16 | 0.67 |
KELM-PseAAC | 142 | 57 | 136 | 7 | 95.30 | 70.47 | 81.29 | 0.66 |
CPPred-FL | 144 | 62 | 131 | 5 | 96.64 | 67.88 | 80.41 | 0.65 |
MLCPP | 144 | 65 | 128 | 5 | 96.64 | 66.32 | 79.53 | 0.64 |
CPPred-RF | 146 | 74 | 119 | 3 | 97.99 | 61.66 | 77.49 | 0.62 |
SkipCPP-Pred | 148 | 81 | 112 | 1 | 99.33 | 58.03 | 76.02 | 0.60 |
CellPPD | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
CellPPD-motif | 120 | 47 | 146 | 29 | 80.54 | 75.65 | 77.78 | 0.56 |
KELM-hybrid-PseAAC | 138 | 84 | 109 | 11 | 92.62 | 56.48 | 72.22 | 0.51 |
KELM-DAC | 138 | 87 | 106 | 11 | 92.62 | 54.92 | 71.35 | 0.50 |
Performance measurements
Results and discussion
Comparative results on benchmark validation datasets
In this work, our aim is to conduct an unbiased performance evaluation for existing prediction tools. To avoid the potential evaluation biases in implementation of existing predictors, we chose the prediction tools only with available web servers for comparison, including CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred, and CPPred-FL [11, 12, 15, 16, 23, 34]. We noticed that some servers, such as CellPPD and KELM-CPPpred, have more than one prediction models (refer to Web-accessible prediction tools for details). We collected a total of 12 prediction models from the 6 web servers. To conduct a comprehensive comparison, all the available prediction models are tested and compared. Moreover, it is instructive to compare the web-accessible prediction tools with the independent test, since this makes the use of them close to practical applications. Here, two benchmark validation datasets, mlcpp and kelm, are used for the test. The comparison results on the two benchmarks are presented in Tables 4 and 5, respectively.
On the kelm dataset (see Table 4 and Figure 2), MLCPP achieved the best performance among the 12 tested prediction models, giving the highest ACC of 80.67% and MCC of 0.63, respectively. Despite this, other prediction tools, like CPPred-RF, KELM-AAC and KELM-hybrid-AAC, are indeed competitive with the best MLCPP in terms of MCC; their MCCs are all 0.58, which are slightly worse than that of the MLCPP (MCC = 0.63). Here, we do not consider more on the comparison in terms of ACC, since the kelm dataset is imbalanced. Therefore, in such an imbalanced dataset, MCC is a better metric to measure the overall performance of a prediction model. However, considering that the number of the positives and negatives in the kelm dataset are relatively few. Therefore, within such a small gap in performance, it is actually hard to determine which prediction tool is better.
Next, we further compared the performance on the larger mlcpp dataset that contains more testing samples (149 positives and 193 negatives). The results are shown in Table 5 and Figure 3, where the following aspects were observed. The 1st observation is that the KELM-hybrid-AAC, trained with the KELM classifier and the hybrid features (motif and AAC), outperforms other competing prediction tools in three out of the four metrics: SP, ACC and MCC. More specifically, KELM-hybrid-AAC achieved the SP, ACC and MCC of 96.37%, 95.61% and 0.91, higher than that of the 2nd best KELM-hybrid-DAC by 23.31%, 13.15% and 0.23, respectively. The 2nd observation is that, the performance of MLCPP was decreased on the mlcpp (see Table 5 and Figure 3) as compared to that on the kelm (see Table 4 and Figure 2); among the 12 compared prediction models, MLCPP ranks in the 6th place. Furthermore, we observed that the SP of KELM-hybrid-AAC can achieve 96.37% high. This indicates that using this model would generate fairly few false positives; most of random peptides without cell-penetrating property would be filtered out by the KELM-hybrid-AAC. This greatly facilitates biological researchers for the experimental validation of predictions, thereby largely reducing the cost and time of the validation. In addition, we also investigated the ability of prediction tools for predicting the positives (true CPPs). For the prediction of true CPPs, SkipCPP-Pred and CPPred-RF are the top two predictors with the SEs of 99.33 and 97.99%, respectively; CPPred-FL and MLCPP achieved the 3rd best SE of 96.64%. This demonstrates that they can identify more CPPs than other predictors.
Taken all together, through the comparison study on two benchmarks, it can be concluded that one of the prediction models in the KELM web, namely KELM-hybrid-AAC, generally outperforms the state-of-art web-accessible predictors (including 11 prediction models). Importantly, the KELM-hybrid-AAC is capable of providing more balanced performance between SE and SP in comparison with other prediction tools.
Length-dependency comparison of existing prediction tools
For CPP prediction tools, identification of peptides in proteins with cell-penetrating activity is the main in this field. As we know, experimentally validated CPPs are in the range of 10–50 residues long. Therefore, it is interesting to see whether there is length-dependency lying in existing CPP prediction tools. For this purpose, the samples in benchmark dataset are divided into four groups according to the length: [10, 15], (15, 20], (20, 25] and (25, 30]. Note that the length of the samples in the 1st group [10, 15] is in the range of 10–15; the length of the samples in the 2nd group (15, 20] is in the range of 15–20; and so forth. Figure 4 illustrates the prediction results of the 12 prediction models from the six web servers over different length ranges on the mlcpp dataset. Note that we conducted the length-dependency comparison only on the mlcpp dataset, since it contains more testing samples than the kelm dataset. Therefore, the mlcpp dataset is more representative than the kelm dataset. As shown in Figure 4, we can clearly see that almost all the prediction models performed better in the length range (20, 25] than the other range. In other words, predicting the CPPs and non-CPPs with the length range in (20, 25] is a relatively easy task, as compared to the prediction in other length ranges. For the prediction of CPPs beyond the range (20, 25], there is no clear trend observed. This result is quite interesting. It might provide some insights for the design of CPPs.
Usage comparison of cell-penetrating peptides web servers
In this section, we investigated whether these servers are as friendly as they claimed in their own studies, since the usage experience of web-servers is quite important as well. We found that there are several limitations when conducting the server test. The 1st limitation is the length limitation of input sequences for some servers. For example, CellPPD allows users to input sequences with 1 to 50 residues, while KELM-CPPpred with 5–30 residues. SkipCPP-Pred requires input sequences with the length of >10 residues. Secondly, some servers do not support multiple sequences or uploading sequence files for batch processing. MLCPP and CPPred-FL have more convenient function for users to upload their data with specified format files (i.e. FASTA files). The CellPPD server has the file uploading function but does not work well. Thirdly, we found errors happening frequently when testing the CellPPD server. The prediction results are covered by the 1st query sequence, resulting in invalid prediction results. Therefore, we have to conduct the independent test one by one to ensure yield the correct prediction results, which is quite inconvenient if making the prediction in a large-scale data. Fourthly, in terms of running time, SkipCPP-Pred is the fastest one of these web servers, no matter that there are hundreds or thousands of query sequences. Although KELM-CPPpred claims that they can handle multiple sequences for one single run, the processing speed is quite slow, and the server usually returns with timeout errors when submitting more than 50 peptides as inputs. Finally, CPPred-RF and MLCPP are the only two servers that can predict the uptake efficiency of CPPs. In particular, CellPPD has the function of designing efficient CPPs, and provides to specify CPP motifs in query long protein sequences.
Discussion
In this study, our aim is to conduct the empirical comparison and analysis of 12 prediction models from 6 state-of-the-art CPP prediction tools that are accessible as web servers. We evaluated the prediction models on two benchmark validation datasets in an unbiased way, respectively. Benchmarking results demonstrate that, among the 12 prediction models, the KELM-hybrid-AAC from the KELM-CPPpred web, provides the exceptionally best performance. This might be due to the use of the hybrid features, which integrate the motif-based descriptor with the compositional descriptor AAC. We further analyzed which kind of motif features they used, and found that they specified the most frequent amino acid motifs in their dataset, including RRRRRR, RRA, GRRX (where X = R, W, T), RRGRX (X = R, G, T) and KKRK. The results demonstrated that the fusion of amino acid and some sequential motifs can sufficiently capture the inner characteristics of CPPs/non-CPPs. Another interesting finding is that in the KELM-CPPpred server, they recommend to use another model namely KELM-hybrid-PseAAC based on the results in their study [34]; whereas as for our test, we found that the recommended model is actually almost the worst one among the six prediction models in KELM-CPPpred server. Generally speaking, it can be concluded that the KELM-hybrid-AAC model is a better choice to make predictions in performance.
To further improve the performance of CPP prediction, there are still many aspects that can be explored. The 1st and also the most important aspect is feature representation. As we can see, most of prediction tools extract features using the information from primary sequences, including amino acid PPs, AAC, DAC and ASDC, etc. The following question is, why not use the structural information? The answer is that peptides are very short, usually with 10–50 residues long. It is extremely difficult to find some structural characteristics to discriminate CPPs from non-CPPs, as such short peptides cannot form stable secondary structures. Basically, only sequential information can be explored. Therefore, how to effectively use different types of sequential information is an open question. Recently, Wei et al. [68] proposed a feature representation learning strategy to automatically learn the most informative features in the supervised way. This work might give researchers a hint to explore other strategies for more effective sequence-based feature representations. Additionally, feature selection that facilitates the discovery of the most predictive features can be another direction to the improved performance. Moreover, recent studies have been shown that powerful classifiers are also an alternative way complementary to the improvement in performance [15, 34, 70, 71].
Although much progress has been done for the development of CPP prediction tools, some limitations and challenges remain to be addressed. Firstly, the main challenge that all of the ML-based prediction tools face is the selection of high-quality samples to represent adequately the positive (true CPPs) and negatives (non-CPPs). Currently, the number of the positive samples is still limited. Although the public database like CPPsite 2.0 collects almost 2000 experimentally validated CPPs, only 400–500 are remained in benchmark datasets after the removal of high-identity sequences. For the selection of negative samples, random peptides within proteins not annotated as CPPs are frequently used. An alternative way to generate negative control samples is to shuffle the genomic content of existing CPPs (e.g. scrambled CPPs). However, it cannot guarantee that the random peptides are not true CPPs, although the probability of such random sequences to be true CPPs is very low. Secondly, most of current predictors focus more on identifying true CPPs from non-CPPs, whereas few studies (only CPPred-RF and MLCPP) are on the uptake efficiency prediction, which is quite important as well, since the uptake efficiency of CPPs is closely associated with their practical applications as efficient drug delivery. One possible reason is that there is not enough experimental data to predict the efficiency of CPPs. Moreover, for the only two prediction tools for the efficiency prediction, simple yes/no prediction of internalization may soon be outdated.
Funding
National Natural Science Foundation of China (Nos. 61701340, 61702361 and 61771331), the Natural Science Foundation of Tianjin City (Nos. 18JCQNJC00500 and 18JCQNJC00800), the National Key R&D Program of China (SQ2018YFC090002) and the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology (2018R1D1A1B07049572).
Ran Su is currently an associate professor in the School of Computer Software, College of Intelligence and Computing, at Tianjin University, China. Tianjin University, China. Her research interests include pattern recognition, machine learning and bioinformatics.
Jie Hu received her BSc degree in Resource environment and Urban Planning Management from Wuhan University of Science and Technology, China. She is currently a graduate student in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. Her research interests are bioinformatics and machine learning.
Quan Zou is a professor of University of Electronic Science and Technology of China. He received his PhD in Computer Science from Harbin Institute of Technology, P.R. China in 2009. His research is in the areas of bioinformatics, machine learning and parallel computing, with focus on genome assembly, annotation and functional analysis from the next generation sequencing data with parallel computing methods.
Balachandran Manavalan received his PhD degree in 2011 from Ajou University, South Korea. Currently, he is working as a research professor in the Department of Physiology, Ajou University School of Medicine, Suwon, Korea. He is also an associate member of Korea Institute for Advanced Study (KIAS), Seoul, Korea. His main research interests include protein structure prediction, machine learning, data mining, computational biology, and functional genomics.
Leyi Wei received his PhD in Computer Science from Xiamen University, China. He is currently an Assistant Professor in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. His research interests include machine learning and their applications to bioinformatics.