Abstract

Cell-penetrating peptides (CPPs) facilitate the delivery of therapeutically relevant molecules, including DNA, proteins and oligonucleotides, into cells both in vitro and in vivo. This unique ability explores the possibility of CPPs as therapeutic delivery and its potential applications in clinical therapy. Over the last few decades, a number of machine learning (ML)-based prediction tools have been developed, and some of them are freely available as web portals. However, the predictions produced by various tools are difficult to quantify and compare. In particular, there is no systematic comparison of the web-based prediction tools in performance, especially in practical applications. In this work, we provide a comprehensive review on the biological importance of CPPs, CPP database and existing ML-based methods for CPP prediction. To evaluate current prediction tools, we conducted a comparative study and analyzed a total of 12 models from 6 publicly available CPP prediction tools on 2 benchmark validation sets of CPPs and non-CPPs. Our benchmarking results demonstrated that a model from the KELM-CPPpred, namely KELM-hybrid-AAC, showed a significant improvement in overall performance, when compared to the other 11 prediction models. Moreover, through a length-dependency analysis, we find that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than peptides in other length ranges.

Introduction

Cell-penetrating peptides (CPPs) are short peptides with approximately 5–30 amino acids residues in length [1]. One of the most distinctive characteristics of CPPs is the ability to carry a variety of bioactive molecules into cells without specific receptor interaction [2–5]. The cargoes that CPPs attach vary in different sizes, such as small molecule compounds, dyes, peptides, polypeptide nucleic acids, proteins, plasmid DNA, liposomes, phage particles, superparamagnetic particles and so on [2, 4]. Relevant fluorescence validation experiments have been performed to verify the cell penetrating capability of CPPs [6–8]. Considering this unique property, CPPs that improve the cellular uptake of various bioactive molecules are expected to be promising therapeutic candidates. In consideration of the potential of CPPs in therapeutics, the further development and application of CPP-based delivery strategies have steadily emerged over the past few years, demonstrating the great potential of gene delivery and cancer therapy, as well as effective clinical efficacy [2, 9].

The 1st CPP was discovered by Frankel et al. in the 1980s, which demonstrates that the human immunodeficiency virus 1 (HIV 1) transactivating (Tat) protein was able to enter tissue-cultured cells, to translocate into the nucleus and to transactivate the viral gene expression [10]. The α-helical domain of Tat protein spanning the residues 48–60, mainly composed of basic amino acids, was found as the main determinant for cell internalization and nucleus translocation. After that, Penetratin peptide, the 3rd helix of the antennapedia homeodomain, was found to efficiently cross the cell membranes with an energy-independent mechanism. These observations explore the basic research of CPPs. Since then, research on CPPs gained the growing interest in the last 30 years [11, 12], leading to an exponential increase in the number of CPPs. Currently, there are a total of 1850 experimentally validated entries in CPPsite 2.0—the largest CPP database [13]. The number of CPPs in current database has increased nearly twice relative to its previous version (CPPsite) [14]. It should be pointed out that of the entries in CPPsite 2.0, roughly 90% are derived from natural proteins [15], while the remaining are synthetic proteins or chimeric peptides [13, 16]. Along with the rapid development and wide applications of the next-generation sequencing techniques [17], a plenty of novel protein sequences have been generated rapidly at low cost [15, 16]. Subsequently, amongst these novel and uncharacterized protein sequences, it can be expected to explore more functional peptides with cell-penetrating activities [16]. Unfortunately, it is extremely difficult to apply traditional experimental methods to practical applications, especially with the avalanche of protein sequences, as they have some intrinsic limitations, such as being expensive, labor intensive and time-consuming [16]. To address the limitations from the perspective of experimental methods, computational methods have recently emerged as a promising alternative for accurate and efficient predicting CPPs.

Over the past few decades, a variety of computational methods, especially machine learning (ML)-based methods, have been developed for the prediction of CPPs [1, 6, 18–21]. ML techniques are able to extract useful patterns hidden in experimentally validated CPPs and make effective use of these patterns to accurately predict whether new uncharacterized peptides have the activity of cell-penetrating or not. Importantly, they allow for the use of only primary protein sequences as inputs, without any prior knowledge (i.e. secondary structures), showing the great potential for the high-throughput prediction in large-scale proteomic data. So far, various ML techniques have been applied to the development of CPP prediction methods, such as Support Vector Machine (SVM) [7, 22–24], Random Forest (RF) [8, 11, 12, 16, 25, 26], Neural Network (NN) [20, 27–29], Extremely Randomized Tree (ERT) [15, 30–32] and Kernel Extreme Learning Machine (KELM) [33, 34], thereby generating a number of prediction methods. We found that there is a common phenomenon lying in existing prediction methods; that is, different methods claimed to outperform other previously published methods in their own studies. However, comparisons carried out in different studies are somewhat biased. The possible biases are generated in the following three reasons. Firstly, the comparison was performed by developers themselves. As for the implementation of the compared prediction tools, setting algorithm parameters would greatly impact the performance. The problem is that, in some studies, no algorithmic details are given indeed [1, 6–8, 20, 26, 28]. Therefore, it is difficult to conduct a fair comparison. Secondly, the performances from one study to another are indeed not comparable due to the difference of benchmark datasets. We found that different methods use different training datasets and validation datasets [12, 15, 23, 34]. Thirdly, performance comparison was usually evaluated by cross-validation; the independent test was seldom performed, which is somewhat more important. Moreover, most of existing methods ignore highlighting the specificity (SP) comparison. The SP of a predictor corresponds to the ability of predicting non-CPPs (negatives). It is of great importance for wet-lab researchers, because the low SP of a predictor will produce a large number of false positives when applied to identify functional peptides in large-scale proteomic data, therefore increasing the expense of experimental validation. Consequently, it is also a need to conduct a comparative analysis for existing prediction tools in terms of SP.

In this review, we firstly summarized existing CPP prediction methods using different ML algorithms. Then, we carried out an unbiased evaluation of existing web-based prediction tools using two benchmark validation datasets. Note that there are six prediction tools that have the available web portals, including CellPPD [23], SkipCPP-Pred [12], CPPred-RF [16], KELM-CPPpred [34], MLCPP [15] and CPPred-FL [11], respectively. It should be mentioned that some of them provides more than one prediction models. Therefore, in this study we tested and compared a total of 12 CPP prediction models from the six web servers. Our comparative results demonstrate that the KELM-hybrid-AAC model from the KELM-CPPpred server significantly outperforms other competing prediction models in terms of SP, accuracy (ACC) and Matthew’s correlation coefficient (MCC). More importantly, it achieves more balanced sensitivity (SE) and SP as compared with other prediction tools. In particular, the remarkable higher SP indicates that it can be applied to large-scale proteomics, drastically reducing the generation of false positives. This will greatly facilitate the reduce of cost and time on experimental validation of predictions generated from ML models. Finally, we conducted a length-dependency comparison analysis, and found that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than the peptides in other length ranges.

Framework of CPPs prediction using machine learning methods. (A) The pipeline of machine learning based CPP prediction method. The 1st stage is dataset preparation to form training dataset and independent dataset; the 2nd stage is feature encoding, composed of feature representation and feature optimization; the 3rd stage is to train and evaluate a prediction model. An independent test is usually needed to validate the ability of a trained model. Ultimately, for a given query sequence, the developed prediction model predicts whether it is a CPP or not. (B–D) represent the brief illustration of the ANN and RF, respectively.
Figure 1

Framework of CPPs prediction using machine learning methods. (A) The pipeline of machine learning based CPP prediction method. The 1st stage is dataset preparation to form training dataset and independent dataset; the 2nd stage is feature encoding, composed of feature representation and feature optimization; the 3rd stage is to train and evaluate a prediction model. An independent test is usually needed to validate the ability of a trained model. Ultimately, for a given query sequence, the developed prediction model predicts whether it is a CPP or not. (BD) represent the brief illustration of the ANN and RF, respectively.

Materials and methods

Framework of CPP prediction using machine learning

The framework of CPP prediction using ML is illustrated in Figure 1, which involves three main stages. The 1st stage is dataset preparation. Candidate peptide sequences are generally collected from validated databases and relevant literatures [35]. To construct a high-quality prediction model, training sets and independent testing sets are usually needed. Training sets are used for model training and the testing sets are for validating the transferability and reliability of the trained model. The 2nd stage is feature encoding. This stage is composed of feature representation and feature optimization [36]. For feature representation, various feature descriptors are usually used to capture the characteristics of CPPs, including compositional features [i.e. amino acid composition (AAC) and dipeptide composition (DAC)], binary profile, motif-based features and physicochemical features, etc. To improve the feature representation ability, the features are often optimized by removing some irrelevant features [37]. The last stage is model construction and prediction. The optimal features from the previous stage are trained with ML algorithms (i.e. SVM and RF). For query peptide sequences, they are encoded with the feature vectors and then fed into the trained model. Ultimately, the developed prediction model will provide a reliable prediction result whether it is a CPP or not.

Cell-penetration peptides database

To date, there are two public CPP databases, namely CPPsite [14] and CPPsite 2.0 [13]. Note that CPPsite 2.0 is the successor of CPPsite. The details of the two databases are presented in Table 1. To be specific, CPPsite is the 1st database of CPPs created by Gautam et al. [14] in 2012, containing 843 entries with the information of sequence, subcellular localization, physicochemical properties (PPs) and uptake efficiency, etc. In 2015, Agrawal et al. [13] released an updated version of CPPsite, called CPPsite 2.0. It contains 1850 entries, including the information of model system, cargo information, the chemical modifications, predicted tertiary structure and so on [38].

Table 1

The detailed information of CPP databases

DatabasesYearNumber of true CPPsWebsiteRef.
CPPsite2012843http://webs.iiitd.edu.in/raghava/cppsite/[14]
CPPsite2.020151850http://webs.iiitd.edu.in/raghava/cppsite2.0/[13]
DatabasesYearNumber of true CPPsWebsiteRef.
CPPsite2012843http://webs.iiitd.edu.in/raghava/cppsite/[14]
CPPsite2.020151850http://webs.iiitd.edu.in/raghava/cppsite2.0/[13]
Table 1

The detailed information of CPP databases

DatabasesYearNumber of true CPPsWebsiteRef.
CPPsite2012843http://webs.iiitd.edu.in/raghava/cppsite/[14]
CPPsite2.020151850http://webs.iiitd.edu.in/raghava/cppsite2.0/[13]
DatabasesYearNumber of true CPPsWebsiteRef.
CPPsite2012843http://webs.iiitd.edu.in/raghava/cppsite/[14]
CPPsite2.020151850http://webs.iiitd.edu.in/raghava/cppsite2.0/[13]
Table 2

Summary of 15 existing machine learning based prediction tools in the literature

YearMethodsFeature representationFeature selectionPredictor NameURLRef
2005Z-descriptorsBulk properties of the constituent amino acidsN.A.N.A.N.A.[6]
2008Partial least squaresChemical propertiesPrincipal component AnalysisN.A.N.A.[1]
2010ANNsBiochemical featuresN.A.N.A.N.A.[28]
2011SMO-based SVMs and the Pearson VII universal kernelBasic biochemical propertiesScatter search approachN.A.N.A.[7]
2011ANNsN.A.Principal component AnalysisN.A.N.A.[20]
2013SVMsSequence composition, binary profile of patterns and physicochemical propertiesN.A.CellPPDhttp://crdd.osdd.net/raghava/cellppd/[23]
2013N-to-1 NNMotif informationN.A.CPPpredhttp://bioware.ucd.ie/cpppred[29]
2015RFPseAAC and five properties of amino acidsmRMR and IFSN.A.N.A.[26]
2016SVMsDipeptide compositionAnalysis of varianceC2Predhttp://lin.uestc.edu.cn/server/C2Pred[24]
2017Random ForestK-skip-2-gramN.A.SkipCPP-Predhttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
2017Random ForestPC-PseAAC, SC-PseAAC, ASDC and PPsMRMD and SFSCPPred-RFhttp://server.malab.cn/CPPred-RF/[16]
2018Extremely randomized tree and RFAAC, AAI, DPC, PCP and CTDN.A.MLCPPhttp://www.thegleelab.org/MLCPP/[15]
2018Kernel extreme learning machineAAC, DAC, PseAAC and the motif-based hybrid featuresN.A.KELM-CPPpredhttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
2018RFCompositional information, position-specific information and physicochemical propertiesmRMR and SFSCPPred-FLhttp://server.malab.cn/CPPred-FL/[11]
2018RFSequence length, physicochemical properties and molecular propertiesN.A.N.A.N.A.[8]
YearMethodsFeature representationFeature selectionPredictor NameURLRef
2005Z-descriptorsBulk properties of the constituent amino acidsN.A.N.A.N.A.[6]
2008Partial least squaresChemical propertiesPrincipal component AnalysisN.A.N.A.[1]
2010ANNsBiochemical featuresN.A.N.A.N.A.[28]
2011SMO-based SVMs and the Pearson VII universal kernelBasic biochemical propertiesScatter search approachN.A.N.A.[7]
2011ANNsN.A.Principal component AnalysisN.A.N.A.[20]
2013SVMsSequence composition, binary profile of patterns and physicochemical propertiesN.A.CellPPDhttp://crdd.osdd.net/raghava/cellppd/[23]
2013N-to-1 NNMotif informationN.A.CPPpredhttp://bioware.ucd.ie/cpppred[29]
2015RFPseAAC and five properties of amino acidsmRMR and IFSN.A.N.A.[26]
2016SVMsDipeptide compositionAnalysis of varianceC2Predhttp://lin.uestc.edu.cn/server/C2Pred[24]
2017Random ForestK-skip-2-gramN.A.SkipCPP-Predhttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
2017Random ForestPC-PseAAC, SC-PseAAC, ASDC and PPsMRMD and SFSCPPred-RFhttp://server.malab.cn/CPPred-RF/[16]
2018Extremely randomized tree and RFAAC, AAI, DPC, PCP and CTDN.A.MLCPPhttp://www.thegleelab.org/MLCPP/[15]
2018Kernel extreme learning machineAAC, DAC, PseAAC and the motif-based hybrid featuresN.A.KELM-CPPpredhttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
2018RFCompositional information, position-specific information and physicochemical propertiesmRMR and SFSCPPred-FLhttp://server.malab.cn/CPPred-FL/[11]
2018RFSequence length, physicochemical properties and molecular propertiesN.A.N.A.N.A.[8]

Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).

Table 2

Summary of 15 existing machine learning based prediction tools in the literature

YearMethodsFeature representationFeature selectionPredictor NameURLRef
2005Z-descriptorsBulk properties of the constituent amino acidsN.A.N.A.N.A.[6]
2008Partial least squaresChemical propertiesPrincipal component AnalysisN.A.N.A.[1]
2010ANNsBiochemical featuresN.A.N.A.N.A.[28]
2011SMO-based SVMs and the Pearson VII universal kernelBasic biochemical propertiesScatter search approachN.A.N.A.[7]
2011ANNsN.A.Principal component AnalysisN.A.N.A.[20]
2013SVMsSequence composition, binary profile of patterns and physicochemical propertiesN.A.CellPPDhttp://crdd.osdd.net/raghava/cellppd/[23]
2013N-to-1 NNMotif informationN.A.CPPpredhttp://bioware.ucd.ie/cpppred[29]
2015RFPseAAC and five properties of amino acidsmRMR and IFSN.A.N.A.[26]
2016SVMsDipeptide compositionAnalysis of varianceC2Predhttp://lin.uestc.edu.cn/server/C2Pred[24]
2017Random ForestK-skip-2-gramN.A.SkipCPP-Predhttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
2017Random ForestPC-PseAAC, SC-PseAAC, ASDC and PPsMRMD and SFSCPPred-RFhttp://server.malab.cn/CPPred-RF/[16]
2018Extremely randomized tree and RFAAC, AAI, DPC, PCP and CTDN.A.MLCPPhttp://www.thegleelab.org/MLCPP/[15]
2018Kernel extreme learning machineAAC, DAC, PseAAC and the motif-based hybrid featuresN.A.KELM-CPPpredhttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
2018RFCompositional information, position-specific information and physicochemical propertiesmRMR and SFSCPPred-FLhttp://server.malab.cn/CPPred-FL/[11]
2018RFSequence length, physicochemical properties and molecular propertiesN.A.N.A.N.A.[8]
YearMethodsFeature representationFeature selectionPredictor NameURLRef
2005Z-descriptorsBulk properties of the constituent amino acidsN.A.N.A.N.A.[6]
2008Partial least squaresChemical propertiesPrincipal component AnalysisN.A.N.A.[1]
2010ANNsBiochemical featuresN.A.N.A.N.A.[28]
2011SMO-based SVMs and the Pearson VII universal kernelBasic biochemical propertiesScatter search approachN.A.N.A.[7]
2011ANNsN.A.Principal component AnalysisN.A.N.A.[20]
2013SVMsSequence composition, binary profile of patterns and physicochemical propertiesN.A.CellPPDhttp://crdd.osdd.net/raghava/cellppd/[23]
2013N-to-1 NNMotif informationN.A.CPPpredhttp://bioware.ucd.ie/cpppred[29]
2015RFPseAAC and five properties of amino acidsmRMR and IFSN.A.N.A.[26]
2016SVMsDipeptide compositionAnalysis of varianceC2Predhttp://lin.uestc.edu.cn/server/C2Pred[24]
2017Random ForestK-skip-2-gramN.A.SkipCPP-Predhttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
2017Random ForestPC-PseAAC, SC-PseAAC, ASDC and PPsMRMD and SFSCPPred-RFhttp://server.malab.cn/CPPred-RF/[16]
2018Extremely randomized tree and RFAAC, AAI, DPC, PCP and CTDN.A.MLCPPhttp://www.thegleelab.org/MLCPP/[15]
2018Kernel extreme learning machineAAC, DAC, PseAAC and the motif-based hybrid featuresN.A.KELM-CPPpredhttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
2018RFCompositional information, position-specific information and physicochemical propertiesmRMR and SFSCPPred-FLhttp://server.malab.cn/CPPred-FL/[11]
2018RFSequence length, physicochemical properties and molecular propertiesN.A.N.A.N.A.[8]

Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).

Existing CPP prediction methods

ML algorithms have been widely used to identify CPPs [38]. We summarized existing ML-based CPP prediction methods in Table 2. According to ML algorithms, they are categorized into the following four classes, which are described in detail below.

Prediction methods based on Neural Network

Artificial NNs (ANNs) (Figure 1B) is an algorithmic model that simulates the structure of the brain’s synaptic connections to process information and react to the real world [39]. The ANNs have two unique properties: (1) they are able to learn from examples and adapt to the change in environmental parameters; and (2) they are able to generate highly nonlinear decision boundaries in the multidimensional input space [39, 40].

Until now, there are three ANN-based CPP prediction methods. For example, Dobchev et al. [28] specified biochemical features for true CPPs and non-CPPs, and trained a prediction model using ANN algorithms and Principle Component Analysis (PCA). The reason for using PCA was to select the most informative variables (used as inputs in the net) from the training set. Their model is reported to achieve an accuracy of 80–100% on a validation dataset containing 101 peptides (penetrating and non-penetrating). The 2nd ANN method, proposed by Karelson et al. [20], is to predict the cell-penetrating capability of compounds or drugs. It combines quantitative structure–activity relationship principles with ANN algorithms to develop a prediction model with an overall accuracy of 83%. However, the limitation is that this method needs structural information as inputs, which are not always available, especially for characterizing the cell-penetrating properties of random peptides. The 3rd ANN-based prediction method is called CPPpred [29]. The prediction model of CPPpred was trained on redundancy reduced datasets and achieved an accuracy of 82.98% with the independent test. In particular, this is the 1st study to emphasize the importance of stringent training datasets for the construction of a robust prediction model.

Prediction methods based on Support Vector Machine

The objective of SVM (Figure 1C) is to create a maximum margin separation hyperplane that can separate the positives from negatives with the minimal misclassification rate [41–43]. Basically, it maps the given input features into a high-dimensional space using kernel functions and finds a hyperplane that maximizes the distance between the hyperplane and two classes [44, 45]. For a given test sample that was mapped into the high-dimensional space (as described above), SVM can predict the test sample based on which side of the hyperplane they fall in. Notably, there are different kernel functions, including linear, polynomial functions and Gaussian radial-basis function. In SVM, there are two critical parameters: C (controls the trade-off between the training error and margin) and g (controls how peaked Gaussians are centered on the support vectors). To achieve the best performance, the parameters usually need to be optimized by grid search approach.

Some prediction tools based on SVM have been proposed for predicting CPPs. For example, Sanders et al. [7] developed an SVM-based approach for identifying potential CPPs. The prediction model was trained using the basic biochemical properties of peptides as features. The authors used three different benchmark datasets to highlight the importance of balanced datasets for accurate prediction. The accuracy of the balanced dataset reached to 91.72%. Gautam et al. [23] proposed an SVM-based predictor called CellPPD and established a public web server for the prediction of CPPs. In CellPPD, different feature representation algorithms, such as AAC, DAC, binary profile, motif features and PPs, were used for training different predictive models. The prediction model based on hybrid features is reported to achieve a maximum accuracy of 97.40%, better than the models based on individual features [34]. Tang et al. [24] developed C2Pred, a predictor based on optimized DAC as feature. The overall accuracy of C2Pred is about 83.6%. They also developed a web server with the implementation of C2Pred, but as the writing of this paper, the server is out of service already.

Prediction methods based on Random Forest

RF (Figure 1D) is a powerful ML algorithm [25], with successful applications in bioinformatics [8, 11, 12, 16, 26, 46]. RF is an ensemble of decision trees. The training procedure is briefly introduced as follows. Assuming there exist N samples with M features in the training set, RF selects N samples by bootstrapping to form a new training dataset and then, randomly selects m (m ≪M) features to train a decision tree on the new training dataset. Repeat this procedure until all the decision trees in RF are trained. The final prediction result is determined by an ensemble of the scores of all the decision trees. In RF, the numbers of decision trees and randomly selected features (mtry) are two main parameters for training accurate RF models.

RF algorithm has been widely applied to the field regarding the prediction of CPPs. Chen et al. [26] developed an RF-based CPP prediction model. The model was trained on a series of PPs, including pseudo-AAC (PseAAC) [18], molecular volume, polarity, codon diversity, electrostatic charge and secondary structure [26]. Optimized features were selected by minimum redundancy maximum relevance [47] and incremental feature selection [48]. The overall accuracy of the prediction model is 83.45%. Considering the long-range effect between residues, in previous study we proposed an adaptive k-skip-2-gram algorithm to extract features and trained a predictor named SkipCPP-Pred with an improved accuracy of 90.6% [12]. In our another work, we proposed a two-layer predictor called CPPred-RF, for which the 1st layer is to discriminate true CPPs from non-CPPs while the 2nd layer is to predict the uptake efficiency of CPPs: high or low [16]. The prediction model was trained on integrative features, which combine four sequence-based descriptors, including PC-PseAAC [49], SC-PseAAC [49], adaptive skip DAC (ASDC), and PPs [50–52]. As compared to SkipCPP-Pred, CPPred-RF increased the prediction accuracy (evaluated with 10-fold cross validation) to 91.6% on the same benchmark dataset. It is worth noting that the CPPred-RF is the 1st tool that can predict the uptake efficiency of CPPs. Another work, from Wolfe et al. [8], focuses on the transport of phosphorodiamidite morpholino oligonucleotides by CPPs [8]. Peptide molecular weight, sequence length, theoretical net charge and amino acid physicochemical descriptors were used as input features to train a RF model. Recently, Qiang et al. [11] proposed a computational predictor called CPPred-FL. More specifically, CPPred-FL introduces the feature representation learning strategy to learn the class and probabilistic information from ML models built with multiple feature descriptors, such as PPs, compositional information and position-specific information, etc. The best overall accuracy of CPPred-FL is up to 92.1% [11]. Although the accuracy is not significantly improved as compared to their previous study [16], the feature number they used for training the predictive models is far fewer. This feature representation strategy explores a new effective way to extract high-expressive features.

Prediction methods based on other machine learning algorithms

Besides the methods above, there are some other prediction methods based on other ML algorithms, such as ERT [15, 30] and KELM [33, 34]. In a recent study, Manavalan et al. [15] proposed a two-layer model for predicting CPPs and their uptake efficiency. The 1st-layer model for the prediction of CPPs was trained by ERT algorithm with an accuracy of 89.6%, while the uptake efficiency prediction model (2nd layer) was trained by RF with an accuracy of 72.5%. Pandey et al. [34] developed a KELM-based model. Their prediction models utilize six different feature descriptors, including AAC, dipeptide AAC (DAC), PseAAC and three hybrid features (Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC) [34]. KELM-CPPpred achieved an accuracy of 83.10% on an independent dataset. Moreover, there are some studies with no clear description for the use of ML algorithms. For example, Hällbrink et al. [6] proposed a prediction method concentrated on five z-descriptors [53], which are extracted from physical characteristics of peptide sequences. Likewise, Hansen et al. [1] developed a method based on the chemical properties to predict CPPs and non-CPPs.

Table 3

Summary of six available web servers for CPP prediction

PredictorsYearClassifierPredicting uptake efficiencySequence length
Limitation
Upload sequenceMultiple inputURLRef.
CellPPD2013SVMN.A.1–50NoYeshttp://crdd.osdd.net/raghava/cellppd/[23]
SkipCPP-Pred2017RFN.A.No less than 10NoYeshttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
CPPred-RF2017RFYesNo limitationNoYeshttp://server.malab.cn/CPPred-RF[16]
MLCPP2018ERT and RFYesNo limitationYesYeswww.thegleelab.org/MLCPP[15]
KELM-CPPpred2018KELMN.A.5–30NoYeshttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
CPPred-FL2018RFN.A.No limitationYesYeshttp://server.malab.cn/CPPred-FL[11]
PredictorsYearClassifierPredicting uptake efficiencySequence length
Limitation
Upload sequenceMultiple inputURLRef.
CellPPD2013SVMN.A.1–50NoYeshttp://crdd.osdd.net/raghava/cellppd/[23]
SkipCPP-Pred2017RFN.A.No less than 10NoYeshttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
CPPred-RF2017RFYesNo limitationNoYeshttp://server.malab.cn/CPPred-RF[16]
MLCPP2018ERT and RFYesNo limitationYesYeswww.thegleelab.org/MLCPP[15]
KELM-CPPpred2018KELMN.A.5–30NoYeshttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
CPPred-FL2018RFN.A.No limitationYesYeshttp://server.malab.cn/CPPred-FL[11]
Table 3

Summary of six available web servers for CPP prediction

PredictorsYearClassifierPredicting uptake efficiencySequence length
Limitation
Upload sequenceMultiple inputURLRef.
CellPPD2013SVMN.A.1–50NoYeshttp://crdd.osdd.net/raghava/cellppd/[23]
SkipCPP-Pred2017RFN.A.No less than 10NoYeshttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
CPPred-RF2017RFYesNo limitationNoYeshttp://server.malab.cn/CPPred-RF[16]
MLCPP2018ERT and RFYesNo limitationYesYeswww.thegleelab.org/MLCPP[15]
KELM-CPPpred2018KELMN.A.5–30NoYeshttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
CPPred-FL2018RFN.A.No limitationYesYeshttp://server.malab.cn/CPPred-FL[11]
PredictorsYearClassifierPredicting uptake efficiencySequence length
Limitation
Upload sequenceMultiple inputURLRef.
CellPPD2013SVMN.A.1–50NoYeshttp://crdd.osdd.net/raghava/cellppd/[23]
SkipCPP-Pred2017RFN.A.No less than 10NoYeshttp://server.malab.cn/SkipCPP-Pred/Index.html[12]
CPPred-RF2017RFYesNo limitationNoYeshttp://server.malab.cn/CPPred-RF[16]
MLCPP2018ERT and RFYesNo limitationYesYeswww.thegleelab.org/MLCPP[15]
KELM-CPPpred2018KELMN.A.5–30NoYeshttp://sairam.people.iitgn.ac.in/KELM-CPPpred.html[34]
CPPred-FL2018RFN.A.No limitationYesYeshttp://server.malab.cn/CPPred-FL[11]

Web-accessible prediction tools

As described in the Existing CPP prediction methods, there are a total of 15 prediction methods, but only 6 of them provide available web servers for the prediction of CPPs. They are CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred and CPPred-FL, respectively [11, 12, 15, 16, 23, 34]. The basic information of the web servers is summarized in Table 3. They are described in detail below.

  • CellPPD is an in silico method for predicting and designing CPPs. The web server provides users two prediction models: (1) SVM-based model and (2) SVM + Motif-based model [23]. The 1st model was trained by SVM classifier [14] using binary N10-C10 descriptor, while the 2nd was trained with a hybrid descriptor of binary profile patterns and motif features. It should be pointed out that CellPPD is the 1st server to predict CPPs. Additionally, CellPPD is able to identify potential CPPs from protein sequences, but the length of input protein sequences is limited to 500 residues long. Furthermore, the web server also allows users to design novel cell penetrating peptides with certain PPs according to specific needs. The web server is freely available at http://crdd.osdd.net/raghava/cellppd/.

  • SkipCPP-Pred is a RF-based prediction method. The prediction model was trained with the features extracted by an adaptive k-skip-2-gram algorithm [12]. Due to the use of sequential features only, SkipCPP-Pred is capable of fast predicting whether input peptides are CPPs or not. Notably, this server does not have any limit on the size of the input sequences. The web server can be accessed via http://server.malab.cn/SkipCPP-Pred/Index.html.

  • CPPred-RF is a two-layer RF-based predictor for predicting CPPs and their uptake efficiency simultaneously [16]. This is the 1st server that makes a breakthrough in the prediction of the uptake efficiency of CPPs. Similar to other servers, it supports for the prediction of multiple sequences. CPPred-RF is publicly available at http://server.malab.cn/CPPred-RF.

  • MLCPP, similar to CPPred-RF, is also a two-layer predictor for CPPs and their uptake efficiency. For given peptide sequences, the 1st-layer model predicts the query sequence as CPPs or not; if the input sequences are predicted as CPPs, the 2nd-layer model predicts their uptake efficiency [15]. The final results include the prediction information and corresponding probability scores. MLCPP is freely available at www.thegleelab.org/MLCPP.

  • KELM-CPPpred is a KELM-based CPP prediction tool. The web server provides six prediction models based on different features, including AAC, DAC, PseAAC, Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC [34]. Users can select one of the models to make predictions. KELM-CPPpred allows users to type one or multiple query sequences of 5–30 residues in length as inputs. The web server can be accessed via http://sairam.people.iitgn.ac.in/KELM-CPPpred.html.

  • CPPred-FL is a recent predictor for CPP prediction [11]. The server provides two prediction modes based on class information and probabilistic information for CPP identification. Different from the servers above, this server is designed to identify CPPs within proteins. When using this prediction tool, users should choose one prediction mode and set a confidence threshold and cutting length. It allows users to submit multiple protein sequences. The output of CPPred-FL contains all the peptide sequences predicted with cell-penetrating activity, corresponding residue position and prediction confidence. CPPred-FL is publicly available at http://server.malab.cn/CPPred-FL.

Validation datasets

Two benchmark validation datasets were used in this study for a comparative study of existing methods. They were downloaded from the independent datasets in two most recent studies: Pandey’s work [34] and Manavalan’s study [15]. For convenience of discussions, they are respectively denoted as mlcpp and kelm. The kelm dataset includes 96 experimentally validated CPPs as positives and 96 non-CPPs as negatives, whereas the mlcpp dataset consists of 311 true CPPs (positives) and 311 non-CPPs (negatives). However, some servers have strict length limitations for input testing sequences (see Table 3 for details). To test all the web-based prediction tools, we removed those sequences not meeting the need of the predictors in terms of sequence length. Moreover, considering the bias of high sequence similarity between training datasets and validation datasets, we firstly removed the sequences in the validation datasets having significant sequence similarity with the sequences in training datasets using BLASTP (version 2.8.1+) under default setting. Afterwards, we used CD-HIT, a frequently-used sequence homology reduction software in bioinformatics, to further remove those sequences in the validation datasets sharing the sequence identity of >30% against the sequences in the training datasets. By doing so, only 71 CPPs and 48 non-CPPs from kelm, and 149 CPPs and 193 non-CPPs from mlcpp were retained. It is worth noting that the positives from both of validation datasets were derived from the CPPsite2.0 database.

Table 4

Performance of 12 CPP prediction models in six web-accessible predictors on kelm dataset

Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
MLCPP535431874.6589.5880.670.63
CPPred-RF5912361283.1075.0079.830.58
KELM-AAC495432269.0189.5877.310.58
KELM-hybrid-AAC495432269.0189.5877.310.58
CPPred-FL5610381578.8779.1778.990.57
CellPPD453452663.3893.7575.630.57
CellPPD-motif453452663.3893.7575.630.57
KELM-PseAAC5913351283.1072.9278.990.56
KELM-DAC401473156.3497.9273.110.56
SkipCPP-Pred5813351381.6972.9278.150.55
KELM-hybrid-PseAAC5914341283.1070.8378.150.54
KELM-hybrid-DAC498402269.0183.3374.790.51
Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
MLCPP535431874.6589.5880.670.63
CPPred-RF5912361283.1075.0079.830.58
KELM-AAC495432269.0189.5877.310.58
KELM-hybrid-AAC495432269.0189.5877.310.58
CPPred-FL5610381578.8779.1778.990.57
CellPPD453452663.3893.7575.630.57
CellPPD-motif453452663.3893.7575.630.57
KELM-PseAAC5913351283.1072.9278.990.56
KELM-DAC401473156.3497.9273.110.56
SkipCPP-Pred5813351381.6972.9278.150.55
KELM-hybrid-PseAAC5914341283.1070.8378.150.54
KELM-hybrid-DAC498402269.0183.3374.790.51
Table 4

Performance of 12 CPP prediction models in six web-accessible predictors on kelm dataset

Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
MLCPP535431874.6589.5880.670.63
CPPred-RF5912361283.1075.0079.830.58
KELM-AAC495432269.0189.5877.310.58
KELM-hybrid-AAC495432269.0189.5877.310.58
CPPred-FL5610381578.8779.1778.990.57
CellPPD453452663.3893.7575.630.57
CellPPD-motif453452663.3893.7575.630.57
KELM-PseAAC5913351283.1072.9278.990.56
KELM-DAC401473156.3497.9273.110.56
SkipCPP-Pred5813351381.6972.9278.150.55
KELM-hybrid-PseAAC5914341283.1070.8378.150.54
KELM-hybrid-DAC498402269.0183.3374.790.51
Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
MLCPP535431874.6589.5880.670.63
CPPred-RF5912361283.1075.0079.830.58
KELM-AAC495432269.0189.5877.310.58
KELM-hybrid-AAC495432269.0189.5877.310.58
CPPred-FL5610381578.8779.1778.990.57
CellPPD453452663.3893.7575.630.57
CellPPD-motif453452663.3893.7575.630.57
KELM-PseAAC5913351283.1072.9278.990.56
KELM-DAC401473156.3497.9273.110.56
SkipCPP-Pred5813351381.6972.9278.150.55
KELM-hybrid-PseAAC5914341283.1070.8378.150.54
KELM-hybrid-DAC498402269.0183.3374.790.51
Table 5

Performance of 12 CPP prediction models in six web-accessible predictors on mlcpp dataset

Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
KELM-hybrid-AAC1417186894.6396.3795.610.91
KELM-hybrid-DAC14152141894.6373.0682.460.68
KELM-AAC14052141993.9673.0682.160.67
KELM-PseAAC14257136795.3070.4781.290.66
CPPred-FL14462131596.6467.8880.410.65
MLCPP14465128596.6466.3279.530.64
CPPred-RF14674119397.9961.6677.490.62
SkipCPP-Pred14881112199.3358.0376.020.60
CellPPD120471462980.5475.6577.780.56
CellPPD-motif120471462980.5475.6577.780.56
KELM-hybrid-PseAAC138841091192.6256.4872.220.51
KELM-DAC138871061192.6254.9271.350.50
Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
KELM-hybrid-AAC1417186894.6396.3795.610.91
KELM-hybrid-DAC14152141894.6373.0682.460.68
KELM-AAC14052141993.9673.0682.160.67
KELM-PseAAC14257136795.3070.4781.290.66
CPPred-FL14462131596.6467.8880.410.65
MLCPP14465128596.6466.3279.530.64
CPPred-RF14674119397.9961.6677.490.62
SkipCPP-Pred14881112199.3358.0376.020.60
CellPPD120471462980.5475.6577.780.56
CellPPD-motif120471462980.5475.6577.780.56
KELM-hybrid-PseAAC138841091192.6256.4872.220.51
KELM-DAC138871061192.6254.9271.350.50
Table 5

Performance of 12 CPP prediction models in six web-accessible predictors on mlcpp dataset

Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
KELM-hybrid-AAC1417186894.6396.3795.610.91
KELM-hybrid-DAC14152141894.6373.0682.460.68
KELM-AAC14052141993.9673.0682.160.67
KELM-PseAAC14257136795.3070.4781.290.66
CPPred-FL14462131596.6467.8880.410.65
MLCPP14465128596.6466.3279.530.64
CPPred-RF14674119397.9961.6677.490.62
SkipCPP-Pred14881112199.3358.0376.020.60
CellPPD120471462980.5475.6577.780.56
CellPPD-motif120471462980.5475.6577.780.56
KELM-hybrid-PseAAC138841091192.6256.4872.220.51
KELM-DAC138871061192.6254.9271.350.50
Prediction toolsTPFPTNFNSE (%)SP (%)ACC (%)MCC
KELM-hybrid-AAC1417186894.6396.3795.610.91
KELM-hybrid-DAC14152141894.6373.0682.460.68
KELM-AAC14052141993.9673.0682.160.67
KELM-PseAAC14257136795.3070.4781.290.66
CPPred-FL14462131596.6467.8880.410.65
MLCPP14465128596.6466.3279.530.64
CPPred-RF14674119397.9961.6677.490.62
SkipCPP-Pred14881112199.3358.0376.020.60
CellPPD120471462980.5475.6577.780.56
CellPPD-motif120471462980.5475.6577.780.56
KELM-hybrid-PseAAC138841091192.6256.4872.220.51
KELM-DAC138871061192.6254.9271.350.50

Performance measurements

Four evaluation metrics, including SE, SP, ACC and MCC, have been widely used in several bioinformatic fields [54–67]. Here, we also utilized the four metrics for evaluating the prediction models, which are calculated by the following formulas:
where TP, TN, FP and FN denote the numbers of true positive, true negative, false positive and false negative, respectively. SE and SP measure the predictive ability for two classes: positive and the negative, respectively. ACC and MCC evaluate the overall performance of the predictive model [68, 69].
Performance of 12 CPP prediction models in six different web-accessible predictors on kelm dataset. (A–D) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.
Figure 2

Performance of 12 CPP prediction models in six different web-accessible predictors on kelm dataset. (AD) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.

Results and discussion

Comparative results on benchmark validation datasets

In this work, our aim is to conduct an unbiased performance evaluation for existing prediction tools. To avoid the potential evaluation biases in implementation of existing predictors, we chose the prediction tools only with available web servers for comparison, including CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred, and CPPred-FL [11, 12, 15, 16, 23, 34]. We noticed that some servers, such as CellPPD and KELM-CPPpred, have more than one prediction models (refer to Web-accessible prediction tools for details). We collected a total of 12 prediction models from the 6 web servers. To conduct a comprehensive comparison, all the available prediction models are tested and compared. Moreover, it is instructive to compare the web-accessible prediction tools with the independent test, since this makes the use of them close to practical applications. Here, two benchmark validation datasets, mlcpp and kelm, are used for the test. The comparison results on the two benchmarks are presented in Tables 4 and 5, respectively.

On the kelm dataset (see Table 4 and Figure 2), MLCPP achieved the best performance among the 12 tested prediction models, giving the highest ACC of 80.67% and MCC of 0.63, respectively. Despite this, other prediction tools, like CPPred-RF, KELM-AAC and KELM-hybrid-AAC, are indeed competitive with the best MLCPP in terms of MCC; their MCCs are all 0.58, which are slightly worse than that of the MLCPP (MCC = 0.63). Here, we do not consider more on the comparison in terms of ACC, since the kelm dataset is imbalanced. Therefore, in such an imbalanced dataset, MCC is a better metric to measure the overall performance of a prediction model. However, considering that the number of the positives and negatives in the kelm dataset are relatively few. Therefore, within such a small gap in performance, it is actually hard to determine which prediction tool is better.

Next, we further compared the performance on the larger mlcpp dataset that contains more testing samples (149 positives and 193 negatives). The results are shown in Table 5 and Figure 3, where the following aspects were observed. The 1st observation is that the KELM-hybrid-AAC, trained with the KELM classifier and the hybrid features (motif and AAC), outperforms other competing prediction tools in three out of the four metrics: SP, ACC and MCC. More specifically, KELM-hybrid-AAC achieved the SP, ACC and MCC of 96.37%, 95.61% and 0.91, higher than that of the 2nd best KELM-hybrid-DAC by 23.31%, 13.15% and 0.23, respectively. The 2nd observation is that, the performance of MLCPP was decreased on the mlcpp (see Table 5 and Figure 3) as compared to that on the kelm (see Table 4 and Figure 2); among the 12 compared prediction models, MLCPP ranks in the 6th place. Furthermore, we observed that the SP of KELM-hybrid-AAC can achieve 96.37% high. This indicates that using this model would generate fairly few false positives; most of random peptides without cell-penetrating property would be filtered out by the KELM-hybrid-AAC. This greatly facilitates biological researchers for the experimental validation of predictions, thereby largely reducing the cost and time of the validation. In addition, we also investigated the ability of prediction tools for predicting the positives (true CPPs). For the prediction of true CPPs, SkipCPP-Pred and CPPred-RF are the top two predictors with the SEs of 99.33 and 97.99%, respectively; CPPred-FL and MLCPP achieved the 3rd best SE of 96.64%. This demonstrates that they can identify more CPPs than other predictors.

Performance of 12 CPP prediction models in 6 different web-accessible predictors on mlcpp dataset. (A–D) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.
Figure 3

Performance of 12 CPP prediction models in 6 different web-accessible predictors on mlcpp dataset. (AD) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.

Taken all together, through the comparison study on two benchmarks, it can be concluded that one of the prediction models in the KELM web, namely KELM-hybrid-AAC, generally outperforms the state-of-art web-accessible predictors (including 11 prediction models). Importantly, the KELM-hybrid-AAC is capable of providing more balanced performance between SE and SP in comparison with other prediction tools.

Length-dependency comparison of existing prediction tools

For CPP prediction tools, identification of peptides in proteins with cell-penetrating activity is the main in this field. As we know, experimentally validated CPPs are in the range of 10–50 residues long. Therefore, it is interesting to see whether there is length-dependency lying in existing CPP prediction tools. For this purpose, the samples in benchmark dataset are divided into four groups according to the length: [10, 15], (15, 20], (20, 25] and (25, 30]. Note that the length of the samples in the 1st group [10, 15] is in the range of 10–15; the length of the samples in the 2nd group (15, 20] is in the range of 15–20; and so forth. Figure 4 illustrates the prediction results of the 12 prediction models from the six web servers over different length ranges on the mlcpp dataset. Note that we conducted the length-dependency comparison only on the mlcpp dataset, since it contains more testing samples than the kelm dataset. Therefore, the mlcpp dataset is more representative than the kelm dataset. As shown in Figure 4, we can clearly see that almost all the prediction models performed better in the length range (20, 25] than the other range. In other words, predicting the CPPs and non-CPPs with the length range in (20, 25] is a relatively easy task, as compared to the prediction in other length ranges. For the prediction of CPPs beyond the range (20, 25], there is no clear trend observed. This result is quite interesting. It might provide some insights for the design of CPPs.

Performance of 12 CPP prediction models in 6 web-accessible predictors with different length ranges on mlcpp dataset. (A–L) denote the overall ACC of 12 CPP prediction models in prediction of CPPs and non-CPPs with different ranges, respectively. Note that we divided the samples in mlcpp dataset into four different length ranges: [10, 15], (15,20], (20,25], and (25,30]. In each sub-figure, T denotes the ratio of correctly predicted among the total samples in one specific range; and F denotes the ratio of in-correctly predicted samples among the total samples in one specific range.
Figure 4

Performance of 12 CPP prediction models in 6 web-accessible predictors with different length ranges on mlcpp dataset. (AL) denote the overall ACC of 12 CPP prediction models in prediction of CPPs and non-CPPs with different ranges, respectively. Note that we divided the samples in mlcpp dataset into four different length ranges: [10, 15], (15,20], (20,25], and (25,30]. In each sub-figure, T denotes the ratio of correctly predicted among the total samples in one specific range; and F denotes the ratio of in-correctly predicted samples among the total samples in one specific range.

Usage comparison of cell-penetrating peptides web servers

In this section, we investigated whether these servers are as friendly as they claimed in their own studies, since the usage experience of web-servers is quite important as well. We found that there are several limitations when conducting the server test. The 1st limitation is the length limitation of input sequences for some servers. For example, CellPPD allows users to input sequences with 1 to 50 residues, while KELM-CPPpred with 5–30 residues. SkipCPP-Pred requires input sequences with the length of >10 residues. Secondly, some servers do not support multiple sequences or uploading sequence files for batch processing. MLCPP and CPPred-FL have more convenient function for users to upload their data with specified format files (i.e. FASTA files). The CellPPD server has the file uploading function but does not work well. Thirdly, we found errors happening frequently when testing the CellPPD server. The prediction results are covered by the 1st query sequence, resulting in invalid prediction results. Therefore, we have to conduct the independent test one by one to ensure yield the correct prediction results, which is quite inconvenient if making the prediction in a large-scale data. Fourthly, in terms of running time, SkipCPP-Pred is the fastest one of these web servers, no matter that there are hundreds or thousands of query sequences. Although KELM-CPPpred claims that they can handle multiple sequences for one single run, the processing speed is quite slow, and the server usually returns with timeout errors when submitting more than 50 peptides as inputs. Finally, CPPred-RF and MLCPP are the only two servers that can predict the uptake efficiency of CPPs. In particular, CellPPD has the function of designing efficient CPPs, and provides to specify CPP motifs in query long protein sequences.

Discussion

In this study, our aim is to conduct the empirical comparison and analysis of 12 prediction models from 6 state-of-the-art CPP prediction tools that are accessible as web servers. We evaluated the prediction models on two benchmark validation datasets in an unbiased way, respectively. Benchmarking results demonstrate that, among the 12 prediction models, the KELM-hybrid-AAC from the KELM-CPPpred web, provides the exceptionally best performance. This might be due to the use of the hybrid features, which integrate the motif-based descriptor with the compositional descriptor AAC. We further analyzed which kind of motif features they used, and found that they specified the most frequent amino acid motifs in their dataset, including RRRRRR, RRA, GRRX (where X = R, W, T), RRGRX (X = R, G, T) and KKRK. The results demonstrated that the fusion of amino acid and some sequential motifs can sufficiently capture the inner characteristics of CPPs/non-CPPs. Another interesting finding is that in the KELM-CPPpred server, they recommend to use another model namely KELM-hybrid-PseAAC based on the results in their study [34]; whereas as for our test, we found that the recommended model is actually almost the worst one among the six prediction models in KELM-CPPpred server. Generally speaking, it can be concluded that the KELM-hybrid-AAC model is a better choice to make predictions in performance.

To further improve the performance of CPP prediction, there are still many aspects that can be explored. The 1st and also the most important aspect is feature representation. As we can see, most of prediction tools extract features using the information from primary sequences, including amino acid PPs, AAC, DAC and ASDC, etc. The following question is, why not use the structural information? The answer is that peptides are very short, usually with 10–50 residues long. It is extremely difficult to find some structural characteristics to discriminate CPPs from non-CPPs, as such short peptides cannot form stable secondary structures. Basically, only sequential information can be explored. Therefore, how to effectively use different types of sequential information is an open question. Recently, Wei et al. [68] proposed a feature representation learning strategy to automatically learn the most informative features in the supervised way. This work might give researchers a hint to explore other strategies for more effective sequence-based feature representations. Additionally, feature selection that facilitates the discovery of the most predictive features can be another direction to the improved performance. Moreover, recent studies have been shown that powerful classifiers are also an alternative way complementary to the improvement in performance [15, 34, 70, 71].

Although much progress has been done for the development of CPP prediction tools, some limitations and challenges remain to be addressed. Firstly, the main challenge that all of the ML-based prediction tools face is the selection of high-quality samples to represent adequately the positive (true CPPs) and negatives (non-CPPs). Currently, the number of the positive samples is still limited. Although the public database like CPPsite 2.0 collects almost 2000 experimentally validated CPPs, only 400–500 are remained in benchmark datasets after the removal of high-identity sequences. For the selection of negative samples, random peptides within proteins not annotated as CPPs are frequently used. An alternative way to generate negative control samples is to shuffle the genomic content of existing CPPs (e.g. scrambled CPPs). However, it cannot guarantee that the random peptides are not true CPPs, although the probability of such random sequences to be true CPPs is very low. Secondly, most of current predictors focus more on identifying true CPPs from non-CPPs, whereas few studies (only CPPred-RF and MLCPP) are on the uptake efficiency prediction, which is quite important as well, since the uptake efficiency of CPPs is closely associated with their practical applications as efficient drug delivery. One possible reason is that there is not enough experimental data to predict the efficiency of CPPs. Moreover, for the only two prediction tools for the efficiency prediction, simple yes/no prediction of internalization may soon be outdated.

Key Points

  • We comprehensively review a variety of existing ML-based methods for the prediction of CPPs.

  • We conduct a comparative study and analyze available web servers for predicting CPPs.

  • Benchmarking results demonstrate that, among the 12 prediction models from 6 available CPP prediction servers, the KELM-hybrid-AAC model provides the best performance.

  • Our analysis demonstrates that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than peptides in other length ranges.

Funding

National Natural Science Foundation of China (Nos. 61701340, 61702361 and 61771331), the Natural Science Foundation of Tianjin City (Nos. 18JCQNJC00500 and 18JCQNJC00800), the National Key R&D Program of China (SQ2018YFC090002) and the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology (2018R1D1A1B07049572).

Ran Su is currently an associate professor in the School of Computer Software, College of Intelligence and Computing, at Tianjin University, China. Tianjin University, China. Her research interests include pattern recognition, machine learning and bioinformatics.

Jie Hu received her BSc degree in Resource environment and Urban Planning Management from Wuhan University of Science and Technology, China. She is currently a graduate student in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. Her research interests are bioinformatics and machine learning.

Quan Zou is a professor of University of Electronic Science and Technology of China. He received his PhD in Computer Science from Harbin Institute of Technology, P.R. China in 2009. His research is in the areas of bioinformatics, machine learning and parallel computing, with focus on genome assembly, annotation and functional analysis from the next generation sequencing data with parallel computing methods.

Balachandran Manavalan received his PhD degree in 2011 from Ajou University, South Korea. Currently, he is working as a research professor in the Department of Physiology, Ajou University School of Medicine, Suwon, Korea. He is also an associate member of Korea Institute for Advanced Study (KIAS), Seoul, Korea. His main research interests include protein structure prediction, machine learning, data mining, computational biology, and functional genomics.

Leyi Wei received his PhD in Computer Science from Xiamen University, China. He is currently an Assistant Professor in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. His research interests include machine learning and their applications to bioinformatics.

References

1.

Hansen
M
,
Kilk
K
,
Langel
Ü
.
Predicting cell-penetrating peptides
.
Adv Drug Deliv Rev
2008
;
60
:
572
579
.

2.

Kilk
K
.
Cell-penetrating peptides and bioactive cargoes: strategies and mechanisms.
Doctoral Thesis,
Institutionen för Neurokemi
,
2004
.

3.

Madani
F
,
Lindberg
S
,
Langel
Ü
, et al. 
Mechanisms of cellular uptake of cell-penetrating peptides
,
J Biophys
2011
;
2011
; DOI: .

4.

Milletti
F
.
Cell-penetrating peptides: classes, origin, and current landscape
.
Drug Disco Today
2012
;
17
:
850
860
.

5.

Raucher
D
,
Ryu
JS
.
Cell-penetrating peptides: strategies for anticancer treatment
.
Trends Mol Med
2015
;
21
:
560
570
.

6.

Hällbrink
M
,
Kilk
K
,
Elmquist
A
, et al. 
Prediction of cell-penetrating peptides
.
Int J Pept Res Ther
2005
;
11
:
249
259
.

7.

Sanders
WS
,
Johnston
CI
,
Bridges
SM
, et al. 
Prediction of cell penetrating peptides by support vector machines
.
PLoS Comput Biol
2011
;
7
:
e1002101
.

8.

Wolfe
JM
,
Fadzen
CM
,
Choo
Z-N
, et al. 
Machine learning to predict cell-penetrating peptides for antisense delivery
.
ACS Cent Sci
2018
;
4
:
512
520
.

9.

Heitz
F
,
Morris
MC
,
Divita
G
.
Twenty years of cell-penetrating peptides: from molecular mechanisms to therapeutics
.
Br J Pharmacol
2009
;
157
:
195
206
.

10.

Frankel
AD
,
Pabo
CO
.
Cellular uptake of the tat protein from human immunodeficiency virus
.
Cell
1988
;
55
:
1189
1193
.

11.

Qiang
X
,
Zhou
C
,
Ye
X
, et al. 
CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning. A predictor for CPP identification
.
Briefings in Bioinformatics
2018
. DOI: .

12.

Wei
L
,
Tang
J
,
Zou
Q
.
SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides
.
BMC Genomics
2017
;
18
:
1
.

13.

Agrawal
P
,
Bhalla
S
,
Usmani
SS
, et al. 
CPPsite 2.0: a repository of experimentally validated cell-penetrating peptides
.
Nucleic Acids Res
2015
;
44
:
D1098
D1103
.

14.

Gautam
A
,
Singh
H
,
Tyagi
A
, et al. 
CPPsite: a curated database of cell penetrating peptides
.
Database
2012
;
2012
. DOI: .

15.

Manavalan
B
,
Subramaniyam
S
,
Shin
TH
, et al. 
Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy
.
J Proteome Res
2018
;
17
:
2715
2716
.

16.

Wei
L
,
Xing
P
,
Su
R
, et al. 
CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency
.
J Proteome Res
2017
;
16
:
2044
2053
.

17.

Ansorge
WJ
.
Next-generation DNA sequencing techniques
.
N Biotechnol
2009
;
25
:
195
203
.

18.

Chou
KC
.
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins
2001
;
43
:
246
255
.

19.

Diener
C
,
Martínez
GGR
,
Blas
DM
, et al. 
Effective design of multifunctional peptides by combining compatible functions
.
PLoS Comput Biol
2016
;
12
:
e1004786
.

20.

Karelson
M
,
Dobchev
D
.
Using artificial neural networks to predict cell-penetrating compounds
.
Expert Opin Drug Discov
2011
;
6
:
783
796
.

21.

Wei
H
,
Yang
W
,
Tang
H
, et al. 
The development of machine learning methods in cell-penetrating peptides identification: a brief review
.
Curr Drug Metab
2018
. DOI: .

22.

Cortes
C
,
Vapnik
V
.
Support vector machine
.
Mach Learn
1995
;
20
:
273
297
.

23.

Gautam
A
,
Chaudhary
K
,
Kumar
R
, et al. 
In silico approaches for designing highly effective cell penetrating peptides
.
J Transl Med
2013
;
11
:
74
.

24.

Tang
H
,
Su
Z-D
,
Wei
H-H
, et al. 
Prediction of cell-penetrating peptides with feature selection techniques
.
Biochem Biophys Res Commun
2016
;
477
:
150
154
.

25.

Breiman
L
.
Random forests
.
Mach Learn
2001
;
45
:
5
32
.

26.

Chen
L
,
Chu
C
,
Huang
T
, et al. 
Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models
.
Amino Acids
2015
;
47
:
1485
1493
.

27.

Specht
DF
.
A general regression neural network
.
IEEE Trans Neural Netw
1991
;
2
:
568
576
.

28.

A Dobchev
D
,
Mager
I
,
Tulp
I
, et al. 
Prediction of cell-penetrating peptides using artificial neural networks
.
Curr Comput Aided Drug Des
2010
;
6
:
79
89
.

29.

Holton
TA
,
Pollastri
G
,
Shields
DC
, et al. 
CPPpred: prediction of cell penetrating peptides
.
Bioinformatics
2013
;
29
:
3094
3096
.

30.

Geurts
P
,
Ernst
D
,
Wehenkel
L
.
Extremely randomized trees
.
Mach Learn
2006
;
63
:
3
42
.

31.

Basith
S
,
Manavalan
B
,
Shin
TH
, et al. 
iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree
.
Comput Struct Biotechnol J
2018
;
16
:
412
420
.

32.

Manavalan
B
,
Basith
S
,
Shin
TH
, et al. 
MLACP: machine-learning-based prediction of anticancer peptides
.
Oncotarget
2017
;
8
:
77121
.

33.

Huang
G-B
,
Zhu
Q-Y
,
Siew
C-K
.
Extreme learning machine: theory and applications
.
Neurocomputing
2006
;
70
:
489
501
.

34.

Pandey
P
,
Patel
V
,
George
NV
, et al. 
KELM-CPPpred: kernel extreme learning machine based prediction model for cell-penetrating peptides
.
J Proteome Res
2018
;
17
:
3214
3222
.

35.

Stegmayer
G
,
Di Persia
LE
,
Rubiolo
M
, et al. 
Predicting novel microRNA: a comprehensive comparison of machine learning approaches
.
Brief Bioinform
2018
. DOI: .

36.

Liu
B
,
Jiang
S
,
Zou
Q
.
HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search
.
Brief Bioinform
2018
. DOI: .

37.

Zou
Q
,
Zeng
J
,
Cao
L
, et al. 
A novel features ranking metric with application to scalable visual and bioinformatics data classification
.
Neurocomputing
2016
;
173
:
346
354
.

38.

Usmani
SS
,
Kumar
R
,
Bhalla
S
, et al. 
In silico tools and databases for designing peptide-based vaccine and drugs
.
Adv Protein Chem Struct Biol
2018
;
112
:
221
263
.

39.

Dreiseitl
S
,
Ohno-Machado
L
.
Logistic regression and artificial neural network classification models: a methodology review
.
J Biomed Inform
2002
;
35
:
352
359
.

40.

Liu
B
,
Li
S
.
ProtDet-CCH: protein remote homology detection by combining long short-term memory and ranking methods
,
IEEE/ACM Trans Comput Biol Bioinform
2018
. DOI: .

41.

Noble
WS
.
What is a support vector machine?
Nat Biotechnol
2006
;
24
:
1565
.

42.

Liu
B
,
Wu
H
,
Wang
X
, et al. 
Pse-Analysis: a python package for DNA, RNA and protein peptide sequence analysis based on pseudo components and kernel methods
.
Oncotarget
2017
;
8
:
13338
13343
.

43.

Manavalan
B
,
Lee
J
.
SVMQA: support–vector-machine-based protein single-model quality assessment
.
Bioinformatics
2017
;
33
:
2496
2503
.

44.

Manavalan
B
,
Shin
TH
,
Lee
G
.
PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine
.
Front Microbiol
2018
;
9
:
476
.

45.

Manavalan
B
,
Shin
TH
,
Lee
G
.
DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest
.
Oncotarget
2018
;
9
:
1944
.

46.

Manavalan
B
,
Lee
J
,
Lee
J
.
Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms
.
PloS One
2014
;
9
:
e106542
.

47.

Peng
H
,
Long
F
,
Ding
C
.
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
.
IEEE Trans Pattern Anal Mach Intell
2005
;
27
:
1226
1238
.

48.

Kohavi
R
,
John
GH
.
Wrappers for feature subset selection
.
Artif Intell
1997
;
97
:
273
324
.

49.

Liu
B
,
Liu
F
,
Wang
X
, et al. 
Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences
.
Nucleic Acids Res
2015
;
43
:
W65
W71
.

50.

Lin
C
,
Zou
Y
,
Qin
J
, et al. 
Hierarchical classification of protein folds using a novel ensemble classifier
.
PloS One
2013
;
8
:
e56499
.

51.

Zou
Q
,
Wang
Z
,
Guan
X
et al. 
An approach for identifying cytokines based on a novel ensemble classifier
,
BioMed Res Int
2013
;
2013
. DOI: .

52.

Liu
B.
BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches
.
Brief Bioinform
2018
; DOI: .

53.

Sandberg
M
,
Eriksson
L
,
Jonsson
J
, et al. 
New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids
.
J Med Chem
1998
;
41
:
2481
2491
.

54.

Zou
Q
,
Xing
P
,
Wei
L
et al. 
Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA
.
RNA
2018
. DOI: .

55.

Zhang
Z
,
Zhao
Y
,
Liao
X
et al. 
Deep learning in omics: a survey and guideline
,
Brief Funct Genomics
2018
. DOI: .

56.

Wei
L
,
Su
R
,
Wang
B
, et al. 
Integration of deep feature representations and handcrafted features to improve the prediction of N 6 -methyladenosine sites
.
Neurocomputing
2019
;
324
:
3
9
.

57.

Long
HX
,
Wang
M
,
Fu
HY
.
Deep convolutional neural networks for predicting hydroxyproline in proteins
.
Curr Bioinform
2017
;
12
:
233
238
.

58.

Yu
L
,
Sun
X
,
Tian
SW
, et al. 
Drug and nondrug classification based on deep learning with various feature selection strategies
.
Curr Bioinform
2018
;
13
:
253
259
.

59.

Wei
L
,
Ding
Y
,
Su
R
, et al. 
Prediction of human protein subcellular localization using deep learning
.
J Parallel Distrib Comput
2018
;
117
:
212
217
.

60.

Feng
C-Q
,
Zhang
Z-Y
,
Zhu
X-J
, et al. 
iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators
,
Bioinformatics
2018
. DOI: .

61.

Tang
H
,
Zhao
Y-W
,
Zou
P
, et al. 
HBPred: a tool to identify growth hormone-binding proteins
.
Int J Biol Sci
2018
;
14
:
957
964
.

62.

He
Z
,
Yu
W
.
Stable feature selection for biomarker discovery
.
Comput Biol Chem
2010
;
34
:
215
225
.

63.

Cabarle
FGC
,
Adorna
HN
,
Jiang
M
, et al. 
Spiking neural p systems with scheduled synapses
.
IEEE Trans Nanobioscience
2017
;
16
:
792
801
.

64.

Song
T
,
Rodríguez-Patón
A
,
Zheng
P
, et al. 
Spiking neural P systems with colored spikes
.
IEEE Trans Cogn Dev Syst
2017
. DOI: .

65.

Song
T
,
Zeng
X
,
Zheng
P
, et al. 
A parallel workflow pattern modelling using spiking neural P systems with colored spikes
.
IEEE Trans Nanobioscience
2018
;
17
:
474
484
.

66.

Yang
H
,
Lv
H
,
Ding
H
, et al. 
iRNA-2OM: a sequence-based predictor for identifying 2′-O-methylation sites in Homo sapiens
.
J Comput Biol
2018
;
25
:
1266
1277
.

67.

Zhu
X-J
,
Feng
C-Q
,
Lai
H-Y
, et al. 
Predicting protein structural classes for low-similarity sequences by evaluating different features
,
Knowledge-Based Syst
2018
. DOI: .

68.

Wei
L
,
Zhou
C
,
Chen
H
et al. 
ACPred-FL: a sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides
,
Bioinformatics
2018
. DOI: .

69.

Liu
B
,
Weng
F
,
Huang
D-S
, et al. 
iRO-3wPseKNC: Identify DNA replication origins by three-window-based PseKNC
,
Bioinformatics
2018
. DOI: .

70.

Zou
Q
,
Wan
S
,
Ju
Y
, et al. 
Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy
.
BMC Syst Biol
2016
;
10
:
114
.

71.

Zeng
XX
,
Liu
L
,
Lu
LY
, et al. 
Prediction of potential disease-associated microRNAs using structural perturbation method
.
Bioinformatics
2018
;
34
:
2425
2432
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)