Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools

Su, Ran; Hu, Jie; Zou, Quan; Manavalan, Balachandran; Wei, Leyi

doi:10.1093/bib/bby124

Abstract

Cell-penetrating peptides (CPPs) facilitate the delivery of therapeutically relevant molecules, including DNA, proteins and oligonucleotides, into cells both in vitro and in vivo. This unique ability explores the possibility of CPPs as therapeutic delivery and its potential applications in clinical therapy. Over the last few decades, a number of machine learning (ML)-based prediction tools have been developed, and some of them are freely available as web portals. However, the predictions produced by various tools are difficult to quantify and compare. In particular, there is no systematic comparison of the web-based prediction tools in performance, especially in practical applications. In this work, we provide a comprehensive review on the biological importance of CPPs, CPP database and existing ML-based methods for CPP prediction. To evaluate current prediction tools, we conducted a comparative study and analyzed a total of 12 models from 6 publicly available CPP prediction tools on 2 benchmark validation sets of CPPs and non-CPPs. Our benchmarking results demonstrated that a model from the KELM-CPPpred, namely KELM-hybrid-AAC, showed a significant improvement in overall performance, when compared to the other 11 prediction models. Moreover, through a length-dependency analysis, we find that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than peptides in other length ranges.

cell-penetrating peptides, machine learning algorithm, feature representation, web servers

Introduction

Cell-penetrating peptides (CPPs) are short peptides with approximately 5–30 amino acids residues in length [1]. One of the most distinctive characteristics of CPPs is the ability to carry a variety of bioactive molecules into cells without specific receptor interaction [2–5]. The cargoes that CPPs attach vary in different sizes, such as small molecule compounds, dyes, peptides, polypeptide nucleic acids, proteins, plasmid DNA, liposomes, phage particles, superparamagnetic particles and so on [2, 4]. Relevant fluorescence validation experiments have been performed to verify the cell penetrating capability of CPPs [6–8]. Considering this unique property, CPPs that improve the cellular uptake of various bioactive molecules are expected to be promising therapeutic candidates. In consideration of the potential of CPPs in therapeutics, the further development and application of CPP-based delivery strategies have steadily emerged over the past few years, demonstrating the great potential of gene delivery and cancer therapy, as well as effective clinical efficacy [2, 9].

The 1st CPP was discovered by Frankel et al. in the 1980s, which demonstrates that the human immunodeficiency virus 1 (HIV 1) transactivating (Tat) protein was able to enter tissue-cultured cells, to translocate into the nucleus and to transactivate the viral gene expression [10]. The α-helical domain of Tat protein spanning the residues 48–60, mainly composed of basic amino acids, was found as the main determinant for cell internalization and nucleus translocation. After that, Penetratin peptide, the 3rd helix of the antennapedia homeodomain, was found to efficiently cross the cell membranes with an energy-independent mechanism. These observations explore the basic research of CPPs. Since then, research on CPPs gained the growing interest in the last 30 years [11, 12], leading to an exponential increase in the number of CPPs. Currently, there are a total of 1850 experimentally validated entries in CPPsite 2.0—the largest CPP database [13]. The number of CPPs in current database has increased nearly twice relative to its previous version (CPPsite) [14]. It should be pointed out that of the entries in CPPsite 2.0, roughly 90% are derived from natural proteins [15], while the remaining are synthetic proteins or chimeric peptides [13, 16]. Along with the rapid development and wide applications of the next-generation sequencing techniques [17], a plenty of novel protein sequences have been generated rapidly at low cost [15, 16]. Subsequently, amongst these novel and uncharacterized protein sequences, it can be expected to explore more functional peptides with cell-penetrating activities [16]. Unfortunately, it is extremely difficult to apply traditional experimental methods to practical applications, especially with the avalanche of protein sequences, as they have some intrinsic limitations, such as being expensive, labor intensive and time-consuming [16]. To address the limitations from the perspective of experimental methods, computational methods have recently emerged as a promising alternative for accurate and efficient predicting CPPs.

Over the past few decades, a variety of computational methods, especially machine learning (ML)-based methods, have been developed for the prediction of CPPs [1, 6, 18–21]. ML techniques are able to extract useful patterns hidden in experimentally validated CPPs and make effective use of these patterns to accurately predict whether new uncharacterized peptides have the activity of cell-penetrating or not. Importantly, they allow for the use of only primary protein sequences as inputs, without any prior knowledge (i.e. secondary structures), showing the great potential for the high-throughput prediction in large-scale proteomic data. So far, various ML techniques have been applied to the development of CPP prediction methods, such as Support Vector Machine (SVM) [7, 22–24], Random Forest (RF) [8, 11, 12, 16, 25, 26], Neural Network (NN) [20, 27–29], Extremely Randomized Tree (ERT) [15, 30–32] and Kernel Extreme Learning Machine (KELM) [33, 34], thereby generating a number of prediction methods. We found that there is a common phenomenon lying in existing prediction methods; that is, different methods claimed to outperform other previously published methods in their own studies. However, comparisons carried out in different studies are somewhat biased. The possible biases are generated in the following three reasons. Firstly, the comparison was performed by developers themselves. As for the implementation of the compared prediction tools, setting algorithm parameters would greatly impact the performance. The problem is that, in some studies, no algorithmic details are given indeed [1, 6–8, 20, 26, 28]. Therefore, it is difficult to conduct a fair comparison. Secondly, the performances from one study to another are indeed not comparable due to the difference of benchmark datasets. We found that different methods use different training datasets and validation datasets [12, 15, 23, 34]. Thirdly, performance comparison was usually evaluated by cross-validation; the independent test was seldom performed, which is somewhat more important. Moreover, most of existing methods ignore highlighting the specificity (SP) comparison. The SP of a predictor corresponds to the ability of predicting non-CPPs (negatives). It is of great importance for wet-lab researchers, because the low SP of a predictor will produce a large number of false positives when applied to identify functional peptides in large-scale proteomic data, therefore increasing the expense of experimental validation. Consequently, it is also a need to conduct a comparative analysis for existing prediction tools in terms of SP.

In this review, we firstly summarized existing CPP prediction methods using different ML algorithms. Then, we carried out an unbiased evaluation of existing web-based prediction tools using two benchmark validation datasets. Note that there are six prediction tools that have the available web portals, including CellPPD [23], SkipCPP-Pred [12], CPPred-RF [16], KELM-CPPpred [34], MLCPP [15] and CPPred-FL [11], respectively. It should be mentioned that some of them provides more than one prediction models. Therefore, in this study we tested and compared a total of 12 CPP prediction models from the six web servers. Our comparative results demonstrate that the KELM-hybrid-AAC model from the KELM-CPPpred server significantly outperforms other competing prediction models in terms of SP, accuracy (ACC) and Matthew’s correlation coefficient (MCC). More importantly, it achieves more balanced sensitivity (SE) and SP as compared with other prediction tools. In particular, the remarkable higher SP indicates that it can be applied to large-scale proteomics, drastically reducing the generation of false positives. This will greatly facilitate the reduce of cost and time on experimental validation of predictions generated from ML models. Finally, we conducted a length-dependency comparison analysis, and found that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than the peptides in other length ranges.

Figure 1

Framework of CPPs prediction using machine learning methods. (A) The pipeline of machine learning based CPP prediction method. The 1st stage is dataset preparation to form training dataset and independent dataset; the 2nd stage is feature encoding, composed of feature representation and feature optimization; the 3rd stage is to train and evaluate a prediction model. An independent test is usually needed to validate the ability of a trained model. Ultimately, for a given query sequence, the developed prediction model predicts whether it is a CPP or not. (B–D) represent the brief illustration of the ANN and RF, respectively.

Open in new tab Download slide

Materials and methods

Framework of CPP prediction using machine learning

The framework of CPP prediction using ML is illustrated in Figure 1, which involves three main stages. The 1st stage is dataset preparation. Candidate peptide sequences are generally collected from validated databases and relevant literatures [35]. To construct a high-quality prediction model, training sets and independent testing sets are usually needed. Training sets are used for model training and the testing sets are for validating the transferability and reliability of the trained model. The 2nd stage is feature encoding. This stage is composed of feature representation and feature optimization [36]. For feature representation, various feature descriptors are usually used to capture the characteristics of CPPs, including compositional features [i.e. amino acid composition (AAC) and dipeptide composition (DAC)], binary profile, motif-based features and physicochemical features, etc. To improve the feature representation ability, the features are often optimized by removing some irrelevant features [37]. The last stage is model construction and prediction. The optimal features from the previous stage are trained with ML algorithms (i.e. SVM and RF). For query peptide sequences, they are encoded with the feature vectors and then fed into the trained model. Ultimately, the developed prediction model will provide a reliable prediction result whether it is a CPP or not.

Cell-penetration peptides database

To date, there are two public CPP databases, namely CPPsite [14] and CPPsite 2.0 [13]. Note that CPPsite 2.0 is the successor of CPPsite. The details of the two databases are presented in Table 1. To be specific, CPPsite is the 1st database of CPPs created by Gautam et al. [14] in 2012, containing 843 entries with the information of sequence, subcellular localization, physicochemical properties (PPs) and uptake efficiency, etc. In 2015, Agrawal et al. [13] released an updated version of CPPsite, called CPPsite 2.0. It contains 1850 entries, including the information of model system, cargo information, the chemical modifications, predicted tertiary structure and so on [38].

Table 1

Open in new tab

The detailed information of CPP databases

Databases	Year	Number of true CPPs	Website	Ref.
CPPsite	2012	843	http://webs.iiitd.edu.in/raghava/cppsite/	[14]
CPPsite2.0	2015	1850	http://webs.iiitd.edu.in/raghava/cppsite2.0/	[13]

Table 1

Open in new tab

The detailed information of CPP databases

Databases	Year	Number of true CPPs	Website	Ref.
CPPsite	2012	843	http://webs.iiitd.edu.in/raghava/cppsite/	[14]
CPPsite2.0	2015	1850	http://webs.iiitd.edu.in/raghava/cppsite2.0/	[13]

Table 2

Open in new tab

Summary of 15 existing machine learning based prediction tools in the literature

Year	Methods	Feature representation	Feature selection	Predictor Name	URL	Ref
2005	Z-descriptors	Bulk properties of the constituent amino acids	N.A.	N.A.	N.A.	[6]
2008	Partial least squares	Chemical properties	Principal component Analysis	N.A.	N.A.	[1]
2010	ANNs	Biochemical features	N.A.	N.A.	N.A.	[28]
2011	SMO-based SVMs and the Pearson VII universal kernel	Basic biochemical properties	Scatter search approach	N.A.	N.A.	[7]
2011	ANNs	N.A.	Principal component Analysis	N.A.	N.A.	[20]
2013	SVMs	Sequence composition, binary profile of patterns and physicochemical properties	N.A.	CellPPD	http://crdd.osdd.net/raghava/cellppd/	[23]
2013	N-to-1 NN	Motif information	N.A.	CPPpred	http://bioware.ucd.ie/cpppred	[29]
2015	RF	PseAAC and five properties of amino acids	mRMR and IFS	N.A.	N.A.	[26]
2016	SVMs	Dipeptide composition	Analysis of variance	C2Pred	http://lin.uestc.edu.cn/server/C2Pred	[24]
2017	Random Forest	K-skip-2-gram	N.A.	SkipCPP-Pred	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
2017	Random Forest	PC-PseAAC, SC-PseAAC, ASDC and PPs	MRMD and SFS	CPPred-RF	http://server.malab.cn/CPPred-RF/	[16]
2018	Extremely randomized tree and RF	AAC, AAI, DPC, PCP and CTD	N.A.	MLCPP	http://www.thegleelab.org/MLCPP/	[15]
2018	Kernel extreme learning machine	AAC, DAC, PseAAC and the motif-based hybrid features	N.A.	KELM-CPPpred	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
2018	RF	Compositional information, position-specific information and physicochemical properties	mRMR and SFS	CPPred-FL	http://server.malab.cn/CPPred-FL/	[11]
2018	RF	Sequence length, physicochemical properties and molecular properties	N.A.	N.A.	N.A.	[8]

Year	Methods	Feature representation	Feature selection	Predictor Name	URL	Ref
2005	Z-descriptors	Bulk properties of the constituent amino acids	N.A.	N.A.	N.A.	[6]
2008	Partial least squares	Chemical properties	Principal component Analysis	N.A.	N.A.	[1]
2010	ANNs	Biochemical features	N.A.	N.A.	N.A.	[28]
2011	SMO-based SVMs and the Pearson VII universal kernel	Basic biochemical properties	Scatter search approach	N.A.	N.A.	[7]
2011	ANNs	N.A.	Principal component Analysis	N.A.	N.A.	[20]
2013	SVMs	Sequence composition, binary profile of patterns and physicochemical properties	N.A.	CellPPD	http://crdd.osdd.net/raghava/cellppd/	[23]
2013	N-to-1 NN	Motif information	N.A.	CPPpred	http://bioware.ucd.ie/cpppred	[29]
2015	RF	PseAAC and five properties of amino acids	mRMR and IFS	N.A.	N.A.	[26]
2016	SVMs	Dipeptide composition	Analysis of variance	C2Pred	http://lin.uestc.edu.cn/server/C2Pred	[24]
2017	Random Forest	K-skip-2-gram	N.A.	SkipCPP-Pred	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
2017	Random Forest	PC-PseAAC, SC-PseAAC, ASDC and PPs	MRMD and SFS	CPPred-RF	http://server.malab.cn/CPPred-RF/	[16]
2018	Extremely randomized tree and RF	AAC, AAI, DPC, PCP and CTD	N.A.	MLCPP	http://www.thegleelab.org/MLCPP/	[15]
2018	Kernel extreme learning machine	AAC, DAC, PseAAC and the motif-based hybrid features	N.A.	KELM-CPPpred	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
2018	RF	Compositional information, position-specific information and physicochemical properties	mRMR and SFS	CPPred-FL	http://server.malab.cn/CPPred-FL/	[11]
2018	RF	Sequence length, physicochemical properties and molecular properties	N.A.	N.A.	N.A.	[8]

Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).

Table 2

Open in new tab

Summary of 15 existing machine learning based prediction tools in the literature

Year	Methods	Feature representation	Feature selection	Predictor Name	URL	Ref
2005	Z-descriptors	Bulk properties of the constituent amino acids	N.A.	N.A.	N.A.	[6]
2008	Partial least squares	Chemical properties	Principal component Analysis	N.A.	N.A.	[1]
2010	ANNs	Biochemical features	N.A.	N.A.	N.A.	[28]
2011	SMO-based SVMs and the Pearson VII universal kernel	Basic biochemical properties	Scatter search approach	N.A.	N.A.	[7]
2011	ANNs	N.A.	Principal component Analysis	N.A.	N.A.	[20]
2013	SVMs	Sequence composition, binary profile of patterns and physicochemical properties	N.A.	CellPPD	http://crdd.osdd.net/raghava/cellppd/	[23]
2013	N-to-1 NN	Motif information	N.A.	CPPpred	http://bioware.ucd.ie/cpppred	[29]
2015	RF	PseAAC and five properties of amino acids	mRMR and IFS	N.A.	N.A.	[26]
2016	SVMs	Dipeptide composition	Analysis of variance	C2Pred	http://lin.uestc.edu.cn/server/C2Pred	[24]
2017	Random Forest	K-skip-2-gram	N.A.	SkipCPP-Pred	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
2017	Random Forest	PC-PseAAC, SC-PseAAC, ASDC and PPs	MRMD and SFS	CPPred-RF	http://server.malab.cn/CPPred-RF/	[16]
2018	Extremely randomized tree and RF	AAC, AAI, DPC, PCP and CTD	N.A.	MLCPP	http://www.thegleelab.org/MLCPP/	[15]
2018	Kernel extreme learning machine	AAC, DAC, PseAAC and the motif-based hybrid features	N.A.	KELM-CPPpred	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
2018	RF	Compositional information, position-specific information and physicochemical properties	mRMR and SFS	CPPred-FL	http://server.malab.cn/CPPred-FL/	[11]
2018	RF	Sequence length, physicochemical properties and molecular properties	N.A.	N.A.	N.A.	[8]

Year	Methods	Feature representation	Feature selection	Predictor Name	URL	Ref
2005	Z-descriptors	Bulk properties of the constituent amino acids	N.A.	N.A.	N.A.	[6]
2008	Partial least squares	Chemical properties	Principal component Analysis	N.A.	N.A.	[1]
2010	ANNs	Biochemical features	N.A.	N.A.	N.A.	[28]
2011	SMO-based SVMs and the Pearson VII universal kernel	Basic biochemical properties	Scatter search approach	N.A.	N.A.	[7]
2011	ANNs	N.A.	Principal component Analysis	N.A.	N.A.	[20]
2013	SVMs	Sequence composition, binary profile of patterns and physicochemical properties	N.A.	CellPPD	http://crdd.osdd.net/raghava/cellppd/	[23]
2013	N-to-1 NN	Motif information	N.A.	CPPpred	http://bioware.ucd.ie/cpppred	[29]
2015	RF	PseAAC and five properties of amino acids	mRMR and IFS	N.A.	N.A.	[26]
2016	SVMs	Dipeptide composition	Analysis of variance	C2Pred	http://lin.uestc.edu.cn/server/C2Pred	[24]
2017	Random Forest	K-skip-2-gram	N.A.	SkipCPP-Pred	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
2017	Random Forest	PC-PseAAC, SC-PseAAC, ASDC and PPs	MRMD and SFS	CPPred-RF	http://server.malab.cn/CPPred-RF/	[16]
2018	Extremely randomized tree and RF	AAC, AAI, DPC, PCP and CTD	N.A.	MLCPP	http://www.thegleelab.org/MLCPP/	[15]
2018	Kernel extreme learning machine	AAC, DAC, PseAAC and the motif-based hybrid features	N.A.	KELM-CPPpred	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
2018	RF	Compositional information, position-specific information and physicochemical properties	mRMR and SFS	CPPred-FL	http://server.malab.cn/CPPred-FL/	[11]
2018	RF	Sequence length, physicochemical properties and molecular properties	N.A.	N.A.	N.A.	[8]

Note: Sequential minimal optimization (SMO); Pseudo amino acid composition (PseAAC); Minimum redundancy maximum relevance (mRMR); Incremental feature selection (IFS); Parallel correlation pseudo-amino-acid composition (PC-PseAAC); Series correlation pseudo-amino-acid composition (SC-PseAAC); Adaptive skip dipeptide composition (ASDC); Physicochemical properties (PPs); Maximal Relevance−Maximal Distance (MRMD); Sequential forward search (SFS); Amino acid composition (AAC); Amino acid index (AAI); Dipeptide composition (DPC); Physicochemical properties (PCP); Composition−transition−distribution (CTD); Dipeptide amino acid composition (DAC).

Existing CPP prediction methods

ML algorithms have been widely used to identify CPPs [38]. We summarized existing ML-based CPP prediction methods in Table 2. According to ML algorithms, they are categorized into the following four classes, which are described in detail below.

Prediction methods based on Neural Network

Artificial NNs (ANNs) (Figure 1B) is an algorithmic model that simulates the structure of the brain’s synaptic connections to process information and react to the real world [39]. The ANNs have two unique properties: (1) they are able to learn from examples and adapt to the change in environmental parameters; and (2) they are able to generate highly nonlinear decision boundaries in the multidimensional input space [39, 40].

Until now, there are three ANN-based CPP prediction methods. For example, Dobchev et al. [28] specified biochemical features for true CPPs and non-CPPs, and trained a prediction model using ANN algorithms and Principle Component Analysis (PCA). The reason for using PCA was to select the most informative variables (used as inputs in the net) from the training set. Their model is reported to achieve an accuracy of 80–100% on a validation dataset containing 101 peptides (penetrating and non-penetrating). The 2nd ANN method, proposed by Karelson et al. [20], is to predict the cell-penetrating capability of compounds or drugs. It combines quantitative structure–activity relationship principles with ANN algorithms to develop a prediction model with an overall accuracy of 83%. However, the limitation is that this method needs structural information as inputs, which are not always available, especially for characterizing the cell-penetrating properties of random peptides. The 3rd ANN-based prediction method is called CPPpred [29]. The prediction model of CPPpred was trained on redundancy reduced datasets and achieved an accuracy of 82.98% with the independent test. In particular, this is the 1st study to emphasize the importance of stringent training datasets for the construction of a robust prediction model.

Prediction methods based on Support Vector Machine

The objective of SVM (Figure 1C) is to create a maximum margin separation hyperplane that can separate the positives from negatives with the minimal misclassification rate [41–43]. Basically, it maps the given input features into a high-dimensional space using kernel functions and finds a hyperplane that maximizes the distance between the hyperplane and two classes [44, 45]. For a given test sample that was mapped into the high-dimensional space (as described above), SVM can predict the test sample based on which side of the hyperplane they fall in. Notably, there are different kernel functions, including linear, polynomial functions and Gaussian radial-basis function. In SVM, there are two critical parameters: C (controls the trade-off between the training error and margin) and g (controls how peaked Gaussians are centered on the support vectors). To achieve the best performance, the parameters usually need to be optimized by grid search approach.

Some prediction tools based on SVM have been proposed for predicting CPPs. For example, Sanders et al. [7] developed an SVM-based approach for identifying potential CPPs. The prediction model was trained using the basic biochemical properties of peptides as features. The authors used three different benchmark datasets to highlight the importance of balanced datasets for accurate prediction. The accuracy of the balanced dataset reached to 91.72%. Gautam et al. [23] proposed an SVM-based predictor called CellPPD and established a public web server for the prediction of CPPs. In CellPPD, different feature representation algorithms, such as AAC, DAC, binary profile, motif features and PPs, were used for training different predictive models. The prediction model based on hybrid features is reported to achieve a maximum accuracy of 97.40%, better than the models based on individual features [34]. Tang et al. [24] developed C2Pred, a predictor based on optimized DAC as feature. The overall accuracy of C2Pred is about 83.6%. They also developed a web server with the implementation of C2Pred, but as the writing of this paper, the server is out of service already.

Prediction methods based on Random Forest

RF (Figure 1D) is a powerful ML algorithm [25], with successful applications in bioinformatics [8, 11, 12, 16, 26, 46]. RF is an ensemble of decision trees. The training procedure is briefly introduced as follows. Assuming there exist N samples with M features in the training set, RF selects N samples by bootstrapping to form a new training dataset and then, randomly selects m (m ≪M) features to train a decision tree on the new training dataset. Repeat this procedure until all the decision trees in RF are trained. The final prediction result is determined by an ensemble of the scores of all the decision trees. In RF, the numbers of decision trees and randomly selected features (mtry) are two main parameters for training accurate RF models.

RF algorithm has been widely applied to the field regarding the prediction of CPPs. Chen et al. [26] developed an RF-based CPP prediction model. The model was trained on a series of PPs, including pseudo-AAC (PseAAC) [18], molecular volume, polarity, codon diversity, electrostatic charge and secondary structure [26]. Optimized features were selected by minimum redundancy maximum relevance [47] and incremental feature selection [48]. The overall accuracy of the prediction model is 83.45%. Considering the long-range effect between residues, in previous study we proposed an adaptive k-skip-2-gram algorithm to extract features and trained a predictor named SkipCPP-Pred with an improved accuracy of 90.6% [12]. In our another work, we proposed a two-layer predictor called CPPred-RF, for which the 1st layer is to discriminate true CPPs from non-CPPs while the 2nd layer is to predict the uptake efficiency of CPPs: high or low [16]. The prediction model was trained on integrative features, which combine four sequence-based descriptors, including PC-PseAAC [49], SC-PseAAC [49], adaptive skip DAC (ASDC), and PPs [50–52]. As compared to SkipCPP-Pred, CPPred-RF increased the prediction accuracy (evaluated with 10-fold cross validation) to 91.6% on the same benchmark dataset. It is worth noting that the CPPred-RF is the 1st tool that can predict the uptake efficiency of CPPs. Another work, from Wolfe et al. [8], focuses on the transport of phosphorodiamidite morpholino oligonucleotides by CPPs [8]. Peptide molecular weight, sequence length, theoretical net charge and amino acid physicochemical descriptors were used as input features to train a RF model. Recently, Qiang et al. [11] proposed a computational predictor called CPPred-FL. More specifically, CPPred-FL introduces the feature representation learning strategy to learn the class and probabilistic information from ML models built with multiple feature descriptors, such as PPs, compositional information and position-specific information, etc. The best overall accuracy of CPPred-FL is up to 92.1% [11]. Although the accuracy is not significantly improved as compared to their previous study [16], the feature number they used for training the predictive models is far fewer. This feature representation strategy explores a new effective way to extract high-expressive features.

Prediction methods based on other machine learning algorithms

Besides the methods above, there are some other prediction methods based on other ML algorithms, such as ERT [15, 30] and KELM [33, 34]. In a recent study, Manavalan et al. [15] proposed a two-layer model for predicting CPPs and their uptake efficiency. The 1st-layer model for the prediction of CPPs was trained by ERT algorithm with an accuracy of 89.6%, while the uptake efficiency prediction model (2nd layer) was trained by RF with an accuracy of 72.5%. Pandey et al. [34] developed a KELM-based model. Their prediction models utilize six different feature descriptors, including AAC, dipeptide AAC (DAC), PseAAC and three hybrid features (Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC) [34]. KELM-CPPpred achieved an accuracy of 83.10% on an independent dataset. Moreover, there are some studies with no clear description for the use of ML algorithms. For example, Hällbrink et al. [6] proposed a prediction method concentrated on five z-descriptors [53], which are extracted from physical characteristics of peptide sequences. Likewise, Hansen et al. [1] developed a method based on the chemical properties to predict CPPs and non-CPPs.

Table 3

Open in new tab

Summary of six available web servers for CPP prediction

Predictors	Year	Classifier	Predicting uptake efficiency	Sequence length Limitation	Upload sequence	Multiple input	URL	Ref.
CellPPD	2013	SVM	N.A.	1–50	No	Yes	http://crdd.osdd.net/raghava/cellppd/	[23]
SkipCPP-Pred	2017	RF	N.A.	No less than 10	No	Yes	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
CPPred-RF	2017	RF	Yes	No limitation	No	Yes	http://server.malab.cn/CPPred-RF	[16]
MLCPP	2018	ERT and RF	Yes	No limitation	Yes	Yes	www.thegleelab.org/MLCPP	[15]
KELM-CPPpred	2018	KELM	N.A.	5–30	No	Yes	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
CPPred-FL	2018	RF	N.A.	No limitation	Yes	Yes	http://server.malab.cn/CPPred-FL	[11]

Predictors	Year	Classifier	Predicting uptake efficiency	Sequence length Limitation	Upload sequence	Multiple input	URL	Ref.
CellPPD	2013	SVM	N.A.	1–50	No	Yes	http://crdd.osdd.net/raghava/cellppd/	[23]
SkipCPP-Pred	2017	RF	N.A.	No less than 10	No	Yes	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
CPPred-RF	2017	RF	Yes	No limitation	No	Yes	http://server.malab.cn/CPPred-RF	[16]
MLCPP	2018	ERT and RF	Yes	No limitation	Yes	Yes	www.thegleelab.org/MLCPP	[15]
KELM-CPPpred	2018	KELM	N.A.	5–30	No	Yes	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
CPPred-FL	2018	RF	N.A.	No limitation	Yes	Yes	http://server.malab.cn/CPPred-FL	[11]

Table 3

Open in new tab

Summary of six available web servers for CPP prediction

Predictors	Year	Classifier	Predicting uptake efficiency	Sequence length Limitation	Upload sequence	Multiple input	URL	Ref.
CellPPD	2013	SVM	N.A.	1–50	No	Yes	http://crdd.osdd.net/raghava/cellppd/	[23]
SkipCPP-Pred	2017	RF	N.A.	No less than 10	No	Yes	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
CPPred-RF	2017	RF	Yes	No limitation	No	Yes	http://server.malab.cn/CPPred-RF	[16]
MLCPP	2018	ERT and RF	Yes	No limitation	Yes	Yes	www.thegleelab.org/MLCPP	[15]
KELM-CPPpred	2018	KELM	N.A.	5–30	No	Yes	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
CPPred-FL	2018	RF	N.A.	No limitation	Yes	Yes	http://server.malab.cn/CPPred-FL	[11]

Predictors	Year	Classifier	Predicting uptake efficiency	Sequence length Limitation	Upload sequence	Multiple input	URL	Ref.
CellPPD	2013	SVM	N.A.	1–50	No	Yes	http://crdd.osdd.net/raghava/cellppd/	[23]
SkipCPP-Pred	2017	RF	N.A.	No less than 10	No	Yes	http://server.malab.cn/SkipCPP-Pred/Index.html	[12]
CPPred-RF	2017	RF	Yes	No limitation	No	Yes	http://server.malab.cn/CPPred-RF	[16]
MLCPP	2018	ERT and RF	Yes	No limitation	Yes	Yes	www.thegleelab.org/MLCPP	[15]
KELM-CPPpred	2018	KELM	N.A.	5–30	No	Yes	http://sairam.people.iitgn.ac.in/KELM-CPPpred.html	[34]
CPPred-FL	2018	RF	N.A.	No limitation	Yes	Yes	http://server.malab.cn/CPPred-FL	[11]

Web-accessible prediction tools

As described in the Existing CPP prediction methods, there are a total of 15 prediction methods, but only 6 of them provide available web servers for the prediction of CPPs. They are CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred and CPPred-FL, respectively [11, 12, 15, 16, 23, 34]. The basic information of the web servers is summarized in Table 3. They are described in detail below.

CellPPD is an in silico method for predicting and designing CPPs. The web server provides users two prediction models: (1) SVM-based model and (2) SVM + Motif-based model [23]. The 1st model was trained by SVM classifier [14] using binary N10-C10 descriptor, while the 2nd was trained with a hybrid descriptor of binary profile patterns and motif features. It should be pointed out that CellPPD is the 1st server to predict CPPs. Additionally, CellPPD is able to identify potential CPPs from protein sequences, but the length of input protein sequences is limited to 500 residues long. Furthermore, the web server also allows users to design novel cell penetrating peptides with certain PPs according to specific needs. The web server is freely available at http://crdd.osdd.net/raghava/cellppd/.
SkipCPP-Pred is a RF-based prediction method. The prediction model was trained with the features extracted by an adaptive k-skip-2-gram algorithm [12]. Due to the use of sequential features only, SkipCPP-Pred is capable of fast predicting whether input peptides are CPPs or not. Notably, this server does not have any limit on the size of the input sequences. The web server can be accessed via http://server.malab.cn/SkipCPP-Pred/Index.html.
CPPred-RF is a two-layer RF-based predictor for predicting CPPs and their uptake efficiency simultaneously [16]. This is the 1st server that makes a breakthrough in the prediction of the uptake efficiency of CPPs. Similar to other servers, it supports for the prediction of multiple sequences. CPPred-RF is publicly available at http://server.malab.cn/CPPred-RF.
MLCPP, similar to CPPred-RF, is also a two-layer predictor for CPPs and their uptake efficiency. For given peptide sequences, the 1st-layer model predicts the query sequence as CPPs or not; if the input sequences are predicted as CPPs, the 2nd-layer model predicts their uptake efficiency [15]. The final results include the prediction information and corresponding probability scores. MLCPP is freely available at www.thegleelab.org/MLCPP.
KELM-CPPpred is a KELM-based CPP prediction tool. The web server provides six prediction models based on different features, including AAC, DAC, PseAAC, Hybrid-AAC, Hybrid-DAC and Hybrid-PseAAC [34]. Users can select one of the models to make predictions. KELM-CPPpred allows users to type one or multiple query sequences of 5–30 residues in length as inputs. The web server can be accessed via http://sairam.people.iitgn.ac.in/KELM-CPPpred.html.
CPPred-FL is a recent predictor for CPP prediction [11]. The server provides two prediction modes based on class information and probabilistic information for CPP identification. Different from the servers above, this server is designed to identify CPPs within proteins. When using this prediction tool, users should choose one prediction mode and set a confidence threshold and cutting length. It allows users to submit multiple protein sequences. The output of CPPred-FL contains all the peptide sequences predicted with cell-penetrating activity, corresponding residue position and prediction confidence. CPPred-FL is publicly available at http://server.malab.cn/CPPred-FL.

Validation datasets

Two benchmark validation datasets were used in this study for a comparative study of existing methods. They were downloaded from the independent datasets in two most recent studies: Pandey’s work [34] and Manavalan’s study [15]. For convenience of discussions, they are respectively denoted as mlcpp and kelm. The kelm dataset includes 96 experimentally validated CPPs as positives and 96 non-CPPs as negatives, whereas the mlcpp dataset consists of 311 true CPPs (positives) and 311 non-CPPs (negatives). However, some servers have strict length limitations for input testing sequences (see Table 3 for details). To test all the web-based prediction tools, we removed those sequences not meeting the need of the predictors in terms of sequence length. Moreover, considering the bias of high sequence similarity between training datasets and validation datasets, we firstly removed the sequences in the validation datasets having significant sequence similarity with the sequences in training datasets using BLASTP (version 2.8.1+) under default setting. Afterwards, we used CD-HIT, a frequently-used sequence homology reduction software in bioinformatics, to further remove those sequences in the validation datasets sharing the sequence identity of >30% against the sequences in the training datasets. By doing so, only 71 CPPs and 48 non-CPPs from kelm, and 149 CPPs and 193 non-CPPs from mlcpp were retained. It is worth noting that the positives from both of validation datasets were derived from the CPPsite2.0 database.

Table 4

Open in new tab

Performance of 12 CPP prediction models in six web-accessible predictors on kelm dataset

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
MLCPP	53	5	43	18	74.65	89.58	80.67	0.63
CPPred-RF	59	12	36	12	83.10	75.00	79.83	0.58
KELM-AAC	49	5	43	22	69.01	89.58	77.31	0.58
KELM-hybrid-AAC	49	5	43	22	69.01	89.58	77.31	0.58
CPPred-FL	56	10	38	15	78.87	79.17	78.99	0.57
CellPPD	45	3	45	26	63.38	93.75	75.63	0.57
CellPPD-motif	45	3	45	26	63.38	93.75	75.63	0.57
KELM-PseAAC	59	13	35	12	83.10	72.92	78.99	0.56
KELM-DAC	40	1	47	31	56.34	97.92	73.11	0.56
SkipCPP-Pred	58	13	35	13	81.69	72.92	78.15	0.55
KELM-hybrid-PseAAC	59	14	34	12	83.10	70.83	78.15	0.54
KELM-hybrid-DAC	49	8	40	22	69.01	83.33	74.79	0.51

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
MLCPP	53	5	43	18	74.65	89.58	80.67	0.63
CPPred-RF	59	12	36	12	83.10	75.00	79.83	0.58
KELM-AAC	49	5	43	22	69.01	89.58	77.31	0.58
KELM-hybrid-AAC	49	5	43	22	69.01	89.58	77.31	0.58
CPPred-FL	56	10	38	15	78.87	79.17	78.99	0.57
CellPPD	45	3	45	26	63.38	93.75	75.63	0.57
CellPPD-motif	45	3	45	26	63.38	93.75	75.63	0.57
KELM-PseAAC	59	13	35	12	83.10	72.92	78.99	0.56
KELM-DAC	40	1	47	31	56.34	97.92	73.11	0.56
SkipCPP-Pred	58	13	35	13	81.69	72.92	78.15	0.55
KELM-hybrid-PseAAC	59	14	34	12	83.10	70.83	78.15	0.54
KELM-hybrid-DAC	49	8	40	22	69.01	83.33	74.79	0.51

Table 4

Open in new tab

Performance of 12 CPP prediction models in six web-accessible predictors on kelm dataset

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
MLCPP	53	5	43	18	74.65	89.58	80.67	0.63
CPPred-RF	59	12	36	12	83.10	75.00	79.83	0.58
KELM-AAC	49	5	43	22	69.01	89.58	77.31	0.58
KELM-hybrid-AAC	49	5	43	22	69.01	89.58	77.31	0.58
CPPred-FL	56	10	38	15	78.87	79.17	78.99	0.57
CellPPD	45	3	45	26	63.38	93.75	75.63	0.57
CellPPD-motif	45	3	45	26	63.38	93.75	75.63	0.57
KELM-PseAAC	59	13	35	12	83.10	72.92	78.99	0.56
KELM-DAC	40	1	47	31	56.34	97.92	73.11	0.56
SkipCPP-Pred	58	13	35	13	81.69	72.92	78.15	0.55
KELM-hybrid-PseAAC	59	14	34	12	83.10	70.83	78.15	0.54
KELM-hybrid-DAC	49	8	40	22	69.01	83.33	74.79	0.51

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
MLCPP	53	5	43	18	74.65	89.58	80.67	0.63
CPPred-RF	59	12	36	12	83.10	75.00	79.83	0.58
KELM-AAC	49	5	43	22	69.01	89.58	77.31	0.58
KELM-hybrid-AAC	49	5	43	22	69.01	89.58	77.31	0.58
CPPred-FL	56	10	38	15	78.87	79.17	78.99	0.57
CellPPD	45	3	45	26	63.38	93.75	75.63	0.57
CellPPD-motif	45	3	45	26	63.38	93.75	75.63	0.57
KELM-PseAAC	59	13	35	12	83.10	72.92	78.99	0.56
KELM-DAC	40	1	47	31	56.34	97.92	73.11	0.56
SkipCPP-Pred	58	13	35	13	81.69	72.92	78.15	0.55
KELM-hybrid-PseAAC	59	14	34	12	83.10	70.83	78.15	0.54
KELM-hybrid-DAC	49	8	40	22	69.01	83.33	74.79	0.51

Table 5

Open in new tab

Performance of 12 CPP prediction models in six web-accessible predictors on mlcpp dataset

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
KELM-hybrid-AAC	141	7	186	8	94.63	96.37	95.61	0.91
KELM-hybrid-DAC	141	52	141	8	94.63	73.06	82.46	0.68
KELM-AAC	140	52	141	9	93.96	73.06	82.16	0.67
KELM-PseAAC	142	57	136	7	95.30	70.47	81.29	0.66
CPPred-FL	144	62	131	5	96.64	67.88	80.41	0.65
MLCPP	144	65	128	5	96.64	66.32	79.53	0.64
CPPred-RF	146	74	119	3	97.99	61.66	77.49	0.62
SkipCPP-Pred	148	81	112	1	99.33	58.03	76.02	0.60
CellPPD	120	47	146	29	80.54	75.65	77.78	0.56
CellPPD-motif	120	47	146	29	80.54	75.65	77.78	0.56
KELM-hybrid-PseAAC	138	84	109	11	92.62	56.48	72.22	0.51
KELM-DAC	138	87	106	11	92.62	54.92	71.35	0.50

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
KELM-hybrid-AAC	141	7	186	8	94.63	96.37	95.61	0.91
KELM-hybrid-DAC	141	52	141	8	94.63	73.06	82.46	0.68
KELM-AAC	140	52	141	9	93.96	73.06	82.16	0.67
KELM-PseAAC	142	57	136	7	95.30	70.47	81.29	0.66
CPPred-FL	144	62	131	5	96.64	67.88	80.41	0.65
MLCPP	144	65	128	5	96.64	66.32	79.53	0.64
CPPred-RF	146	74	119	3	97.99	61.66	77.49	0.62
SkipCPP-Pred	148	81	112	1	99.33	58.03	76.02	0.60
CellPPD	120	47	146	29	80.54	75.65	77.78	0.56
CellPPD-motif	120	47	146	29	80.54	75.65	77.78	0.56
KELM-hybrid-PseAAC	138	84	109	11	92.62	56.48	72.22	0.51
KELM-DAC	138	87	106	11	92.62	54.92	71.35	0.50

Table 5

Open in new tab

Performance of 12 CPP prediction models in six web-accessible predictors on mlcpp dataset

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
KELM-hybrid-AAC	141	7	186	8	94.63	96.37	95.61	0.91
KELM-hybrid-DAC	141	52	141	8	94.63	73.06	82.46	0.68
KELM-AAC	140	52	141	9	93.96	73.06	82.16	0.67
KELM-PseAAC	142	57	136	7	95.30	70.47	81.29	0.66
CPPred-FL	144	62	131	5	96.64	67.88	80.41	0.65
MLCPP	144	65	128	5	96.64	66.32	79.53	0.64
CPPred-RF	146	74	119	3	97.99	61.66	77.49	0.62
SkipCPP-Pred	148	81	112	1	99.33	58.03	76.02	0.60
CellPPD	120	47	146	29	80.54	75.65	77.78	0.56
CellPPD-motif	120	47	146	29	80.54	75.65	77.78	0.56
KELM-hybrid-PseAAC	138	84	109	11	92.62	56.48	72.22	0.51
KELM-DAC	138	87	106	11	92.62	54.92	71.35	0.50

Prediction tools	TP	FP	TN	FN	SE (%)	SP (%)	ACC (%)	MCC
KELM-hybrid-AAC	141	7	186	8	94.63	96.37	95.61	0.91
KELM-hybrid-DAC	141	52	141	8	94.63	73.06	82.46	0.68
KELM-AAC	140	52	141	9	93.96	73.06	82.16	0.67
KELM-PseAAC	142	57	136	7	95.30	70.47	81.29	0.66
CPPred-FL	144	62	131	5	96.64	67.88	80.41	0.65
MLCPP	144	65	128	5	96.64	66.32	79.53	0.64
CPPred-RF	146	74	119	3	97.99	61.66	77.49	0.62
SkipCPP-Pred	148	81	112	1	99.33	58.03	76.02	0.60
CellPPD	120	47	146	29	80.54	75.65	77.78	0.56
CellPPD-motif	120	47	146	29	80.54	75.65	77.78	0.56
KELM-hybrid-PseAAC	138	84	109	11	92.62	56.48	72.22	0.51
KELM-DAC	138	87	106	11	92.62	54.92	71.35	0.50

Performance measurements

Four evaluation metrics, including SE, SP, ACC and MCC, have been widely used in several bioinformatic fields [54–67]. Here, we also utilized the four metrics for evaluating the prediction models, which are calculated by the following formulas:

$$\left\{\begin{array}{c} SE=\displaystyle \frac{TP}{TP+ FN}\ast 100\%\\[6pt] {} SP=\displaystyle \frac{TN}{TN+ FP}\ast 100\%\\[6pt] {} ACC=\displaystyle \frac{TP+ TN}{TP+ TN+ FN+ FP}\ast 100\%\\[6pt] {} MCC=\displaystyle \frac{TP\ast TN- FP\ast FN}{\sqrt{\left( TP+ FN\right)\left( TP+ FP\right)\left( TN+ FP\right)\left( TN+ FN\right)}}\end{array}\right.,$$

where TP, TN, FP and FN denote the numbers of true positive, true negative, false positive and false negative, respectively. SE and SP measure the predictive ability for two classes: positive and the negative, respectively. ACC and MCC evaluate the overall performance of the predictive model [68, 69].

Figure 2

Performance of 12 CPP prediction models in six different web-accessible predictors on kelm dataset. (A–D) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.

Open in new tab Download slide

Results and discussion

Comparative results on benchmark validation datasets

In this work, our aim is to conduct an unbiased performance evaluation for existing prediction tools. To avoid the potential evaluation biases in implementation of existing predictors, we chose the prediction tools only with available web servers for comparison, including CellPPD, SkipCPP-Pred, CPPred-RF, MLCPP, KELM-CPPpred, and CPPred-FL [11, 12, 15, 16, 23, 34]. We noticed that some servers, such as CellPPD and KELM-CPPpred, have more than one prediction models (refer to Web-accessible prediction tools for details). We collected a total of 12 prediction models from the 6 web servers. To conduct a comprehensive comparison, all the available prediction models are tested and compared. Moreover, it is instructive to compare the web-accessible prediction tools with the independent test, since this makes the use of them close to practical applications. Here, two benchmark validation datasets, mlcpp and kelm, are used for the test. The comparison results on the two benchmarks are presented in Tables 4 and 5, respectively.

On the kelm dataset (see Table 4 and Figure 2), MLCPP achieved the best performance among the 12 tested prediction models, giving the highest ACC of 80.67% and MCC of 0.63, respectively. Despite this, other prediction tools, like CPPred-RF, KELM-AAC and KELM-hybrid-AAC, are indeed competitive with the best MLCPP in terms of MCC; their MCCs are all 0.58, which are slightly worse than that of the MLCPP (MCC = 0.63). Here, we do not consider more on the comparison in terms of ACC, since the kelm dataset is imbalanced. Therefore, in such an imbalanced dataset, MCC is a better metric to measure the overall performance of a prediction model. However, considering that the number of the positives and negatives in the kelm dataset are relatively few. Therefore, within such a small gap in performance, it is actually hard to determine which prediction tool is better.

Next, we further compared the performance on the larger mlcpp dataset that contains more testing samples (149 positives and 193 negatives). The results are shown in Table 5 and Figure 3, where the following aspects were observed. The 1st observation is that the KELM-hybrid-AAC, trained with the KELM classifier and the hybrid features (motif and AAC), outperforms other competing prediction tools in three out of the four metrics: SP, ACC and MCC. More specifically, KELM-hybrid-AAC achieved the SP, ACC and MCC of 96.37%, 95.61% and 0.91, higher than that of the 2nd best KELM-hybrid-DAC by 23.31%, 13.15% and 0.23, respectively. The 2nd observation is that, the performance of MLCPP was decreased on the mlcpp (see Table 5 and Figure 3) as compared to that on the kelm (see Table 4 and Figure 2); among the 12 compared prediction models, MLCPP ranks in the 6th place. Furthermore, we observed that the SP of KELM-hybrid-AAC can achieve 96.37% high. This indicates that using this model would generate fairly few false positives; most of random peptides without cell-penetrating property would be filtered out by the KELM-hybrid-AAC. This greatly facilitates biological researchers for the experimental validation of predictions, thereby largely reducing the cost and time of the validation. In addition, we also investigated the ability of prediction tools for predicting the positives (true CPPs). For the prediction of true CPPs, SkipCPP-Pred and CPPred-RF are the top two predictors with the SEs of 99.33 and 97.99%, respectively; CPPred-FL and MLCPP achieved the 3rd best SE of 96.64%. This demonstrates that they can identify more CPPs than other predictors.

Figure 3

Performance of 12 CPP prediction models in 6 different web-accessible predictors on mlcpp dataset. (A–D) denote the performance in ACC, MCC, SE and SP, respectively. Note that the prediction model with the best performance in each sub-figure is marked in orange.

Open in new tab Download slide

Taken all together, through the comparison study on two benchmarks, it can be concluded that one of the prediction models in the KELM web, namely KELM-hybrid-AAC, generally outperforms the state-of-art web-accessible predictors (including 11 prediction models). Importantly, the KELM-hybrid-AAC is capable of providing more balanced performance between SE and SP in comparison with other prediction tools.

Length-dependency comparison of existing prediction tools

For CPP prediction tools, identification of peptides in proteins with cell-penetrating activity is the main in this field. As we know, experimentally validated CPPs are in the range of 10–50 residues long. Therefore, it is interesting to see whether there is length-dependency lying in existing CPP prediction tools. For this purpose, the samples in benchmark dataset are divided into four groups according to the length: [10, 15], (15, 20], (20, 25] and (25, 30]. Note that the length of the samples in the 1st group [10, 15] is in the range of 10–15; the length of the samples in the 2nd group (15, 20] is in the range of 15–20; and so forth. Figure 4 illustrates the prediction results of the 12 prediction models from the six web servers over different length ranges on the mlcpp dataset. Note that we conducted the length-dependency comparison only on the mlcpp dataset, since it contains more testing samples than the kelm dataset. Therefore, the mlcpp dataset is more representative than the kelm dataset. As shown in Figure 4, we can clearly see that almost all the prediction models performed better in the length range (20, 25] than the other range. In other words, predicting the CPPs and non-CPPs with the length range in (20, 25] is a relatively easy task, as compared to the prediction in other length ranges. For the prediction of CPPs beyond the range (20, 25], there is no clear trend observed. This result is quite interesting. It might provide some insights for the design of CPPs.

Figure 4

Performance of 12 CPP prediction models in 6 web-accessible predictors with different length ranges on mlcpp dataset. (A–L) denote the overall ACC of 12 CPP prediction models in prediction of CPPs and non-CPPs with different ranges, respectively. Note that we divided the samples in mlcpp dataset into four different length ranges: [10, 15], (15,20], (20,25], and (25,30]. In each sub-figure, T denotes the ratio of correctly predicted among the total samples in one specific range; and F denotes the ratio of in-correctly predicted samples among the total samples in one specific range.

Open in new tab Download slide

Usage comparison of cell-penetrating peptides web servers

In this section, we investigated whether these servers are as friendly as they claimed in their own studies, since the usage experience of web-servers is quite important as well. We found that there are several limitations when conducting the server test. The 1st limitation is the length limitation of input sequences for some servers. For example, CellPPD allows users to input sequences with 1 to 50 residues, while KELM-CPPpred with 5–30 residues. SkipCPP-Pred requires input sequences with the length of >10 residues. Secondly, some servers do not support multiple sequences or uploading sequence files for batch processing. MLCPP and CPPred-FL have more convenient function for users to upload their data with specified format files (i.e. FASTA files). The CellPPD server has the file uploading function but does not work well. Thirdly, we found errors happening frequently when testing the CellPPD server. The prediction results are covered by the 1st query sequence, resulting in invalid prediction results. Therefore, we have to conduct the independent test one by one to ensure yield the correct prediction results, which is quite inconvenient if making the prediction in a large-scale data. Fourthly, in terms of running time, SkipCPP-Pred is the fastest one of these web servers, no matter that there are hundreds or thousands of query sequences. Although KELM-CPPpred claims that they can handle multiple sequences for one single run, the processing speed is quite slow, and the server usually returns with timeout errors when submitting more than 50 peptides as inputs. Finally, CPPred-RF and MLCPP are the only two servers that can predict the uptake efficiency of CPPs. In particular, CellPPD has the function of designing efficient CPPs, and provides to specify CPP motifs in query long protein sequences.

Discussion

In this study, our aim is to conduct the empirical comparison and analysis of 12 prediction models from 6 state-of-the-art CPP prediction tools that are accessible as web servers. We evaluated the prediction models on two benchmark validation datasets in an unbiased way, respectively. Benchmarking results demonstrate that, among the 12 prediction models, the KELM-hybrid-AAC from the KELM-CPPpred web, provides the exceptionally best performance. This might be due to the use of the hybrid features, which integrate the motif-based descriptor with the compositional descriptor AAC. We further analyzed which kind of motif features they used, and found that they specified the most frequent amino acid motifs in their dataset, including RRRRRR, RRA, GRRX (where X = R, W, T), RRGRX (X = R, G, T) and KKRK. The results demonstrated that the fusion of amino acid and some sequential motifs can sufficiently capture the inner characteristics of CPPs/non-CPPs. Another interesting finding is that in the KELM-CPPpred server, they recommend to use another model namely KELM-hybrid-PseAAC based on the results in their study [34]; whereas as for our test, we found that the recommended model is actually almost the worst one among the six prediction models in KELM-CPPpred server. Generally speaking, it can be concluded that the KELM-hybrid-AAC model is a better choice to make predictions in performance.

To further improve the performance of CPP prediction, there are still many aspects that can be explored. The 1st and also the most important aspect is feature representation. As we can see, most of prediction tools extract features using the information from primary sequences, including amino acid PPs, AAC, DAC and ASDC, etc. The following question is, why not use the structural information? The answer is that peptides are very short, usually with 10–50 residues long. It is extremely difficult to find some structural characteristics to discriminate CPPs from non-CPPs, as such short peptides cannot form stable secondary structures. Basically, only sequential information can be explored. Therefore, how to effectively use different types of sequential information is an open question. Recently, Wei et al. [68] proposed a feature representation learning strategy to automatically learn the most informative features in the supervised way. This work might give researchers a hint to explore other strategies for more effective sequence-based feature representations. Additionally, feature selection that facilitates the discovery of the most predictive features can be another direction to the improved performance. Moreover, recent studies have been shown that powerful classifiers are also an alternative way complementary to the improvement in performance [15, 34, 70, 71].

Although much progress has been done for the development of CPP prediction tools, some limitations and challenges remain to be addressed. Firstly, the main challenge that all of the ML-based prediction tools face is the selection of high-quality samples to represent adequately the positive (true CPPs) and negatives (non-CPPs). Currently, the number of the positive samples is still limited. Although the public database like CPPsite 2.0 collects almost 2000 experimentally validated CPPs, only 400–500 are remained in benchmark datasets after the removal of high-identity sequences. For the selection of negative samples, random peptides within proteins not annotated as CPPs are frequently used. An alternative way to generate negative control samples is to shuffle the genomic content of existing CPPs (e.g. scrambled CPPs). However, it cannot guarantee that the random peptides are not true CPPs, although the probability of such random sequences to be true CPPs is very low. Secondly, most of current predictors focus more on identifying true CPPs from non-CPPs, whereas few studies (only CPPred-RF and MLCPP) are on the uptake efficiency prediction, which is quite important as well, since the uptake efficiency of CPPs is closely associated with their practical applications as efficient drug delivery. One possible reason is that there is not enough experimental data to predict the efficiency of CPPs. Moreover, for the only two prediction tools for the efficiency prediction, simple yes/no prediction of internalization may soon be outdated.

Key Points

We comprehensively review a variety of existing ML-based methods for the prediction of CPPs.
We conduct a comparative study and analyze available web servers for predicting CPPs.
Benchmarking results demonstrate that, among the 12 prediction models from 6 available CPP prediction servers, the KELM-hybrid-AAC model provides the best performance.
Our analysis demonstrates that existing prediction tools tend to more accurately predict CPPs and non-CPPs with the length of 20–25 residues long than peptides in other length ranges.

Funding

National Natural Science Foundation of China (Nos. 61701340, 61702361 and 61771331), the Natural Science Foundation of Tianjin City (Nos. 18JCQNJC00500 and 18JCQNJC00800), the National Key R&D Program of China (SQ2018YFC090002) and the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science, and Technology (2018R1D1A1B07049572).

Ran Su is currently an associate professor in the School of Computer Software, College of Intelligence and Computing, at Tianjin University, China. Tianjin University, China. Her research interests include pattern recognition, machine learning and bioinformatics.

Jie Hu received her BSc degree in Resource environment and Urban Planning Management from Wuhan University of Science and Technology, China. She is currently a graduate student in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. Her research interests are bioinformatics and machine learning.

Quan Zou is a professor of University of Electronic Science and Technology of China. He received his PhD in Computer Science from Harbin Institute of Technology, P.R. China in 2009. His research is in the areas of bioinformatics, machine learning and parallel computing, with focus on genome assembly, annotation and functional analysis from the next generation sequencing data with parallel computing methods.

Balachandran Manavalan received his PhD degree in 2011 from Ajou University, South Korea. Currently, he is working as a research professor in the Department of Physiology, Ajou University School of Medicine, Suwon, Korea. He is also an associate member of Korea Institute for Advanced Study (KIAS), Seoul, Korea. His main research interests include protein structure prediction, machine learning, data mining, computational biology, and functional genomics.

Leyi Wei received his PhD in Computer Science from Xiamen University, China. He is currently an Assistant Professor in School of Computer Science and Technology, College of Intelligence and Computing, at Tianjin University, China. His research interests include machine learning and their applications to bioinformatics.

References

1.

Hansen

M

,

Kilk

K

,

Langel

Ü

.

Predicting cell-penetrating peptides

.

Adv Drug Deliv Rev

2008

;

60

:

572

–

579

.

Google Scholar

OpenURL Placeholder Text

WorldCat

2.

Kilk

K

.

Cell-penetrating peptides and bioactive cargoes: strategies and mechanisms.

Doctoral Thesis,

Institutionen för Neurokemi

,

2004

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

3.

Madani

F

,

Lindberg

S

,

Langel

Ü

, et al.

Mechanisms of cellular uptake of cell-penetrating peptides

,

J Biophys

2011

;

2011

; DOI:

10.1155/2011/414729

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

4.

Milletti

F

.

Cell-penetrating peptides: classes, origin, and current landscape

.

Drug Disco Today

2012

;

17

:

850

–

860

.

Google Scholar

OpenURL Placeholder Text

WorldCat

5.

Raucher

D

,

Ryu

JS

.

Cell-penetrating peptides: strategies for anticancer treatment

.

Trends Mol Med

2015

;

21

:

560

–

570

.

Google Scholar

OpenURL Placeholder Text

WorldCat

6.

Hällbrink

M

,

Kilk

K

,

Elmquist

A

, et al.

Prediction of cell-penetrating peptides

.

Int J Pept Res Ther

2005

;

11

:

249

–

259

.

Google Scholar

OpenURL Placeholder Text

WorldCat

7.

Sanders

WS

,

Johnston

CI

,

Bridges

SM

, et al.

Prediction of cell penetrating peptides by support vector machines

.

PLoS Comput Biol

2011

;

7

:

e1002101

.

Google Scholar

OpenURL Placeholder Text

WorldCat

8.

Wolfe

JM

,

Fadzen

CM

,

Choo

Z-N

, et al.

Machine learning to predict cell-penetrating peptides for antisense delivery

.

ACS Cent Sci

2018

;

4

:

512

–

520

.

Google Scholar

OpenURL Placeholder Text

WorldCat

9.

Heitz

F

,

Morris

MC

,

Divita

G

.

Twenty years of cell-penetrating peptides: from molecular mechanisms to therapeutics

.

Br J Pharmacol

2009

;

157

:

195

–

206

.

Google Scholar

OpenURL Placeholder Text

WorldCat

10.

Frankel

AD

,

Pabo

CO

.

Cellular uptake of the tat protein from human immunodeficiency virus

.

Cell

1988

;

55

:

1189

–

1193

.

Google Scholar

OpenURL Placeholder Text

WorldCat

11.

Qiang

X

,

Zhou

C

,

Ye

X

, et al.

CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning. A predictor for CPP identification

.

Briefings in Bioinformatics

2018

. DOI:

10.1093/bib/bby091

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

12.

Wei

L

,

Tang

J

,

Zou

Q

.

SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides

.

BMC Genomics

2017

;

18

:

1

.

Google Scholar

OpenURL Placeholder Text

WorldCat

13.

Agrawal

P

,

Bhalla

S

,

Usmani

SS

, et al.

CPPsite 2.0: a repository of experimentally validated cell-penetrating peptides

.

Nucleic Acids Res

2015

;

44

:

D1098

–

D1103

.

Google Scholar

OpenURL Placeholder Text

WorldCat

14.

Gautam

A

,

Singh

H

,

Tyagi

A

, et al.

CPPsite: a curated database of cell penetrating peptides

.

Database

2012

;

2012

. DOI:

10.1093/database/bas015

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

15.

Manavalan

B

,

Subramaniyam

S

,

Shin

TH

, et al.

Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy

.

J Proteome Res

2018

;

17

:

2715

–

2716

.

Google Scholar

OpenURL Placeholder Text

WorldCat

16.

Wei

L

,

Xing

P

,

Su

R

, et al.

CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency

.

J Proteome Res

2017

;

16

:

2044

–

2053

.

Google Scholar

OpenURL Placeholder Text

WorldCat

17.

Ansorge

WJ

.

Next-generation DNA sequencing techniques

.

N Biotechnol

2009

;

25

:

195

–

203

.

Google Scholar

OpenURL Placeholder Text

WorldCat

18.

Chou

KC

.

Prediction of protein cellular attributes using pseudo-amino acid composition

.

Proteins

2001

;

43

:

246

–

255

.

Google Scholar

OpenURL Placeholder Text

WorldCat

19.

Diener

C

,

Martínez

GGR

,

Blas

DM

, et al.

Effective design of multifunctional peptides by combining compatible functions

.

PLoS Comput Biol

2016

;

12

:

e1004786

.

Google Scholar

OpenURL Placeholder Text

WorldCat

20.

Karelson

M

,

Dobchev

D

.

Using artificial neural networks to predict cell-penetrating compounds

.

Expert Opin Drug Discov

2011

;

6

:

783

–

796

.

Google Scholar

OpenURL Placeholder Text

WorldCat

21.

Wei

H

,

Yang

W

,

Tang

H

, et al.

The development of machine learning methods in cell-penetrating peptides identification: a brief review

.

Curr Drug Metab

2018

. DOI:

10.2174/1389200219666181010114750

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

22.

Cortes

C

,

Vapnik

V

.

Support vector machine

.

Mach Learn

1995

;

20

:

273

–

297

.

Google Scholar

OpenURL Placeholder Text

WorldCat

23.

Gautam

A

,

Chaudhary

K

,

Kumar

R

, et al.

In silico approaches for designing highly effective cell penetrating peptides

.

J Transl Med

2013

;

11

:

74

.

Google Scholar

OpenURL Placeholder Text

WorldCat

24.

Tang

H

,

Su

Z-D

,

Wei

H-H

, et al.

Prediction of cell-penetrating peptides with feature selection techniques

.

Biochem Biophys Res Commun

2016

;

477

:

150

–

154

.

Google Scholar

OpenURL Placeholder Text

WorldCat

25.

Breiman

L

.

Random forests

.

Mach Learn

2001

;

45

:

5

–

32

.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Chen

L

,

Chu

C

,

Huang

T

, et al.

Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models

.

Amino Acids

2015

;

47

:

1485

–

1493

.

Google Scholar

OpenURL Placeholder Text

WorldCat

27.

Specht

DF

.

A general regression neural network

.

IEEE Trans Neural Netw

1991

;

2

:

568

–

576

.

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

A Dobchev

D

,

Mager

I

,

Tulp

I

, et al.

Prediction of cell-penetrating peptides using artificial neural networks

.

Curr Comput Aided Drug Des

2010

;

6

:

79

–

89

.

Google Scholar

OpenURL Placeholder Text

WorldCat

29.

Holton

TA

,

Pollastri

G

,

Shields

DC

, et al.

CPPpred: prediction of cell penetrating peptides

.

Bioinformatics

2013

;

29

:

3094

–

3096

.

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Geurts

P

,

Ernst

D

,

Wehenkel

L

.

Extremely randomized trees

.

Mach Learn

2006

;

63

:

3

–

42

.

Google Scholar

OpenURL Placeholder Text

WorldCat

31.

Basith

S

,

Manavalan

B

,

Shin

TH

, et al.

iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree

.

Comput Struct Biotechnol J

2018

;

16

:

412

–

420

.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Manavalan

B

,

Basith

S

,

Shin

TH

, et al.

MLACP: machine-learning-based prediction of anticancer peptides

.

Oncotarget

2017

;

8

:

77121

.

Google Scholar

OpenURL Placeholder Text

WorldCat

33.

Huang

G-B

,

Zhu

Q-Y

,

Siew

C-K

.

Extreme learning machine: theory and applications

.

Neurocomputing

2006

;

70

:

489

–

501

.

Google Scholar

OpenURL Placeholder Text

WorldCat

34.

Pandey

P

,

Patel

V

,

George

NV

, et al.

KELM-CPPpred: kernel extreme learning machine based prediction model for cell-penetrating peptides

.

J Proteome Res

2018

;

17

:

3214

–

3222

.

Google Scholar

OpenURL Placeholder Text

WorldCat

35.

Stegmayer

G

,

Di Persia

LE

,

Rubiolo

M

, et al.

Predicting novel microRNA: a comprehensive comparison of machine learning approaches

.

Brief Bioinform

2018

. DOI:

10.1093/bib/bby037

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

36.

Liu

B

,

Jiang

S

,

Zou

Q

.

HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search

.

Brief Bioinform

2018

. DOI:

10.1093/bib/bby104

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

37.

Zou

Q

,

Zeng

J

,

Cao

L

, et al.

A novel features ranking metric with application to scalable visual and bioinformatics data classification

.

Neurocomputing

2016

;

173

:

346

–

354

.

Google Scholar

OpenURL Placeholder Text

WorldCat

38.

Usmani

SS

,

Kumar

R

,

Bhalla

S

, et al.

In silico tools and databases for designing peptide-based vaccine and drugs

.

Adv Protein Chem Struct Biol

2018

;

112

:

221

–

263

.

Google Scholar

OpenURL Placeholder Text

WorldCat

39.

Dreiseitl

S

,

Ohno-Machado

L

.

Logistic regression and artificial neural network classification models: a methodology review

.

J Biomed Inform

2002

;

35

:

352

–

359

.

Google Scholar

OpenURL Placeholder Text

WorldCat

40.

Liu

B

,

Li

S

.

ProtDet-CCH: protein remote homology detection by combining long short-term memory and ranking methods

,

IEEE/ACM Trans Comput Biol Bioinform

2018

. DOI:

10.1109/TCBB.2018.2789880

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

41.

Noble

WS

.

What is a support vector machine?

Nat Biotechnol

2006

;

24

:

1565

.

Google Scholar

OpenURL Placeholder Text

WorldCat

42.

Liu

B

,

Wu

H

,

Wang

X

, et al.

Pse-Analysis: a python package for DNA, RNA and protein peptide sequence analysis based on pseudo components and kernel methods

.

Oncotarget

2017

;

8

:

13338

–

13343

.

Google Scholar

OpenURL Placeholder Text

WorldCat

43.

Manavalan

B

,

Lee

J

.

SVMQA: support–vector-machine-based protein single-model quality assessment

.

Bioinformatics

2017

;

33

:

2496

–

2503

.

Google Scholar

OpenURL Placeholder Text

WorldCat

44.

Manavalan

B

,

Shin

TH

,

Lee

G

.

PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine

.

Front Microbiol

2018

;

9

:

476

.

Google Scholar

OpenURL Placeholder Text

WorldCat

45.

Manavalan

B

,

Shin

TH

,

Lee

G

.

DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

.

Oncotarget

2018

;

9

:

1944

.

Google Scholar

OpenURL Placeholder Text

WorldCat

46.

Manavalan

B

,

Lee

J

,

Lee

J

.

Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms

.

PloS One

2014

;

9

:

e106542

.

Google Scholar

OpenURL Placeholder Text

WorldCat

47.

Peng

H

,

Long

F

,

Ding

C

.

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

.

IEEE Trans Pattern Anal Mach Intell

2005

;

27

:

1226

–

1238

.

Google Scholar

OpenURL Placeholder Text

WorldCat

48.

Kohavi

R

,

John

GH

.

Wrappers for feature subset selection

.

Artif Intell

1997

;

97

:

273

–

324

.

Google Scholar

OpenURL Placeholder Text

WorldCat

49.

Liu

B

,

Liu

F

,

Wang

X

, et al.

Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences

.

Nucleic Acids Res

2015

;

43

:

W65

–

W71

.

Google Scholar

OpenURL Placeholder Text

WorldCat

50.

Lin

C

,

Zou

Y

,

Qin

J

, et al.

Hierarchical classification of protein folds using a novel ensemble classifier

.

PloS One

2013

;

8

:

e56499

.

Google Scholar

OpenURL Placeholder Text

WorldCat

51.

Zou

Q

,

Wang

Z

,

Guan

X

et al.

An approach for identifying cytokines based on a novel ensemble classifier

,

BioMed Res Int

2013

;

2013

. DOI:

10.1155/2013/686090

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

52.

Liu

B.

BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches

.

Brief Bioinform

2018

; DOI:

10.1093/bib/bbx165

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

53.

Sandberg

M

,

Eriksson

L

,

Jonsson

J

, et al.

New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids

.

J Med Chem

1998

;

41

:

2481

–

2491

.

Google Scholar

OpenURL Placeholder Text

WorldCat

54.

Zou

Q

,

Xing

P

,

Wei

L

et al.

Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA

.

RNA

2018

. DOI:

10.1261/rna.069112.118RNA

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

55.

Zhang

Z

,

Zhao

Y

,

Liao

X

et al.

Deep learning in omics: a survey and guideline

,

Brief Funct Genomics

2018

. DOI:

10.1093/bfgp/ely030

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

56.

Wei

L

,

Su

R

,

Wang

B

, et al.

Integration of deep feature representations and handcrafted features to improve the prediction of N 6 -methyladenosine sites

.

Neurocomputing

2019

;

324

:

3

–

9

.

Google Scholar

OpenURL Placeholder Text

WorldCat

57.

Long

HX

,

Wang

M

,

Fu

HY

.

Deep convolutional neural networks for predicting hydroxyproline in proteins

.

Curr Bioinform

2017

;

12

:

233

–

238

.

Google Scholar

OpenURL Placeholder Text

WorldCat

58.

Yu

L

,

Sun

X

,

Tian

SW

, et al.

Drug and nondrug classification based on deep learning with various feature selection strategies

.

Curr Bioinform

2018

;

13

:

253

–

259

.

Google Scholar

OpenURL Placeholder Text

WorldCat

59.

Wei

L

,

Ding

Y

,

Su

R

, et al.

Prediction of human protein subcellular localization using deep learning

.

J Parallel Distrib Comput

2018

;

117

:

212

–

217

.

Google Scholar

OpenURL Placeholder Text

WorldCat

60.

Feng

C-Q

,

Zhang

Z-Y

,

Zhu

X-J

, et al.

iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators

,

Bioinformatics

2018

. DOI:

10.1093/bioinformatics/bty827

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

61.

Tang

H

,

Zhao

Y-W

,

Zou

P

, et al.

HBPred: a tool to identify growth hormone-binding proteins

.

Int J Biol Sci

2018

;

14

:

957

–

964

.

Google Scholar

OpenURL Placeholder Text

WorldCat

62.

He

Z

,

Yu

W

.

Stable feature selection for biomarker discovery

.

Comput Biol Chem

2010

;

34

:

215

–

225

.

Google Scholar

OpenURL Placeholder Text

WorldCat

63.

Cabarle

FGC

,

Adorna

HN

,

Jiang

M

, et al.

Spiking neural p systems with scheduled synapses

.

IEEE Trans Nanobioscience

2017

;

16

:

792

–

801

.

Google Scholar

OpenURL Placeholder Text

WorldCat

64.

Song

T

,

Rodríguez-Patón

A

,

Zheng

P

, et al.

Spiking neural P systems with colored spikes

.

IEEE Trans Cogn Dev Syst

2017

. DOI:

10.1109/TCDS.2017.2785332

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

65.

Song

T

,

Zeng

X

,

Zheng

P

, et al.

A parallel workflow pattern modelling using spiking neural P systems with colored spikes

.

IEEE Trans Nanobioscience

2018

;

17

:

474

–

484

.

Google Scholar

OpenURL Placeholder Text

WorldCat

66.

Yang

H

,

Lv

H

,

Ding

H

, et al.

iRNA-2OM: a sequence-based predictor for identifying 2′-O-methylation sites in Homo sapiens

.

J Comput Biol

2018

;

25

:

1266

–

1277

.

Google Scholar

OpenURL Placeholder Text

WorldCat

67.

Zhu

X-J

,

Feng

C-Q

,

Lai

H-Y

, et al.

Predicting protein structural classes for low-similarity sequences by evaluating different features

,

Knowledge-Based Syst

2018

. DOI:

10.1016/j.knosys.2018.10.007

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

68.

Wei

L

,

Zhou

C

,

Chen

H

et al.

ACPred-FL: a sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides

,

Bioinformatics

2018

. DOI:

10.1093/bioinformatics/bty451

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

69.

Liu

B

,

Weng

F

,

Huang

D-S

, et al.

iRO-3wPseKNC: Identify DNA replication origins by three-window-based PseKNC

,

Bioinformatics

2018

. DOI:

10.1093/bioinformatics/bty312

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

70.

Zou

Q

,

Wan

S

,

Ju

Y

, et al.

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy

.

BMC Syst Biol

2016

;

10

:

114

.

Google Scholar

OpenURL Placeholder Text

WorldCat

71.

Zeng

XX

,

Liu

L

,

Lu

LY

, et al.

Prediction of potential disease-associated microRNAs using structural perturbation method

.

Bioinformatics

2018

;

34

:

2425

–

2432

.

Google Scholar

OpenURL Placeholder Text

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
January 2019	101
February 2019	19
March 2019	27
April 2019	15
May 2019	8
June 2019	4
July 2019	4
August 2019	9
September 2019	18
October 2019	36
November 2019	9
December 2019	12
January 2020	3
February 2020	17
March 2020	7
April 2020	40
May 2020	24
June 2020	23
July 2020	34
August 2020	22
September 2020	17
October 2020	30
November 2020	30
December 2020	23
January 2021	24
February 2021	15
March 2021	20
April 2021	51
May 2021	60
June 2021	82
July 2021	70
August 2021	122
September 2021	73
October 2021	32
November 2021	34
December 2021	43
January 2022	45
February 2022	31
March 2022	77
April 2022	83
May 2022	108
June 2022	112
July 2022	109
August 2022	134
September 2022	116
October 2022	131
November 2022	102
December 2022	100
January 2023	86
February 2023	117
March 2023	141
April 2023	91
May 2023	97
June 2023	92
July 2023	81
August 2023	62
September 2023	89
October 2023	99
November 2023	81
December 2023	89
January 2024	110
February 2024	100
March 2024	112
April 2024	94
May 2024	94
June 2024	79
July 2024	100
August 2024	78
September 2024	56
October 2024	89

Article Contents

Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools

Abstract

Introduction

Materials and methods

Framework of CPP prediction using machine learning

Cell-penetration peptides database

Existing CPP prediction methods

Prediction methods based on Neural Network

Prediction methods based on Support Vector Machine

Prediction methods based on Random Forest

Prediction methods based on other machine learning algorithms

Web-accessible prediction tools

Validation datasets

Performance measurements

Results and discussion

Comparative results on benchmark validation datasets

Length-dependency comparison of existing prediction tools

Usage comparison of cell-penetrating peptides web servers

Discussion

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools

Abstract

Introduction

Materials and methods

Framework of CPP prediction using machine learning

Cell-penetration peptides database

Existing CPP prediction methods

Prediction methods based on Neural Network

Prediction methods based on Support Vector Machine

Prediction methods based on Random Forest

Prediction methods based on other machine learning algorithms

Web-accessible prediction tools

Validation datasets

Performance measurements

Results and discussion

Comparative results on benchmark validation datasets

Length-dependency comparison of existing prediction tools

Usage comparison of cell-penetrating peptides web servers

Discussion

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only