Abstract

Motivation: The number of protein families has been estimated to be as small as 1000. Recent study shows that the growth in discovery of novel structures that are deposited into PDB and the related rate of increase of SCOP categories are slowing down. This indicates that the protein structure space will be soon covered and thus we may be able to derive most of remaining structures by using the known folding patterns. Present tertiary structure prediction methods behave well when a homologous structure is predicted, but give poorer results when no homologous templates are available. At the same time, some proteins that share twilight-zone sequence identity can form similar folds. Therefore, determination of structural similarity without sequence similarity would be beneficial for prediction of tertiary structures.

Results: The proposed PFRES method for automated protein fold classification from low identity (<35%) sequences obtains 66.4% and 68.4% accuracy for two test sets, respectively. PFRES obtains 6.3–12.4% higher accuracy than the existing methods. The prediction accuracy of PFRES is shown to be statistically significantly better than the accuracy of competing methods. Our method adopts a carefully designed, ensemble-based classifier, and a novel, compact and custom-designed feature representation that includes nearly 90% less features than the representation of the most accurate competing method (36 versus 283). The proposed representation combines evolutionary information by using the PSI-BLAST profile-based composition vector and information extracted from the secondary structure predicted with PSI-PRED.

Availability: The method is freely available from the authors upon request.

Contact:  lkurgan@ece.ualberta.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Protein structures are being solved to answer key biological questions related to protein function, regulation and interactions. Outside of their biological context, the solved structures are increasingly useful for structure modeling/prediction for unsolved protein sequences that have a closely related (similar) sequence with known structure (Tress et al., 2005; Wang et al., 2005). Based on the Chothia's estimate, which states that the number of different protein families is finite and perhaps as small as 1000 (Chothia, 1992), it seems feasible to derive most of the unsolved structures by homology modeling based only on a relatively small portion of the protein structures that are determined experimentally. This explains why the novel structures are especially valuable. This fact also served as the basis of the Protein Structure Initiative that was initiated by NIH in 1999 (Chandonia and Brenner, 2006). One of the aims of this project is to cover the structure space of proteins. These early findings are supported by a recent computational analysis of the Protein Data Bank, which showed that the growth of the structural data has slowed down and that the rate of increase of the related SCOP categories (including number of families, superfamilies and folds) is also slowing down (Levitt, 2007). Homology modeling is based on the assumption that homologous sequences share similar folding patterns (Ruan et al., 2006; Zhang and Skolnick, 2005). At the same time, sequences with low sequence identity can also share similar folding patterns (Paiardini et al., 2004) and can be used to predict tertiary structure (Bujnicki, 2006). Sequence alignment software is an important tool to find homologous sequences among the known structures (Altschul et al., 1997; Yu et al., 2006), but inept when no homologous sequences are available. Research also shows that finding similar folding patterns among the low identity sequences is beneficial for reconstruction of the tertiary structure (Reinhardt and Eisenberg, 2004; Tomii et al., 2005).

A comprehensive and detailed description of the structural relationships between all solved proteins is provided in the SCOP (Structural Class of Proteins) database (Andreeva et al., 2004; Murzin et al., 1995). This database implements a hierarchy of relations between known protein and protein domain structures. The classification on the first level of the hierarchy is commonly known as the protein structural class, while the second level classifies proteins into folds, which are the classification target in this article. Several machine-learning methods have been applied to detect the structurally similar proteins (protein folds) from sequences that share low identity. Ding and Dubchak investigated support vector machine (SVM) and neural network for protein fold classification (Ding and Dubchak, 2001). Shen and Chou studied ensemble models based on nearest neighbor (Shen and Chou, 2006). The prediction of the type of the protein fold for a given sequence is performed with an intermediate step that converts the sequence into a feature space representation. Several other ensemble models that applied the same feature space representation as the one proposed by Ding and Dubchak were also proposed (Bologna and Appel, 2002; Nanni, 2006; Okun, 2004). In these studies protein sequences were represented by composition vector (CV), predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability and pseudo-amino acid composition. The fold classification success rate ranged between 56% and 62%. The dimensionality of the feature space was relatively high, i.e. 125 features were proposed by Ding and Dubchak and 283 features by Chou and Shen, when compared with size of the dataset used in the experimental evaluation, i.e. 313 training and 385 test proteins. To this end, we propose a novel fold classification method, called PFRES, that provides significantly better prediction accuracy when compared with the existing methods and that uses a small number of new and more effective features. The main source of the achieved improvement is attributed to the application of PSI-BLAST profile (Altschul et al., 1997) based composition vector, which considers evolutionary information (Jones, 1999, 2007; Kim and Park, 2004), instead of the composition and pseudo-composition vectors that were used in the prior works. We also applied features generated from secondary structure predicted with PSI-PRED (Jones, 1999), which are also shown to be beneficial in the context of the fold classification. Finally, we note that PFRES, as well as all other relevant competing methods, address a simplified fold classification problem, i.e. they predict 27 folds, due to low counts of proteins that belong to the remaining folds.

2 MATERIALS AND METHODS

2.1 Datasets

The proposed method was designed on a training dataset with 313 domains proposed by Ding and Dubchak (Ding and Dubchak, 2001). The tests were performed on two datasets: the test set 1 with 385 domains was also taken from Ding and Dubchak (2001) and was used to perform comparison with other existing methods; the test set 2 with 908 domains was included to provide larger scale evaluation on more recently deposited domains and to assure that the proposed method does not overfit the first, small test set.

The follow-up study by Shen and Chou excluded two training domains (2SCMC and 2GPS) and two domains from test set 1 (2YHX_1 and 2YHX_2) due to lack of sequence records (Shen and Chou, 2006). We follow Shen and Chou's study and adopt the two datasets without these four sequences. The sequence identity for any pair of sequences in the training set is <35%. According to the dataset authors, the sequence in test set 1 share more than 35% sequence identity with the sequences in the training set. We found seven duplicates between these two sets, i.e. 1APLC, 3RUB2, 2REB1, 1DSBA2, 1GLCG1, 1GLCG2 and 1SLTA from the training set correspond to 1YRNB, 3RUBL2, 2REB_1, 1DSBA2, 1GLAG1, 1GLAG2 and1SLAA from the test set 1, respectively. We also found another 12 pairs that share over 50% identity. This redundancy may result in overestimated test results on dataset 1, but at the same time it should not impact ability to compare the relative differences between prediction accuracies achieved by various methods on this test set. The training and test set 1 sequences belong to the following 27 folds: (1) globin-like, (3) cytochrome c, (4) DNAbinding3-helical bundle, (7) 4-helical up-and-down bundle, (9) 4-helicalcytokines, (11) EF-hand, (20) immunoglobulin-like, (23) cupredoxins, (26) viral coat and capsid proteins, (30) conA-like lectin/glucanases, (31) SH3-like barrel, (32) OB-fold, (33) beta-trefoil, (35) trypsin-like serine proteases, (39) lipocalins, (46) (TIM)-barrel, (47) FAD (also NAD)-binding motif, (48) flavodoxin-like, (51) NAD(P)-binding Rossmann-fold, (54) P-loop, (57) thioredoxin-like, (59) ribonuclease H-like motif, (62) hydrolases, (69) periplasmic binding protein-like, (72) b-grasp, (87) ferredoxin-like and (110) small inhibitors, toxins and lectins. These 27 folds are the most populated in SCOP; each of them contains at least seven proteins. Based on the concept of protein structural class proposed by Levitt and Chothia (Levitt and Chothia, 1976), folds 1–11 belong to all-α structural class, folds 20–39 to all-β class, folds 46–69 to α/β class and folds 72–87 to α + β class. The fold distribution can be found in Ding and Dubchak (2001) and these two datasets can be downloaded from Supplementary Material in Shen and Chou (2006).

Test set 2 includes sequences that belong to the same 27 folds and that were deposited into PDB between 2002 and 2004. The selected timeframe is a result of two factors: the newest version of SCOP assigned folds only for sequences deposited until January 2005, while the training set and test set 1 were generated before 2001 and we aimed to avoid overlap between these datasets. The sequences in test set 2 were filtered by CD-HIT (Li and Godzik, 2006) at 40% sequence identity. Next, the remaining sequences were aligned with the sequences in both the training set and test set 1 using Smith–Waterman algorithm (Smith and Waterman, 1981). Only sequences that have <35% sequence identity with any sequence in these two sets were selected to form the test set 2. The resulting 908 sequences are available from the authors upon request.

2.2 Feature space representation

2.2.1 PSI-BLAST profile-based composition vector

The composition vector (CV) is computed directly from amino acid (AA) sequence (Chen et al., 2007; Chou, 2005). Given that the 20 AAs, which are ordered alphabetically (A, C, … , W, Y), are represented as AA1, AA2, … , AA19 and AA20, and the number of occurrences of AAi in the entire sequence is denoted as ni, the composition vector is defined as:
where L is the length of the sequence. This representation was used by majority of the existing fold classification methods (Bologna and Appel, 2002; Ding and Dubchak, 2001; Nanni, 2006; Okun, 2004).

The new representation, which combines PSI-BLAST profile and the concept of composition vector, was developed for the proposed prediction method. The prior successful applications of PSI-BLAST profile illustrate that the evolutionary information is more informative than the query sequence itself (Jones, 1999, 2007; Kim and Park, 2004). PSI-BLAST aligns a given query sequence to a database of sequences. Using multiple sequence alignment, PSI-BLAST counts the frequency of each AA at each position for the query sequence and generates 20-dimensional vector of AA frequencies for each position in the query sequence. The generated PSI-BLAST profile can be used to identify key positions of conserved AAs and positions that undergo mutations. Our approach combines the composition vector of the entire sequence and the PSI-BLAST profile into so called PSI-BLAST profile-based composition vector (PCV). The PSI-BLAST profile is an L × 20 matrix, which is denoted as [ai,j], where i = 1,2, … , L denotes position in the query sequence and j = 1,2, … , 20 denotes a given AA. After applying the substitution matrix and log function, aij values range between −9 and 11. The proposed representation is related to calculation of the composition vector based on binary coding. The binary coding uses a 20-dimensional vector to encode each AA. In binary coding, AAi is encoded as (0,0, … , 0,1,0, … , 0,0), where only the ith value is greater than 0. The binary coding matrix is denoted as [bi,j]. The binary encoding and PSI-BLAST profile matrices have the same dimensionality (L × 20).

CV can be computed from the binary coding matrix in a straightforward way. For a given protein sequence A1A2AN
where {CVi, i = 1, 2, … , 20} is the 20-dimensional composition vector.
PCV is calculated in a similar way. The only difference is that the binary coding matrix [bi,j] is replaced by PSI-BLAST profile [ai,j]. Therefore, PCV is defined as:
Since PSI-BLAST profile values can be negative, while the frequencies of AA pairs should not be negative, we redefine PCV as follows:
where the negative aki values are replaced by 0 and the 20-dimensional {PCVi, i = 1, 2, … , 20} vector corresponds to the PSI-BLAST profile-based composition vector.

2.2.2 Secondary structure predicted with PSI-PRED

Predicted secondary structure is proven to be helpful in fold classification. The recently proposed fold classification studies (Ding and Dubchak, 2001; Shen and Chou, 2006) used the secondary structure predicted with relatively older methods (Holley and Karplus, 1989; Quian and Sejnowski, 1988). In contrast, we use a more recent PSI-PRED method (Jones, 1999), which is shown to provide superior accuracy when compared with other state-of-the-art competing secondary structure prediction methods (Birzele and Kramer, 2006; Lin et al., 2005). We used PSI-PRED25 with default parameters to predict secondary structure from the protein sequences. The 3-state predictions (helix, strand and coil) are used to generate the features.

Secondary structure content (SSC) is shown to improve classification accuracy of a related problem of structural class prediction (Kurgan and Chen, 2007). To this end, we introduce SSC that is calculated from the secondary structure predicted with PSI-PRED. Let us denote the AA sequence as {Ai, i = 1, 2, … , L} and the predicted secondary structure as {Si, i = 1, 2, … , L}. We count the occurrences of ‘H’, ‘E’ and ‘C’ predictions and denote the corresponding counts as COUNTH, COUNTE, COUNTC, respectively. The SSC is defined as:
where class = ‘H’, ‘E’ and ‘C’.

Number of distinct secondary structure segments (DSSS). Although secondary structure content reflects information about the secondary structure of the entire sequence, it does not provide information concerning individual secondary structure segments. At the same time, size (length) of secondary structure segments is one of the deciding factors when it comes to the classification of the structural classes and folds. To this end, we designed features that count the number of occurrences of distinct helix, strand and coil structures which length (number of the corresponding AAs) is above a certain threshold. In this way short secondary structure segments, which possibly can be incorrectly predicted, will be filtered out. We varied the threshold values between 2 and 9 for the strand and coil segments and between 3 and 9 for the helix segments and run predictions using SVM classifier. The corresponding results are shown in Figure 1. Based on the graph, the threshold to count helical segments equals 7. The thresholds for strand and coil segments equal 5 and 6, respectively. We note that the accuracies resulting from using different threshold are relatively similar, i.e. within 1%, and thus the quality of the proposed method should not be sensitive to this parameter.

Optimization of segment length thresholds to define DSSS features.
Fig. 1.

Optimization of segment length thresholds to define DSSS features.

Arrangement of DSSS. In some cases, structural folds cannot be distinguished based on the SSC and DSSS features. For instance, the α/β and α + β classes contain both α-helices and β-strands; the α/β class includes mainly parallel β-strands, while α + β class mainly includes anti-parallel strands, which is related to the arrangement of secondary structure segments, but not the SSC or DSSS values. Therefore, we also designed another set of features that encode arrangement of three neighboring secondary structure segments, which meet the minimum threshold criteria set for DSSS features. There are 27 possible segment arrangements, i.e. class-class-class where class = ‘H’, ‘E’ and ‘C’. We count the corresponding number of occurrences for each arrangement.

Finally, we also include the length of the sequence (L) as a feature. Table 1 summarizes features used in this article.

Table 1.

Summary of the feature selection results

Features setTotal number featuresSelected features
PCV2020
SSC33
Number of DSSS32
Arrangement of DSSS2710
Length11
Total5436
Features setTotal number featuresSelected features
PCV2020
SSC33
Number of DSSS32
Arrangement of DSSS2710
Length11
Total5436
Table 1.

Summary of the feature selection results

Features setTotal number featuresSelected features
PCV2020
SSC33
Number of DSSS32
Arrangement of DSSS2710
Length11
Total5436
Features setTotal number featuresSelected features
PCV2020
SSC33
Number of DSSS32
Arrangement of DSSS2710
Length11
Total5436

2.3 Feature selection

Feature selection method was used to reduce the dimensionality and potentially improve the prediction accuracy. An entropy-based feature selection method (Yu and Liu, 2003), which evaluates each feature by measuring the information gain with respect to the class (protein fold), was applied.

The entropy of a feature X is defined as:
where {xi} is a set of values of X and P(xi) is the prior probability of xi. The conditional entropy of X, given another feature Y (in our case the protein fold) is defined as:
where P(xi | yj) is the posterior probability of X given the value yi of Y.
The amount by which the entropy of X decreases reflects additional information about X provided by Y and is called information gain
According to this measure, Y has stronger correlation with X than with Z if IG(X|Y) > IG(Z|Y). The feature selection was performed using 10-fold cross-validation on the training set. Among the original set of 54 features, 36 with the best information gain values were selected; see Table 1.

2.4 Proposed prediction method

The proposed prediction method was designed and tested in two steps. First, we selected a set of best-performing classifiers among six state-of-the-art methods that include SVM (Kerthi et al., 2001), Multiple Logistic Regression (Le and Houwelingen, 1992), instance learning-based Kstar (Cleary and Trigg, 1995) and IB1 (Aha and Kibler, 1991) algorithms, Naïve Bayes (John and Langley, 1995), and Random Forest (Leo, 2001) and when using the selected 36 features to represent sequences. Second, three different ensembles of the selected classifiers, including voting, grading and stacking (Seewald, 2002; Seewald and Fuernkranz, 2001), were tried and the best performing ensemble was used to implement our fold classification method. As a result, voting-based ensemble, which combines predictions from the three classifiers based on an unweighted average of the corresponding classification probability estimates, was selected. The architecture of the proposed PFRES method is shown in Figure 2. The classification algorithms used to develop and compare the proposed method were implemented in Weka (Witten and Frank, 2005).

Architecture of the proposed fold classification method.
Fig. 2.

Architecture of the proposed fold classification method.

3 RESULTS AND DISCUSSION

The experiments first report results related to the design of the proposed fold classification method. We also test and discuss effectiveness of individual feature sets from the proposed sequence representation. Finally, the results of our ensemble method are compared with the results of five competing methods.

For the test set 1, the fold classification accuracies for the six classifiers that include SVM, Multiple Logistic Regression, Kstar, IB1, Naïve Bayes and Random Forest and when using the selected 36 features to represent sequences are shown in Table 2. Random Forest (with 250 trees) gives the highest accuracy, i.e. 66.8%, among the six classifiers. The two runner-up classifiers, SVM (with RBF kernel with γ = 0.8 and complexity parameter C = 5.0) and Kstar (with global blend = 96), obtained 66.1% and 65.0% accuracy, respectively. The same classifiers were also evaluated on the test set 2 by applying the same group of features and the same parameters. Random Forest again gives the highest accuracy, i.e. 63.3%, with the same two runner-up classifiers, SVM and Kstar, which obtained 62.4% and 62.7% accuracy, respectively. The accuracies on test set 2 are slightly lower than accuracies on test set 1. The remaining three classifiers obtained accuracy that is 3–10% lower than the accuracy of the three best classifiers, and thus were not used to implement the proposed fold classification method.

Table 2.

Comparison of prediction accuracies between different classifiers for the proposed sequence representation that includes the selected 36 features

FoldsIndividual classifiersEnsemble classifiers
SVMKstarRandom ForestIB1Naïve BayesRegressionGradingVotingStacking-C
1100100100100100100100100100
310010010088.9100100100100100
4454570406550556060
710062.510087.51007510075100
910088.988.988.910010088.988.988.9
1177.866.766.766.766.766.766.766.766.7
207584.177.365.952.365.986.481.879.5
2333.316.733.32533.341.72533.333.3
2684.610092.384.692.384.610092.392.3
3066.766.783.366.75066.766.766.783.3
315062.537.537.55062.55062.537.5
3252.647.452.647.463.221.152.652.663.2
331007550100100100757575
35255050502525505050
39100100100100100100100100100
4664.666.766.762.539.647.968.868.866.7
4783.391.783.383.383.37583.391.791.7
4830.830.846.238.553.830.838.546.238.5
5159.370.459.370.448.129.66366.766.7
545041.733.341.741.741.741.733.333.3
5737.55037.537.537.562.5505037.5
5966.758.366.766.758.358.358.366.766.7
6257.157.142.985.742.957.142.957.157.1
69502550505025505050
72252525252525252525
8755.640.751.933.351.933.348.151.951.9
11096.388.996.388.992.666.796.396.396.3
Overall66.16566.862.160.355.167.668.468.1
FoldsIndividual classifiersEnsemble classifiers
SVMKstarRandom ForestIB1Naïve BayesRegressionGradingVotingStacking-C
1100100100100100100100100100
310010010088.9100100100100100
4454570406550556060
710062.510087.51007510075100
910088.988.988.910010088.988.988.9
1177.866.766.766.766.766.766.766.766.7
207584.177.365.952.365.986.481.879.5
2333.316.733.32533.341.72533.333.3
2684.610092.384.692.384.610092.392.3
3066.766.783.366.75066.766.766.783.3
315062.537.537.55062.55062.537.5
3252.647.452.647.463.221.152.652.663.2
331007550100100100757575
35255050502525505050
39100100100100100100100100100
4664.666.766.762.539.647.968.868.866.7
4783.391.783.383.383.37583.391.791.7
4830.830.846.238.553.830.838.546.238.5
5159.370.459.370.448.129.66366.766.7
545041.733.341.741.741.741.733.333.3
5737.55037.537.537.562.5505037.5
5966.758.366.766.758.358.358.366.766.7
6257.157.142.985.742.957.142.957.157.1
69502550505025505050
72252525252525252525
8755.640.751.933.351.933.348.151.951.9
11096.388.996.388.992.666.796.396.396.3
Overall66.16566.862.160.355.167.668.468.1
Table 2.

Comparison of prediction accuracies between different classifiers for the proposed sequence representation that includes the selected 36 features

FoldsIndividual classifiersEnsemble classifiers
SVMKstarRandom ForestIB1Naïve BayesRegressionGradingVotingStacking-C
1100100100100100100100100100
310010010088.9100100100100100
4454570406550556060
710062.510087.51007510075100
910088.988.988.910010088.988.988.9
1177.866.766.766.766.766.766.766.766.7
207584.177.365.952.365.986.481.879.5
2333.316.733.32533.341.72533.333.3
2684.610092.384.692.384.610092.392.3
3066.766.783.366.75066.766.766.783.3
315062.537.537.55062.55062.537.5
3252.647.452.647.463.221.152.652.663.2
331007550100100100757575
35255050502525505050
39100100100100100100100100100
4664.666.766.762.539.647.968.868.866.7
4783.391.783.383.383.37583.391.791.7
4830.830.846.238.553.830.838.546.238.5
5159.370.459.370.448.129.66366.766.7
545041.733.341.741.741.741.733.333.3
5737.55037.537.537.562.5505037.5
5966.758.366.766.758.358.358.366.766.7
6257.157.142.985.742.957.142.957.157.1
69502550505025505050
72252525252525252525
8755.640.751.933.351.933.348.151.951.9
11096.388.996.388.992.666.796.396.396.3
Overall66.16566.862.160.355.167.668.468.1
FoldsIndividual classifiersEnsemble classifiers
SVMKstarRandom ForestIB1Naïve BayesRegressionGradingVotingStacking-C
1100100100100100100100100100
310010010088.9100100100100100
4454570406550556060
710062.510087.51007510075100
910088.988.988.910010088.988.988.9
1177.866.766.766.766.766.766.766.766.7
207584.177.365.952.365.986.481.879.5
2333.316.733.32533.341.72533.333.3
2684.610092.384.692.384.610092.392.3
3066.766.783.366.75066.766.766.783.3
315062.537.537.55062.55062.537.5
3252.647.452.647.463.221.152.652.663.2
331007550100100100757575
35255050502525505050
39100100100100100100100100100
4664.666.766.762.539.647.968.868.866.7
4783.391.783.383.383.37583.391.791.7
4830.830.846.238.553.830.838.546.238.5
5159.370.459.370.448.129.66366.766.7
545041.733.341.741.741.741.733.333.3
5737.55037.537.537.562.5505037.5
5966.758.366.766.758.358.358.366.766.7
6257.157.142.985.742.957.142.957.157.1
69502550505025505050
72252525252525252525
8755.640.751.933.351.933.348.151.951.9
11096.388.996.388.992.666.796.396.396.3
Overall66.16566.862.160.355.167.668.468.1

Among the 27 folds, fold 1 and fold 39 are the easiest to classify, i.e. all six classifiers achieved 100% accuracy for these two folds. Folds 3, 7, 9, 26, 33, 47 and 110 are also relatively easy to classify, i.e. the average accuracy of the six classifiers for these folds is above 80%. The average prediction accuracy for all-α structural class (folds 1–11) is 77.1%, for all-β class (folds 20–39) is 64.3%, for α/β class (folds 46–69) is 55.6% and for α + β class (folds 72–87) is 40%. The folds that belong to all-α and all-β structural classes are easier to classify, while folds that belong to α/β and α + β classes are more difficult to correctly recognize. This is expected as the proposed features, and especially those based on the predicted secondary structure, should be able to successfully represent proteins that contain mainly α-helices and β-strands. At the same time, although still well performing, the proposed features are less efficient in capturing long range interactions that are characteristic to formation of parallel and anti-parallel β-strands.

3.1 Effectiveness of PCV features

The PSI-BLAST profile-based composition vector (PCV), which is proposed in this article, was directly compared with the corresponding sequence-based composition vector (CV) representation that was used in Bologna and Appel (2002), Ding and Dubchak (2001), Nanni (2006) and Okun (2004). PCV and CV were compared based on fold classification performed with the six classifiers. The prediction results are shown in Figure 3. The comparison shows consistent superior quality of PCV features, i.e. the results based on PCV features are at least 13% higher than the result from CV features for all six classifiers. For test set 1, the average accuracy when using PCV features is 54.8%, while for CV features it drops to 39.1%. For the test set 2, the average accuracy when using PCV features is 46.3%, while for CV features it drops to 27.3%. This illustrates that sequential evolutionary information is critical for successful classification of protein folds, even for sequences that share low sequence identity. The results also indicate that the test set 2 is more challenging.

Comparison of prediction accuracies (y axis) between PSI-BLAST profile-based composition vector and sequence-based composition vector. Two sets of feature were tested with six classifiers (x axis) on both test sets.
Fig. 3.

Comparison of prediction accuracies (y axis) between PSI-BLAST profile-based composition vector and sequence-based composition vector. Two sets of feature were tested with six classifiers (x axis) on both test sets.

3.2 Effectiveness of features based on the predicted secondary structure

Features generated from the predicted secondary structure that were proposed in this article, which include SSC, number of DSSS and the arrangement of DSSS, are also shown to contribute to the improved fold classification. We compared prediction accuracy of the three best classifiers when using the PCV features with accuracy when the features computed from the predicted secondary structure are added, see Figure 4. For test set 1, 55.4%, 59.5% and 59.3% accuracies were obtained for Kstar, Random Forest and SVM classifiers, respectively, when using only PCV to represent sequences. After adding SSC features, the accuracies increased to 57.7%, 62.4% and 60.1%. By adding the number of DSSS, the accuracies again increased to 61.4%, 65.8% and 64.8%. Finally, adding the features related to the arrangement of DSSS results in accuracies of 63.4%, 65.8% and 65.5%. Similar results were observed for the test set 2. The accuracies of Kstar, Random Forest and SVM classifiers equal 43.6%, 50.9% and 50% when using only PCV, 49.7%, 57.3% and 57.9% after adding SSC features, 55.7%, 63% and 61.5% after adding the number of DSSS, and finally 61%, 63% and 62.6% after adding the features related to the arrangement of DSSS, respectively. These consistent improvements show that each of the proposed features sets results in improvements and illustrates the importance of the secondary structure information with respect to the classification of protein folds.

Comparison of classification accuracy (y axis) obtained by using features calculated from the secondary structure predicted by PSI-PRED, i.e. PCV features only, PCV and SSC features, PCV, SSC and number of DSSS features, and PCV, SSC, number of DSSS and arrangement of DSSS features. Results of the three best classifiers on both test sets (xaxis) are shown.
Fig. 4.

Comparison of classification accuracy (y axis) obtained by using features calculated from the secondary structure predicted by PSI-PRED, i.e. PCV features only, PCV and SSC features, PCV, SSC and number of DSSS features, and PCV, SSC, number of DSSS and arrangement of DSSS features. Results of the three best classifiers on both test sets (xaxis) are shown.

3.3 Comparison of ensemble models

Several prior works on protein folds classification applied ensemble models to improve prediction accuracy (Bologna and Appel, 2002; Nanni, 2006; Shen and Chou, 2006). The method by Shen and Chou (Shen and Chou, 2006) ensembles nine evidence-theoretic k-nearest neighbor classifiers that use different input feature sets. The ensemble proposed in Bologna and Appel (2002) applies four specialized neural networks that use different subsets of protein sequences from the training set. Finally, the ensemble developed by Nanni (Nanni, 2006) uses 27 k-local hyperplane-based nearest neighbor classifiers, each of which uses different subset of features among these proposed in Ding and Dubchak (2001). In contrast to the above methods that ensemble the same type of classifiers, our method ensembles three different classifiers that provide complementary predictions, i.e. SVM provides superior predictions for folds 9, 11, 33, 54 and 87; Kstar for folds 20, 26, 31, 47, 51 and 57; Random Forest for folds 4, 30 and 48; see Table 2. Three methods for combining multiple classifiers that include voting, grading and stacking were compared on test set 1; see Table 2. All three ensembles are shown to provide better accuracies than the best single classifier, Random Forest. The proposed method adopts the best performing voting-based ensemble that achieves 68.4% accuracy on test set 1. For the test set 2, the same voting-based ensemble achieves 66.4% accuracy. In case of both test sets, folds 1, 3, 9, 20 and 110 were predicted with accuracy of above 80%, while accuracy of below 50% was recorded for folds 23, 35, 48 and 72. Results on both test sets show that the application of the ensemble model results in 2–3% improvement in prediction accuracy over the prediction based on single classifier. The lower prediction accuracy on the test set 2 could be explained by the strict separation (up to 35% sequence similarity) between this test set and the training set. In contrast, test set 1 is shown to share some redundant and similar sequences with the training set. When these 19 sequences were removed from test set 1, the PFRES obtains 67% accuracy on this set, which is only 0.6% higher than accuracy on the test set 2.

3.4 Comparison with competing prediction methods

The proposed PFRES method was compared with five recent methods that address same task on test set 1; see Table 3. Ding and Dubchak's method uses representation with 125 features and SVM and neural networks as the classifiers (Ding and Dubchak, 2001). Okun's method uses features proposed in Ding and Dubchak (2001) and k-local hyperplane nearest neighbor classifier (Okun, 2004). Bologna and Appel's and Nanni's methods again use the same features and the ensemble-based classifiers (Bologna and Appel, 2002; Nanni, 2006). Finally, method by Shen and Chou uses a new representation that includes 283 features and the ensemble-based classifier. They substituted composition vector from the feature set proposed by Ding and Dubchak with 178 features that implement pseudo-amino acid composition. When compared with the competing methods, PFRES uses only 36 features, which is 70% less features than the representation applied in Ding and Dubchak (2001), Bologna and Appel (2002), Nanni (2006) and Okun (2004), and nearly 90% less features than the representation proposed in Shen and Chou (2006). Table 3 shows that PFRES provides 6.3–12.4% higher accuracy than the prior methods. When compared with the best performing competing method by Shen and Chou, prediction with PFRES results in substantial 6.3/37.9 = 17% error rate reduction. PFRES provides superior accuracy for 13 out of 27 folds, while method by Shen and Chou provides the best predictions for nine folds.

Table 3.

Comparison between PFRES and the competing fold classification methods on test set 1. The best results for each fold are shown in bold

FoldsFold classification methods
SVMa (%)HKNNb (%)DIMLPc (%)SEd (%)PFPe (%)PFRES this article
183.383.385.083.383.3100
377.877.897.888.955.6100
435.050.066.070.085.060.0
750.087.541.350.075.075.0
910088.991.110010088.9
1166.744.422.233.333.366.7
2071.656.875.779.670.581.8
2316.725.040.025.016.733.3
2650.084.680.869.210092.3
3033.350.046.733.333.366.7
3150.050.075.062.537.562.5
3226.342.122.636.815.852.6
3350.050.045.050.075.075.0
3525.050.050.025.050.050.0
3957.142.974.328.671.4100
4677.179.283.887.597.968.8
4758.358.355.058.366.791.7
4848.753.952.361.515.446.2
5161.140.739.337.044.466.7
5436.133.341.750.033.333.3
5750.037.546.350.062.550.0
5935.771.455.064.366.766.7
6271.471.444.371.457.157.1
6925.025.025.025.050.050.0
7212.525.023.825.037.525.0
8737.025.941.133.329.651.9
11083.385.210085.296.396.3
Overall56.057.161.161.162.168.4
FoldsFold classification methods
SVMa (%)HKNNb (%)DIMLPc (%)SEd (%)PFPe (%)PFRES this article
183.383.385.083.383.3100
377.877.897.888.955.6100
435.050.066.070.085.060.0
750.087.541.350.075.075.0
910088.991.110010088.9
1166.744.422.233.333.366.7
2071.656.875.779.670.581.8
2316.725.040.025.016.733.3
2650.084.680.869.210092.3
3033.350.046.733.333.366.7
3150.050.075.062.537.562.5
3226.342.122.636.815.852.6
3350.050.045.050.075.075.0
3525.050.050.025.050.050.0
3957.142.974.328.671.4100
4677.179.283.887.597.968.8
4758.358.355.058.366.791.7
4848.753.952.361.515.446.2
5161.140.739.337.044.466.7
5436.133.341.750.033.333.3
5750.037.546.350.062.550.0
5935.771.455.064.366.766.7
6271.471.444.371.457.157.1
6925.025.025.025.050.050.0
7212.525.023.825.037.525.0
8737.025.941.133.329.651.9
11083.385.210085.296.396.3
Overall56.057.161.161.162.168.4

arefers to (Ding and Dubchak, 2001).

brefers to (Okun, 2004).

crefers to (Bologna and Appel, 2002).

drefers to (Nanni, 2006).

erefers to (Shen and Chou, 2006).

Table 3.

Comparison between PFRES and the competing fold classification methods on test set 1. The best results for each fold are shown in bold

FoldsFold classification methods
SVMa (%)HKNNb (%)DIMLPc (%)SEd (%)PFPe (%)PFRES this article
183.383.385.083.383.3100
377.877.897.888.955.6100
435.050.066.070.085.060.0
750.087.541.350.075.075.0
910088.991.110010088.9
1166.744.422.233.333.366.7
2071.656.875.779.670.581.8
2316.725.040.025.016.733.3
2650.084.680.869.210092.3
3033.350.046.733.333.366.7
3150.050.075.062.537.562.5
3226.342.122.636.815.852.6
3350.050.045.050.075.075.0
3525.050.050.025.050.050.0
3957.142.974.328.671.4100
4677.179.283.887.597.968.8
4758.358.355.058.366.791.7
4848.753.952.361.515.446.2
5161.140.739.337.044.466.7
5436.133.341.750.033.333.3
5750.037.546.350.062.550.0
5935.771.455.064.366.766.7
6271.471.444.371.457.157.1
6925.025.025.025.050.050.0
7212.525.023.825.037.525.0
8737.025.941.133.329.651.9
11083.385.210085.296.396.3
Overall56.057.161.161.162.168.4
FoldsFold classification methods
SVMa (%)HKNNb (%)DIMLPc (%)SEd (%)PFPe (%)PFRES this article
183.383.385.083.383.3100
377.877.897.888.955.6100
435.050.066.070.085.060.0
750.087.541.350.075.075.0
910088.991.110010088.9
1166.744.422.233.333.366.7
2071.656.875.779.670.581.8
2316.725.040.025.016.733.3
2650.084.680.869.210092.3
3033.350.046.733.333.366.7
3150.050.075.062.537.562.5
3226.342.122.636.815.852.6
3350.050.045.050.075.075.0
3525.050.050.025.050.050.0
3957.142.974.328.671.4100
4677.179.283.887.597.968.8
4758.358.355.058.366.791.7
4848.753.952.361.515.446.2
5161.140.739.337.044.466.7
5436.133.341.750.033.333.3
5750.037.546.350.062.550.0
5935.771.455.064.366.766.7
6271.471.444.371.457.157.1
6925.025.025.025.050.050.0
7212.525.023.825.037.525.0
8737.025.941.133.329.651.9
11083.385.210085.296.396.3
Overall56.057.161.161.162.168.4

arefers to (Ding and Dubchak, 2001).

brefers to (Okun, 2004).

crefers to (Bologna and Appel, 2002).

drefers to (Nanni, 2006).

erefers to (Shen and Chou, 2006).

The statistical significance of the differences between accuracies obtained by the proposed and the competing methods over the 27 proteins folds was investigated using paired t-test. The corresponding t-values for the differences between PFRES and PFP (Shen and Chou, 2006), SE (Nanni, 2006), DIMLP (Bologna and Appel, 2002), HKNN (Okun, 2004) and SVM (Ding and Dubchak, 2001) methods equal 2.44, 3.18, 2.82, 3.12 and 4.12, respectively. As the critical t-value for the standard 0.05 significance level equals 1.71, the test shows that the proposed method provides statistically significantly better predictions than the predictions of the five competing methods. We also note that critical t-values for stronger, 0.01 and 0.005, significance levels equal 2.48 and 2.78, respectively.

3.5 Impact of the quality of the secondary structure predicted by PSI-PRED

Since 15 features proposed in this article were generated from the secondary structure predicted by PSI-PRED, we further analyze the impact of the quality of the predicted secondary structure on the accuracy of the fold classification. For the test set 2, the average accuracy of the predicted secondary structure was 75.4%. We divided the test set 2 into two subsets with sequences for which the secondary structure was predicted with accuracy below and above the average, correspondingly. The PFRES was evaluated on each of these subsets independently, see Table 4. The prediction accuracy for the second subset was 67.3%, while for the first subset it was slightly lower and equal 65.2%. As expected, higher quality of predicted secondary structure results in higher accuracy of fold classification. At the same time, this difference is relatively small, i.e. 2%, while the difference in accuracy of the predicted secondary structure between these two subsets was much larger (over 13%, see Table 4). This shows that the proposed method provides relatively stable quality of predictions with respect to the quality of the predicted secondary structure. We also note that current secondary structure prediction methods achieve the average accuracy close to 80%, e.g. EVA server reports that PSI-PRED provides the average accuracy 77.9% for 224 proteins (tested between April 2001 and September 2005), and Porter provides the average accuracy of 79.8% for 77 proteins (February 2005 to March 2006) (Eyrich et al., 2001). Since the average accuracy of the predicted secondary structure for sequences in the test set 2 was 75.4%, we believe that the presented test results provide a reliable estimate of the future performance of the proposed method.

Table 4.

Average accuracy of predicted secondary structure and accuracy of fold classification for two subsets of test set 2; subset 1 includes sequences for which secondary structure was predicted with accuracy below 75.4%; subset 2 includes the remaining sequences

Number of sequencesAverage accuracy of predicted secondary structure (%)Accuracy of fold classification with PFRES (%)
Subset137967.665.2
Subset252981.167.3
Total90875.466.4
Number of sequencesAverage accuracy of predicted secondary structure (%)Accuracy of fold classification with PFRES (%)
Subset137967.665.2
Subset252981.167.3
Total90875.466.4
Table 4.

Average accuracy of predicted secondary structure and accuracy of fold classification for two subsets of test set 2; subset 1 includes sequences for which secondary structure was predicted with accuracy below 75.4%; subset 2 includes the remaining sequences

Number of sequencesAverage accuracy of predicted secondary structure (%)Accuracy of fold classification with PFRES (%)
Subset137967.665.2
Subset252981.167.3
Total90875.466.4
Number of sequencesAverage accuracy of predicted secondary structure (%)Accuracy of fold classification with PFRES (%)
Subset137967.665.2
Subset252981.167.3
Total90875.466.4

4 CONCLUSIONS

A high quality predictor for the protein fold classification would be beneficial for in silico prediction of tertiary structure of proteins with low sequence identity, since it would allow for the determination of structural similarity without the sequence similarity. To this end, we propose PFRES method that uses a novel protein sequence representation, which consists of a small set of 36 features, and applies a carefully designed ensemble classifier. The proposed feature representation that is utilized by PFRES includes PSI-BLAST profile-based composition vector, features based on secondary structure predicted with PSI-PRED and sequence length. The experimental evaluation of the proposed fold classification method was performed with a standard benchmark dataset and another large set of over 900 sequences, both with chains with identity below 35% with respect to the training sequences. Using the benchmark set, PFRES is shown to predict the protein folds with 68.4% accuracy, which is over 6% higher than the accuracy of the best existing method. The results also show that the fold classification accuracy of the proposed method is statistically significantly better than the accuracy of all competing methods. Similar performance, i.e. 66.4% was achieved by the proposed method on the second test set. At the same time, PFRES uses 70–90% less features to represent sequences when compared with the existing methods. The proposed PSI-BLAST profile-based composition vector, which imbeds evolutionary information, was compared with commonly used sequence-based composition vector. Our empirical tests with six machine-learning classifiers have shown that the PSI-BLAST profile-based composition vector is superior to the composition vector. The new representation can be extended to other protein prediction tasks that currently apply AA composition, e.g. prediction of structural class, secondary structure content, membrane protein type, enzyme family, etc. to improve their accuracy.

ACKNOWLEDGEMENTS

K.C.'s research was supported by the Alberta Ingenuity Scholarship and NSERC Canada. L.K. acknowledges support from NSERC Canada.

Conflict of Interest: none declared.

REFERENCES

Aha
D
Kibler
D
Instance-based learning algorithms
Mach. Learn.
1991
, vol. 
6
 (pg. 
37
-
66
)
Altschul
SF
et al. 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
1997
, vol. 
17
 (pg. 
3389
-
3402
)
Andreeva
A
et al. 
SCOP database in 2004: refinements integrate structure and sequence family data
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
D226
-
D229
)
Birzele
F
Kramer
S
A new representation for protein secondary structure prediction based on frequent patterns
Bioinformatics
2006
, vol. 
22
 (pg. 
2628
-
2634
)
Bologna
G
Appel
RD
A comparison study on protein fold recognition
Proceedings of the 9th International Conference on Neural Information Processing
2002
, vol. 
Vol. 5
 (pg. 
2492
-
2496
)
Bujnicki
JM
Protein structure prediction by recombination of fragments
Chem. BioChem.
2006
, vol. 
7
 (pg. 
19
-
27
)
Chandonia
JM
Brenner
SE
The impact of structural genomics: expectations and outcomes
Science
2006
, vol. 
311
 (pg. 
347
-
351
)
Chen
K
et al. 
Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs
BMC Struct. Biol.
2007
, vol. 
7
 pg. 
25
 
Chou
KC
Progress in protein structural class prediction and its impact to bioinformatics and proteomics
Curr. Protein Pept. Sci.
2005
, vol. 
6
 (pg. 
423
-
436
)
Chothia
C
Proteins. One thousand families for the molecular biologist
Nature
1992
, vol. 
357
 (pg. 
543
-
544
)
Cleary
JG
Trigg
LE
K*: an instance-based learner using an entropic distance measure
1995
Proceedings of the 12th International Conference on Machine Learning
(pg. 
108
-
114
)
Ding
CH
Dubchak
I
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics
2001
, vol. 
17
 (pg. 
349
-
358
)
Eyrich
VA
et al. 
EVA: continuous automatic evaluation of protein structure prediction servers
Bioinformatics
2001
, vol. 
17
 (pg. 
1242
-
1243
)
Holley
LH
Karplus
M
Protein secondary structure prediction with a neural network
Proc. Natl Acad. Sci. USA
1989
, vol. 
86
 (pg. 
152
-
156
)
John
GH
Langley
P
Estimating continuous distributions in Bayesian classifiers
1995
Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence
(pg. 
338
-
345
)
Jones
DT
Protein secondary structure prediction based on position-specific scoring matrices
J. Mol. Biol.
1999
, vol. 
292
 (pg. 
195
-
202
)
Jones
DT
Improving the accuracy of transmembrane protein topology prediction using evolutionary information
Bioinformatics
2007
, vol. 
23
 (pg. 
538
-
544
)
Kerthi
SS
et al. 
Improvements to Platt's SMO algorithm for SVM classifier design
Neural Comput.
2001
, vol. 
13
 (pg. 
637
-
649
)
Kim
H
Park
H
Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor
Proteins
2004
, vol. 
54
 (pg. 
557
-
562
)
Kurgan
L
Chen
K
Prediction of protein structural class for the twilight zone sequences
Biochem. Biophys. Res. Commun.
2007
, vol. 
357
 (pg. 
453
-
460
)
Le
CS
Houwelingen
JC
Ridge estimators in logistic regression
Appl. Stat.
1992
, vol. 
41
 (pg. 
191
-
201
)
Leo
B
Random forests
Mach. Learn.
2001
, vol. 
1
 (pg. 
5
-
32
)
Levitt
M
Growth of novel protein structural data
Proc. Natl Acad. Sci. USA
2007
, vol. 
104
 (pg. 
3183
-
3188
)
Levitt
M
Chothia
C
Structural patterns in globular proteins
Nature
1976
, vol. 
261
 (pg. 
552
-
558
)
Li
W
Godzik
A
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics
2006
, vol. 
22
 (pg. 
1658
-
1659
)
Lin
K
et al. 
A simple and fast secondary structure prediction method using hidden neural networks
Bioinformatics
2005
, vol. 
21
 (pg. 
152
-
159
)
Murzin
AG
et al. 
SCOP: a structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
1995
, vol. 
247
 (pg. 
536
-
5340
)
Nanni
L
A novel ensemble of classifiers for protein fold recognition
Neurocomputing.
2006
, vol. 
69
 (pg. 
2434
-
2437
)
Okun
O
Protein fold recognition with K-local hyperplane distance nearest neighbor algorithm
Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics
2004
, vol. 
Vol. 1
 (pg. 
51
-
57
)
Paiardini
A
et al. 
Evolutionarily conserved regions and hydrophobic contacts at the superfamily level: the case of the fold-type I, pyridoxal-5′-phosphate-dependent enzymes
Protein Sci.
2004
, vol. 
13
 (pg. 
2992
-
3005
)
Quian
N
Sejnowski
TJ
Predicting the secondary structure of globular proteins using neural network models
J. Mol. Biol.
1988
, vol. 
202
 (pg. 
865
-
884
)
Reinhardt
A
Eisenberg
D
DPANN: improved sequence to structure alignments following fold recognition
Proteins
2004
, vol. 
56
 (pg. 
528
-
538
)
Ruan
J
et al. 
Quantitative analysis of the conservation of the tertiary structure of protein segments
Protein J.
2006
, vol. 
25
 (pg. 
301
-
315
)
Seewald
AK
How to make stacking better and faster while also taking care of an unknown weakness
2002
Proceedings of the 19th International Conference on Machine Learning
(pg. 
554
-
561
)
Seewald
AK
Fuernkranz
J
An evaluation of grading classifiers
2001
Proceedings of 4th International Conference on Advances in Intelligent Data Analysis
(pg. 
115
-
124
)
Shen
HB
Chou
KC
Ensemble classifier for protein fold pattern recognition
Bioinformatics
2006
, vol. 
22
 (pg. 
1717
-
1722
)
Smith
TF
Waterman
MS
Identification of common molecular subsequences
J. Mol. Biol.
1981
, vol. 
147
 (pg. 
195
-
197
)
Tomii
K
et al. 
Protein structure prediction using a variety of profile libraries and 3D verification
Proteins
2005
, vol. 
61
 
7
(pg. 
114
-
121
)
Tress
M
et al. 
Assessment of predictions submitted for the CASP6 comparative modeling category
Proteins
2005
, vol. 
61
 
7
(pg. 
27
-
45
)
Wang
G
et al. 
Assessment of fold recognition predictions in CASP6
Proteins
2005
, vol. 
61
 
Suppl. 7
(pg. 
46
-
66
)
Witten
I
Frank
E
Data Mining: Practical Machine Learning Tools and Techniques
2005
San Francisco
Morgan Kaufmann
Yu
L
Liu
H
Feature selection for high-dimensional data: a fast correlation-based filter solution
2003
Proceedings of the 10th International Conference on Machine Learning
(pg. 
856
-
863
)
Yu
YK
et al. 
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Nucleic Acids Res.
2006
, vol. 
34
 (pg. 
5966
-
5973
)
Zhang
Y
Skolnick
J
The protein structure prediction problem could be solved using the current PDB library
Proc. Natl Acad. Sci. USA
2005
, vol. 
102
 (pg. 
1029
-
1034
)

Author notes

Associate Editor: Alfonso Valencia

Supplementary data