Abstract

Motivation: Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo-amino acid composition, which were extracted from a training dataset. The operation engine for the constituent individual classifiers was OET-KNN (optimized evidence-theoretic k-nearest neighbors) rule. Their outcomes were combined through a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns.

Results: The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have <25% sequence identity with the proteins used in training the classifier. Such a rate is 6–21% higher than the corresponding rates obtained by various existing NN (neural networks) and SVM (support vector machines) approaches, implying that the ensemble classifier is very promising and might become a useful vehicle in protein science, as well as proteomics and bioinformatics.

Availability: The ensemble classifier, called PFP-Pred, is available as a web-server at Author Webpage for public usage.

Contact:  lifesci-sjtu@san.rr.com

Supplementary information: Supplementary data are available on Bioinformatics online.

INTRODUCTION

The avalanche of protein sequences generated in the post-genomic era has challenged us for developing computational methods by which the structural information can be timely extracted from sequence databases. Although the direct prediction of the three-dimensional (3D) structure of a protein from its sequence based on the least free energy principle is scientifically quite sound and some encouraging results already obtained in elucidating the handedness problems and packing arrangements in proteins (see e.g. Chou and Carlacci, 1991; Chou et al., 1982, 1984, 1990), it is very difficult to predict its overall fold owing to the notorious local minimum problem. Also, although it is quite successful to predict the 3D structure of a protein according to the homology modeling approach (Chou, 2004; Holm and Sander, 1999), a hurdle exists when the query protein does not have any structure-known homologous protein in the existing databases. Facing this kind of situation, can we find a different approach to predict the fold of a protein? In this paper, we shall resort to the taxonomic approach, whose underpinning is based on the assumption that the number of protein folds is limited (Chou and Zhang, 1995; Dubchak et al., 1999; Finkelstein and Ptitsyn, 1987; Murzin et al., 1995). Accordingly, predicting the 3D structure of a protein may be first converted to a problem of classification, i.e. identifying which fold pattern it belongs to. The present study was initiated in an attempt to introduce a novel approach, the ensemble classifier, to recognize the fold pattern for a query protein.

MATERIALS AND METHODS

The working (training and testing) datasets studied here were taken from Ding and Dubchak (2001). The original training dataset and testing dataset contain 313 proteins and 385 proteins, respectively. Of these proteins, however, two (i.e. 2SCMC and 2GPS) in the training dataset and two (2YHX_1 and 2YHX_2) in the testing dataset do not have sequence records. These four proteins were excluded for further consideration due to lacking sequence information. Accordingly, we have 311 proteins for training dataset and 383 proteins for testing dataset. The names of the training and testing proteins and their sequences are given in Online Supplementary Materials AI and AII, respectively. None of proteins in the testing dataset has >35% sequence identity to those in the training dataset (Ding and Dubchak, 2001). According to the SCOP database (Andreeva et al., 2004; Murzin et al., 1995), the proteins in the training and testing datasets (Online Supplementary Materials A) were further classified into the following 27-fold types (Ding and Dubchak, 2001; Dubchak et al., 1995, 1999): (1) globin-like, (2) cytochrome c, (3) DNA-binding 3-helical bundle, (4) 4-helical up-and-down bundle, (5) 4-helical cytokines, (6) EF-hand, (7) immunoglobulin-like, (8) cupredoxins, (9) viral coat and capsid proteins, (10) conA-like lectin/glucanases, (11) SH3-like barrel, (12) OB-fold, (13) beta-trefoil, (14) trypsin-like serine proteases, (15) lipocalins, (16) (TIM)-barrel, (17) FAD (also NAD)-binding motif, (18) flavodoxin-like, (19) NAD(P)-binding Rossmann-fold, (20) P-loop, (21) thioredoxin-like, (22) ribonuclease H-like motif, (23) hydrolases, (24) periplasmic binding protein-like, (25) β-grasp, (26) ferredoxin-like and (27) small inhibitors, toxins, lectins. Of the above 27-fold types, types 1–6 belong to all α structural class, types 7–15 to all β class, types 16–24 to α/β class and type 25–27 to α+β class. Therefore, the classification of 27 folds is one level deeper than that of 4 structural classes (Cai, 2001; Chou and Zhang, 1995; Zhou, 1998; Zhou and Assa-Munt, 2001). Naturally, it is more challenging and difficult to conduct prediction among the 27-fold types than among the 4 structural classes (Chou, 1995; Chou and Maggiora, 1998).

To deal with the problem, Ding and Dubchak (2001) extracted the following six features from protein sequences: (1) amino acid composition, (2) predicted secondary structure, (3) hydrophobicity, (4) normalized van der Waals volume, (5) polarity and (6) polarizability. Of the above six features, only the amino acid composition contains 20 components, with each representing the occurrence frequency of one of the 20 native amino acids in a given protein (Chou and Zhang, 1994; Zhou and Doctor, 2003). For the other five features, each contains 3 + 3 +5 × 3 = 21 components, as detailed in Ding and Dubchak (2001) and Dubchak et al. (1999). Based on these multiple parameter sets and majority voting rule trained by the proteins in the training dataset, an overall success rate of 56% was reported (Ding and Dubchak, 2001) in predicting the fold type for the proteins in the testing dataset.

In the present study, in order to avoid completely ignoring the sequence-order effects, the pseudo-amino acid composition (Chou, 2001) was used to replace the conventional amino acid composition (Chou and Zhang, 1993; Nakashima et al., 1986) as used in (Ding and Dubchak, 2001). However, rather than using a combined correlation function (Chou, 2001), here the alternate correlation function between hydrophobicity and hydrophilicity (Chou, 2005; Chou and Cai, 2005) is adopted to reflect the sequence-order effects. For reader's convenience, a brief introduction about amphiphilic pseudo-amino acid composition (PseAA) is given below.

Suppose a protein P with a sequence of L amino acid residues:
(1)
where R1 represents the residue at chain position 1, R2 at position 2 and so forth. The hydrophobicity and hydrophilicity of the constituent amino acids in a protein play a very important role to its folding; e.g. many helices in proteins are amphiphilic that is formed by the hydrophobic and hydrophilic amino acids according to a special order along the helix chain, as illustrated by the ‘wenxiang’ diagram (Chou et al., 1997). Therefore, these two indices may be one of the optimal choices to reflect the sequence-order effects. In view of this, the sequence-order effects can be indirectly and partially, but quite effectively, reflected through the following equations (Fig. 1):
(2)
where Hi,j1 and Hi,j2 are the hydrophobicity and hydrophilicity correlation functions given by
(3)
where (h1(Ri) and (h2Ri) are, respectively, the hydrophobicity and hydrophilicity values for the ith (i = 1,2,…,L) amino acid in Equation (1), and the dot (·) means the multiplication sign. In Equation (2) τ1 and τ2 are called the 1st-tier correlation factors that reflect the sequence-order correlation between all the most contiguous residues along a protein chain through hydrophobicity and hydrophilicity, respectively [Figure 1(a1), (a2)], τ3 and τ4 are the corresponding 2nd-tier correlation factors that reflect the sequence-order correlation between all the 2nd most contiguous residues [Figure 1(b1),(b2)], and so forth. Note that before substituting the values of hydrophobicity and hydrophilicity into Equation (3), they were all subjected to a standard conversion as described by the following equation:
(4)
where the symbols h10(Ri) and h20(Ri) represent the original hydrophobicity value (Tanford, 1962) and hydrophilicity value (Hopp and Woods, 1981) for amino acid Ri, respectively (Table 1); <h10> and <h20> their means over 20 native amino acids; SD(h10) and SD(h20) their standard deviations. The converted hydrophobicity and hydrophilicity values obtained by Equation (4) will have a zero mean value over the 20 native amino acids and will remain unchanged if going through the same conversion procedure again. As we can see from Equations (1–4) as well as Figure 1, a considerable amount of sequence-order information has been incorporated into the 2λcorrelation factors through the hydrophobic and hydrophilic values of the amino acid residues along a protein chain. By fusing the 2λ amphiphilic correlation factors into the classical amino acid composition, we have the following augmented discrete form to represent a protein sample P:
(5)
where
(6)
where fi (i = 1,2,…20) are the normalized occurrence frequencies of the 20 native amino acids in the protein P, τj the sequence-correlation factor computed according to Equation (2) and w the weight factor. In the current study, we chose w = 0.5 to make the results of Equation (6) within the range easier to be handled (w can be of course assigned with other values, but this would not have a big different impact to the final results). Therefore, the first 20 numbers in Equation (5) represent the classic amino acid composition, and the next 2λ discrete numbers reflect the amphiphilic sequence correlation along a protein chain. Such a protein representation is called ‘amphiphilic pseudo-amino acid composition’, which has the same form as the conventional amino acid composition, but contains much more information. It is through the 2λ pseudo-amino acid components that the sequence order of a protein chain and the distribution of the hydrophobic and hydrophilic amino acids along the chain are indirectly and partially reflected. It should be pointed out that, according to the definition of the classical amino acid composition, all its components must be ≥0; it is not always true, however, for the pseudo-amino acid composition: the components corresponding to the sequence correlation factors may also be < 0.
A schematic drawing to show the amphiphilic correlation along a protein chain, where the values of Hi,j1and Hi,j2 are given by Equations (3) and (4) and Table 1. The correlation via hydrophobicity is shown in red, while the correlation via hydrophilicity in blue (a colour version of this figure appears in the Supplementary data). Panel (a1/a2) reflects the coupling mode between all the most contiguous residues, panel (b1/b2) that between all the 2nd most contiguous residues, and panel (c1/c2) that between all the 3rd most contiguous residues.
Fig. 1

A schematic drawing to show the amphiphilic correlation along a protein chain, where the values of Hi,j1and Hi,j2 are given by Equations (3) and (4) and Table 1. The correlation via hydrophobicity is shown in red, while the correlation via hydrophilicity in blue (a colour version of this figure appears in the Supplementary data). Panel (a1/a2) reflects the coupling mode between all the most contiguous residues, panel (b1/b2) that between all the 2nd most contiguous residues, and panel (c1/c2) that between all the 3rd most contiguous residues.

Table 1

The amino acid parameters used for deriving the amphiphilic pseudo-amino acid components [cf. Equation (4)]

CodeHydrophobicityah10Hydrophilicitybh20
A0.62−0.5
C0.29−1.0
D−0.903.0
E−0.743.0
F1.19−2.5
G0.480.0
H−0.40−0.5
I1.38−1.8
K−1.503.0
L1.06−1.8
M0.64−1.3
N−0.782.0
P0.120.0
Q−0.850.2
R−2.533.0
S−0.180.3
T−0.05−0.4
V1.08−1.5
W0.81−3.4
Y0.26−2.3
CodeHydrophobicityah10Hydrophilicitybh20
A0.62−0.5
C0.29−1.0
D−0.903.0
E−0.743.0
F1.19−2.5
G0.480.0
H−0.40−0.5
I1.38−1.8
K−1.503.0
L1.06−1.8
M0.64−1.3
N−0.782.0
P0.120.0
Q−0.850.2
R−2.533.0
S−0.180.3
T−0.05−0.4
V1.08−1.5
W0.81−3.4
Y0.26−2.3

aThe hydrophobicity values were taken from Tanford (1962).

bThe hydrophilicity values were taken from Hopp and Woods (1981).

Table 1

The amino acid parameters used for deriving the amphiphilic pseudo-amino acid components [cf. Equation (4)]

CodeHydrophobicityah10Hydrophilicitybh20
A0.62−0.5
C0.29−1.0
D−0.903.0
E−0.743.0
F1.19−2.5
G0.480.0
H−0.40−0.5
I1.38−1.8
K−1.503.0
L1.06−1.8
M0.64−1.3
N−0.782.0
P0.120.0
Q−0.850.2
R−2.533.0
S−0.180.3
T−0.05−0.4
V1.08−1.5
W0.81−3.4
Y0.26−2.3
CodeHydrophobicityah10Hydrophilicitybh20
A0.62−0.5
C0.29−1.0
D−0.903.0
E−0.743.0
F1.19−2.5
G0.480.0
H−0.40−0.5
I1.38−1.8
K−1.503.0
L1.06−1.8
M0.64−1.3
N−0.782.0
P0.120.0
Q−0.850.2
R−2.533.0
S−0.180.3
T−0.05−0.4
V1.08−1.5
W0.81−3.4
Y0.26−2.3

aThe hydrophobicity values were taken from Tanford (1962).

bThe hydrophilicity values were taken from Hopp and Woods (1981).

In this study, the OET-KNN (optimized evidence-theoretic k-nearest neighbors) algorithm is adopted as the operation engine of a classifier (Shen and Chou, 2005). For reader's convenience, a brief introduction about OET-KNN classifier and its key equations are given in Appendix A. However, quite different from the case of (Shen and Chou, 2005), now we have many different input types, such as the (20+2λ)D PseAA, 21D predicted secondary structure, 21D hydrophobicity, 21D normalized van der Waals volume, 21D polarity and 21D polarizability (Ding and Dubchak, 2001). Since a basic classifier is defined by one operation engine and one input type, one way to use the information from the multiple input types is to combine the above 6 input types into one and use a [(21×5)+(20+2λ)]D vector to represent it. However, doing so would introduce too many parameters into the input, thereby reducing the cluster-tolerant capacity (Chou, 1999) and cross-validation success rate. Furthermore, the PseAA with a different value of λ will become a different input type. In the present study, λ was assigned with 1, 4, 14 and 30. Therefore, we are actually facing 5 + 4 = 9 different input types (Table 2), and have 9 basic classifiers. To deal with this situation, we shall introduce an ensemble classifier, by which not only the other five features described in (Ding and Dubchak, 2001) but also the pseudo-amino acid compositions with a set of different λ values can be automatically fused into one prediction system.

Table 2

List of nine features extracted from protein sequences for fold recognition

FeaturesDimension
Pseudo-amino Acid Compositiona22
Pseudo-amino Acid Compositionb28
Pseudo-amino Acid Compositionc48
Pseudo-amino Acid Compositiond80
Predicted secondary structure21
Hydrophobicity21
Normalized van der Waals volume21
Polarity21
Polarizability21
FeaturesDimension
Pseudo-amino Acid Compositiona22
Pseudo-amino Acid Compositionb28
Pseudo-amino Acid Compositionc48
Pseudo-amino Acid Compositiond80
Predicted secondary structure21
Hydrophobicity21
Normalized van der Waals volume21
Polarity21
Polarizability21

aThe effects of the first rank of sequence-order correlation are incorporated [cf. Equation (5) with λ = 1].

bThe effects of the first 4 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 4].

cThe effects of the first 14 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 14].

dThe effects of the first 30 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 30].

Table 2

List of nine features extracted from protein sequences for fold recognition

FeaturesDimension
Pseudo-amino Acid Compositiona22
Pseudo-amino Acid Compositionb28
Pseudo-amino Acid Compositionc48
Pseudo-amino Acid Compositiond80
Predicted secondary structure21
Hydrophobicity21
Normalized van der Waals volume21
Polarity21
Polarizability21
FeaturesDimension
Pseudo-amino Acid Compositiona22
Pseudo-amino Acid Compositionb28
Pseudo-amino Acid Compositionc48
Pseudo-amino Acid Compositiond80
Predicted secondary structure21
Hydrophobicity21
Normalized van der Waals volume21
Polarity21
Polarizability21

aThe effects of the first rank of sequence-order correlation are incorporated [cf. Equation (5) with λ = 1].

bThe effects of the first 4 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 4].

cThe effects of the first 14 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 14].

dThe effects of the first 30 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 30].

The framework of ensemble classifier system was established by combining numerous basic classifiers together in order to reduce the variance caused by the peculiarities of a single training set and hence be able to learn a more expressive concept in classification than a single classifier. Illustrated in Figure 2 is the basic framework for an ensemble classifier that consists of Ω = 9 basic classifiers. The final output of the ensemble is the weighted fusion of the outputs produced by the nine individual classifiers, as formulated below.

Flowchart to show how the ensemble classifier ℂ [Equation (7)] is formed by fusing Ω=9 basic individual classifiers: ℂ1, ℂ2, … , and ℂ9. A colour version of this figure appears in the Supplementary data.
Fig. 2

Flowchart to show how the ensemble classifier [Equation (7)] is formed by fusing Ω=9 basic individual classifiers: 1, 2, … , and 9. A colour version of this figure appears in the Supplementary data.

Suppose the ensemble classifier ℂ is expressed by
(7)
where ℂ1, ℂ2, …, ℂ3 represent the nine basic OET-KNN classifiers (Appendix A) each operating on the input derived from one of the nine features listed in Table 2; i.e. classifier ℂ1 operates on the 22D PseAA, ℂ2 on the 28D PseAA, ℂ3 on the 48D PseAA, ℂ4 on the 80D PseAA, ℂ5 on the 21D predicted secondary structure ℂ6 on the 21D hydrophobicity, ℂ7 on the 21D normalized van der Waals volume, ℂ8 on the 21D polarity, and ℂ9 on the 21D polarizability. In Equation (7) the symbol ⊕ denotes the fusing operator. For reader's convenience, the values of the nine input parameter systems (cf. Table 2) for each of the proteins in the training and testing datasets are given in the Online Supplementary Materials BI and BII, respectively.
Thus, the process of how the ensemble classifier ℂ works by fusing the nine basic classifiers ℂ(i) (i = 1,2,⋯,9) can be formulated as follows. Suppose
(8)
where S1 is the set only containing proteins of fold type 1, S2 the set of fold type 2, and so forth; ℝi(P,Sj) is the belief function or supporting degree for P belonging to Sj obtained by the ith basic classifier as defined by Equation (A5) in Appendix A; and wi is the weighted factor, which was assigned in this study with the value of the success rate obtained by the ith single basic classifier ℂi, as will be further discussed below.
Thus the query protein P is predicted belonging to the fold type with which its score of Equation 8 is the highest; i.e. suppose
(9)
where the operator Max means taking the maximum one among those in the brackets, and the subscript μ is the very fold type predicted for the query protein P. If there is a tie, the query protein may not be uniquely determined and will be randomly assigned among those with a tie, but cases like that rarely occur.

RESULTS AND DISCUSSION

To demonstrate the power of the ensemble classifier, predictions were conducted based on the same training and testing datasets used by the previous investigators (Chung and Huang, 2003; Ding and Dubchak, 2001). None of proteins in these datasets has >35% sequence identity to any other, and most of proteins in the testing dataset have <25% sequence identity with those in the training dataset (Ding and Dubchak, 2001). The overall success rate in recognizing the fold among the 27 folding types by the ensemble classifier for the 383 proteins in the independent dataset is given in Table 3, where, for facilitating comparison, the success rates by the other approaches are also listed. As can be seen from Table 3, the ensemble classifier, which was formed by fusing nine basic classifiers, obviously outperformed the other approaches.

Table 3

Overall success rates by different approaches in recognizing the fold types for proteins in the independent testing dataset

ClassifierSuccess rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)49.4
NN (Neural Networks)a (Ding and Dubchak, 2001)41.8
SVM (Support Vector Machines)b (Ding and Dubchak, 2001)45.2
SVMc (Ding and Dubchak, 2001)51.1
SVMd (Ding and Dubchak, 2001)56.0
Ensemble Classifiere62.1
ClassifierSuccess rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)49.4
NN (Neural Networks)a (Ding and Dubchak, 2001)41.8
SVM (Support Vector Machines)b (Ding and Dubchak, 2001)45.2
SVMc (Ding and Dubchak, 2001)51.1
SVMd (Ding and Dubchak, 2001)56.0
Ensemble Classifiere62.1

aThe training method for NN is ‘one against others’.

bThe training method for SVM is ‘one against others’.

cThe training method for SVM is 'unique one against others'.

dThe training method for SVM is 'all against all'.

eThe ensemble classifier is constructed by nine OET-KNN classifiers [cf. Equation (7)], and the number of neighbors in each OET-KNN classifier is 8.

Table 3

Overall success rates by different approaches in recognizing the fold types for proteins in the independent testing dataset

ClassifierSuccess rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)49.4
NN (Neural Networks)a (Ding and Dubchak, 2001)41.8
SVM (Support Vector Machines)b (Ding and Dubchak, 2001)45.2
SVMc (Ding and Dubchak, 2001)51.1
SVMd (Ding and Dubchak, 2001)56.0
Ensemble Classifiere62.1
ClassifierSuccess rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)49.4
NN (Neural Networks)a (Ding and Dubchak, 2001)41.8
SVM (Support Vector Machines)b (Ding and Dubchak, 2001)45.2
SVMc (Ding and Dubchak, 2001)51.1
SVMd (Ding and Dubchak, 2001)56.0
Ensemble Classifiere62.1

aThe training method for NN is ‘one against others’.

bThe training method for SVM is ‘one against others’.

cThe training method for SVM is 'unique one against others'.

dThe training method for SVM is 'all against all'.

eThe ensemble classifier is constructed by nine OET-KNN classifiers [cf. Equation (7)], and the number of neighbors in each OET-KNN classifier is 8.

It is instructive to note that if using each of the nine basic classifiers ℂ1, ℂ2, ℂ3, ℂ4, ℂ5, ℂ6, ℂ7, ℂ8, ℂ9 to do the same prediction, the success rates would be 0.40, 0.44, 0.40, 0.29, 0.42, 0.37, 0.32, 0.29, 0.24, respectively. All of them are significantly lower than the rate of 0.62 = 62% obtained by the ensemble classifier (Table 3), indicating that a strong classifier can be generated by fusing many weak classifiers. Actually, as mentioned above, these single classifier rates were assigned for the weights wi(i= 1,2, … , 9) in Equation (9) to form the ensemble classifier.

CONCLUSIONS

An ensemble classifier is formed by a set of basic classifiers, whose individual outcomes are combined in some way, typically through a weighted voting, to give a final determination in classifying a query sample. The current ensemble classifier consists of nine basic individual classifiers. Their operation engine was OET-KNN algorithm, but they were each trained in nine different parameter systems extracted from the training dataset; i.e. 22D PseAA, 28D PseAA, 48D PseAA, 80D PseAA, 21D predicted secondary structure, 21D hydrophobicity, 21D normalized van der Waals volume, 21D polarity and 21D polarizability.

It is instructive to note that although the operation engine adopted here for the basic classifiers is the OET-KNN algorithm, others, such as the covariant discriminant algorithm and SVM algorithm, can also be used to replace the OET-KNN for forming different ensemble classifiers. Moreover, the constituent individual basic classifiers can be driven by completely different operation engines as well, and an ensemble classifier thus formed would become one with a mixture of operation engines. Similarly, we can also design an ensemble classifier by fusing both different input types and different operation engines. It is shown thru the present study that the ensemble classifier formed by fusing different input types, particularly different dimensions of pseudo-amino acid composition [(cf. Equation (5)], is very promising for enhancing the success rate in recognizing the fold type of proteins.

APPENDIX A

The optimized evidence-theoretic k-nearest neighbors (OET-KNN) classifier

For reader's convenience, a brief introduction of the OET-KNN classifier is given below. For further explanation, refer to (Shen and Chou, 2005). Let us consider a problem of classifying N entities into 27 classes (fold types), which can be formulated as
(A1)
The available information is assumed to consist of a trainingdataset
(A2)
where the N entities Pi(i = 1,2, … , N) and their corresponding pattern (class) labels θi(i = 1,2, … , N) take values in F of Equation (A1). According to the KNN (k-nearest neighbors) rule (Cover and Hart, 1967), an unclassified entity P is assigned to the class represented by a majority of its K-nearest neighbors of P. Owing to its good performance and simple-to-use feature, the KNN rule, also named as ‘voting KNN rule’, is quite popular in pattern recognition community.

The ET-KNN (evidence theoretic k-nearest neighbors) rule is a pattern classification method based on the Dempster–Shafer theory of belief functions (Denoeux, 1995). In the classification process, each neighbor of a pattern to be classified is considered as an item of evidence supporting certain hypotheses concerning the class membership of that pattern. Based on this evidence, basic belief masses are assigned to each subset concerned. Such masses are obtained for each of the k-nearest neighbors of the pattern under consideration and aggregated using the Dempster's rule of combination (Shafer, 1976). A decision is made by assigning a pattern to the class with the maximum credibility.

Suppose P is a query protein to be classified, and SKP is the set of its k-nearest neighbors in the training dataset ℕ of Equation (A2). Thus, for any PiSKP, the knowledge that Pi belongs to class ΦμF can be considered as a piece of evidence that increases our belief that P also belongs to Φμ. According to the basic belief assignment mapping theory (Shafer, 1976), this item of evidence can be formulated by
(A3)
where α0 is a fixed parameter, γμ is a parameter associated with class Φμ and D2(Pi, P) is the square Euclidean distance between P and Pi. In the ET-KNN rule, it was not addressed how to optimally select the parameters. In 1998 an optimization procedure to determine the optimal or near-optimal parameter values was proposed from the data by minimizing an error function (Zouhal and Denoeux, 1998). It was observed that the OET-KNN rule obtained thru such an optimization treatment would lead to a substantial improvement in classification accuracy. The optimal parameter thus obtained for α0 of Equation A3 is 0.95, and those for γμ are given in Table A1.
Table A1

The optimal parameter γj (j = 1, 2, …, 27) in Equation A3 obtained thru the optimized procedure (Zouhal & Denoeux, 1998) for the 9 basic individual classifiers in Equation 7

ℂ1ℂ2ℂ3ℂ4ℂ5ℂ6ℂ7ℂ8ℂ9
γ10.10280.07140.03980.04340.48480.12750.19770.13960.0682
γ20.29080.07270.05230.13240.05850.07840.36540.22180.0387
γ30.06560.04900.02400.04350.04690.06040.24820.29880.0445
γ40.08880.04800.07520.05970.43060.15410.40920.40260.0525
γ50.07980.05250.02740.03120.41460.15100.08660.11750.0543
γ60.09310.11760.04680.05810.12980.45860.09810.45800.1741
γ70.09540.07830.05200.05430.28710.09730.37230.12000.0456
γ80.12410.10130.04750.04620.18900.35870.50570.13650.0385
γ90.14760.10760.07000.06990.13860.10340.36060.11010.0425
γ100.12100.07870.04360.04480.48710.21420.54270.12340.0795
γ110.10020.05180.02650.08400.28620.24230.25300.30390.0436
γ120.12190.10140.04550.05940.38100.19110.33260.23520.0731
γ130.13310.09690.04490.03750.20960.11640.37230.10640.0645
γ140.10330.08990.04790.04840.07020.20260.06650.40690.1816
γ150.11080.08750.05230.03170.09610.12800.42490.14420.0407
γ160.15430.11460.06870.05780.13760.12980.13460.14720.0505
γ170.16650.10260.07060.18150.46620.15050.56270.14300.0419
γ180.61700.15480.15430.05020.09770.30610.10030.13340.0511
γ190.16640.11900.06160.06440.56610.14160.51180.14430.0419
γ200.14780.10970.06950.06820.13740.11190.48030.12400.0496
γ210.11830.08120.04790.13210.42070.20180.29750.15170.0409
γ220.16290.12260.07590.22210.13310.24840.15130.55930.0998
γ230.15530.11320.07160.06630.19500.14400.15810.15270.0662
γ240.17040.11440.06710.05770.14300.12430.13790.14540.0566
γ250.13130.10650.05190.15950.27680.14900.31590.18970.1788
γ260.15170.09530.03030.08310.42890.33290.39370.42590.0466
γ270.02620.01530.0133−0.0020.00930.04030.29200.04250.0492
ℂ1ℂ2ℂ3ℂ4ℂ5ℂ6ℂ7ℂ8ℂ9
γ10.10280.07140.03980.04340.48480.12750.19770.13960.0682
γ20.29080.07270.05230.13240.05850.07840.36540.22180.0387
γ30.06560.04900.02400.04350.04690.06040.24820.29880.0445
γ40.08880.04800.07520.05970.43060.15410.40920.40260.0525
γ50.07980.05250.02740.03120.41460.15100.08660.11750.0543
γ60.09310.11760.04680.05810.12980.45860.09810.45800.1741
γ70.09540.07830.05200.05430.28710.09730.37230.12000.0456
γ80.12410.10130.04750.04620.18900.35870.50570.13650.0385
γ90.14760.10760.07000.06990.13860.10340.36060.11010.0425
γ100.12100.07870.04360.04480.48710.21420.54270.12340.0795
γ110.10020.05180.02650.08400.28620.24230.25300.30390.0436
γ120.12190.10140.04550.05940.38100.19110.33260.23520.0731
γ130.13310.09690.04490.03750.20960.11640.37230.10640.0645
γ140.10330.08990.04790.04840.07020.20260.06650.40690.1816
γ150.11080.08750.05230.03170.09610.12800.42490.14420.0407
γ160.15430.11460.06870.05780.13760.12980.13460.14720.0505
γ170.16650.10260.07060.18150.46620.15050.56270.14300.0419
γ180.61700.15480.15430.05020.09770.30610.10030.13340.0511
γ190.16640.11900.06160.06440.56610.14160.51180.14430.0419
γ200.14780.10970.06950.06820.13740.11190.48030.12400.0496
γ210.11830.08120.04790.13210.42070.20180.29750.15170.0409
γ220.16290.12260.07590.22210.13310.24840.15130.55930.0998
γ230.15530.11320.07160.06630.19500.14400.15810.15270.0662
γ240.17040.11440.06710.05770.14300.12430.13790.14540.0566
γ250.13130.10650.05190.15950.27680.14900.31590.18970.1788
γ260.15170.09530.03030.08310.42890.33290.39370.42590.0466
γ270.02620.01530.0133−0.0020.00930.04030.29200.04250.0492
Table A1

The optimal parameter γj (j = 1, 2, …, 27) in Equation A3 obtained thru the optimized procedure (Zouhal & Denoeux, 1998) for the 9 basic individual classifiers in Equation 7

ℂ1ℂ2ℂ3ℂ4ℂ5ℂ6ℂ7ℂ8ℂ9
γ10.10280.07140.03980.04340.48480.12750.19770.13960.0682
γ20.29080.07270.05230.13240.05850.07840.36540.22180.0387
γ30.06560.04900.02400.04350.04690.06040.24820.29880.0445
γ40.08880.04800.07520.05970.43060.15410.40920.40260.0525
γ50.07980.05250.02740.03120.41460.15100.08660.11750.0543
γ60.09310.11760.04680.05810.12980.45860.09810.45800.1741
γ70.09540.07830.05200.05430.28710.09730.37230.12000.0456
γ80.12410.10130.04750.04620.18900.35870.50570.13650.0385
γ90.14760.10760.07000.06990.13860.10340.36060.11010.0425
γ100.12100.07870.04360.04480.48710.21420.54270.12340.0795
γ110.10020.05180.02650.08400.28620.24230.25300.30390.0436
γ120.12190.10140.04550.05940.38100.19110.33260.23520.0731
γ130.13310.09690.04490.03750.20960.11640.37230.10640.0645
γ140.10330.08990.04790.04840.07020.20260.06650.40690.1816
γ150.11080.08750.05230.03170.09610.12800.42490.14420.0407
γ160.15430.11460.06870.05780.13760.12980.13460.14720.0505
γ170.16650.10260.07060.18150.46620.15050.56270.14300.0419
γ180.61700.15480.15430.05020.09770.30610.10030.13340.0511
γ190.16640.11900.06160.06440.56610.14160.51180.14430.0419
γ200.14780.10970.06950.06820.13740.11190.48030.12400.0496
γ210.11830.08120.04790.13210.42070.20180.29750.15170.0409
γ220.16290.12260.07590.22210.13310.24840.15130.55930.0998
γ230.15530.11320.07160.06630.19500.14400.15810.15270.0662
γ240.17040.11440.06710.05770.14300.12430.13790.14540.0566
γ250.13130.10650.05190.15950.27680.14900.31590.18970.1788
γ260.15170.09530.03030.08310.42890.33290.39370.42590.0466
γ270.02620.01530.0133−0.0020.00930.04030.29200.04250.0492
ℂ1ℂ2ℂ3ℂ4ℂ5ℂ6ℂ7ℂ8ℂ9
γ10.10280.07140.03980.04340.48480.12750.19770.13960.0682
γ20.29080.07270.05230.13240.05850.07840.36540.22180.0387
γ30.06560.04900.02400.04350.04690.06040.24820.29880.0445
γ40.08880.04800.07520.05970.43060.15410.40920.40260.0525
γ50.07980.05250.02740.03120.41460.15100.08660.11750.0543
γ60.09310.11760.04680.05810.12980.45860.09810.45800.1741
γ70.09540.07830.05200.05430.28710.09730.37230.12000.0456
γ80.12410.10130.04750.04620.18900.35870.50570.13650.0385
γ90.14760.10760.07000.06990.13860.10340.36060.11010.0425
γ100.12100.07870.04360.04480.48710.21420.54270.12340.0795
γ110.10020.05180.02650.08400.28620.24230.25300.30390.0436
γ120.12190.10140.04550.05940.38100.19110.33260.23520.0731
γ130.13310.09690.04490.03750.20960.11640.37230.10640.0645
γ140.10330.08990.04790.04840.07020.20260.06650.40690.1816
γ150.11080.08750.05230.03170.09610.12800.42490.14420.0407
γ160.15430.11460.06870.05780.13760.12980.13460.14720.0505
γ170.16650.10260.07060.18150.46620.15050.56270.14300.0419
γ180.61700.15480.15430.05020.09770.30610.10030.13340.0511
γ190.16640.11900.06160.06440.56610.14160.51180.14430.0419
γ200.14780.10970.06950.06820.13740.11190.48030.12400.0496
γ210.11830.08120.04790.13210.42070.20180.29750.15170.0409
γ220.16290.12260.07590.22210.13310.24840.15130.55930.0998
γ230.15530.11320.07160.06630.19500.14400.15810.15270.0662
γ240.17040.11440.06710.05770.14300.12430.13790.14540.0566
γ250.13130.10650.05190.15950.27680.14900.31590.18970.1788
γ260.15170.09530.03030.08310.42890.33290.39370.42590.0466
γ270.02620.01530.0133−0.0020.00930.04030.29200.04250.0492
The belief function of P belonging to class Φμ is a combination of its k-nearest neighbors, and can be formulated as
(A4)
where ⊕ is called the orthogonal sum, which is commutative and associative. According to Dempster's rule (Shafer, 1976), the belief function of Equation A4 can be expressed as
(A5)
where SK,iP is the i-th possible subset of SKP, and ⊆, ∩ and ∅ are the symbols in set theory, representing ‘contained in’, ‘intersection’, and the empty set, respectively.
A decision is made by assigning the query protein P to the class with which the belief or credibility function of Equation A5 has the maximum value; i.e. if
(A6)
here μ=1, 2, … , or 27 and the operator Max means taking the maximum one among those in the brackets, then the class Φμ is the class predicted for the query protein.Conflict of Interest: none declared.

REFERENCES

Andreeva
A.
et al. 
SCOP database in 2004: refinements integrate structure and sequence family data
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
D226
-
D229
)
Cai
Y.D.
Is it a paradox or misinterpretation
Proteins
2001
, vol. 
43
 (pg. 
336
-
338
)
Chou
J.J.
Zhang
C.T.
A joint prediction of the folding types of 1490 human proteins from their genetic codons
J. Theor. Biol.
1993
, vol. 
161
 (pg. 
251
-
262
)
Chou
K.C.
A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space
Proteins
1995
, vol. 
21
 (pg. 
319
-
344
)
Chou
K.C.
A key driving force in determination of protein structural classes
Biochem. Biophys. Res. Commun.
1999
, vol. 
264
 (pg. 
216
-
224
)
Chou
K.C.
Prediction of protein cellular attributes using pseudo-amino acid composition
Proteins
2001
, vol. 
43
 (pg. 
246
-
255
(Erratum: ibid., 2001, Vol.44, 60)
Chou
K.C.
Review: structural bioinformatics and its impact to biomedical science
Curr. Med. Chem.
2004
, vol. 
11
 (pg. 
2105
-
2134
)
Chou
K.C.
Using amphiphilic pseudo-amino acid composition to predict enzyme subfamily classes
Bioinformatics
2005
, vol. 
21
 (pg. 
10
-
19
)
Chou
K.C.
Cai
Y.D.
Prediction of membrane protein types by incorporating amphipathic effects
J. Chem. Inform. Modeling
2005
, vol. 
45
 (pg. 
407
-
413
)
Chou
K.C.
Carlacci
L.
Energetic approach to the folding of alpha/beta barrels
Proteins
1991
, vol. 
9
 (pg. 
280
-
295
)
Chou
K.C.
Maggiora
G.M.
Domain structural class prediction
Protein Eng.
1998
, vol. 
11
 (pg. 
523
-
538
)
Chou
K.C.
et al. 
Energetic approach to packing of a-helices: 2. General treatment of nonequivalent and nonregular helices
J. Am. Chem. Soc.
1984
, vol. 
106
 (pg. 
3161
-
3170
)
Chou
K.C.
et al. 
Review: energetics of interactions of regular structural elements in proteins
Accounts Chem. Res.
1990
, vol. 
23
 (pg. 
134
-
141
)
Chou
K.C.
et al. 
Structure of beta-sheets: origin of the right-handed twist and of the increased stability of antiparallel over parallel sheets
J. Mol. Biol.
1982
, vol. 
162
 (pg. 
89
-
112
)
Chou
K.C.
Zhang
C.T.
Predicting protein folding types by distance functions that make allowances for amino acid interactions
J. Biol. Chem.
1994
, vol. 
269
 (pg. 
22014
-
22020
)
Chou
K.C.
Zhang
C.T.
Review: prediction of protein structural classes
Crit. Rev. Biochem. Mol. Biol.
1995
, vol. 
30
 (pg. 
275
-
349
)
Chou
K.C.
et al. 
Disposition of amphiphilic helices in heteropolar environments
Proteins
1997
, vol. 
28
 (pg. 
99
-
108
)
Chung
I.F.
Huang
C.D.
Kaynak
O.
Alpaydin
E.
Oja
E.
Xu
L.
Recognition of structure classification of protein folding by NN and SVM hierarchical learning architecture
Lecture Notes in Computer Sciences
2003
, vol. 
Vol 2714
 
Istanbul, Turkey
Springer
(pg. 
1159
-
1167
)
Cover
T.M.
Hart
P.E.
Nearest neighbour pattern classification
IEEE Trans. Inform. Theory
1967
, vol. 
IT-13
 (pg. 
21
-
27
)
Denoeux
T.
A k-nearest neighbor classification rule based on Dempster–Shafer theory
IEEE Trans. Syst. Man Cybern.
1995
, vol. 
25
 (pg. 
804
-
813
)
Ding
C.H.
Dubchak
I.
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics
2001
, vol. 
17
 (pg. 
349
-
358
)
Dubchak
I.
et al. 
Prediction of protein folding class using global description of amino acid sequence
Proc. Natl Acad. Sci. USA
1995
, vol. 
92
 (pg. 
8700
-
8704
)
Dubchak
I.
et al. 
Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification
Proteins
1999
, vol. 
35
 (pg. 
401
-
407
)
Finkelstein
A.V.
Ptitsyn
O.B.
Why do globular proteins fit the limited set of folding patterns?
Prog. Biophys. Mol. Biol.
1987
, vol. 
50
 (pg. 
171
-
190
)
Holm
L.
Sander
C.
Protein folds and families: sequence and structure alignments
Nucleic Acids Res.
1999
, vol. 
27
 (pg. 
244
-
247
)
Hopp
T.P.
Woods
K.R.
Prediction of protein antigenic determinants from amino acid sequences
Proc. Natl Acad. Sci. USA
1981
, vol. 
78
 (pg. 
3824
-
3828
)
Murzin
A.G.
et al. 
SCOP: a structural classification of protein database for the investigation of sequence and structures
J. Mol. Biol.
1995
, vol. 
247
 (pg. 
536
-
540
)
Nakashima
H.
et al. 
The folding type of a protein is relevant to the amino acid composition
J. Biochem.
1986
, vol. 
99
 (pg. 
152
-
162
)
Shafer
G.
A Mathematical Theory of Evidence
1976
Princeton, NJ
Princeton University Press
Shen
H.B.
Chou
K.C.
Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types
Biochem. Biophys. Res. Commun.
2005
, vol. 
334
 (pg. 
288
-
292
)
Tanford
C.
Contribution of hydrophobic interactions to the stability of the globular conformation of proteins
J. Am. Chem. Soc.
1962
, vol. 
84
 (pg. 
4240
-
4274
)
Zhou
G.P.
An intriguing controversy over protein structural class prediction
Journal of Protein Chemistry
1998
, vol. 
17
 (pg. 
729
-
738
)
Zhou
G.P.
Assa-Munt
N.
Some insights into protein structural class prediction
Proteins
2001
, vol. 
44
 (pg. 
57
-
59
)
Zhou
G.P.
Doctor
K.
Subcellular location prediction of apoptosis proteins
Proteins
2003
, vol. 
50
 (pg. 
44
-
48
)
Zouhal
L.M.
Denoeux
T.
An evidence-theoretic K-NN rule with parameter optimization
IEEE Trans. Syst. Man Cybern.
1998
, vol. 
28
 (pg. 
263
-
271
)

Author notes

Associate Editor: Keith A Crandall