iBet uBet web content aggregator. Adding the entire web to your favor.

Author Notes

Abstract

Motivation: Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo-amino acid composition, which were extracted from a training dataset. The operation engine for the constituent individual classifiers was OET-KNN (optimized evidence-theoretic k-nearest neighbors) rule. Their outcomes were combined through a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns.

Results: The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have <25% sequence identity with the proteins used in training the classifier. Such a rate is 6–21% higher than the corresponding rates obtained by various existing NN (neural networks) and SVM (support vector machines) approaches, implying that the ensemble classifier is very promising and might become a useful vehicle in protein science, as well as proteomics and bioinformatics.

Availability: The ensemble classifier, called PFP-Pred, is available as a web-server at Author Webpage for public usage.

Contact: lifesci-sjtu@san.rr.com

Supplementary information: Supplementary data are available on Bioinformatics online.

INTRODUCTION

The avalanche of protein sequences generated in the post-genomic era has challenged us for developing computational methods by which the structural information can be timely extracted from sequence databases. Although the direct prediction of the three-dimensional (3D) structure of a protein from its sequence based on the least free energy principle is scientifically quite sound and some encouraging results already obtained in elucidating the handedness problems and packing arrangements in proteins (see e.g. Chou and Carlacci, 1991; Chou et al., 1982, 1984, 1990), it is very difficult to predict its overall fold owing to the notorious local minimum problem. Also, although it is quite successful to predict the 3D structure of a protein according to the homology modeling approach (Chou, 2004; Holm and Sander, 1999), a hurdle exists when the query protein does not have any structure-known homologous protein in the existing databases. Facing this kind of situation, can we find a different approach to predict the fold of a protein? In this paper, we shall resort to the taxonomic approach, whose underpinning is based on the assumption that the number of protein folds is limited (Chou and Zhang, 1995; Dubchak et al., 1999; Finkelstein and Ptitsyn, 1987; Murzin et al., 1995). Accordingly, predicting the 3D structure of a protein may be first converted to a problem of classification, i.e. identifying which fold pattern it belongs to. The present study was initiated in an attempt to introduce a novel approach, the ensemble classifier, to recognize the fold pattern for a query protein.

MATERIALS AND METHODS

The working (training and testing) datasets studied here were taken from Ding and Dubchak (2001). The original training dataset and testing dataset contain 313 proteins and 385 proteins, respectively. Of these proteins, however, two (i.e. 2SCMC and 2GPS) in the training dataset and two (2YHX_1 and 2YHX_2) in the testing dataset do not have sequence records. These four proteins were excluded for further consideration due to lacking sequence information. Accordingly, we have 311 proteins for training dataset and 383 proteins for testing dataset. The names of the training and testing proteins and their sequences are given in Online Supplementary Materials AI and AII, respectively. None of proteins in the testing dataset has >35% sequence identity to those in the training dataset (Ding and Dubchak, 2001). According to the SCOP database (Andreeva et al., 2004; Murzin et al., 1995), the proteins in the training and testing datasets (Online Supplementary Materials A) were further classified into the following 27-fold types (Ding and Dubchak, 2001; Dubchak et al., 1995, 1999): (1) globin-like, (2) cytochrome c, (3) DNA-binding 3-helical bundle, (4) 4-helical up-and-down bundle, (5) 4-helical cytokines, (6) EF-hand, (7) immunoglobulin-like, (8) cupredoxins, (9) viral coat and capsid proteins, (10) conA-like lectin/glucanases, (11) SH3-like barrel, (12) OB-fold, (13) beta-trefoil, (14) trypsin-like serine proteases, (15) lipocalins, (16) (TIM)-barrel, (17) FAD (also NAD)-binding motif, (18) flavodoxin-like, (19) NAD(P)-binding Rossmann-fold, (20) P-loop, (21) thioredoxin-like, (22) ribonuclease H-like motif, (23) hydrolases, (24) periplasmic binding protein-like, (25) β-grasp, (26) ferredoxin-like and (27) small inhibitors, toxins, lectins. Of the above 27-fold types, types 1–6 belong to all α structural class, types 7–15 to all β class, types 16–24 to α/β class and type 25–27 to α+β class. Therefore, the classification of 27 folds is one level deeper than that of 4 structural classes (Cai, 2001; Chou and Zhang, 1995; Zhou, 1998; Zhou and Assa-Munt, 2001). Naturally, it is more challenging and difficult to conduct prediction among the 27-fold types than among the 4 structural classes (Chou, 1995; Chou and Maggiora, 1998).

To deal with the problem, Ding and Dubchak (2001) extracted the following six features from protein sequences: (1) amino acid composition, (2) predicted secondary structure, (3) hydrophobicity, (4) normalized van der Waals volume, (5) polarity and (6) polarizability. Of the above six features, only the amino acid composition contains 20 components, with each representing the occurrence frequency of one of the 20 native amino acids in a given protein (Chou and Zhang, 1994; Zhou and Doctor, 2003). For the other five features, each contains 3 + 3 +5 × 3 = 21 components, as detailed in Ding and Dubchak (2001) and Dubchak et al. (1999). Based on these multiple parameter sets and majority voting rule trained by the proteins in the training dataset, an overall success rate of 56% was reported (Ding and Dubchak, 2001) in predicting the fold type for the proteins in the testing dataset.

In the present study, in order to avoid completely ignoring the sequence-order effects, the pseudo-amino acid composition (Chou, 2001) was used to replace the conventional amino acid composition (Chou and Zhang, 1993; Nakashima et al., 1986) as used in (Ding and Dubchak, 2001). However, rather than using a combined correlation function (Chou, 2001), here the alternate correlation function between hydrophobicity and hydrophilicity (Chou, 2005; Chou and Cai, 2005) is adopted to reflect the sequence-order effects. For reader's convenience, a brief introduction about amphiphilic pseudo-amino acid composition (PseAA) is given below.

Suppose a protein P with a sequence of L amino acid residues:

R_{1} R_{2} R_{3} R_{4} R_{5} R_{6} R_{7} \dots R_{L},

(1)

where R1 represents the residue at chain position 1, R2 at position 2 and so forth. The hydrophobicity and hydrophilicity of the constituent amino acids in a protein play a very important role to its folding; e.g. many helices in proteins are amphiphilic that is formed by the hydrophobic and hydrophilic amino acids according to a special order along the helix chain, as illustrated by the ‘wenxiang’ diagram (Chou et al., 1997). Therefore, these two indices may be one of the optimal choices to reflect the sequence-order effects. In view of this, the sequence-order effects can be indirectly and partially, but quite effectively, reflected through the following equations (Fig. 1):

(2)

where

H_{i, j}^{1}

and

H_{i, j}^{2}

are the hydrophobicity and hydrophilicity correlation functions given by

{\begin{cases} H_{i, j}^{1} = h^{1} (R_{i}) \cdot h^{1} (R_{j}) \\ H_{i, j}^{2} = h^{2} (R_{i}) \cdot h^{2} (R_{j}) \end{cases}

(3)

where (h¹(R_i) and (h²R_i) are, respectively, the hydrophobicity and hydrophilicity values for the ith (i = 1,2,…,L) amino acid in Equation (1), and the dot (·) means the multiplication sign. In Equation (2) τ₁ and τ₂ are called the 1st-tier correlation factors that reflect the sequence-order correlation between all the most contiguous residues along a protein chain through hydrophobicity and hydrophilicity, respectively [Figure 1(a1), (a2)], τ₃ and τ₄ are the corresponding 2nd-tier correlation factors that reflect the sequence-order correlation between all the 2nd most contiguous residues [Figure 1(b1),(b2)], and so forth. Note that before substituting the values of hydrophobicity and hydrophilicity into Equation (3), they were all subjected to a standard conversion as described by the following equation:

{\begin{array}{l} h_{1} (R_{i}) = \frac{h_{1}^{0} (R_{i}) - < h_{1}^{0} >}{SD (h_{1}^{0})} \\ h_{2} (R_{i}) = \frac{h_{2}^{0} (R_{i}) - < h_{2}^{0} >}{SD (h_{2}^{0})} \end{array}

(4)

where the symbols

h_{1}^{0} (R_{i})

and

h_{2}^{0} (R_{i})

represent the original hydrophobicity value (Tanford, 1962) and hydrophilicity value (Hopp and Woods, 1981) for amino acid R_i, respectively (Table 1);

< h_{1}^{0} >

and

< h_{2}^{0} >

their means over 20 native amino acids;

SD (h_{1}^{0})

and

SD (h_{2}^{0})

their standard deviations. The converted hydrophobicity and hydrophilicity values obtained by Equation (4) will have a zero mean value over the 20 native amino acids and will remain unchanged if going through the same conversion procedure again. As we can see from Equations (1–4) as well as Figure 1, a considerable amount of sequence-order information has been incorporated into the 2λcorrelation factors through the hydrophobic and hydrophilic values of the amino acid residues along a protein chain. By fusing the 2λ amphiphilic correlation factors into the classical amino acid composition, we have the following augmented discrete form to represent a protein sample P:

P = [\begin{array}{c} p_{1} \\ ⋮ \\ p_{20} \\ p_{20 + 1} \\ ⋮ \\ p_{20 + λ} \\ p_{20 + λ + 1} \\ ⋮ \\ p_{20 + 2 λ} \end{array}],

(5)

where

p_{u} = {\begin{cases} \frac{f_{u}}{\sum_{i = 1}^{20} f_{i} + w \sum_{j = 1}^{2 λ} τ_{j}}, & (1 \leq u \leq 20) \\ \frac{w τ_{u}}{\sum_{i = 1}^{20} f_{i} + w \sum_{j = 1}^{2 λ} τ_{j}}, & (20 + 1 \leq u \leq 20 + 2 λ) \end{cases}

(6)

where f_i (i = 1,2,…20) are the normalized occurrence frequencies of the 20 native amino acids in the protein P, τ_j the sequence-correlation factor computed according to Equation (2) and w the weight factor. In the current study, we chose w = 0.5 to make the results of Equation (6) within the range easier to be handled (w can be of course assigned with other values, but this would not have a big different impact to the final results). Therefore, the first 20 numbers in Equation (5) represent the classic amino acid composition, and the next 2λ discrete numbers reflect the amphiphilic sequence correlation along a protein chain. Such a protein representation is called ‘amphiphilic pseudo-amino acid composition’, which has the same form as the conventional amino acid composition, but contains much more information. It is through the 2λ pseudo-amino acid components that the sequence order of a protein chain and the distribution of the hydrophobic and hydrophilic amino acids along the chain are indirectly and partially reflected. It should be pointed out that, according to the definition of the classical amino acid composition, all its components must be ≥0; it is not always true, however, for the pseudo-amino acid composition: the components corresponding to the sequence correlation factors may also be < 0.

Fig. 1

A schematic drawing to show the amphiphilic correlation along a protein chain, where the values of $H_{i, j}^{1}$ and $H_{i, j}^{2}$ are given by Equations (3) and (4) and Table 1. The correlation via hydrophobicity is shown in red, while the correlation via hydrophilicity in blue (a colour version of this figure appears in the Supplementary data). Panel (a1/a2) reflects the coupling mode between all the most contiguous residues, panel (b1/b2) that between all the 2nd most contiguous residues, and panel (c1/c2) that between all the 3rd most contiguous residues.

Open in new tab Download slide

Table 1

Open in new tab

The amino acid parameters used for deriving the amphiphilic pseudo-amino acid components [cf. Equation (4)]

Code	Hydrophobicity^a $h_{1}^{0}$	Hydrophilicity^b $h_{2}^{0}$
A	0.62	−0.5
C	0.29	−1.0
D	−0.90	3.0
E	−0.74	3.0
F	1.19	−2.5
G	0.48	0.0
H	−0.40	−0.5
I	1.38	−1.8
K	−1.50	3.0
L	1.06	−1.8
M	0.64	−1.3
N	−0.78	2.0
P	0.12	0.0
Q	−0.85	0.2
R	−2.53	3.0
S	−0.18	0.3
T	−0.05	−0.4
V	1.08	−1.5
W	0.81	−3.4
Y	0.26	−2.3

Code	Hydrophobicity^a $h_{1}^{0}$	Hydrophilicity^b $h_{2}^{0}$
A	0.62	−0.5
C	0.29	−1.0
D	−0.90	3.0
E	−0.74	3.0
F	1.19	−2.5
G	0.48	0.0
H	−0.40	−0.5
I	1.38	−1.8
K	−1.50	3.0
L	1.06	−1.8
M	0.64	−1.3
N	−0.78	2.0
P	0.12	0.0
Q	−0.85	0.2
R	−2.53	3.0
S	−0.18	0.3
T	−0.05	−0.4
V	1.08	−1.5
W	0.81	−3.4
Y	0.26	−2.3

^aThe hydrophobicity values were taken from Tanford (1962).

^bThe hydrophilicity values were taken from Hopp and Woods (1981).

Table 1

Open in new tab

The amino acid parameters used for deriving the amphiphilic pseudo-amino acid components [cf. Equation (4)]

Code	Hydrophobicity^a $h_{1}^{0}$	Hydrophilicity^b $h_{2}^{0}$
A	0.62	−0.5
C	0.29	−1.0
D	−0.90	3.0
E	−0.74	3.0
F	1.19	−2.5
G	0.48	0.0
H	−0.40	−0.5
I	1.38	−1.8
K	−1.50	3.0
L	1.06	−1.8
M	0.64	−1.3
N	−0.78	2.0
P	0.12	0.0
Q	−0.85	0.2
R	−2.53	3.0
S	−0.18	0.3
T	−0.05	−0.4
V	1.08	−1.5
W	0.81	−3.4
Y	0.26	−2.3

Code	Hydrophobicity^a $h_{1}^{0}$	Hydrophilicity^b $h_{2}^{0}$
A	0.62	−0.5
C	0.29	−1.0
D	−0.90	3.0
E	−0.74	3.0
F	1.19	−2.5
G	0.48	0.0
H	−0.40	−0.5
I	1.38	−1.8
K	−1.50	3.0
L	1.06	−1.8
M	0.64	−1.3
N	−0.78	2.0
P	0.12	0.0
Q	−0.85	0.2
R	−2.53	3.0
S	−0.18	0.3
T	−0.05	−0.4
V	1.08	−1.5
W	0.81	−3.4
Y	0.26	−2.3

^aThe hydrophobicity values were taken from Tanford (1962).

^bThe hydrophilicity values were taken from Hopp and Woods (1981).

In this study, the OET-KNN (optimized evidence-theoretic k-nearest neighbors) algorithm is adopted as the operation engine of a classifier (Shen and Chou, 2005). For reader's convenience, a brief introduction about OET-KNN classifier and its key equations are given in Appendix A. However, quite different from the case of (Shen and Chou, 2005), now we have many different input types, such as the (20+2λ)D PseAA, 21D predicted secondary structure, 21D hydrophobicity, 21D normalized van der Waals volume, 21D polarity and 21D polarizability (Ding and Dubchak, 2001). Since a basic classifier is defined by one operation engine and one input type, one way to use the information from the multiple input types is to combine the above 6 input types into one and use a [(21×5)+(20+2λ)]D vector to represent it. However, doing so would introduce too many parameters into the input, thereby reducing the cluster-tolerant capacity (Chou, 1999) and cross-validation success rate. Furthermore, the PseAA with a different value of λ will become a different input type. In the present study, λ was assigned with 1, 4, 14 and 30. Therefore, we are actually facing 5 + 4 = 9 different input types (Table 2), and have 9 basic classifiers. To deal with this situation, we shall introduce an ensemble classifier, by which not only the other five features described in (Ding and Dubchak, 2001) but also the pseudo-amino acid compositions with a set of different λ values can be automatically fused into one prediction system.

Table 2

Open in new tab

List of nine features extracted from protein sequences for fold recognition

Features	Dimension
Pseudo-amino Acid Composition^a	22
Pseudo-amino Acid Composition^b	28
Pseudo-amino Acid Composition^c	48
Pseudo-amino Acid Composition^d	80
Predicted secondary structure	21
Hydrophobicity	21
Normalized van der Waals volume	21
Polarity	21
Polarizability	21

Features	Dimension
Pseudo-amino Acid Composition^a	22
Pseudo-amino Acid Composition^b	28
Pseudo-amino Acid Composition^c	48
Pseudo-amino Acid Composition^d	80
Predicted secondary structure	21
Hydrophobicity	21
Normalized van der Waals volume	21
Polarity	21
Polarizability	21

^aThe effects of the first rank of sequence-order correlation are incorporated [cf. Equation (5) with λ = 1].

^bThe effects of the first 4 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 4].

^cThe effects of the first 14 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 14].

^dThe effects of the first 30 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 30].

Table 2

Open in new tab

List of nine features extracted from protein sequences for fold recognition

Features	Dimension
Pseudo-amino Acid Composition^a	22
Pseudo-amino Acid Composition^b	28
Pseudo-amino Acid Composition^c	48
Pseudo-amino Acid Composition^d	80
Predicted secondary structure	21
Hydrophobicity	21
Normalized van der Waals volume	21
Polarity	21
Polarizability	21

Features	Dimension
Pseudo-amino Acid Composition^a	22
Pseudo-amino Acid Composition^b	28
Pseudo-amino Acid Composition^c	48
Pseudo-amino Acid Composition^d	80
Predicted secondary structure	21
Hydrophobicity	21
Normalized van der Waals volume	21
Polarity	21
Polarizability	21

^aThe effects of the first rank of sequence-order correlation are incorporated [cf. Equation (5) with λ = 1].

^bThe effects of the first 4 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 4].

^cThe effects of the first 14 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 14].

^dThe effects of the first 30 ranks of sequence-order correlation are incorporated [cf. Equation (5) with λ = 30].

The framework of ensemble classifier system was established by combining numerous basic classifiers together in order to reduce the variance caused by the peculiarities of a single training set and hence be able to learn a more expressive concept in classification than a single classifier. Illustrated in Figure 2 is the basic framework for an ensemble classifier that consists of Ω = 9 basic classifiers. The final output of the ensemble is the weighted fusion of the outputs produced by the nine individual classifiers, as formulated below.

Fig. 2

Flowchart to show how the ensemble classifier $ℂ$ [Equation (7)] is formed by fusing $Ω = 9$ basic individual classifiers: $ℂ_{1}$ ⁠, $ℂ_{2}$ ⁠, … , and $ℂ_{9}$ ⁠. A colour version of this figure appears in the Supplementary data.

Open in new tab Download slide

Suppose the ensemble classifier ℂ is expressed by

ℂ = ℂ_{1} \oplus ℂ_{2} \oplus ℂ_{3} \oplus \dots \oplus ℂ_{8} \oplus ℂ_{9}

(7)

where ℂ₁, ℂ₂, …, ℂ₃ represent the nine basic OET-KNN classifiers (Appendix A) each operating on the input derived from one of the nine features listed in Table 2; i.e. classifier ℂ₁ operates on the 22D PseAA, ℂ₂ on the 28D PseAA, ℂ₃ on the 48D PseAA, ℂ₄ on the 80D PseAA, ℂ₅ on the 21D predicted secondary structure ℂ₆ on the 21D hydrophobicity, ℂ₇ on the 21D normalized van der Waals volume, ℂ₈ on the 21D polarity, and ℂ₉ on the 21D polarizability. In Equation (7) the symbol ⊕ denotes the fusing operator. For reader's convenience, the values of the nine input parameter systems (cf. Table 2) for each of the proteins in the training and testing datasets are given in the Online Supplementary Materials BI and BII, respectively.

Thus, the process of how the ensemble classifier ℂ works by fusing the nine basic classifiers ℂ(i) (i = 1,2,⋯,9) can be formulated as follows. Suppose

Y_{j} = \sum_{i = 1}^{9} w_{i} ℝ_{i} (P, S_{j}), (j = 1, 2, \dots, 27)

(8)

where S₁ is the set only containing proteins of fold type 1, S₂ the set of fold type 2, and so forth; ℝ_i(P,S_j) is the belief function or supporting degree for P belonging to S_j obtained by the ith basic classifier as defined by Equation (A5) in Appendix A; and w_i is the weighted factor, which was assigned in this study with the value of the success rate obtained by the ith single basic classifier ℂ_i, as will be further discussed below.

Thus the query protein P is predicted belonging to the fold type with which its score of Equation 8 is the highest; i.e. suppose

Y_{μ} = Max {Y_{1}, Y_{2}, \dots, Y_{27}}

(9)

where the operator Max means taking the maximum one among those in the brackets, and the subscript μ is the very fold type predicted for the query protein P. If there is a tie, the query protein may not be uniquely determined and will be randomly assigned among those with a tie, but cases like that rarely occur.

RESULTS AND DISCUSSION

To demonstrate the power of the ensemble classifier, predictions were conducted based on the same training and testing datasets used by the previous investigators (Chung and Huang, 2003; Ding and Dubchak, 2001). None of proteins in these datasets has >35% sequence identity to any other, and most of proteins in the testing dataset have <25% sequence identity with those in the training dataset (Ding and Dubchak, 2001). The overall success rate in recognizing the fold among the 27 folding types by the ensemble classifier for the 383 proteins in the independent dataset is given in Table 3, where, for facilitating comparison, the success rates by the other approaches are also listed. As can be seen from Table 3, the ensemble classifier, which was formed by fusing nine basic classifiers, obviously outperformed the other approaches.

Table 3

Open in new tab

Overall success rates by different approaches in recognizing the fold types for proteins in the independent testing dataset

Classifier	Success rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)	48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)	44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)	49.4
NN (Neural Networks)^a (Ding and Dubchak, 2001)	41.8
SVM (Support Vector Machines)^b (Ding and Dubchak, 2001)	45.2
SVM^c (Ding and Dubchak, 2001)	51.1
SVM^d (Ding and Dubchak, 2001)	56.0
Ensemble Classifier^e	62.1

Classifier	Success rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)	48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)	44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)	49.4
NN (Neural Networks)^a (Ding and Dubchak, 2001)	41.8
SVM (Support Vector Machines)^b (Ding and Dubchak, 2001)	45.2
SVM^c (Ding and Dubchak, 2001)	51.1
SVM^d (Ding and Dubchak, 2001)	56.0
Ensemble Classifier^e	62.1

^aThe training method for NN is ‘one against others’.

^bThe training method for SVM is ‘one against others’.

^cThe training method for SVM is 'unique one against others'.

^dThe training method for SVM is 'all against all'.

^eThe ensemble classifier is constructed by nine OET-KNN classifiers [cf. Equation (7)], and the number of neighbors in each OET-KNN classifier is 8.

Table 3

Open in new tab

Overall success rates by different approaches in recognizing the fold types for proteins in the independent testing dataset

Classifier	Success rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)	48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)	44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)	49.4
NN (Neural Networks)^a (Ding and Dubchak, 2001)	41.8
SVM (Support Vector Machines)^b (Ding and Dubchak, 2001)	45.2
SVM^c (Ding and Dubchak, 2001)	51.1
SVM^d (Ding and Dubchak, 2001)	56.0
Ensemble Classifier^e	62.1

Classifier	Success rate(%)
MLP (Multi-Layer Perceptron) (Chung and Huang, 2003)	48.8
GRNN (General Regression Neural Networks) (Chung and Huang, 2003)	44.2
RBFN (Radial Basis Function Networks) (Chung and Huang, 2003)	49.4
NN (Neural Networks)^a (Ding and Dubchak, 2001)	41.8
SVM (Support Vector Machines)^b (Ding and Dubchak, 2001)	45.2
SVM^c (Ding and Dubchak, 2001)	51.1
SVM^d (Ding and Dubchak, 2001)	56.0
Ensemble Classifier^e	62.1

^aThe training method for NN is ‘one against others’.

^bThe training method for SVM is ‘one against others’.

^cThe training method for SVM is 'unique one against others'.

^dThe training method for SVM is 'all against all'.

^eThe ensemble classifier is constructed by nine OET-KNN classifiers [cf. Equation (7)], and the number of neighbors in each OET-KNN classifier is 8.

It is instructive to note that if using each of the nine basic classifiers ℂ₁, ℂ₂, ℂ₃, ℂ₄, ℂ₅, ℂ₆, ℂ₇, ℂ₈, ℂ₉ to do the same prediction, the success rates would be 0.40, 0.44, 0.40, 0.29, 0.42, 0.37, 0.32, 0.29, 0.24, respectively. All of them are significantly lower than the rate of 0.62 = 62% obtained by the ensemble classifier (Table 3), indicating that a strong classifier can be generated by fusing many weak classifiers. Actually, as mentioned above, these single classifier rates were assigned for the weights w_i(i= 1,2, … , 9) in Equation (9) to form the ensemble classifier.

CONCLUSIONS

An ensemble classifier is formed by a set of basic classifiers, whose individual outcomes are combined in some way, typically through a weighted voting, to give a final determination in classifying a query sample. The current ensemble classifier consists of nine basic individual classifiers. Their operation engine was OET-KNN algorithm, but they were each trained in nine different parameter systems extracted from the training dataset; i.e. 22D PseAA, 28D PseAA, 48D PseAA, 80D PseAA, 21D predicted secondary structure, 21D hydrophobicity, 21D normalized van der Waals volume, 21D polarity and 21D polarizability.

It is instructive to note that although the operation engine adopted here for the basic classifiers is the OET-KNN algorithm, others, such as the covariant discriminant algorithm and SVM algorithm, can also be used to replace the OET-KNN for forming different ensemble classifiers. Moreover, the constituent individual basic classifiers can be driven by completely different operation engines as well, and an ensemble classifier thus formed would become one with a mixture of operation engines. Similarly, we can also design an ensemble classifier by fusing both different input types and different operation engines. It is shown thru the present study that the ensemble classifier formed by fusing different input types, particularly different dimensions of pseudo-amino acid composition [(cf. Equation (5)], is very promising for enhancing the success rate in recognizing the fold type of proteins.

APPENDIX A

The optimized evidence-theoretic k-nearest neighbors (OET-KNN) classifier

For reader's convenience, a brief introduction of the OET-KNN classifier is given below. For further explanation, refer to (Shen and Chou, 2005). Let us consider a problem of classifying N entities into 27 classes (fold types), which can be formulated as

F = {Φ_{1}, Φ_{2}, …, Φ_{μ}, …, Φ_{27}}

(A1)

The available information is assumed to consist of a trainingdataset

ℕ = {(P_{1}, θ_{1}), \dots, (P_{N}, θ_{N})}

(A2)

where the N entities P_i(i = 1,2, … , N) and their corresponding pattern (class) labels θ_i(i = 1,2, … , N) take values in

F

of Equation (A1). According to the KNN (k-nearest neighbors) rule (Cover and Hart, 1967), an unclassified entity P is assigned to the class represented by a majority of its K-nearest neighbors of P. Owing to its good performance and simple-to-use feature, the KNN rule, also named as ‘voting KNN rule’, is quite popular in pattern recognition community.

The ET-KNN (evidence theoretic k-nearest neighbors) rule is a pattern classification method based on the Dempster–Shafer theory of belief functions (Denoeux, 1995). In the classification process, each neighbor of a pattern to be classified is considered as an item of evidence supporting certain hypotheses concerning the class membership of that pattern. Based on this evidence, basic belief masses are assigned to each subset concerned. Such masses are obtained for each of the k-nearest neighbors of the pattern under consideration and aggregated using the Dempster's rule of combination (Shafer, 1976). A decision is made by assigning a pattern to the class with the maximum credibility.

Suppose P is a query protein to be classified, and

S_{K}^{P}

is the set of its k-nearest neighbors in the training dataset ℕ of Equation (A2). Thus, for any

P_{i} \in S_{K}^{P}

⁠, the knowledge that P_i belongs to class Φ_μ ∈

F

can be considered as a piece of evidence that increases our belief that P also belongs to Φ_μ. According to the basic belief assignment mapping theory (Shafer, 1976), this item of evidence can be formulated by

ℝ (P_{i}, Φ_{μ}) = α_{0} exp [- γ_{μ}^{2} D^{2} (P_{i}, P)]

(A3)

where α₀ is a fixed parameter, γ_μ is a parameter associated with class Φ_μ and D²(P_i, P) is the square Euclidean distance between P and P_i. In the ET-KNN rule, it was not addressed how to optimally select the parameters. In 1998 an optimization procedure to determine the optimal or near-optimal parameter values was proposed from the data by minimizing an error function (Zouhal and Denoeux, 1998). It was observed that the OET-KNN rule obtained thru such an optimization treatment would lead to a substantial improvement in classification accuracy. The optimal parameter thus obtained for α₀ of Equation A3 is 0.95, and those for γ_μ are given in Table A1.

Table A1

Open in new tab

The optimal parameter γ_j (j = 1, 2, …, 27) in Equation A3 obtained thru the optimized procedure (Zouhal & Denoeux, 1998) for the 9 basic individual classifiers in Equation 7

	ℂ1	ℂ2	ℂ3	ℂ4	ℂ5	ℂ6	ℂ7	ℂ8	ℂ9
γ₁	0.1028	0.0714	0.0398	0.0434	0.4848	0.1275	0.1977	0.1396	0.0682
γ₂	0.2908	0.0727	0.0523	0.1324	0.0585	0.0784	0.3654	0.2218	0.0387
γ₃	0.0656	0.0490	0.0240	0.0435	0.0469	0.0604	0.2482	0.2988	0.0445
γ₄	0.0888	0.0480	0.0752	0.0597	0.4306	0.1541	0.4092	0.4026	0.0525
γ₅	0.0798	0.0525	0.0274	0.0312	0.4146	0.1510	0.0866	0.1175	0.0543
γ₆	0.0931	0.1176	0.0468	0.0581	0.1298	0.4586	0.0981	0.4580	0.1741
γ₇	0.0954	0.0783	0.0520	0.0543	0.2871	0.0973	0.3723	0.1200	0.0456
γ₈	0.1241	0.1013	0.0475	0.0462	0.1890	0.3587	0.5057	0.1365	0.0385
γ₉	0.1476	0.1076	0.0700	0.0699	0.1386	0.1034	0.3606	0.1101	0.0425
γ₁₀	0.1210	0.0787	0.0436	0.0448	0.4871	0.2142	0.5427	0.1234	0.0795
γ₁₁	0.1002	0.0518	0.0265	0.0840	0.2862	0.2423	0.2530	0.3039	0.0436
γ₁₂	0.1219	0.1014	0.0455	0.0594	0.3810	0.1911	0.3326	0.2352	0.0731
γ₁₃	0.1331	0.0969	0.0449	0.0375	0.2096	0.1164	0.3723	0.1064	0.0645
γ₁₄	0.1033	0.0899	0.0479	0.0484	0.0702	0.2026	0.0665	0.4069	0.1816
γ₁₅	0.1108	0.0875	0.0523	0.0317	0.0961	0.1280	0.4249	0.1442	0.0407
γ₁₆	0.1543	0.1146	0.0687	0.0578	0.1376	0.1298	0.1346	0.1472	0.0505
γ₁₇	0.1665	0.1026	0.0706	0.1815	0.4662	0.1505	0.5627	0.1430	0.0419
γ₁₈	0.6170	0.1548	0.1543	0.0502	0.0977	0.3061	0.1003	0.1334	0.0511
γ₁₉	0.1664	0.1190	0.0616	0.0644	0.5661	0.1416	0.5118	0.1443	0.0419
γ₂₀	0.1478	0.1097	0.0695	0.0682	0.1374	0.1119	0.4803	0.1240	0.0496
γ₂₁	0.1183	0.0812	0.0479	0.1321	0.4207	0.2018	0.2975	0.1517	0.0409
γ₂₂	0.1629	0.1226	0.0759	0.2221	0.1331	0.2484	0.1513	0.5593	0.0998
γ₂₃	0.1553	0.1132	0.0716	0.0663	0.1950	0.1440	0.1581	0.1527	0.0662
γ₂₄	0.1704	0.1144	0.0671	0.0577	0.1430	0.1243	0.1379	0.1454	0.0566
γ₂₅	0.1313	0.1065	0.0519	0.1595	0.2768	0.1490	0.3159	0.1897	0.1788
γ₂₆	0.1517	0.0953	0.0303	0.0831	0.4289	0.3329	0.3937	0.4259	0.0466
γ₂₇	0.0262	0.0153	0.0133	−0.002	0.0093	0.0403	0.2920	0.0425	0.0492

	ℂ1	ℂ2	ℂ3	ℂ4	ℂ5	ℂ6	ℂ7	ℂ8	ℂ9
γ₁	0.1028	0.0714	0.0398	0.0434	0.4848	0.1275	0.1977	0.1396	0.0682
γ₂	0.2908	0.0727	0.0523	0.1324	0.0585	0.0784	0.3654	0.2218	0.0387
γ₃	0.0656	0.0490	0.0240	0.0435	0.0469	0.0604	0.2482	0.2988	0.0445
γ₄	0.0888	0.0480	0.0752	0.0597	0.4306	0.1541	0.4092	0.4026	0.0525
γ₅	0.0798	0.0525	0.0274	0.0312	0.4146	0.1510	0.0866	0.1175	0.0543
γ₆	0.0931	0.1176	0.0468	0.0581	0.1298	0.4586	0.0981	0.4580	0.1741
γ₇	0.0954	0.0783	0.0520	0.0543	0.2871	0.0973	0.3723	0.1200	0.0456
γ₈	0.1241	0.1013	0.0475	0.0462	0.1890	0.3587	0.5057	0.1365	0.0385
γ₉	0.1476	0.1076	0.0700	0.0699	0.1386	0.1034	0.3606	0.1101	0.0425
γ₁₀	0.1210	0.0787	0.0436	0.0448	0.4871	0.2142	0.5427	0.1234	0.0795
γ₁₁	0.1002	0.0518	0.0265	0.0840	0.2862	0.2423	0.2530	0.3039	0.0436
γ₁₂	0.1219	0.1014	0.0455	0.0594	0.3810	0.1911	0.3326	0.2352	0.0731
γ₁₃	0.1331	0.0969	0.0449	0.0375	0.2096	0.1164	0.3723	0.1064	0.0645
γ₁₄	0.1033	0.0899	0.0479	0.0484	0.0702	0.2026	0.0665	0.4069	0.1816
γ₁₅	0.1108	0.0875	0.0523	0.0317	0.0961	0.1280	0.4249	0.1442	0.0407
γ₁₆	0.1543	0.1146	0.0687	0.0578	0.1376	0.1298	0.1346	0.1472	0.0505
γ₁₇	0.1665	0.1026	0.0706	0.1815	0.4662	0.1505	0.5627	0.1430	0.0419
γ₁₈	0.6170	0.1548	0.1543	0.0502	0.0977	0.3061	0.1003	0.1334	0.0511
γ₁₉	0.1664	0.1190	0.0616	0.0644	0.5661	0.1416	0.5118	0.1443	0.0419
γ₂₀	0.1478	0.1097	0.0695	0.0682	0.1374	0.1119	0.4803	0.1240	0.0496
γ₂₁	0.1183	0.0812	0.0479	0.1321	0.4207	0.2018	0.2975	0.1517	0.0409
γ₂₂	0.1629	0.1226	0.0759	0.2221	0.1331	0.2484	0.1513	0.5593	0.0998
γ₂₃	0.1553	0.1132	0.0716	0.0663	0.1950	0.1440	0.1581	0.1527	0.0662
γ₂₄	0.1704	0.1144	0.0671	0.0577	0.1430	0.1243	0.1379	0.1454	0.0566
γ₂₅	0.1313	0.1065	0.0519	0.1595	0.2768	0.1490	0.3159	0.1897	0.1788
γ₂₆	0.1517	0.0953	0.0303	0.0831	0.4289	0.3329	0.3937	0.4259	0.0466
γ₂₇	0.0262	0.0153	0.0133	−0.002	0.0093	0.0403	0.2920	0.0425	0.0492

Table A1

Open in new tab

The optimal parameter γ_j (j = 1, 2, …, 27) in Equation A3 obtained thru the optimized procedure (Zouhal & Denoeux, 1998) for the 9 basic individual classifiers in Equation 7

	ℂ1	ℂ2	ℂ3	ℂ4	ℂ5	ℂ6	ℂ7	ℂ8	ℂ9
γ₁	0.1028	0.0714	0.0398	0.0434	0.4848	0.1275	0.1977	0.1396	0.0682
γ₂	0.2908	0.0727	0.0523	0.1324	0.0585	0.0784	0.3654	0.2218	0.0387
γ₃	0.0656	0.0490	0.0240	0.0435	0.0469	0.0604	0.2482	0.2988	0.0445
γ₄	0.0888	0.0480	0.0752	0.0597	0.4306	0.1541	0.4092	0.4026	0.0525
γ₅	0.0798	0.0525	0.0274	0.0312	0.4146	0.1510	0.0866	0.1175	0.0543
γ₆	0.0931	0.1176	0.0468	0.0581	0.1298	0.4586	0.0981	0.4580	0.1741
γ₇	0.0954	0.0783	0.0520	0.0543	0.2871	0.0973	0.3723	0.1200	0.0456
γ₈	0.1241	0.1013	0.0475	0.0462	0.1890	0.3587	0.5057	0.1365	0.0385
γ₉	0.1476	0.1076	0.0700	0.0699	0.1386	0.1034	0.3606	0.1101	0.0425
γ₁₀	0.1210	0.0787	0.0436	0.0448	0.4871	0.2142	0.5427	0.1234	0.0795
γ₁₁	0.1002	0.0518	0.0265	0.0840	0.2862	0.2423	0.2530	0.3039	0.0436
γ₁₂	0.1219	0.1014	0.0455	0.0594	0.3810	0.1911	0.3326	0.2352	0.0731
γ₁₃	0.1331	0.0969	0.0449	0.0375	0.2096	0.1164	0.3723	0.1064	0.0645
γ₁₄	0.1033	0.0899	0.0479	0.0484	0.0702	0.2026	0.0665	0.4069	0.1816
γ₁₅	0.1108	0.0875	0.0523	0.0317	0.0961	0.1280	0.4249	0.1442	0.0407
γ₁₆	0.1543	0.1146	0.0687	0.0578	0.1376	0.1298	0.1346	0.1472	0.0505
γ₁₇	0.1665	0.1026	0.0706	0.1815	0.4662	0.1505	0.5627	0.1430	0.0419
γ₁₈	0.6170	0.1548	0.1543	0.0502	0.0977	0.3061	0.1003	0.1334	0.0511
γ₁₉	0.1664	0.1190	0.0616	0.0644	0.5661	0.1416	0.5118	0.1443	0.0419
γ₂₀	0.1478	0.1097	0.0695	0.0682	0.1374	0.1119	0.4803	0.1240	0.0496
γ₂₁	0.1183	0.0812	0.0479	0.1321	0.4207	0.2018	0.2975	0.1517	0.0409
γ₂₂	0.1629	0.1226	0.0759	0.2221	0.1331	0.2484	0.1513	0.5593	0.0998
γ₂₃	0.1553	0.1132	0.0716	0.0663	0.1950	0.1440	0.1581	0.1527	0.0662
γ₂₄	0.1704	0.1144	0.0671	0.0577	0.1430	0.1243	0.1379	0.1454	0.0566
γ₂₅	0.1313	0.1065	0.0519	0.1595	0.2768	0.1490	0.3159	0.1897	0.1788
γ₂₆	0.1517	0.0953	0.0303	0.0831	0.4289	0.3329	0.3937	0.4259	0.0466
γ₂₇	0.0262	0.0153	0.0133	−0.002	0.0093	0.0403	0.2920	0.0425	0.0492

	ℂ1	ℂ2	ℂ3	ℂ4	ℂ5	ℂ6	ℂ7	ℂ8	ℂ9
γ₁	0.1028	0.0714	0.0398	0.0434	0.4848	0.1275	0.1977	0.1396	0.0682
γ₂	0.2908	0.0727	0.0523	0.1324	0.0585	0.0784	0.3654	0.2218	0.0387
γ₃	0.0656	0.0490	0.0240	0.0435	0.0469	0.0604	0.2482	0.2988	0.0445
γ₄	0.0888	0.0480	0.0752	0.0597	0.4306	0.1541	0.4092	0.4026	0.0525
γ₅	0.0798	0.0525	0.0274	0.0312	0.4146	0.1510	0.0866	0.1175	0.0543
γ₆	0.0931	0.1176	0.0468	0.0581	0.1298	0.4586	0.0981	0.4580	0.1741
γ₇	0.0954	0.0783	0.0520	0.0543	0.2871	0.0973	0.3723	0.1200	0.0456
γ₈	0.1241	0.1013	0.0475	0.0462	0.1890	0.3587	0.5057	0.1365	0.0385
γ₉	0.1476	0.1076	0.0700	0.0699	0.1386	0.1034	0.3606	0.1101	0.0425
γ₁₀	0.1210	0.0787	0.0436	0.0448	0.4871	0.2142	0.5427	0.1234	0.0795
γ₁₁	0.1002	0.0518	0.0265	0.0840	0.2862	0.2423	0.2530	0.3039	0.0436
γ₁₂	0.1219	0.1014	0.0455	0.0594	0.3810	0.1911	0.3326	0.2352	0.0731
γ₁₃	0.1331	0.0969	0.0449	0.0375	0.2096	0.1164	0.3723	0.1064	0.0645
γ₁₄	0.1033	0.0899	0.0479	0.0484	0.0702	0.2026	0.0665	0.4069	0.1816
γ₁₅	0.1108	0.0875	0.0523	0.0317	0.0961	0.1280	0.4249	0.1442	0.0407
γ₁₆	0.1543	0.1146	0.0687	0.0578	0.1376	0.1298	0.1346	0.1472	0.0505
γ₁₇	0.1665	0.1026	0.0706	0.1815	0.4662	0.1505	0.5627	0.1430	0.0419
γ₁₈	0.6170	0.1548	0.1543	0.0502	0.0977	0.3061	0.1003	0.1334	0.0511
γ₁₉	0.1664	0.1190	0.0616	0.0644	0.5661	0.1416	0.5118	0.1443	0.0419
γ₂₀	0.1478	0.1097	0.0695	0.0682	0.1374	0.1119	0.4803	0.1240	0.0496
γ₂₁	0.1183	0.0812	0.0479	0.1321	0.4207	0.2018	0.2975	0.1517	0.0409
γ₂₂	0.1629	0.1226	0.0759	0.2221	0.1331	0.2484	0.1513	0.5593	0.0998
γ₂₃	0.1553	0.1132	0.0716	0.0663	0.1950	0.1440	0.1581	0.1527	0.0662
γ₂₄	0.1704	0.1144	0.0671	0.0577	0.1430	0.1243	0.1379	0.1454	0.0566
γ₂₅	0.1313	0.1065	0.0519	0.1595	0.2768	0.1490	0.3159	0.1897	0.1788
γ₂₆	0.1517	0.0953	0.0303	0.0831	0.4289	0.3329	0.3937	0.4259	0.0466
γ₂₇	0.0262	0.0153	0.0133	−0.002	0.0093	0.0403	0.2920	0.0425	0.0492

The belief function of P belonging to class Φ_μ is a combination of its k-nearest neighbors, and can be formulated as

\begin{matrix} ℝ (P, Φ_{μ}) = (\dots ((ℝ (P_{1}, Φ_{μ}) \oplus ℝ (P_{2}, Φ_{μ})) \oplus ℝ (P_{3}, Φ_{μ})) \oplus \dots) \\ \oplus ℝ (P_{K}, Φ_{μ}) \end{matrix}

(A4)

where ⊕ is called the orthogonal sum, which is commutative and associative. According to Dempster's rule (Shafer, 1976), the belief function of Equation A4 can be expressed as

ℝ (P, Φ_{μ}) = \frac{\sum_{S_{K, i}^{P} \subseteq S_{K}^{P}, S_{K, j}^{P} \subseteq S_{K}^{P}, S_{K}^{P} \cap S_{K, j}^{P} = Φ_{μ}} ℝ (P, S_{K, j}^{P}) ℝ (P, S_{K, j}^{P})}{1 - \sum_{S_{K, i}^{P} \subseteq S_{K}^{P}, S_{K, j}^{P} \subseteq S_{K}^{P}, S_{K, i}^{P} \cap S_{K, j}^{P} = \emptyset} ℝ (P, S_{K, j}^{P}) ℝ (P, S_{K, j}^{P})}

(A5)

where

S_{K, i}^{P}

is the i-th possible subset of

S_{K}^{P}

⁠, and ⊆, ∩ and ∅ are the symbols in set theory, representing ‘contained in’, ‘intersection’, and the empty set, respectively.

A decision is made by assigning the query protein P to the class with which the belief or credibility function of Equation A5 has the maximum value; i.e. if

ℝ (P, Φ_{μ}) = Max {ℝ (P, Φ_{1}), ℝ (P, Φ_{2}), \dots, ℝ (P, Φ_{27})}

(A6)

here μ=1, 2, … , or 27 and the operator Max means taking the maximum one among those in the brackets, then the class Φ_μ is the class predicted for the query protein.Conflict of Interest: none declared.

REFERENCES

Andreeva

et al.

SCOP database in 2004: refinements integrate structure and sequence family data

Nucleic Acids Res.

2004

, vol.

(pg.

D226

D229

)

Month:	Total Views:
November 2016	4
December 2016	17
January 2017	9
February 2017	20
March 2017	12
April 2017	8
May 2017	8
June 2017	26
July 2017	14
August 2017	21
September 2017	7
October 2017	19
November 2017	19
December 2017	34
January 2018	57
February 2018	36
March 2018	41
April 2018	35
May 2018	73
June 2018	41
July 2018	41
August 2018	34
September 2018	15
October 2018	30
November 2018	30
December 2018	32
January 2019	26
February 2019	49
March 2019	42
April 2019	44
May 2019	42
June 2019	26
July 2019	31
August 2019	44
September 2019	63
October 2019	56
November 2019	41
December 2019	36
January 2020	43
February 2020	24

Article Contents

Ensemble classifier for protein fold pattern recognition

Abstract

INTRODUCTION

MATERIALS AND METHODS

RESULTS AND DISCUSSION

CONCLUSIONS

APPENDIX A

The optimized evidence-theoretic k-nearest neighbors (OET-KNN) classifier

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only