iBet uBet web content aggregator. Adding the entire web to your favor.

The comparative results of the prediction performance of the method based on different negative data sets, respectively, using AC with lg of 25 amino acids

Negative data set	Psub	Prcp	1-let	2-let	3-let
Sensitivity (%)	85.22	41.76	79.29	69.81	60.74
Precision (%)	87.83	62.64	82.67	85.14	80.15
Accuracy (%)	86.23 ± 1.95	58.42 ± 1.68	79.25 ± 7.80	77.30 ± 12.38	70.25 ± 10.40

Negative data set	Psub	Prcp	1-let	2-let	3-let
Sensitivity (%)	85.22	41.76	79.29	69.81	60.74
Precision (%)	87.83	62.64	82.67	85.14	80.15
Accuracy (%)	86.23 ± 1.95	58.42 ± 1.68	79.25 ± 7.80	77.30 ± 12.38	70.25 ± 10.40

Psub is the negative data set of non-interacting pairs of non-co-localized proteins; Prcp is the negative data set derived from the method by Shen et al. (26). The three negative data sets, 1-let, 2-let and 3-let are obtained by shuffling the protein sequences with k-let counts, k = 1, 2, 3.

Table 1.

Open in new tab Download slide

The comparative results of the prediction performance of the method based on different negative data sets, respectively, using AC with lg of 25 amino acids

Negative data set	Psub	Prcp	1-let	2-let	3-let
Sensitivity (%)	85.22	41.76	79.29	69.81	60.74
Precision (%)	87.83	62.64	82.67	85.14	80.15
Accuracy (%)	86.23 ± 1.95	58.42 ± 1.68	79.25 ± 7.80	77.30 ± 12.38	70.25 ± 10.40

Negative data set	Psub	Prcp	1-let	2-let	3-let
Sensitivity (%)	85.22	41.76	79.29	69.81	60.74
Precision (%)	87.83	62.64	82.67	85.14	80.15
Accuracy (%)	86.23 ± 1.95	58.42 ± 1.68	79.25 ± 7.80	77.30 ± 12.38	70.25 ± 10.40

Selecting optimal lg

The use of AC with large lags will result in more variables that account for interactions of amino acids with large distances apart in the sequence. The maximal possible lg is the length of the shortest sequence (50 amino acids) in the data set. In this study, several lgs were optimized in order to achieve the best characterization of the protein sequences. Using Psub as the negative data set, nine models were constructed with nine different lgs, respectively (lg = 5, 10, 15, 20, 25, 30, 35, 40, 45). The prediction results for the nine models are shown in Figure 1. As seen from the curve, the prediction accuracy increases when lg increases from 5 to 30, but it slightly fluctuates when lg increases from 30 up to 45. There is a peak point with an average accuracy of 87.36% and the lg of 30 amino acids. It is concluded that AC with lg less than 30 amino acids would lose some useful features of the protein sequences and larger lgs could introduce noise instead of improving the prediction power of the model. So the optimal lg is 30 amino acids.

Figure 1.

The average prediction accuracy of the method with AC of different lgs respectively.

Comparing the performance of AC with that of ACC

After represented by the seven descriptors, a protein pair was converted into a 420-dimensional (2 × 30 × 7) vector by AC with lg of 30 amino acids. However, when ACC is used, a protein sequence will be a vector of 2940 dimension (2 × 30 × 7 × 7). To reduce the calculating time, only AC variables were used as the input of SVM. Here, we also used ACC to transform the protein sequences and compared the performance of the model based on ACC with that of the model based on AC. From Table 2, we can see that the model based on ACC transform gives good results with the average sensitivity, precision and accuracy of 89.93, 88.87 and 89.33 ± 2.67%, respectively. However, when the dimension of vector space is dramatically reduced from 2940 to 420 using AC transform, the performance of the model based on AC is very close to that of the model based on ACC. It proves that CC variables only have a little contribution to the performance of the model and AC variables are the principal components of ACC variables.

Table 2.

The prediction results of the test sets based on the negative data set Psub and lg of 30 amino acids

	Test set	TP	FN	TN	FP	Sensitivity (%)	Precision (%)	Accuracy (%)
ACC	1	2096	282	2226	152	88.14	93.24	90.87
	2	2282	96	1741	637	95.96	78.18	84.59
	3	2023	355	2291	87	85.07	95.88	90.71
	4	2181	197	2099	279	91.72	88.66	89.99
	5	2052	267	2194	184	88.77	91.98	90.52
	Average	2138	240	2110	268	89.93	88.87	89.33 ± 2.67
AC	1	2161	217	1944	434	90.87	83.28	86.31
	2	2215	163	1890	488	93.15	81.95	86.31
	3	2062	316	2153	225	86.71	90.16	88.63
	4	1890	488	2221	157	79.48	92.33	86.44
	5	2052	326	2185	193	86.29	91.40	89.10
	Average	2076	312	2079	299	87.30	87.82	87.36 ± 1.38

	Test set	TP	FN	TN	FP	Sensitivity (%)	Precision (%)	Accuracy (%)
ACC	1	2096	282	2226	152	88.14	93.24	90.87
	2	2282	96	1741	637	95.96	78.18	84.59
	3	2023	355	2291	87	85.07	95.88	90.71
	4	2181	197	2099	279	91.72	88.66	89.99
	5	2052	267	2194	184	88.77	91.98	90.52
	Average	2138	240	2110	268	89.93	88.87	89.33 ± 2.67
AC	1	2161	217	1944	434	90.87	83.28	86.31
	2	2215	163	1890	488	93.15	81.95	86.31
	3	2062	316	2153	225	86.71	90.16	88.63
	4	1890	488	2221	157	79.48	92.33	86.44
	5	2052	326	2185	193	86.29	91.40	89.10
	Average	2076	312	2079	299	87.30	87.82	87.36 ± 1.38

TP, true positive; FP, false positive; TN, true negative; FN, false negative; Psub is the negative data set of non-interacting pairs of non-co-localized proteins.

Table 2.

The prediction results of the test sets based on the negative data set Psub and lg of 30 amino acids

	Test set	TP	FN	TN	FP	Sensitivity (%)	Precision (%)	Accuracy (%)
ACC	1	2096	282	2226	152	88.14	93.24	90.87
	2	2282	96	1741	637	95.96	78.18	84.59
	3	2023	355	2291	87	85.07	95.88	90.71
	4	2181	197	2099	279	91.72	88.66	89.99
	5	2052	267	2194	184	88.77	91.98	90.52
	Average	2138	240	2110	268	89.93	88.87	89.33 ± 2.67
AC	1	2161	217	1944	434	90.87	83.28	86.31
	2	2215	163	1890	488	93.15	81.95	86.31
	3	2062	316	2153	225	86.71	90.16	88.63
	4	1890	488	2221	157	79.48	92.33	86.44
	5	2052	326	2185	193	86.29	91.40	89.10
	Average	2076	312	2079	299	87.30	87.82	87.36 ± 1.38

	Test set	TP	FN	TN	FP	Sensitivity (%)	Precision (%)	Accuracy (%)
ACC	1	2096	282	2226	152	88.14	93.24	90.87
	2	2282	96	1741	637	95.96	78.18	84.59
	3	2023	355	2291	87	85.07	95.88	90.71
	4	2181	197	2099	279	91.72	88.66	89.99
	5	2052	267	2194	184	88.77	91.98	90.52
	Average	2138	240	2110	268	89.93	88.87	89.33 ± 2.67
AC	1	2161	217	1944	434	90.87	83.28	86.31
	2	2215	163	1890	488	93.15	81.95	86.31
	3	2062	316	2153	225	86.71	90.16	88.63
	4	1890	488	2221	157	79.48	92.33	86.44
	5	2052	326	2185	193	86.29	91.40	89.10
	Average	2076	312	2079	299	87.30	87.82	87.36 ± 1.38

TP, true positive; FP, false positive; TN, true negative; FN, false negative; Psub is the negative data set of non-interacting pairs of non-co-localized proteins.

So in this work, the optimal model was based on the negative data set Psub and AC transform with lg of 30 amino acids. The prediction results for five test sets are listed in Table 2. For all five models, the prediction accuracies are all >86% with a relatively low SD of 1.38%. On average, the sensitivity, precision and prediction accuracy of this model are 87.30, 87.82 and 87.36%, respectively. These results are obtained based on the original data set that contains homologous protein pairs. However, for the statistical predictions, it is absolutely necessary to avoid redundancy and homology bias in the training data set (57). In order to determine the homology effects, the non-redundant data set was constructed by removing the protein pairs with ≥40% pairwise sequence identity from the whole original data set. The performance of the five models based on this non-redundant data set is shown in Supplementary Table S2. The average prediction accuracy of the non-redundant data set is 86.55%.

Two SVM parameters, C and γ were optimized as 32 and 0.03125. So using the whole data set, the final prediction model was built with the optimal parameters.

Performance on the independent data set

In order to evaluate the practical prediction ability of the final prediction model, a large independent data set was constructed. In DIP, the yeast data set contained 17 491 interaction pairs, out of which that which contained a protein with <50 amino acids and those appearing in the training data set were all excluded. Among the remaining 11 474 protein pairs, 10 108 PPIs are correctly predicted by the prediction model and the success rate is 88.09%. In this article, the negative training set was generated by selecting non-interacting pairs of non-co-localized proteins. However, Ben-Hur and Noble (58) have denoted that restricting negative examples to non-co-localized protein pairs leads to a biased estimate of the accuracy of a PPI predictor. So it is necessary to generate a test data set of the non-interacting pairs with the same localization to test the effects of this bias. The yeast proteins used in the positive training set were assigned with the seven main types of localization. The non-interacting protein pairs with the same localization were generated and none of them has occurred in the whole DIP yeast interacting pairs. The performance of this method in predicting such negative samples is summarized in Supplementary Table S3. For cytoplasm and nucleus subsets, only 8000 non-interactions were randomly selected from the large-scale data set, respectively. The result shows that the prediction model is able to correctly predict the non-interacting pairs of all subsets with >80% accuracy, except the cytoplasm subset with 77% accuracy and endoplasmic reticulum subset with 69% accuracy. For all 27 204 non-interactions, the total prediction accuracy is 81.46%. In addition, using the model based on the non-redundant data set, the prediction accuracy for 11 474 yeast PPIs is 93.25% and the result of the non-interacting pairs is shown in Supplementary Table S4. All these results demonstrate that this method is also able to predict non-interacting pairs with the same localization.

CONCLUSION

In this article, we developed a new method for predicting PPIs only using the primary sequences of proteins. The prediction model was constructed based on SVM and AC. Shen et al. (26) have denoted that usually the methods with no local environments of amino acids are not reliable and robust, so they proposed a conjoint triad method to consider the properties of each amino acid and its two proximate amino acids. However, in most cases, the long-range interactions are also important for representing the PPI information. In this article, AC was used to involve the information of interactions between amino acids a longer distance apart in the sequence. A protein sequence was characterized by a series of ACs that covered the information of interactions between one amino acid and its 30 vicinal amino acids in the sequence. So this method adequately takes the neighbouring effect into account. As expected, this method improved the prediction accuracy compared with the current methods. Moreover, three different negative data sets were compared and the model trained using non-interacting pairs of non co-localized proteins yielded the best performance with a high accuracy of 87.36%, when applied to predicting the PPIs of S. cerevisiae. Meanwhile, the final prediction model was tested using the independent data set of the yeast PPIs with a good performance. Overall, such a robust method will be a useful tool to elucidate the biological function of newly discovered proteins and to expedite the study of protein networks.

ACKNOWLEDGEMENTS

The authors gratefully thank Eivind Coward for sharing the Shufflet sequence-randomizing code. The work was funded by the National Natural Science Foundation of China (No. 20775052). Funding to pay the Open Access publication charges for this article was provided by the National Natural Science Foundation of China (No. 20775052).

Conflict of interest statement. None declared.

REFERENCES

Fields

Song

A novel genetic system to detect protein–protein interactions

Nature

1989

, vol.

340

(pg.

245

246

)

Ito

Chiba

Ozawa

Yoshida

Hattori

Sakaki

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Proc. Natl Acad. Sci. USA

2001

, vol.

(pg.

4569

4574

)

Gavin

Boche

Krause

Grandi

Marzioch

Bauer

Schultz

Rick

Michon

Cruciat

Functional organization of the yeast proteome by systematic analysis of protein complexes

Nature

2002

, vol.

415

(pg.

141

147

)

Gruhler

Heilbut

Bader

Moore

Adams

Millar

Taylor

Bennett

Boutilier

et al.

Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

Nature

2002

, vol.

415

(pg.

180

183

)

Zhu

Bilgin

Bangham

Hall

Casamayor

Bertone

Lan

Jansen

Bidlingmaier

Houfek

et al.

Global analysis of protein activities using proteome chips

Science

2001

, vol.

193

(pg.

2101

2105

)

Han

Dupuy

Bertin

Cusick

Vidal

Effect of sampling on topology predictions of protein–protein interaction networks

Nat. Biotechnol.

2005

, vol.

(pg.

839

844

)

Pellegrini

Marcotte

Thompson

Eisenberg

Yeates

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles

Proc. Natl Acad. Sci. USA

1999

, vol.

(pg.

4285

4288

)

Overbeek

Fonstein

D'Souza

Pusch

Maltsev

Use of contiguity on the chromosome to predict functional coupling

In Silico Biol.

1999

, vol.

(pg.

108

)

PubMed

Marcotte

Detecting protein function and protein–protein interactions from genome sequences

Science

1999

, vol.

285

(pg.

751

753

)

Enright

Iliopoulos

Kyrpides

Ouzounis

Protein interaction maps for complete genomes based on gene fusion events

Nature

1999

, vol.

402

(pg.

)

Aloy

Russell

Interrogating protein interaction networks through structural biology

Proc. Natl Acad. Sci. USA

2002

, vol.

(pg.

5896

5901

)

Aloy

Russell

InterPreTS: protein interaction prediction through tertiary structure

Bioinformatics

2003

, vol.

(pg.

161

162

)

Ogmen

Keskin

Aytuna

Nussinov

Gursoy

PRISM: protein interactions by structural matching

Nucleic Acids Res.

2005

, vol.

(pg.

W331

W336

)

Huang

Tien

Huang

Lee

Peng

Tseng

Kao

Huang

POINT: a database for the prediction of protein–protein interactions based on the orthologous interactome

Bioinformatics

2004

, vol.

(pg.

3273

3276

)

Espadaler

Romero-Isart

Jackson

Oliva

Prediction of protein–protein interactions using distant conservation of sequence patterns and structure relationships

Bioinformatics

2005

, vol.

(pg.

3360

3368

)

Sprinzak

Margalit

Correlated sequence-signatures as markers of protein–protein interaction

J. Mol. Biol.

2001

, vol.

311

(pg.

681

692

)

Kim

Park

Suh

Large scale statistical prediction of protein–protein interaction by potentially interacting domain (PID) pair

Genome Inform.

2002

, vol.

(pg.

)

PubMed

Han

Kim

Jang

Lee

Suh

PreSPI: a domain combination based prediction system for protein–protein interaction

Nucleic Acids Res.

2004

, vol.

(pg.

6312

6320

)

Morrison

Breitling

Higham

Gilbert

A lock-and-key model for protein–protein interaction

Bioinformatics

2006

, vol.

(pg.

2212

2019

)

Singhal

Resat

A domain-based approach to predict protein–protein interactions

BMC Bioinformatics

2007

, vol.

pg.

199

Bock

Gough

Predicting protein–protein interactions from primary structure

Bioinformatics

2001

, vol.

(pg.

455

460

)

Martin

Roe

Faulon

Predicting protein–protein interactions using signature products

Bioinformatics

2005

, vol.

(pg.

218

226

)

Cai

Chen

Chung

MCM

Effect of training datasets on support vector machine prediction of protein–protein interactions

Proteomics

2005

, vol.

(pg.

876

884

)

Pitre

Dehne

Chan

Cheetham

Duong

Emili

Gebbia

Greenblatt

Jessulat

Krogan

et al.

PIPE: a protein–protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs

BMC Bioinformatics

2006

, vol.

pg.

365

Chou

Cai

Predicting protein–protein interactions from sequences in a hybridization space

J. Proteome Res.

2006

, vol.

(pg.

316

322

)

Shen

Zhang

Luo

Zhu

Chen

Jiang

Predicting protein–protein interactions based only on sequences information

Proc. Natl Acad. Sci. USA

2007

, vol.

104

(pg.

4337

4341

)

Xenarios

Salwinski

Duan

Higney

Kim

Eisenberg

DIP: the database of interacting proteins. A research tool for studying cellular networks of protein interactions

Nucleic Acids Res.

2002

, vol.

(pg.

303

305

)

Deane

Salwinski

Xenarios

Eisenberg

Protein interactions: two methods for assessment of the reliability of high throughput observations

Mol. Cell. Proteomics

2002

, vol.

(pg.

349

356

)

Jaroszewski

Godzik

Clustering of highly homologous sequences to reduce the size of large protein databases

Bioinformatics

2001

, vol.

(pg.

282

283

)

Kandel

Matias

Unger

Winkler

Shuffling biological sequences

Discrete Appl. Math.

1996

, vol.

(pg.

171

185

)

Coward

Shufflet: shuffling sequences while conserving the k-let counts

Bioinformatics

1999

, vol.

(pg.

1058

1059

)

Tanford

Contribution of hydrophobic interactions to the stability of the globular conformation of proteins

J. Am. Chem. Soc.

1962

, vol.

(pg.

4240

4274

)

Hopp

Woods

Prediction of protein antigenic determinants from amino acid sequences

Proc. Natl Acad. Sci. USA

1981

, vol.

(pg.

3824

3828

)

Krigbaum

Komoriya

Local interactions as a structure determinant for protein molecules: II

Biochim. Biophys. Acta

1979

, vol.

576

(pg.

204

228

)

Grantham

Amino acid difference formular to help explain protein evolution

Science

1974

, vol.

185

(pg.

862

864

)

Charton

The structure dependence of amino acid hydrophobicity parameters

J. Theor. Biol.

1982

, vol.

(pg.

629

644

)

Rose

Geselowitz

Lesser

Lee

Zehfus

Hydrophobicity of amino acid residues in globular proteins

Science

1985

, vol.

229

(pg.

834

838

)

Zhou

Tian

Genetic algorithm-base virtual screening of combinative mode for peptide/protein

Acta Chim. Sinica

2006

, vol.

(pg.

691

697

)

Wold

Jonsson

Sjöström

Sandberg

Rännar

DNA and peptide sequences and chemical processes mutlivariately modelled by principal component analysis and partial least-squares projections to latent structures

Anal. Chim. Acta

1993

, vol.

277

(pg.

239

253

)

Guo

Wen

Huang

Predicting G-protein coupled receptors-G-protein coupling specificity based on autocross-covariance transform

Proteins

2006

, vol.

(pg.

)

Wen

Guo

Wang

Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition

Amino Acids

2007

, vol.

(pg.

277

283

)

Doytchinova

Flower

VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines

BMC Bioinformatics

2007

, vol.

pg.

Vapnik

Statistical learning theory

1998

New York

Wiley

Google Preview