iBet uBet web content aggregator. Adding the entire web to your favor.

Author Notes

Abstract

Summary: Sequence-derived structural and physiochemical features have been frequently used for analysing and predicting structural, functional, expression and interaction profiles of proteins and peptides. To facilitate extensive studies of proteins and peptides, we developed a freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence. It computes five feature groups composed of 13 features, including amino acid composition, dipeptide composition, tripeptide composition, normalized Moreau–Broto autocorrelation, Moran autocorrelation, Geary autocorrelation, sequence-order-coupling number, quasi-sequence-order descriptors, composition, transition and distribution of various structural and physicochemical properties and two types of pseudo amino acid composition (PseAAC) descriptors. These features could be generally regarded as different Chou’s PseAAC modes. In addition, it can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database.

Availability: The python package, propy, is freely available via http://code.google.com/p/protpy/downloads/list, and it runs on Linux and MS-Windows.

Contact: yizeng_liang@263.net

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Sequence-derived structural and physicochemical features have been widely used in the development of machine learning models for predicting protein structural and functional classes (Chou, 2001, 2009), protein–protein interactions (Shen et al., 2007), protein–ligand interactions (Yu et al., 2012), subcellular locations and peptides of specific properties (Chou and Shen, 2008). These features are highly useful for representing and distinguishing proteins or peptides of different structural, functional and interaction profiles. Currently, these structural and physicochemical features of proteins and peptides were routinely used to characterize target proteins in drug–target pairs and predict new drug–target associations to identify potential drug targets (He et al., 2010), following the spirit of chemogenomics.

Several programs for computing protein structural and physicochemical features have been developed (Du et al., 2012; Holland et al., 2008; Li et al., 2006); however, they are not comprehensive and can only be limited to a certain kind of features. Additionally, these are not freely and easily accessible.

We implemented a selection of sophisticated protein features and provide them as a package for the free and open source software environment python. The propy package aims at providing the user with comprehensive implementations of these descriptors in a unified framework to allow easy and transparent computation. To our knowledge, propy is the first open source package computing a large number of protein features based on user-defined structural and physicochemical properties. We recommend propy to analyse and represent the proteins or peptides under investigation. Further, we hope that the package will be helpful when exploring questions concerning the structures, functions and interactions of proteins and peptides in the context of systems biology.

2 PACKAGE DESCRIPTION

The propy package can compute a large number of structural and physicochemical features from amino acid sequence. A list of features for proteins and peptides covered by the current version of propy is summarized in Table 1. These features can be divided into five groups, each of which has been independently predicting protein- and peptide-related problems by using machine-learning methods. The first group includes three features, amino acid composition, dipeptide composition and tripeptide composition, with three descriptors and 8420 descriptor values. The second group consists of three different autocorrelation features: normalized Moreau–Broto autocorrelation, Moran autocorrelation and Geary autocorrelation. The autocorrelation features describe the level of correlation between two protein or peptide sequences in terms of their specific structural or physicochemical property. Each of these features has eight descriptors and 240 descriptor values. The third group contains three feature sets: composition, transition and distribution with 21 descriptors and 147 descriptor values. They represent the amino acid distribution pattern of a specific structural or physicochemical property along a protein or peptide sequence. Seven types of physicochemical properties have been used for calculating these features (Supplementary Material). The fourth group includes two sequence-order feature sets, one is sequence-order-coupling number with two descriptors and 60 descriptor values, and the other is quasi-sequence-order with two descriptors and 100 descriptor values. These features are derived from both Schneider–Wrede physicochemical distance matrix and Grantham chemical distance matrix. The fifth group contains two types of pseudo-amino acid compositions (PseAAC): type I PseAAC with 50 descriptor values and type II PseAAC (i.e. amphiphilic PseAAC) with 50 descriptor values. Apart from these descriptors, it can also compute previous descriptors based on user-defined properties, which are easily accessible from the AAindex database (Kawashima and Kanehisa, 2000). In fact, the aforementioned features can be regarded as different Chou’s PseAAC modes. For example, amino acid, dipeptide, tripeptide or n-mer peptide (n = 4, 5, … ) compositions are just different modes of Chou’s PseAAC. Moreover, the higher-level features, such as GO (Gene Ontology) information, FunD (Functional Domain) information and sequential evolution information, are also skilfully fused into the Chou’s PseAAC descriptors to characterize different protein information, which is widely used for solving various biological problems. An excellent review by Chou (2011) has pointed out their relevancy.

Table 1.

Open in new tab

List of various Chou’s PseAAC modes of proteins and peptides by propy

Feature groups	Features	No. of descriptors
Amino acid composition	Amino acid composition	20
	Dipeptide composition	400
	Tripeptide composition	8000
Autocorrelation	Normalized Moreau–Broto autocorrelation	240 ^a
	Moran autocorrelation	240 ^a
	Geary autocorrelation	240 ^a
Composition, transition and distribution	Composition	21
	Transition	21
	Distribution	105
Quasi-sequence order	Sequence-order-coupling number	60
	Quasi-sequence-order descriptors	100
Pseudo-amino acid composition	Type I pseudo-amino acid composition	50 ^a
	Type II pseudo-amino acid composition	50 ^a

Feature groups	Features	No. of descriptors
Amino acid composition	Amino acid composition	20
	Dipeptide composition	400
	Tripeptide composition	8000
Autocorrelation	Normalized Moreau–Broto autocorrelation	240 ^a
	Moran autocorrelation	240 ^a
	Geary autocorrelation	240 ^a
Composition, transition and distribution	Composition	21
	Transition	21
	Distribution	105
Quasi-sequence order	Sequence-order-coupling number	60
	Quasi-sequence-order descriptors	100
Pseudo-amino acid composition	Type I pseudo-amino acid composition	50 ^a
	Type II pseudo-amino acid composition	50 ^a

^aThe number depends on the choice of the number of properties of amino acid and the choice of the parameter values in algorithms.

Table 1.

Open in new tab

List of various Chou’s PseAAC modes of proteins and peptides by propy

Feature groups	Features	No. of descriptors
Amino acid composition	Amino acid composition	20
	Dipeptide composition	400
	Tripeptide composition	8000
Autocorrelation	Normalized Moreau–Broto autocorrelation	240 ^a
	Moran autocorrelation	240 ^a
	Geary autocorrelation	240 ^a
Composition, transition and distribution	Composition	21
	Transition	21
	Distribution	105
Quasi-sequence order	Sequence-order-coupling number	60
	Quasi-sequence-order descriptors	100
Pseudo-amino acid composition	Type I pseudo-amino acid composition	50 ^a
	Type II pseudo-amino acid composition	50 ^a

Feature groups	Features	No. of descriptors
Amino acid composition	Amino acid composition	20
	Dipeptide composition	400
	Tripeptide composition	8000
Autocorrelation	Normalized Moreau–Broto autocorrelation	240 ^a
	Moran autocorrelation	240 ^a
	Geary autocorrelation	240 ^a
Composition, transition and distribution	Composition	21
	Transition	21
	Distribution	105
Quasi-sequence order	Sequence-order-coupling number	60
	Quasi-sequence-order descriptors	100
Pseudo-amino acid composition	Type I pseudo-amino acid composition	50 ^a
	Type II pseudo-amino acid composition	50 ^a

^aThe number depends on the choice of the number of properties of amino acid and the choice of the parameter values in algorithms.

The propy package contains several functions and modules manipulating proteins and peptides. To obtain protein sequences easily, propy provides a download module, by which the user could easily get protein sequences from the Uniprot website by providing Uniprot IDs or a file containing Uniprot IDs. A check module is also provided to ensure that our input for subsequent calculation is reliable. To facilitate the accessibility of the property or distance matrix of amino acids, propy provides an AAIndex module, which helps the user automatically download the needed property from the AAindex database. There are two means to compute these structural and physicochemical features from protein or peptide sequences. One is to use the built-in modules in the propy package. There exist five modules responding to the calculation of descriptors from five feature groups. The instruction for each module is provided in the form of HTML in propy. We could import related functions to compute these features as needed. The other is to call the GetProDes class by importing the PyPro module, which encapsulates commonly used descriptor calculation methods. We could construct a GetProDes object with a protein sequence input, and then call corresponding methods to calculate these features. A user guide for the use of propy is included in propy to guide how the user uses it to calculate the needed features (Supplementary Material). Additionally, the main advantage of propy is that the users themselves could specify some sets of amino acid properties in the form of dictionary (a data structure in python). More conveniently, the output from the AAIndex module could be directly used as the user-defined property to calculate the aforementioned descriptors, greatly enlarging the applications to our calculated features.

propy is written by the pure python language. We chose to use python because it is open source, and there already exist packages to handle proteins [e.g. Biopython (Cock et al., 2009), PyMol and Pythonscape]. It is convenient for propy to analyse proteins and peptides processed by Biopython. Moreover, it only needs the support of some built-in modules in python. This greatly facilitates the transplantation and applications of the propy package. The use of the dictionary data structure in the propy output makes the users clearly understand the meaning of each feature.

3 DISCUSSION

Sequence analysis of proteins and peptides has become more and more important in various bioinformatics fields. Apart from the prediction of structural and functional classes of proteins or peptides, there exist a few stand-alone applications to calculate protein/peptide descriptors, which are designed to work with drug descriptors in the chemogenomics framework.

propy contains a selection of various Chou’s PseAAC descriptors to analyse, classify and compare complex proteins and peptides. They facilitate to exploit machine-learning techniques to drive hypothesis from complex protein or peptide datasets. The usefulness of the features covered by propy for computing the structural and physicochemical features of proteins and peptides has been validated by a number of published studies (Chou, 2009, 2011). The propy implementation of each of these algorithms was extensively tested by using a number of test sequences. The computed descriptor values were compared with the known values for these sequences to ensure that our computation is accurate.

propy is a powerful open source package for the extraction of features of proteins and peptides. In our future work, we plan to apply the integrated features on various biological research questions and extend the range of functions with new promising descriptors for the coming versions of propy.

ACKNOWLEDGEMENT

The authors thank two anonymous referees for their constructive comments, which greatly helped improve on the original version of the manuscript.

Funding: National Natural Science Foundation of China (21075138, 21275164 and 11271374). The studies meet with the approval of the university’s review board.

Conflict of Interest: none declared.

REFERENCES

Chou

Shen

Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms

Nat. Protoc.

2008

, vol.

(pg.

153

162

)

Chou

Prediction of protein cellular attributes using pseudo-amino acid composition

Proteins

2001

, vol.

(pg.

246

255

)

Chou

Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology

Curr. Proteomics

2009

, vol.

(pg.

262

274

)

Google Scholar

Crossref

WorldCat

Chou

Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review)

J. Theor. Biol.

2011

, vol.

273

(pg.

236

247

)

Cock

PJA

et al.

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Bioinformatics

2009

, vol.

(pg.

1422

1423

)

et al.

PseAAC-builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions

Anal. Biochem.

2012

, vol.

425

(pg.

117

119

)

et al.

Predicting drug-target interaction networks based on functional groups and biological features

PLoS ONE

2010

, vol.

pg.

e9603

Holland

RCG

et al.

BioJava: an open-source framework for bioinformatics

Bioinformatics

2008

, vol.

(pg.

2096

2097

)

Kawashima

Kanehisa

AAindex: amino acid index database

Nucleic Acids Res.

2000

, vol.

pg.

374

et al.

PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence

Nucleic Acids Res.

2006

, vol.

(pg.

W32

W37

)

Shen

et al.

Predicting protein-protein interactions based only on sequences information

Proc. Natl Acad. Sci. U S A

2007

, vol.

104

(pg.

4337

4341

)

et al.

A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data

PLoS ONE

2012

, vol.

pg.

e37608

Author notes

Associate Editor: Trey Ideker

Download all slides

Month:	Total Views:
December 2016	7
January 2017	11
February 2017	54
March 2017	37
April 2017	39
May 2017	27
June 2017	15
July 2017	33
August 2017	31
September 2017	22
October 2017	20
November 2017	34
December 2017	100
January 2018	46
February 2018	63
March 2018	70
April 2018	59
May 2018	60
June 2018	50
July 2018	53
August 2018	70
September 2018	43
October 2018	45
November 2018	88
December 2018	72
January 2019	54
February 2019	64
March 2019	79
April 2019	97
May 2019	69
June 2019	69
July 2019	56
August 2019	45
September 2019	69
October 2019	57
November 2019	95
December 2019	60
January 2020	77
February 2020	82
March 2020	64
April 2020	61
May 2020	53
June 2020	114
July 2020	63
August 2020	74
September 2020	76
October 2020	96
November 2020	59
December 2020	56
January 2021	64
February 2021	64
March 2021	118
April 2021	99
May 2021	60
June 2021	84
July 2021	106
August 2021	93
September 2021	105
October 2021	78
November 2021	85
December 2021	53
January 2022	116
February 2022	148
March 2022	149
April 2022	115
May 2022	92
June 2022	81
July 2022	101
August 2022	127
September 2022	133
October 2022	102
November 2022	81
December 2022	74
January 2023	64
February 2023	92
March 2023	108
April 2023	93
May 2023	72
June 2023	66
July 2023	101
August 2023	84
September 2023	65
October 2023	82
November 2023	74
December 2023	62
January 2024	53
February 2024	106
March 2024	119
April 2024	100
May 2024	62
June 2024	69
July 2024	71
August 2024	74
September 2024	81
October 2024	59
November 2024	103

Article Contents

propy: a tool to generate various modes of Chou’s PseAAC

Abstract

1 INTRODUCTION

2 PACKAGE DESCRIPTION

3 DISCUSSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

propy: a tool to generate various modes of Chou’s PseAAC

Abstract

1 INTRODUCTION

2 PACKAGE DESCRIPTION

3 DISCUSSION

ACKNOWLEDGEMENT

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only