- Split View
-
Views
-
Cite
Cite
Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, Qing-Song Xu, protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences, Bioinformatics, Volume 31, Issue 11, June 2015, Pages 1857–1859, https://doi.org/10.1093/bioinformatics/btv042
- Share Icon Share
Abstract
Summary: Amino acid sequence-derived structural and physiochemical descriptors are extensively utilized for the research of structural, functional, expression and interaction profiles of proteins and peptides. We developed protr, a comprehensive R package for generating various numerical representation schemes of proteins and peptides from amino acid sequence. The package calculates eight descriptor groups composed of 22 types of commonly used descriptors that include about 22 700 descriptor values. It allows users to select amino acid properties from the AAindex database, and use self-defined properties to construct customized descriptors. For proteochemometric modeling, it calculates six types of scales-based descriptors derived by various dimensionality reduction methods. The protr package also integrates the functionality of similarity score computation derived by protein sequence alignment and Gene Ontology semantic similarity measures within a list of proteins, and calculates profile-based protein features based on position-specific scoring matrix. We also developed ProtrWeb, a user-friendly web server for calculating descriptors presented in the protr package.
Availability and implementation: The protr package is freely available from CRAN: http://cran.r-project.org/package=protr , ProtrWeb, is freely available at http://protrweb.scbdd.com/ .
Contact: oriental-cds@163.com or dasongxu@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Protein sequence is the ultimate resource for functional protein research. In order to apply various machine learning approaches on protein sequence data, it is common practice to encode sequence information as numerical features. The type of encoding, however, can significantly affect analyses, and choosing a precise and effective encoding is a critical step ( Chou, 2011 ).
In bioinformatics, sequence-derived structural and physicochemical features have been widely applied in the research of protein structure and functions during the past two decades, such as, protein structural and functional classes ( Chou and Fasman, 2006 ), protein–protein interactions ( Cao et al. , 2014 ; Shen et al. , 2007 ), subcellular locations and peptides of specific properties ( Chou and Shen, 2008 ), and post-translational modifications ( Xu et al. , 2013 ). In chemogenomics, these structural and physicochemical descriptors are also routinely used to characterize target proteins in drug–target pairs for potential drug–target interaction discovery ( Cao et al. , 2012 , 2013a , b ). In proteochemometric (PCM) modeling ( Wikberg et al. , 2004 ), continuous descriptors derived by various molecular descriptor sets and dimensionality reduction methods ( van Westen et al. , 2013a , b ) are successfully employed in several applications, such as DNA binding pattern analysis, compound selection in lead optimization ( van Westen et al. , 2011 ), novel ligand discovery ( van Westen et al. , 2012 ), proteases ligand selectivity modeling ( Ain et al. , 2014 ) and predicting resistance to pesticides for agrochemicals ( van Westen et al. , 2014 ).
Moreover, for protein and peptides, amino acid sequence and annotation-based similarity scores derived from sequence alignments and Gene Ontology (GO) annotation comparison are also useful representation schemes, which are widely used in modeling, such as genome-wide inference of protein–protein interactions ( Zhang et al. , 2012 ).
Several web servers and stand-alone programs, such as PROFEAT ( Li et al. , 2006 ), PseAAC ( Shen and Chou, 2008 ), propy ( Cao et al. , 2013c ) have been established to calculate such structural and physicochemical descriptors. However, currently available solutions are often limited to certain types of descriptors, lack flexibility and usually difficult to seamlessly integrate into the predictive modeling pipeline. We still urgently need a comprehensive and flexible toolkit to calculate and customize these descriptors. Here, we introduce protr and ProtrWeb, the R package and web server for calculating various numerical representation schemes of protein and peptides from amino acid sequence. We recommend using protr to represent the proteins or peptides under investigation. Besides, in the context of systems biology, we hope that protr will be useful for exploring the biological questions about structures, functions and interactions of proteins and peptides.
2 Package description
The protr package calculates various commonly used structural and physicochemical descriptors and PCMs modeling descriptors for amino acid sequences. A list of descriptors for proteins covered by protr is summarized in Table 1 . These descriptors can be generally divided into eight groups. The first group includes the amino acid composition, dipeptide composition, tripeptide composition. The second group consists of three types of autocorrelation descriptors: normalized Moreau–Broto autocorrelation, Moran autocorrelation and Geary autocorrelation. The third group contains the CTD (composition, transition, distribution) descriptors. The fourth group consists of the conjoint triad descriptor. The fifth group contains two sequence-order descriptor sets: sequence order coupling number and quasi-sequence order descriptor. The sixth group includes two types of pseudo-amino acid compositions (PseAAC): pseudo-amino acid composition (Type I PseAAC) and amphiphilic pseudo-amino acid composition (Type II PseAAC). The seventh group contains seven types of descriptors used for PCM modeling: including the scales-based descriptors derived by principal components analysis, factor analysis and multidimensional scaling, in combination with amino acid properties and 2D and 3D molecular descriptor sets, and BLOSUM/PAM matrix-derived descriptors. The eighth group calculates profile-based protein features based on position-specific scoring matrix (PSSM) ( Su et al. , 2006 ). For constructing customized descriptors of certain types, protr supports defining user-specified properties and selecting properties from the AAindex database ( Kawashima et al. , 2008 ). See the package vignette in the Supplementary data for computational details of the descriptors, datasets and the full workflow demonstration.
Descriptor groups . | Descriptor . | Number . |
---|---|---|
Amino acid composition | Amino acid composition | 20 |
Dipeptide composition | 400 | |
Tripeptide composition | 8000 | |
Normalized Moreau-Broto | 240 a | |
Autocorrelation | Moran | 240 a |
Geary | 240 a | |
Composition | 21 | |
CTD | Transition | 21 |
Distribution | 105 | |
Conjoint Triad | Conjoint Triad | 343 |
Quasi-sequence-order | Sequence-order-coupling number | 60 a |
Quasi-sequence-order descriptors | 100 a | |
Pseudo-amino acid composition | Type I | 50 a |
Type II | 80 a | |
Proteochemometric descriptors | Principal components analysis (amino acid properties based) | 175 b |
Principal components analysis (2D and 3D molecular descriptors based) | 4025 b | |
Factor analysis (amino acid properties based) | 175 b | |
Factor analysis (2D and 3D molecular descriptors based) | 4025 b | |
Multidimensional scaling (amino acid properties based) | 175 b | |
Multidimensional scaling (2D and 3D molecular descriptors based) | 4025 b | |
BLOSUM and PAM matrix-derived descriptors | 175 b | |
PSSM | PSSM profile | – |
Descriptor groups . | Descriptor . | Number . |
---|---|---|
Amino acid composition | Amino acid composition | 20 |
Dipeptide composition | 400 | |
Tripeptide composition | 8000 | |
Normalized Moreau-Broto | 240 a | |
Autocorrelation | Moran | 240 a |
Geary | 240 a | |
Composition | 21 | |
CTD | Transition | 21 |
Distribution | 105 | |
Conjoint Triad | Conjoint Triad | 343 |
Quasi-sequence-order | Sequence-order-coupling number | 60 a |
Quasi-sequence-order descriptors | 100 a | |
Pseudo-amino acid composition | Type I | 50 a |
Type II | 80 a | |
Proteochemometric descriptors | Principal components analysis (amino acid properties based) | 175 b |
Principal components analysis (2D and 3D molecular descriptors based) | 4025 b | |
Factor analysis (amino acid properties based) | 175 b | |
Factor analysis (2D and 3D molecular descriptors based) | 4025 b | |
Multidimensional scaling (amino acid properties based) | 175 b | |
Multidimensional scaling (2D and 3D molecular descriptors based) | 4025 b | |
BLOSUM and PAM matrix-derived descriptors | 175 b | |
PSSM | PSSM profile | – |
a The number of descriptor values depends on the choice of the number of properties of amino acid and the choice of the parameter
b The number of descriptor values depends on the choice of the number of components and the choice of the lag parameter
Descriptor groups . | Descriptor . | Number . |
---|---|---|
Amino acid composition | Amino acid composition | 20 |
Dipeptide composition | 400 | |
Tripeptide composition | 8000 | |
Normalized Moreau-Broto | 240 a | |
Autocorrelation | Moran | 240 a |
Geary | 240 a | |
Composition | 21 | |
CTD | Transition | 21 |
Distribution | 105 | |
Conjoint Triad | Conjoint Triad | 343 |
Quasi-sequence-order | Sequence-order-coupling number | 60 a |
Quasi-sequence-order descriptors | 100 a | |
Pseudo-amino acid composition | Type I | 50 a |
Type II | 80 a | |
Proteochemometric descriptors | Principal components analysis (amino acid properties based) | 175 b |
Principal components analysis (2D and 3D molecular descriptors based) | 4025 b | |
Factor analysis (amino acid properties based) | 175 b | |
Factor analysis (2D and 3D molecular descriptors based) | 4025 b | |
Multidimensional scaling (amino acid properties based) | 175 b | |
Multidimensional scaling (2D and 3D molecular descriptors based) | 4025 b | |
BLOSUM and PAM matrix-derived descriptors | 175 b | |
PSSM | PSSM profile | – |
Descriptor groups . | Descriptor . | Number . |
---|---|---|
Amino acid composition | Amino acid composition | 20 |
Dipeptide composition | 400 | |
Tripeptide composition | 8000 | |
Normalized Moreau-Broto | 240 a | |
Autocorrelation | Moran | 240 a |
Geary | 240 a | |
Composition | 21 | |
CTD | Transition | 21 |
Distribution | 105 | |
Conjoint Triad | Conjoint Triad | 343 |
Quasi-sequence-order | Sequence-order-coupling number | 60 a |
Quasi-sequence-order descriptors | 100 a | |
Pseudo-amino acid composition | Type I | 50 a |
Type II | 80 a | |
Proteochemometric descriptors | Principal components analysis (amino acid properties based) | 175 b |
Principal components analysis (2D and 3D molecular descriptors based) | 4025 b | |
Factor analysis (amino acid properties based) | 175 b | |
Factor analysis (2D and 3D molecular descriptors based) | 4025 b | |
Multidimensional scaling (amino acid properties based) | 175 b | |
Multidimensional scaling (2D and 3D molecular descriptors based) | 4025 b | |
BLOSUM and PAM matrix-derived descriptors | 175 b | |
PSSM | PSSM profile | – |
a The number of descriptor values depends on the choice of the number of properties of amino acid and the choice of the parameter
b The number of descriptor values depends on the choice of the number of components and the choice of the lag parameter
Similarity scores are another useful type of representation encoding the relational information between two proteins. In protr, we incorporated protein sequence alignment ( Pages et al. , 2014 ) and GO semantic similarity measures computation ( Yu et al. , 2010 ) to derive similarity scores. The parallelized version of functions for computing the pairwise similarity scores are provided to accelerate the computation speed.
Furthermore, protr provided several useful auxiliary functions, such as functions for loading sequences from FASTA/PDB files, batch downloading protein sequences from UniProt, amino acid type sanity checking, partitioning sequences to create sliding windows, etc. These functions make the tasks of protein sequence data retrieval, pre-processing and manipulation easier in R. For users without recourse to R scripting and requiring ad hoc analysis of protein sequences, we offered ProtrWeb, an easy-to-use web server for calculating the commonly used descriptors presented in protr.
3 Results
To the best of our knowledge, protr is currently the most comprehensive, flexible and integrated open-source toolkit for protein sequence-derived structural and physiochemical descriptor computation. Users can select appropriate descriptors calculated by protr or ProtrWeb according to their needs, and conveniently apply various statistical analysis and machine learning methods in R to solve various biological questions concerning the structures, functions and interactions of proteins and peptides.
Users of the protr package need to intelligently evaluate the underlying details of the descriptors provided, instead of using protr with their data blindly. It would be wise to use some negative and positive control comparisons where relevant to help guide interpretation of the results.
The protr package has been intensively tested to guarantee the computation correctness and speed. To ensure that our calculation is accurate, the calculated descriptor values were compared with the known values for these sequences.
In future development of protr, it is a potential direction to incorporate 3D structural information of proteins ( Grant et al. , 2006 ), which would be beneficial in several analysis and modeling scenerios.
Funding
This study was supported by the National Key Basic Research Program [2015CB910700] and the National Natural Science Foundation of China [Grants Nos. 81402853 and 11271374] and the Postdoctoral Science Foundation of Central South University. The studies meet with the approval of the university’s review board.
Conflict of Interest : none declared.
References
Author notes
Associate Editor: Ziv Bar-Joseph