iBet uBet web content aggregator. Adding the entire web to your favor.

Author Notes

Abstract

Summary: In chemoinformatics and bioinformatics fields, one of the main computational challenges in various predictive modeling is to find a suitable way to effectively represent the molecules under investigation, such as small molecules, proteins and even complex interactions. To solve this problem, we developed a freely available R/Bioconductor package, called Compound–Protein Interaction with R (Rcpi), for complex molecular representation from drugs, proteins and more complex interactions, including protein–protein and compound–protein interactions. Rcpi could calculate a large number of structural and physicochemical features of proteins and peptides from amino acid sequences, molecular descriptors of small molecules from their topology and protein–protein interaction and compound–protein interaction descriptors. In addition to main functionalities, Rcpi could also provide a number of useful auxiliary utilities to facilitate the user’s need. With the descriptors calculated by this package, the users could conveniently apply various statistical machine learning methods in R to solve various biological and drug research questions in computational biology and drug discovery.

Availability and implementation: Rcpi is freely available from the Bioconductor site ( http://bioconductor.org/packages/release/bioc/html/Rcpi.html ).

Contact: oriental-cds@163.com

1 INTRODUCTION

To develop a powerful model for prediction tasks, one of the most important things to consider is how to effectively represent the molecules under investigation such as small molecules, proteins and even complex interactions, by a descriptor. In the field of chemoinformatics, molecular descriptors for small molecules have frequently been used in quantitative structure-activity/property relationship (QSAR/QSPR), virtual screening, database search, ranking, drug ADME/T prediction and other drug discovery processes ( Cao et al. , 2011 ; Cherkasov et al. , 2014 ; Gola et al. , 2006 , Willett, 2014 ). These descriptors capture and magnify distinct aspects of molecular topology to investigate how molecular structures affect molecular properties. In the field of bioinformatics, sequence-derived structural and physicochemical features have been widely used for predicting protein structural and functional classes, protein–protein interactions, subcellular locations and peptides of specific properties, etc ( Chou et al. , 2008 ; Rangwala et al. , 2005 ; Shen et al. , 2007 ; Ye et al. , 2011 ; Zhang et al. , 2005 ; Zhang et al. , 2012 ). These features are highly useful for representing and distinguishing proteins or peptides of different structural, functional and interaction profiles. Currently, their combinations were routinely used to characterize drug–target interactions and predict new drug–target associations to identify potential drug targets ( Cao et al. , 2013b ; He et al. , 2010 ; Prado-Prado et al. , 2011 ), following the spirit of chemogenomics.

Several programs for computing molecular features have been developed, such as TOPS-MODE, Cinfony, Dragon, CODESSA, PROFEAT, BioJava, BioPython, PseAAC, ProPy, etc ( Cao et al. , 2013a , c ; Cock et al. , 2009 ; Du et al. , 2012 ; Holland et al. , 2008 ; Katritzky et al. , 1994 ; Li et al. , 2006 ; O’Boyle et al. , 2008 ; Pérez-González et al. , 2003 ; Todeschini et al. , 2010 ). Although a number of tools, which are either open sources or commercial softwares, have been developed and widely used in the two fields, their applications only focus on the analysis of either small molecules or proteins. To the best of our knowledge, there is currently no open-source code or tools available for the integration and analysis of increasingly popular interaction problems.

We developed a comprehensive molecular representation tool, called Compound–Protein Interaction with R (Rcpi), to emphasize the integration of chemoinformatics and bioinformatics into a chemogenomics platform for drug discovery. Rcpi mainly focuses on the study of molecular representation techniques for not only small molecules and proteins but also interactions of protein–protein and compound–protein. We recommend Rcpi to analyze and represent various complex molecular data under investigation. Further, we hope that the package will be helpful when exploring questions concerning structures, functions and interactions of various molecular data in the context of systems biology.

2 PACKAGE DESCRIPTION

The Rcpi package aims at offering a unique and comprehensive toolkit for complex molecular representations from small molecules, proteins and more complex interactions (see Table 1 ). To make the Rcpi package fully functional, we recommend the users to install the Enhances packages by using:

source(‘ http://bioconductor.org/biocLite.R ’)
biocLite[‘Rcpi’, dependencies = c(‘Imports’, ‘Enhances’)]
Rcpi mainly covers the following four functionalities:
- (a) For small molecules, Rcpi could (i) calculate >300 molecular descriptors, including constitutional, topological, geometrical, electronic, hybrid and molecular property descriptors; (ii) calculate 10 types of molecular fingerprints, including standard and extended Daylight fingerprints, graph fingerprints based on simple connectivity, hybridization fingerprints based only on hybridization state, FP4 keys, E-state fingerprints, MACCS keys, PubChem fingerprints, KR fingerprints defined by Klekota and Roth, short path fingerprints, etc; (iii) realize parallelized pair-wise similarity computation derived by fingerprints and five types of similarity measures within a list of small molecules; (iv) realize parallelized chemical similarity search with selected similarity metrics and maximum common substructure search between one query molecule and one molecular database.
- (b) For protein sequences, Rcpi could (i) calculate a large number of commonly used structural and physicochemical descriptors, such as amino acid composition, autocorrelation, composition, transition, distribution, conjoint traid, quasi-sequence order and pseudo amino acid composition descriptors; (ii) calculate six types of generalized scale-based descriptors for proteochemometric (PCM) modeling, such as generalized scale-based descriptors derived by principal components analysis, amino acid properties, molecular descriptors, factor analysis, multidimensional scaling, and generalized BLOSUM/PAM matrix-derived descriptors; (iii) calculate profile-based protein features based on position-specific scoring matrix (PSSM); (iv) realize parallelized similarity computation derived by protein sequence alignment and Gene Ontology (GO) semantic similarity measures between a list of protein sequences/GO terms/Entrez Gene IDs.
- (c) For interaction data, by combining various types of descriptors for drugs and proteins, interaction descriptors representing protein-protein or compound-protein interactions could be conveniently generated with Rcpi, including (i) two types of compound–protein interaction descriptors; (ii) three types of protein–protein interaction descriptors.
- (d) Several useful auxiliary utilities are included in Rcpi: (i) parallelized molecule and protein sequence retrieval from several online databases, such as PubChem, ChEMBL, KEGG, DrugBank, UniProt, RCSB PDB, etc; (ii) molecular reading/writing in SMILES/SDF formats for small molecules and FASTA/PDB formats for proteins; (iii) molecular format conversion between ∼140 types of molecular formats defined by OpenBabel.

Table 1.

Open in new tab

List of various types of descriptors for complex molecular data by Rcpi

Data types	Feature groups	Number of descriptors
Proteins	Amino acid composition	8420
	Autocorrelation	720 ^a
	Composition, transition and distribution	147
	Conjoint traid descriptors	343
	Quasi-sequence order	160 ^a
	Pseudo-amino acid composition	130 ^a
	PSSM profile	–
	PCM	–
	GO similarity	–
	Sequence similarity	–
Compounds	Constitutional	15
	Topological	183
	Geometrical	49
	Electronic	34
	Hybrid	23
	Molecular property	4
	Fingerprints	10
	Maximum common substructure	–
Compound– protein interaction	Type 1	Nc + Np
Compound– protein interaction	Type 2	Nc × Np
Protein– protein interaction	Type 1	Np + Np
	Type 2	Np + Np
	Type 3	Np × Np

Data types	Feature groups	Number of descriptors
Proteins	Amino acid composition	8420
	Autocorrelation	720 ^a
	Composition, transition and distribution	147
	Conjoint traid descriptors	343
	Quasi-sequence order	160 ^a
	Pseudo-amino acid composition	130 ^a
	PSSM profile	–
	PCM	–
	GO similarity	–
	Sequence similarity	–
Compounds	Constitutional	15
	Topological	183
	Geometrical	49
	Electronic	34
	Hybrid	23
	Molecular property	4
	Fingerprints	10
	Maximum common substructure	–
Compound– protein interaction	Type 1	Nc + Np
Compound– protein interaction	Type 2	Nc × Np
Protein– protein interaction	Type 1	Np + Np
	Type 2	Np + Np
	Type 3	Np × Np

Note : ^a The number of descriptors depends on the choice of the number of properties of amino acids and the choice of the parameter in corresponding algorithms. Nc and Np denote the number of molecular descriptors for compounds and proteins, respectively.

Table 1.

Open in new tab

List of various types of descriptors for complex molecular data by Rcpi

Data types	Feature groups	Number of descriptors
Proteins	Amino acid composition	8420
	Autocorrelation	720 ^a
	Composition, transition and distribution	147
	Conjoint traid descriptors	343
	Quasi-sequence order	160 ^a
	Pseudo-amino acid composition	130 ^a
	PSSM profile	–
	PCM	–
	GO similarity	–
	Sequence similarity	–
Compounds	Constitutional	15
	Topological	183
	Geometrical	49
	Electronic	34
	Hybrid	23
	Molecular property	4
	Fingerprints	10
	Maximum common substructure	–
Compound– protein interaction	Type 1	Nc + Np
Compound– protein interaction	Type 2	Nc × Np
Protein– protein interaction	Type 1	Np + Np
	Type 2	Np + Np
	Type 3	Np × Np

Data types	Feature groups	Number of descriptors
Proteins	Amino acid composition	8420
	Autocorrelation	720 ^a
	Composition, transition and distribution	147
	Conjoint traid descriptors	343
	Quasi-sequence order	160 ^a
	Pseudo-amino acid composition	130 ^a
	PSSM profile	–
	PCM	–
	GO similarity	–
	Sequence similarity	–
Compounds	Constitutional	15
	Topological	183
	Geometrical	49
	Electronic	34
	Hybrid	23
	Molecular property	4
	Fingerprints	10
	Maximum common substructure	–
Compound– protein interaction	Type 1	Nc + Np
Compound– protein interaction	Type 2	Nc × Np
Protein– protein interaction	Type 1	Np + Np
	Type 2	Np + Np
	Type 3	Np × Np

3 DISCUSSION

Rcpi contains a selection of molecular descriptors to analyze, classify and compare complex molecular network in the context of network biology/pharmacology. They facilitate to exploit machine learning techniques to drive hypothesis from complex molecular datasets. The usefulness of these molecular descriptors covered by Rcpi for representing structural features of various molecular data has been sufficiently demonstrated by a number of published studies of the development of machine learning prediction systems.

In the future work, we plan to apply integrated features on various biological and drug research questions, and extend the range of functions with new promising descriptors for the coming versions of Rcpi.

Funding : This study was supported by the National key basic research program (2015CB910700) and the National Natural Science Foundation of China (Grant No. 81402853) and the Postdoctoral Science Foundation of Central South University. The studies meet with the approval of the university’s review board.

Conflict of interest : none declared.

REFERENCES

Cao

et al.

In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint

Anal. Chim. Acta

2011

, vol.

692

(pg.

)

Month:	Total Views:
November 2016	2
December 2016	13
January 2017	5
February 2017	33
March 2017	14
April 2017	19
May 2017	27
June 2017	20
July 2017	18
August 2017	28
September 2017	8
October 2017	28
November 2017	18
December 2017	70
January 2018	66
February 2018	46
March 2018	58
April 2018	54
May 2018	101
June 2018	38
July 2018	47
August 2018	65
September 2018	63
October 2018	59
November 2018	86
December 2018	71
January 2019	43
February 2019	55
March 2019	91
April 2019	60
May 2019	69
June 2019	63
July 2019	52
August 2019	43
September 2019	64
October 2019	40
November 2019	43
December 2019	38
January 2020	68
February 2020	42

Article Contents

Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions

Abstract

1 INTRODUCTION

2 PACKAGE DESCRIPTION

3 DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only