iBet uBet web content aggregator. Adding the entire web to your favor.

Abstract

Motivation

Machine learning techniques require various descriptors from protein and nucleic acid sequences to understand/predict their structure and function as well as distinguishing between disease and neutral mutations. Hence, availability of a feature extraction tool is necessary to bridge the gap.

Results

We developed a comprehensive web-based tool, Seq2Feature, which computes 252 protein and 41 DNA sequence-based descriptors. These features include physicochemical, energetic and conformational properties of proteins, mutation matrices and contact potentials as well as nucleotide composition, physicochemical and conformational properties of DNA. We propose that Seq2Feature could serve as an effective tool for extracting protein and DNA sequence-based features as applicable inputs to machine learning algorithms.

Availability and implementation

https://www.iitm.ac.in/bioinfo/SBFE/index.html.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Due to technical advancements in handling large amount of data and efficient algorithms, machine learning techniques have been widely applied to several areas of biology, such as structure and function of biological macromolecules, image processing, recognizing disease patterns and so on. Esteva et al. (2017) utilized deep neural networks for classifying skin cancers using biopsy-proven clinical images. Anoosha et al. (2015) developed a computational tool for discriminating between driver and passenger mutations in epidermal growth factor in cancer using support vector machines. On protein structure and function, several machine learning algorithms have been proposed for predicting the binding affinity of protein–protein complexes, binding sites in protein–protein, protein-DNA and protein-RNA complexes, protein folding rates, aggregation prone regions in proteins, aggregation rates, secondary structure and solvent accessibility of amino acid residues using sequence information alone. In addition, mutation of amino acid residue or nucleotide may alter the structure and function and some of them lead to diseases. This problem has also been addressed successfully using machine learning methods.

Extraction of sequence-based features from protein/nucleic acid sequences and change in property upon mutation is a key for preparing input features in machine learning as well as deep learning methods. In our earlier work, we have developed a tool, PDBparam for extracting structure-based features for any protein structure as well as protein complexes (Nagarajan et al., 2016). Dukka’s group developed a tool, FEPS (http://bcb.ncat.edu/Features/) for extracting features from protein sequences. Chen et al. developed a Python package and webserver, iFeature for extraction and selection of features from protein and peptide sequences (Chen et al., 2018). These methods are limited to protein sequences and no user-friendly server/standalone program is available for handling mutational effects on amino acid substitutions and nucleic acid sequences.

In this work, we developed a web server and software package, Seq2Feature, which is capable of calculating both protein and DNA sequence-based descriptors. Supplementary Figure S1 shows the flowchart of the descriptions in Seq2Feature. It computes physicochemical properties, contact potential based properties and substitution matrices for protein sequences as well as physicochemical, conformational and nucleotide content based properties in DNA sequences. The webserver is freely available at https://www.iitm.ac.in/bioinfo/SBFE/index.html.

2 Descriptors for protein sequences

2.1 Average property value for protein sequences

We have considered physical, chemical, energetic and conformational properties of amino acid residues reported in the literature (Gromiha, 2005) and listed in AAindex database (Kawashima et al., 2008) for computing the average value (Section 1 in Supplementary Material) and the list of properties are presented in Supplementary Table S4.

2.2 Change in property values upon amino acid substitutions

The change in property value upon amino acid mutation is calculated using the difference between the property values of the mutant and the wild-type amino acid (Section 2 in Supplementary Material).

2.3 Substitution matrices

Amino acid mutation matrices are collected from AAIndex2 database (Kawashima et al., 2008) and the mutation value is directly obtained from matrices. The list of substitution matrices are given in Supplementary Table S5.

2.4 Pairwise contact potentials

Pairwise contact potential matrices are collected from AAIndex3 database (Kawashima et al., 2008) and difference of amino acid contact potential for a mutation is obtained by subtracting contact potential value of N-/C-neighbor of mutation position to wild-type residue from N-/C neighbor to mutant residue (Anoosha et al., 2015). The list of contact potentials are given in Supplementary Table S6.

3 Descriptors for DNA sequences Physicochemical, conformational and nucleotide-based properties

We have considered 16 physicochemical properties including enthalpy, entropy, melting temperature, free energy and stacking energy as well as 18 conformational properties including major groove width, minor groove width, rise, shift, slide, roll, tilt and twist (Friedel et al., 2009) to compute the average property value. The experimental property values are reported for dinucleotides and hence, we split the DNA sequence into overlapping dinucleotides and computed the average property value (Section 3 in Supplementary Material). In addition, we included the nucleotide-based properties which include the contents of A, C, T, G, AT, GC, keto (GT), purine (AG) and pyrimidine (TC). A list of DNA-based properties is given in Supplementary Table S7.

4 Server description and implementation

The Seq2Feature server can calculate 252 protein and 41 DNA-based parameters and the detailed statistics is presented in Supplementary Table S2. Based on properties, the parameters are grouped into different categories such as physicochemical, conformational, energetics, etc. The scripts to calculate these parameters have been written in Python. The Python-CGI scripts are used to render the HTML web pages. The Seq2Feature server works with the sequence in FASTA format. The output can be viewed in tabular form and downloaded in comma separated file (.csv).

Example: Compute property values and change due to mutation in a protein (Fig. 1a).

Fig. 1.

(a) Input protein sequence and mutation, (b) Output for change upon mutation and (c) average property value

Open in new tab Download slide

Input or upload sequence in FASTA format (TP53).
Enter mutation with position for example ‘E2C’. Here ‘E’ is mutated to ‘C’ at position 2. It accepts more than one mutation separated by comma.
Choose the property (amino acid properties, substitution matrices or pairwise properties and contact potential) and then click on submit.
The output page (Fig. 1b and c) shows the input sequence, selected properties, formula used to calculate the values and the actual values for each property. The results for actual and normalized property values (between 0 and 1) can be downloaded by clicking on ‘download now’ button.

Additional example for extracting features from a DNA sequence is illustrated in Section 4 in Supplementary Material and Supplementary Figure S2.

Availability of Seq2Feature: Seq2Feature is freely available at https://www.iitm.ac.in/bioinfo/SBFE/and standalone version can be downloaded from https://www.iitm.ac.in/bioinfo/SBFE/help.html.

5 Applications

Seq2Feature directly calculates various sequence-based properties such as individual amino acid based physicochemical properties, substitution matrices and contact potentials based properties in proteins and physicochemical, conformational and nucleotide content based properties in DNA. The features obtained for mutants are used to gain insights on the relationship between features and change in protein structure and function, discrimination of disease causing and neutral mutations and so on (Section 5 in Supplementary Material). It can be used as universal sequence-based feature extraction tool which will help biologists to use these features in the development of machine learning or statistical tools.

6 Conclusion

In this work, we have developed Seq2Feature, a comprehensive open source tool for generating various physicochemical and structural features from protein/DNA sequence. In addition, we have designed a Python-based toolkit to compute the numerical values associated to the DNA and protein sequences which can be useful for extracting the valuable descriptors for several prediction purposes.

Acknowledgements

We thank Indian Institute of Technology Madras and the High-Performance Computing Environment (HPCE) for computational facilities.

Funding

The work is partially supported by the Department of Science and Technology (DST/INT/SWD/P-05/2016) and the Department of Biotechnology, Government of India [BT/PR16710/BID/7/680/2016] to M.M.G.

Conflict of Interest: none declared.

References

Anoosha

et al. (

2015

)

Discrimination of driver and passenger mutations in epidermal growth factor receptor in cancer

Mutat. Res

780

–

Chen

et al. (

2018

)

iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences

Bioinformatics

2499

–

2502

Esteva

et al. (

2017

)

Dermatologist-level classification of skin cancer with deep neural networks

Nature

542

115

–

118

Friedel

et al. (

2009

)

DiProDB: a database for dinucleotide properties

Nucleic Acids Res

D37

–

D40

Gromiha

M.M.

(

2005

)

A statistical model for predicting protein folding rates from amino acid sequence with structural class information

J. Chem. Inf. Model

494

–

501

Kawashima

et al. (

2008

)

AAindex: amino acid index database, progress report

Nucleic Acids Res

202

–

205

Google Scholar

Crossref

WorldCat

Nagarajan

et al. (

2016

)

PDBparam: online resource for computing structural parameters of proteins

Bioinform. Biol. Insights

–

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
May 2019	27
June 2019	45
July 2019	7
August 2019	21
September 2019	30
October 2019	35
November 2019	165
December 2019	60
January 2020	38
February 2020	21
March 2020	13
April 2020	8
May 2020	5
June 2020	16
July 2020	9
August 2020	5
September 2020	19
October 2020	24
November 2020	11
December 2020	7
January 2021	15
February 2021	9
March 2021	10
April 2021	52
May 2021	45
June 2021	56
July 2021	73
August 2021	48
September 2021	46
October 2021	41
November 2021	50
December 2021	49
January 2022	70
February 2022	50
March 2022	53
April 2022	59
May 2022	61
June 2022	51
July 2022	62
August 2022	44
September 2022	101
October 2022	78
November 2022	61
December 2022	90
January 2023	46
February 2023	57
March 2023	69
April 2023	57
May 2023	38
June 2023	24
July 2023	30
August 2023	50
September 2023	34
October 2023	29
November 2023	56
December 2023	44
January 2024	56
February 2024	83
March 2024	80
April 2024	94
May 2024	65
June 2024	67
July 2024	48
August 2024	82
September 2024	76
October 2024	58
November 2024	44

Article Contents

Seq2Feature: a comprehensive web-based feature extraction tool

Abstract

1 Introduction

2 Descriptors for protein sequences

2.1 Average property value for protein sequences

2.2 Change in property values upon amino acid substitutions

2.3 Substitution matrices

2.4 Pairwise contact potentials

3 Descriptors for DNA sequences Physicochemical, conformational and nucleotide-based properties

4 Server description and implementation

5 Applications

6 Conclusion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Seq2Feature: a comprehensive web-based feature extraction tool

Abstract

1 Introduction

2 Descriptors for protein sequences

2.1 Average property value for protein sequences

2.2 Change in property values upon amino acid substitutions

2.3 Substitution matrices

2.4 Pairwise contact potentials

3 Descriptors for DNA sequences Physicochemical, conformational and nucleotide-based properties

4 Server description and implementation

5 Applications

6 Conclusion

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only