- Split View
-
Views
-
Cite
Cite
Rahul Nikam, M Michael Gromiha, Seq2Feature: a comprehensive web-based feature extraction tool, Bioinformatics, Volume 35, Issue 22, November 2019, Pages 4797–4799, https://doi.org/10.1093/bioinformatics/btz432
- Share Icon Share
Abstract
Machine learning techniques require various descriptors from protein and nucleic acid sequences to understand/predict their structure and function as well as distinguishing between disease and neutral mutations. Hence, availability of a feature extraction tool is necessary to bridge the gap.
We developed a comprehensive web-based tool, Seq2Feature, which computes 252 protein and 41 DNA sequence-based descriptors. These features include physicochemical, energetic and conformational properties of proteins, mutation matrices and contact potentials as well as nucleotide composition, physicochemical and conformational properties of DNA. We propose that Seq2Feature could serve as an effective tool for extracting protein and DNA sequence-based features as applicable inputs to machine learning algorithms.
Supplementary data are available at Bioinformatics online.
1 Introduction
Due to technical advancements in handling large amount of data and efficient algorithms, machine learning techniques have been widely applied to several areas of biology, such as structure and function of biological macromolecules, image processing, recognizing disease patterns and so on. Esteva et al. (2017) utilized deep neural networks for classifying skin cancers using biopsy-proven clinical images. Anoosha et al. (2015) developed a computational tool for discriminating between driver and passenger mutations in epidermal growth factor in cancer using support vector machines. On protein structure and function, several machine learning algorithms have been proposed for predicting the binding affinity of protein–protein complexes, binding sites in protein–protein, protein-DNA and protein-RNA complexes, protein folding rates, aggregation prone regions in proteins, aggregation rates, secondary structure and solvent accessibility of amino acid residues using sequence information alone. In addition, mutation of amino acid residue or nucleotide may alter the structure and function and some of them lead to diseases. This problem has also been addressed successfully using machine learning methods.
Extraction of sequence-based features from protein/nucleic acid sequences and change in property upon mutation is a key for preparing input features in machine learning as well as deep learning methods. In our earlier work, we have developed a tool, PDBparam for extracting structure-based features for any protein structure as well as protein complexes (Nagarajan et al., 2016). Dukka’s group developed a tool, FEPS (http://bcb.ncat.edu/Features/) for extracting features from protein sequences. Chen et al. developed a Python package and webserver, iFeature for extraction and selection of features from protein and peptide sequences (Chen et al., 2018). These methods are limited to protein sequences and no user-friendly server/standalone program is available for handling mutational effects on amino acid substitutions and nucleic acid sequences.
In this work, we developed a web server and software package, Seq2Feature, which is capable of calculating both protein and DNA sequence-based descriptors. Supplementary Figure S1 shows the flowchart of the descriptions in Seq2Feature. It computes physicochemical properties, contact potential based properties and substitution matrices for protein sequences as well as physicochemical, conformational and nucleotide content based properties in DNA sequences. The webserver is freely available at https://www.iitm.ac.in/bioinfo/SBFE/index.html.
2 Descriptors for protein sequences
2.1 Average property value for protein sequences
We have considered physical, chemical, energetic and conformational properties of amino acid residues reported in the literature (Gromiha, 2005) and listed in AAindex database (Kawashima et al., 2008) for computing the average value (Section 1 in Supplementary Material) and the list of properties are presented in Supplementary Table S4.
2.2 Change in property values upon amino acid substitutions
The change in property value upon amino acid mutation is calculated using the difference between the property values of the mutant and the wild-type amino acid (Section 2 in Supplementary Material).
2.3 Substitution matrices
Amino acid mutation matrices are collected from AAIndex2 database (Kawashima et al., 2008) and the mutation value is directly obtained from matrices. The list of substitution matrices are given in Supplementary Table S5.
2.4 Pairwise contact potentials
Pairwise contact potential matrices are collected from AAIndex3 database (Kawashima et al., 2008) and difference of amino acid contact potential for a mutation is obtained by subtracting contact potential value of N-/C-neighbor of mutation position to wild-type residue from N-/C neighbor to mutant residue (Anoosha et al., 2015). The list of contact potentials are given in Supplementary Table S6.
3 Descriptors for DNA sequences Physicochemical, conformational and nucleotide-based properties
We have considered 16 physicochemical properties including enthalpy, entropy, melting temperature, free energy and stacking energy as well as 18 conformational properties including major groove width, minor groove width, rise, shift, slide, roll, tilt and twist (Friedel et al., 2009) to compute the average property value. The experimental property values are reported for dinucleotides and hence, we split the DNA sequence into overlapping dinucleotides and computed the average property value (Section 3 in Supplementary Material). In addition, we included the nucleotide-based properties which include the contents of A, C, T, G, AT, GC, keto (GT), purine (AG) and pyrimidine (TC). A list of DNA-based properties is given in Supplementary Table S7.
4 Server description and implementation
The Seq2Feature server can calculate 252 protein and 41 DNA-based parameters and the detailed statistics is presented in Supplementary Table S2. Based on properties, the parameters are grouped into different categories such as physicochemical, conformational, energetics, etc. The scripts to calculate these parameters have been written in Python. The Python-CGI scripts are used to render the HTML web pages. The Seq2Feature server works with the sequence in FASTA format. The output can be viewed in tabular form and downloaded in comma separated file (.csv).
Input or upload sequence in FASTA format (TP53).
Enter mutation with position for example ‘E2C’. Here ‘E’ is mutated to ‘C’ at position 2. It accepts more than one mutation separated by comma.
Choose the property (amino acid properties, substitution matrices or pairwise properties and contact potential) and then click on submit.
The output page (Fig. 1b and c) shows the input sequence, selected properties, formula used to calculate the values and the actual values for each property. The results for actual and normalized property values (between 0 and 1) can be downloaded by clicking on ‘download now’ button.
Additional example for extracting features from a DNA sequence is illustrated in Section 4 in Supplementary Material and Supplementary Figure S2.
Availability of Seq2Feature: Seq2Feature is freely available at https://www.iitm.ac.in/bioinfo/SBFE/and standalone version can be downloaded from https://www.iitm.ac.in/bioinfo/SBFE/help.html.
5 Applications
Seq2Feature directly calculates various sequence-based properties such as individual amino acid based physicochemical properties, substitution matrices and contact potentials based properties in proteins and physicochemical, conformational and nucleotide content based properties in DNA. The features obtained for mutants are used to gain insights on the relationship between features and change in protein structure and function, discrimination of disease causing and neutral mutations and so on (Section 5 in Supplementary Material). It can be used as universal sequence-based feature extraction tool which will help biologists to use these features in the development of machine learning or statistical tools.
6 Conclusion
In this work, we have developed Seq2Feature, a comprehensive open source tool for generating various physicochemical and structural features from protein/DNA sequence. In addition, we have designed a Python-based toolkit to compute the numerical values associated to the DNA and protein sequences which can be useful for extracting the valuable descriptors for several prediction purposes.
Acknowledgements
We thank Indian Institute of Technology Madras and the High-Performance Computing Environment (HPCE) for computational facilities.
Funding
The work is partially supported by the Department of Science and Technology (DST/INT/SWD/P-05/2016) and the Department of Biotechnology, Government of India [BT/PR16710/BID/7/680/2016] to M.M.G.
Conflict of Interest: none declared.
References