- Split View
-
Views
-
Cite
Cite
Robson P Bonidia, Douglas S Domingues, Danilo S Sanches, André C P L F de Carvalho, MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors, Briefings in Bioinformatics, Volume 23, Issue 1, January 2022, bbab434, https://doi.org/10.1093/bib/bbab434
- Share Icon Share
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350–0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Background
Machine learning (ML) algorithms have been successfully applied to genomics, transcriptomics and proteomics problems [1, 2]. Nevertheless, their predictive performance depends on the representation of the sequences by relevant features, able to extract important aspects present in the original sequences. In [3, 4], the authors address the relevance of using an appropriate mathematical expression to extract features from biological data, which has been adopted by several studies [5–7], e.g. non-classical secreted proteins [8], phage virion proteins (PVP)[9], SARS-CoV-2 [10, 11], sigma70 promoters [12] and long non-coding RNAs [13, 14].
As a result, many techniques have been proposed and experimentally investigated [15, 16], and several of them were made available in public software packages, such as PROFEAT [17], PseAAC [18], propy [19], PseKNC-general [16], SPiCE [20], protr/ProtrWeb [21], ProFET [22], Pse-in-One [4], repDNA [23], Rcpi [24], repRNA [25], BioSeq-analysis [26], iFeature [27], PyBioMed [28], Seq2Feature [29], PyFeat [30], iLearn [7], periodicDNA[31] and iLearnPlus [32].
These software packages have been used to extract features from sequences. However, there are some aspects present in the sequences that the feature extraction techniques included in these tools cannot extract. These features, which were shown to be relevant in previous studies [33–36], describe mathematical aspects observed in biological sequences and will be named here mathematical descriptors [37]. These descriptors are based on several techniques, such as multiple numerical mappings, Fourier transform (FT), chaos game theory, entropy and complex networks (CN). To allow the extraction of these descriptors as features for the study of biological sequences, and also including conventional descriptors available in other packages, we created a novel open-source Python package, named MathFeature.
Group . | Initials . | Application group . | Study . |
---|---|---|---|
Amino acid composition | AAC | Protein | [7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30] |
Pseudo-amino acid composition | PseAAC | Protein | [7] [4] [18] [19] [20] [21] [24] [26] [27] [28] |
Composition, transition, distribution | CTD | Protein | [7] [17] [19] [20] [21] [22] [24] [27] [28] |
Sequence-order | SO | Protein | [7] [17] [19] [20] [21] [24] [27] [28] |
Conjoint triad | CT | Protein | [7] [21] [24] [27] [28] |
Proteochemometric descriptors | PCM | Protein | [7] [21] [24] [27] |
Profile-based features | PF | Protein | [7] [20] [21] [24] [26] [27] |
Nucleic acid composition | NAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] [30] |
Pseudo nucleic acid composition | PseNAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] |
Structure composition | SC | DNA, RNA, Protein | [7] [25] [26] [27] |
Sequence similarity | SS | DNA, RNA, Protein | [24] |
Autocorrelation | – | DNA, RNA, Protein | [7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28] |
Numerical mapping | – | DNA, RNA, Protein | [7] [27] |
K-nearest neighbor | KNN | DNA, RNA, Protein | [7] [27] |
Physicochemical property | PP | DNA, RNA, Protein | [7] [22] [27] [29] |
Group . | Initials . | Application group . | Study . |
---|---|---|---|
Amino acid composition | AAC | Protein | [7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30] |
Pseudo-amino acid composition | PseAAC | Protein | [7] [4] [18] [19] [20] [21] [24] [26] [27] [28] |
Composition, transition, distribution | CTD | Protein | [7] [17] [19] [20] [21] [22] [24] [27] [28] |
Sequence-order | SO | Protein | [7] [17] [19] [20] [21] [24] [27] [28] |
Conjoint triad | CT | Protein | [7] [21] [24] [27] [28] |
Proteochemometric descriptors | PCM | Protein | [7] [21] [24] [27] |
Profile-based features | PF | Protein | [7] [20] [21] [24] [26] [27] |
Nucleic acid composition | NAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] [30] |
Pseudo nucleic acid composition | PseNAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] |
Structure composition | SC | DNA, RNA, Protein | [7] [25] [26] [27] |
Sequence similarity | SS | DNA, RNA, Protein | [24] |
Autocorrelation | – | DNA, RNA, Protein | [7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28] |
Numerical mapping | – | DNA, RNA, Protein | [7] [27] |
K-nearest neighbor | KNN | DNA, RNA, Protein | [7] [27] |
Physicochemical property | PP | DNA, RNA, Protein | [7] [22] [27] [29] |
Group . | Initials . | Application group . | Study . |
---|---|---|---|
Amino acid composition | AAC | Protein | [7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30] |
Pseudo-amino acid composition | PseAAC | Protein | [7] [4] [18] [19] [20] [21] [24] [26] [27] [28] |
Composition, transition, distribution | CTD | Protein | [7] [17] [19] [20] [21] [22] [24] [27] [28] |
Sequence-order | SO | Protein | [7] [17] [19] [20] [21] [24] [27] [28] |
Conjoint triad | CT | Protein | [7] [21] [24] [27] [28] |
Proteochemometric descriptors | PCM | Protein | [7] [21] [24] [27] |
Profile-based features | PF | Protein | [7] [20] [21] [24] [26] [27] |
Nucleic acid composition | NAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] [30] |
Pseudo nucleic acid composition | PseNAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] |
Structure composition | SC | DNA, RNA, Protein | [7] [25] [26] [27] |
Sequence similarity | SS | DNA, RNA, Protein | [24] |
Autocorrelation | – | DNA, RNA, Protein | [7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28] |
Numerical mapping | – | DNA, RNA, Protein | [7] [27] |
K-nearest neighbor | KNN | DNA, RNA, Protein | [7] [27] |
Physicochemical property | PP | DNA, RNA, Protein | [7] [22] [27] [29] |
Group . | Initials . | Application group . | Study . |
---|---|---|---|
Amino acid composition | AAC | Protein | [7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30] |
Pseudo-amino acid composition | PseAAC | Protein | [7] [4] [18] [19] [20] [21] [24] [26] [27] [28] |
Composition, transition, distribution | CTD | Protein | [7] [17] [19] [20] [21] [22] [24] [27] [28] |
Sequence-order | SO | Protein | [7] [17] [19] [20] [21] [24] [27] [28] |
Conjoint triad | CT | Protein | [7] [21] [24] [27] [28] |
Proteochemometric descriptors | PCM | Protein | [7] [21] [24] [27] |
Profile-based features | PF | Protein | [7] [20] [21] [24] [26] [27] |
Nucleic acid composition | NAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] [30] |
Pseudo nucleic acid composition | PseNAC | DNA, RNA | [7] [4] [16] [23] [25] [26] [28] |
Structure composition | SC | DNA, RNA, Protein | [7] [25] [26] [27] |
Sequence similarity | SS | DNA, RNA, Protein | [24] |
Autocorrelation | – | DNA, RNA, Protein | [7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28] |
Numerical mapping | – | DNA, RNA, Protein | [7] [27] |
K-nearest neighbor | KNN | DNA, RNA, Protein | [7] [27] |
Physicochemical property | PP | DNA, RNA, Protein | [7] [22] [27] [29] |
This package provides, in a single environment, many of the mathematical descriptors previously proposed for feature extraction from biological sequences [33–36]. MathFeature contains |$37$| descriptors, in which, |$20$| of them are mathematically organized into five groups (numerical mapping, chaos game, FT, entropy and graphs). Additionally, MathFeature extends our preliminary investigation [36], where we investigated nine sets of mathematical features. MathFeature also includes descriptors for Protein sequences, i.e. prediction of structural features along the primary sequence of amino acids. To the best of our knowledge, MathFeature is the first package to provide such a large and comprehensive set of feature extraction techniques based on mathematical descriptors for DNA, RNA and Proteins.
Related works
Fundamentally, we consider feature engineering a key step to ML application success [38–40], mainly in biological sequence preprocessing [3, 41, 42]. In terms of terminology, according to [38], feature is synonymous of an input variable or attribute. Nevertheless, studies also use the ‘feature descriptor’ terminology (the majority in our review—15 studies), which is the reason why we adopted this term, where a feature descriptor refers to the feature extraction method/technique that can present several measures/values.
In this section, we described 17 studies (cited in Background Section) related to feature extraction packages (tools, web servers, toolkits, etc), providing several feature descriptors for biological sequence analyses. We organized the selected studies into application categories (that is, DNA, RNA, or protein—Supplementary File S1). Furthermore, we also plotted a Venn Diagram (see Supplementary File S2), including all studies by application. In general, most studies are focused on the representation of proteins (eight studies), while DNA and RNA studies had one application each. Moreover, considering the intersection of applications, we found four studies of applications combining DNA, RNA and protein, whereas DNA+protein with two studies and DNA+RNA with one study, respectively.
In our literature review, we found 173 feature descriptors. It is not feasible to individually analyze and describe each descriptor. For this reason, based on our review, we divided these descriptors into 15 large groups, as shown in Table 1. The group column classifies the feature descriptors based on the reviewed studies, and the study column includes packages that have at least one descriptor from the related group.
Considering the groups introduced in Table 1, we realized that most descriptors are based on AAC, PseAAC, CTD and SO for proteins, while NAC and PseNAC descriptors for DNA/RNA, and autocorrelation for DNA, RNA and protein. Nevertheless, MathFeature overcomes other packages in different types of mathematical descriptors (e.g. chaos game, FT, entropy and graphs), except two descriptors in numerical mapping, available in only two packages [7, 27]. In addition, to better illustrate the advantages of MathFeature compared with other studies, we included Table 2, which shows the number of MathFeature descriptors that can also be found in other tools. In that case, it can be noticed that only iLearn has |$15$| descriptors from a total of |$37$| descriptors available in MathFeature. Moreover, we found only a few sets (2 up to 9) of similar descriptors from other packages compared to our study. Based on this analysis, we realized the novelty of MathFeature for providing different descriptors in biological sequences, which we believe to be an important contribution. Also, most studies (13, 76.47%) were dedicated to evaluating only one type of sequence, while 4 (23.53%) studies cover multiple types of sequences, including MathFeature. Finally, our package is also competitive in terms of the number of descriptors (total of 37).
Package . | Mathematical descriptors . | Conventional descriptors . | Number of descriptors calculated . |
---|---|---|---|
MathFeature | 20 | 17 | 37 |
PROFEAT | 0 | 2 | 2 |
PseAAC | 0 | 2 | 2 |
propy | 0 | 5 | 5 |
PseKNC-general | 0 | 5 | 5 |
SPiCE | 0 | 4 | 4 |
ProtrWeb | 0 | 5 | 5 |
ProFET | 2 | 3 | 5 |
Pse-in-One | 0 | 5 | 5 |
repDNA | 0 | 5 | 5 |
Rcpi | 0 | 3 | 3 |
repRNA | 0 | 5 | 5 |
BioSeq-analysis | 0 | 9 | 9 |
iFeature | 1 | 4 | 5 |
PyBioMed | 0 | 7 | 7 |
Seq2Feature | 0 | 0 | 0 |
PyFeat | 1 | 8 | 9 |
iLearn | 2 | 13 | 15 |
Package . | Mathematical descriptors . | Conventional descriptors . | Number of descriptors calculated . |
---|---|---|---|
MathFeature | 20 | 17 | 37 |
PROFEAT | 0 | 2 | 2 |
PseAAC | 0 | 2 | 2 |
propy | 0 | 5 | 5 |
PseKNC-general | 0 | 5 | 5 |
SPiCE | 0 | 4 | 4 |
ProtrWeb | 0 | 5 | 5 |
ProFET | 2 | 3 | 5 |
Pse-in-One | 0 | 5 | 5 |
repDNA | 0 | 5 | 5 |
Rcpi | 0 | 3 | 3 |
repRNA | 0 | 5 | 5 |
BioSeq-analysis | 0 | 9 | 9 |
iFeature | 1 | 4 | 5 |
PyBioMed | 0 | 7 | 7 |
Seq2Feature | 0 | 0 | 0 |
PyFeat | 1 | 8 | 9 |
iLearn | 2 | 13 | 15 |
Package . | Mathematical descriptors . | Conventional descriptors . | Number of descriptors calculated . |
---|---|---|---|
MathFeature | 20 | 17 | 37 |
PROFEAT | 0 | 2 | 2 |
PseAAC | 0 | 2 | 2 |
propy | 0 | 5 | 5 |
PseKNC-general | 0 | 5 | 5 |
SPiCE | 0 | 4 | 4 |
ProtrWeb | 0 | 5 | 5 |
ProFET | 2 | 3 | 5 |
Pse-in-One | 0 | 5 | 5 |
repDNA | 0 | 5 | 5 |
Rcpi | 0 | 3 | 3 |
repRNA | 0 | 5 | 5 |
BioSeq-analysis | 0 | 9 | 9 |
iFeature | 1 | 4 | 5 |
PyBioMed | 0 | 7 | 7 |
Seq2Feature | 0 | 0 | 0 |
PyFeat | 1 | 8 | 9 |
iLearn | 2 | 13 | 15 |
Package . | Mathematical descriptors . | Conventional descriptors . | Number of descriptors calculated . |
---|---|---|---|
MathFeature | 20 | 17 | 37 |
PROFEAT | 0 | 2 | 2 |
PseAAC | 0 | 2 | 2 |
propy | 0 | 5 | 5 |
PseKNC-general | 0 | 5 | 5 |
SPiCE | 0 | 4 | 4 |
ProtrWeb | 0 | 5 | 5 |
ProFET | 2 | 3 | 5 |
Pse-in-One | 0 | 5 | 5 |
repDNA | 0 | 5 | 5 |
Rcpi | 0 | 3 | 3 |
repRNA | 0 | 5 | 5 |
BioSeq-analysis | 0 | 9 | 9 |
iFeature | 1 | 4 | 5 |
PyBioMed | 0 | 7 | 7 |
Seq2Feature | 0 | 0 | 0 |
PyFeat | 1 | 8 | 9 |
iLearn | 2 | 13 | 15 |
Package description
MathFeature is a user friendly package that covers |$20$| mathematical descriptors, as illustrated by Figure 1. We also elaborate the MathFeature execution workflow, which can be divided into four simple steps, as shown in Figure 2. In Table 3, we organized the |$20$| descriptors into |$5$| groups (numerical mapping (7), chaos game (2), FT (7), entropy (2) and graphs (2)), according to their structure. MathFeature can be run on console, but we also provide a graphical user interface (GUI)-based platform (see Supplementary File: S3). We briefly describe each of the |$5$| groups representing the |$20$| descriptors:
Descriptor groups . | Descriptor . | Dimension . | Biological Sequence . |
---|---|---|---|
Binary | |$L \cdot 4$| | DNA/RNA | |
Z-curve | |$L \cdot 3$| | DNA/RNA | |
Real | |$L$| | DNA/RNA | |
Numerical mapping | Integer | |$L$| | DNA/RNA/Protein |
EIIP | |$L$| | DNA/RNA/Protein | |
Complex Number | |$L$| | DNA/RNA | |
Atomic Number | |$L$| | DNA/RNA | |
Binary + Fourier | |$19$| | DNA/RNA | |
FT | Z-curve + Fourier | |$19$| | DNA/RNA |
Real + Fourier | |$19$| | DNA/RNA | |
Integer + Fourier | |$19$| | DNA/RNA/Protein | |
EIIP + Fourier | |$19$| | DNA/RNA/Protein | |
Complex Number + Fourier | |$19$| | DNA/RNA | |
Atomic Number + Fourier | |$19$| | DNA/RNA | |
CGR | |$L \cdot 2$| | DNA/RNA | |
Chaos game | Chaos Game Signal (with Fourier) | |$19$| | DNA/RNA |
entropy | Shannon | |$k$| | DNA/RNA/Protein |
Tsallis | |$k$| | DNA/RNA/Protein | |
Graphs | CN (with threshold) | |$12 \cdot t$| | DNA/RNA/Protein |
CN (without threshold) | |$26 \cdot k$| | DNA/RNA/Protein |
Descriptor groups . | Descriptor . | Dimension . | Biological Sequence . |
---|---|---|---|
Binary | |$L \cdot 4$| | DNA/RNA | |
Z-curve | |$L \cdot 3$| | DNA/RNA | |
Real | |$L$| | DNA/RNA | |
Numerical mapping | Integer | |$L$| | DNA/RNA/Protein |
EIIP | |$L$| | DNA/RNA/Protein | |
Complex Number | |$L$| | DNA/RNA | |
Atomic Number | |$L$| | DNA/RNA | |
Binary + Fourier | |$19$| | DNA/RNA | |
FT | Z-curve + Fourier | |$19$| | DNA/RNA |
Real + Fourier | |$19$| | DNA/RNA | |
Integer + Fourier | |$19$| | DNA/RNA/Protein | |
EIIP + Fourier | |$19$| | DNA/RNA/Protein | |
Complex Number + Fourier | |$19$| | DNA/RNA | |
Atomic Number + Fourier | |$19$| | DNA/RNA | |
CGR | |$L \cdot 2$| | DNA/RNA | |
Chaos game | Chaos Game Signal (with Fourier) | |$19$| | DNA/RNA |
entropy | Shannon | |$k$| | DNA/RNA/Protein |
Tsallis | |$k$| | DNA/RNA/Protein | |
Graphs | CN (with threshold) | |$12 \cdot t$| | DNA/RNA/Protein |
CN (without threshold) | |$26 \cdot k$| | DNA/RNA/Protein |
|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer, |$t = $| threshold - number of subgraphs.
Descriptor groups . | Descriptor . | Dimension . | Biological Sequence . |
---|---|---|---|
Binary | |$L \cdot 4$| | DNA/RNA | |
Z-curve | |$L \cdot 3$| | DNA/RNA | |
Real | |$L$| | DNA/RNA | |
Numerical mapping | Integer | |$L$| | DNA/RNA/Protein |
EIIP | |$L$| | DNA/RNA/Protein | |
Complex Number | |$L$| | DNA/RNA | |
Atomic Number | |$L$| | DNA/RNA | |
Binary + Fourier | |$19$| | DNA/RNA | |
FT | Z-curve + Fourier | |$19$| | DNA/RNA |
Real + Fourier | |$19$| | DNA/RNA | |
Integer + Fourier | |$19$| | DNA/RNA/Protein | |
EIIP + Fourier | |$19$| | DNA/RNA/Protein | |
Complex Number + Fourier | |$19$| | DNA/RNA | |
Atomic Number + Fourier | |$19$| | DNA/RNA | |
CGR | |$L \cdot 2$| | DNA/RNA | |
Chaos game | Chaos Game Signal (with Fourier) | |$19$| | DNA/RNA |
entropy | Shannon | |$k$| | DNA/RNA/Protein |
Tsallis | |$k$| | DNA/RNA/Protein | |
Graphs | CN (with threshold) | |$12 \cdot t$| | DNA/RNA/Protein |
CN (without threshold) | |$26 \cdot k$| | DNA/RNA/Protein |
Descriptor groups . | Descriptor . | Dimension . | Biological Sequence . |
---|---|---|---|
Binary | |$L \cdot 4$| | DNA/RNA | |
Z-curve | |$L \cdot 3$| | DNA/RNA | |
Real | |$L$| | DNA/RNA | |
Numerical mapping | Integer | |$L$| | DNA/RNA/Protein |
EIIP | |$L$| | DNA/RNA/Protein | |
Complex Number | |$L$| | DNA/RNA | |
Atomic Number | |$L$| | DNA/RNA | |
Binary + Fourier | |$19$| | DNA/RNA | |
FT | Z-curve + Fourier | |$19$| | DNA/RNA |
Real + Fourier | |$19$| | DNA/RNA | |
Integer + Fourier | |$19$| | DNA/RNA/Protein | |
EIIP + Fourier | |$19$| | DNA/RNA/Protein | |
Complex Number + Fourier | |$19$| | DNA/RNA | |
Atomic Number + Fourier | |$19$| | DNA/RNA | |
CGR | |$L \cdot 2$| | DNA/RNA | |
Chaos game | Chaos Game Signal (with Fourier) | |$19$| | DNA/RNA |
entropy | Shannon | |$k$| | DNA/RNA/Protein |
Tsallis | |$k$| | DNA/RNA/Protein | |
Graphs | CN (with threshold) | |$12 \cdot t$| | DNA/RNA/Protein |
CN (without threshold) | |$26 \cdot k$| | DNA/RNA/Protein |
|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer, |$t = $| threshold - number of subgraphs.
Numerical mapping: Several sequence analysis studies require converting a biological sequence into a numerical sequence. Previous studies [43–45] have proposed descriptors for such, which are able to represent important aspects of these sequences. This group contains 7 descriptors for numerical mapping: Voss [46] (known as binary mapping), Integer [45], real [47], Z-curve [43], electron-ion interaction potential (EIIP) [48, 49], complex Numbers [44, 50] and atomic number [35, 51].
FT: This group consists of feature extraction methods, which generate sequence features based on genomic signal processing (GSP), using FT, a widely applied approach in several biological sequence analysis problems [34–36, 52]. To implement GSP techniques, we used all numerical mappings. A mathematical exploration can be seen in [36].
Chaos game representation (CGR): This approach is also a mapping for a sequence, but scale-independent and iterative for geometric representation of DNA sequences [53]. Based on available CGR representations, the MathFeature package considers classical CGR [34, 53], frequency CGR [54] and CGR signal with FT [34].
Entropy: Different studies have applied concepts from information theory for sequence feature extraction, mainly Shannon’s entropy (SE) [33, 55]. According to [56], Tsallis entropy (TE) [57] has been successfully explored in several studies. Moreover, Tsallis entropy attempted to generalize the Boltzmann/Gibbs’s traditional entropy. This group includes these two descriptors [36].
Graphs: This group has descriptors based on graph theory (CN), which has been successfully used to represent biological sequence for classification tasks [58, 59]. The descriptors implemented in this group include techniques proposed in [60] and explored in [36].
MathFeature also provides well-known descriptors from other studies with biological sequences (called conventional descriptors here, see Table 4, due to the large number of implementations in the revised packages, see Table 1) such as NAC, dinucleotide composition (DNC), trinucleotide composition (TNC), pseudo K-tuple nucleotide composition (PseKNC) [16], accumulated nucleotide frequency (ANF—DNA, RNA and protein) [61], basic k-mer (DNA, RNA and protein) [62], AAC, dipeptide composition (DPC), tripeptide composition (TPC) and Xmer k-Spaced Ymer composition frequency (kGap - DNA, RNA and protein) [30]. In addition, we also implemented two widely known descriptors in coding sequence studies, e.g. open reading frame (ORF) or coding features [36] and Fickett score [63]. Finally, we summarized the set of features generated by each descriptor investigated in this study (mathematical and conventional), as described in Table 5. MathFeature is freely available at https://github.com/Bonidia/MathFeature, and its documentation is provided at https://bonidia.github.io/MathFeature/.
Descriptor groups . | Descriptor . | Dimension . | Biological sequence . |
---|---|---|---|
Basic k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
Customized k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
NAC | |$4$| | DNA/RNA | |
Other descriptors | DNC | |$16$| | DNA/RNA |
TNC | |$64$| | DNA/RNA | |
ORF Features or Coding Features | |$10$| | DNA/RNA | |
Fickett score | |$2$| | DNA/RNA | |
PseKNC | – | DNA/RNA | |
ANF | |$L$| | DNA/RNA/Protein | |
kGap | |$4^X \cdot 4^Y$| or |$20^X \cdot 20^Y$| | DNA/RNA/Protein | |
AAC | |$20$| | Protein | |
DPC | |$400$| | Protein | |
TPC | |$8000$| | Protein |
Descriptor groups . | Descriptor . | Dimension . | Biological sequence . |
---|---|---|---|
Basic k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
Customized k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
NAC | |$4$| | DNA/RNA | |
Other descriptors | DNC | |$16$| | DNA/RNA |
TNC | |$64$| | DNA/RNA | |
ORF Features or Coding Features | |$10$| | DNA/RNA | |
Fickett score | |$2$| | DNA/RNA | |
PseKNC | – | DNA/RNA | |
ANF | |$L$| | DNA/RNA/Protein | |
kGap | |$4^X \cdot 4^Y$| or |$20^X \cdot 20^Y$| | DNA/RNA/Protein | |
AAC | |$20$| | Protein | |
DPC | |$400$| | Protein | |
TPC | |$8000$| | Protein |
|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer
Descriptor groups . | Descriptor . | Dimension . | Biological sequence . |
---|---|---|---|
Basic k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
Customized k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
NAC | |$4$| | DNA/RNA | |
Other descriptors | DNC | |$16$| | DNA/RNA |
TNC | |$64$| | DNA/RNA | |
ORF Features or Coding Features | |$10$| | DNA/RNA | |
Fickett score | |$2$| | DNA/RNA | |
PseKNC | – | DNA/RNA | |
ANF | |$L$| | DNA/RNA/Protein | |
kGap | |$4^X \cdot 4^Y$| or |$20^X \cdot 20^Y$| | DNA/RNA/Protein | |
AAC | |$20$| | Protein | |
DPC | |$400$| | Protein | |
TPC | |$8000$| | Protein |
Descriptor groups . | Descriptor . | Dimension . | Biological sequence . |
---|---|---|---|
Basic k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
Customized k-mer | |$4^k$| or |$20^k$| | DNA/RNA/Protein | |
NAC | |$4$| | DNA/RNA | |
Other descriptors | DNC | |$16$| | DNA/RNA |
TNC | |$64$| | DNA/RNA | |
ORF Features or Coding Features | |$10$| | DNA/RNA | |
Fickett score | |$2$| | DNA/RNA | |
PseKNC | – | DNA/RNA | |
ANF | |$L$| | DNA/RNA/Protein | |
kGap | |$4^X \cdot 4^Y$| or |$20^X \cdot 20^Y$| | DNA/RNA/Protein | |
AAC | |$20$| | Protein | |
DPC | |$400$| | Protein | |
TPC | |$8000$| | Protein |
|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer
Descriptors . | Features . |
---|---|
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF | Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2. |
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier) | Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis. |
Shannon, Tsallis | For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure. |
CN (with threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global). |
CN (without threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices. |
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap | Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G. |
ORF features or coding features | Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF. |
Fickett score | Fickett:orf, Fickett:full:sequence |
PseKNC | Modes of PseKNC with physicochemical properties |
Descriptors . | Features . |
---|---|
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF | Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2. |
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier) | Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis. |
Shannon, Tsallis | For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure. |
CN (with threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global). |
CN (without threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices. |
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap | Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G. |
ORF features or coding features | Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF. |
Fickett score | Fickett:orf, Fickett:full:sequence |
PseKNC | Modes of PseKNC with physicochemical properties |
Descriptors . | Features . |
---|---|
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF | Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2. |
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier) | Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis. |
Shannon, Tsallis | For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure. |
CN (with threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global). |
CN (without threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices. |
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap | Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G. |
ORF features or coding features | Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF. |
Fickett score | Fickett:orf, Fickett:full:sequence |
PseKNC | Modes of PseKNC with physicochemical properties |
Descriptors . | Features . |
---|---|
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF | Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2. |
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier) | Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis. |
Shannon, Tsallis | For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure. |
CN (with threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global). |
CN (without threshold) | Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices. |
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap | Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G. |
ORF features or coding features | Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF. |
Fickett score | Fickett:orf, Fickett:full:sequence |
PseKNC | Modes of PseKNC with physicochemical properties |
Results
The main aim of this paper is to make publicly available a large set of feature extraction techniques for biological sequences, including mathematical descriptors not found in similar packages. These descriptors have been successfully applied to extract relevant features from biological sequences, as can be seen in [36], [34], [52], [33] and [60]. For this reason, to assess the relevance of MathFeature descriptors, we provide case studies, which are detailed and presented in the experimental scenario section.
Experimental scenario
We ran experiments for nine case studies with distinct scenarios for the classification of DNA, RNA and protein sequences, as shown in Table 6. These case studies compare the use of several descriptors in distinct problem domains. Furthermore, we did not include any feature selection or hyperparameter optimization technique. Hence, for a fair comparison, we selected descriptors using stratified random sampling (choosing descriptors in each group defined in the article, e.g. numerical mapping, FT, chaos game, entropy, graphs and conventional) in all case studies to avoid any biased choices according to the problem domain. In addition, to compare our results with state-of-the-art studies, we used different ML algorithms, performance measures and dataset partitions to adapt our pipeline to the benchmark dataset. Finally, we also selected hybridized features using stratified random sampling, to assess how these feature sets can improve the ML model prediction.
Problem . | Reference . | Case study . | Application . | Number of sequences . | Classifier . |
---|---|---|---|---|---|
Non-classical secreted proteins | [8] | I | Protein | 655 | CatBoost |
PVP | [64] | II | Protein | 626 | Support Vector Machines |
SARS-CoV-2 sequences | [65] | III | DNA | 24 815 | Random Forest |
Sigma70 promoters | [12] | IV | DNA | 2141 | Support Vector Machines |
Anticancer Peptides | [66] | V | Protein | 344 | Random Forest |
Protein lysine crotonylation | [67] | VI | Protein | 40 587 | Random Forest |
Long non-coding RNAs | [13] | VII | RNA | 21 000 and 12 000 | CatBoost |
Long non-coding RNAs | [68] | VIII | RNA | 36 000 | Deep Learning |
Sigma70 promoters | [69] | IX | DNA | 2141 | Random Forest |
Problem . | Reference . | Case study . | Application . | Number of sequences . | Classifier . |
---|---|---|---|---|---|
Non-classical secreted proteins | [8] | I | Protein | 655 | CatBoost |
PVP | [64] | II | Protein | 626 | Support Vector Machines |
SARS-CoV-2 sequences | [65] | III | DNA | 24 815 | Random Forest |
Sigma70 promoters | [12] | IV | DNA | 2141 | Support Vector Machines |
Anticancer Peptides | [66] | V | Protein | 344 | Random Forest |
Protein lysine crotonylation | [67] | VI | Protein | 40 587 | Random Forest |
Long non-coding RNAs | [13] | VII | RNA | 21 000 and 12 000 | CatBoost |
Long non-coding RNAs | [68] | VIII | RNA | 36 000 | Deep Learning |
Sigma70 promoters | [69] | IX | DNA | 2141 | Random Forest |
Problem . | Reference . | Case study . | Application . | Number of sequences . | Classifier . |
---|---|---|---|---|---|
Non-classical secreted proteins | [8] | I | Protein | 655 | CatBoost |
PVP | [64] | II | Protein | 626 | Support Vector Machines |
SARS-CoV-2 sequences | [65] | III | DNA | 24 815 | Random Forest |
Sigma70 promoters | [12] | IV | DNA | 2141 | Support Vector Machines |
Anticancer Peptides | [66] | V | Protein | 344 | Random Forest |
Protein lysine crotonylation | [67] | VI | Protein | 40 587 | Random Forest |
Long non-coding RNAs | [13] | VII | RNA | 21 000 and 12 000 | CatBoost |
Long non-coding RNAs | [68] | VIII | RNA | 36 000 | Deep Learning |
Sigma70 promoters | [69] | IX | DNA | 2141 | Random Forest |
Problem . | Reference . | Case study . | Application . | Number of sequences . | Classifier . |
---|---|---|---|---|---|
Non-classical secreted proteins | [8] | I | Protein | 655 | CatBoost |
PVP | [64] | II | Protein | 626 | Support Vector Machines |
SARS-CoV-2 sequences | [65] | III | DNA | 24 815 | Random Forest |
Sigma70 promoters | [12] | IV | DNA | 2141 | Support Vector Machines |
Anticancer Peptides | [66] | V | Protein | 344 | Random Forest |
Protein lysine crotonylation | [67] | VI | Protein | 40 587 | Random Forest |
Long non-coding RNAs | [13] | VII | RNA | 21 000 and 12 000 | CatBoost |
Long non-coding RNAs | [68] | VIII | RNA | 36 000 | Deep Learning |
Sigma70 promoters | [69] | IX | DNA | 2141 | Random Forest |
Case study I-non-classical secreted proteins
Here, we induced a classifier for the non-classical secreted proteins using benchmark datasets provided by [8] (training: 141 positive and 446 negative samples; test: 34 positive and 34 negative objects). We extracted features using integer mapping, FT + integer mapping and AAC. Afterward, we applied the CatBoost algorithm to the new datasets and assessed the predictive performance using Accuracy (ACC), F1-score and Matthews Correlation Coefficient (MCC). Our performance (ACC: 0.8382, F1-score: 0.8070 and MCC: 0.7149) was superior to state-of-the-art tools, such as SecretomeP [70] (ACC: 0.5880, F1-score: 0.4620 and MCC: 0.2000) and PeNGaRoo [8] (ACC: 0.7790, F1-score: 0.7890 and MCC: 0.5610).
Case study II-PVP
This study, considering the prediction of PVP, is reported in [9]. For the experiments carried out, we used benchmark data provided by [64], with 500 sequences for training (250 PVP and 250 non-PVP) and 126 for tests (63 PVP and 63 non-PVP). To numerically represent the sequences, we built a hybrid feature set with SE (|$k = 12$|), CN (|$k = 1$|, |$t = 2$|) and AAC. To generate our predictive model, a classifier was induced using an ensemble method (bagging) of Support Vector Machines (SVMs), assessing its predictive performance with the F1-score, ACC, area under the curve (AUC) and MCC. Experimental results showed high performance for F1-score: 0.7934, ACC: 0.8016, AUC: 0.8661 and MCC: 0.6051. The results using the hybrid set of features were superior to the performance obtained using conventional features extracted from the same dataset [64]. Using the hybrid feature set also improved the predictive performance, when compared with the feature set used by PVPred [71] (ACC: 0.7300, AUC: 0.8570 and MCC: 0.5050), PVP-SVM [9] (ACC: 0.7460, AUC: 0.8440 and MCC: 0.5050) and PVPred-SCM [72] (ACC: 0.7140, AUC: – and MCC: 0.4320) and slightly worse than Meta-iPVP [64] (ACC: 0.8170, AUC: 0.8700 and MCC: 0.6420).
Case study III-SARS-CoV-2 sequences
For this case study, we conducted experiments using a dataset to differentiate SARS-CoV-2 from other viruses (e.g. HIV, Influenza, hepatitis, Ebolavirus, SARS). We downloaded all available virus sequences (29 135) from the NCBI Viral Genome database [65] (complete genomic sequences (DNA), e.g. Nucleotide Completeness = ‘complete’ AND host = ‘homo sapiens’). In a preprocessing phase, we removed sequences smaller than 2000bp and larger than 50 000 bp [73] to eliminate any bias in the sequence size, since SARS-CoV-2 has an average length of 29 838 bp, resulting in a dataset with 22 442 and 2373 sequences from other viruses and SARS-CoV-2, respectively. In this experiment, we extracted the TE-based features (|$k = 12$| and |$q = 6$|). We applied the Random Forest (RF) algorithm to the dataset represented by TE-based features, using 10-fold cross-validation (mean). It is important to note that we continued with an unbalanced dataset, keeping performance metrics (e.g. F1-score, balanced accuracy (BACC), and also including Cohen’s kappa coefficient). In the experimental results, the predictive performance of the RF model to discriminate SARS-CoV-2 from several other viruses with F1-score, BACC and kappa of 0.9873, 0.9919, 0.9860, respectively. Moreover, we tested other conventional descriptors (e.g. k-mer, PseKNC, ORF features, Fickett score and TNC). These descriptors performed between (0.9800-0.9900, balanced accuracy), and hence, we carried out the classification task between SARS-CoV-2 and other viruses, which are linearly separable even using different feature vectors. In addition, these results are supported by [10, 11].
Case study IV-Sigma70 promoters
In this case study, we trained a SVM classifier to induce a sigma70 promoter predictor based on the benchmark dataset from [12]. This dataset contains 741 positive samples (promoter) and 1400 negative samples (non-promoter). For the feature extraction, we used the CGR descriptor. The experiments were assessed partitioning the dataset with 5-fold cross-validation (same as in [12]), when the following mean performance values were obtained: 0.8594, 0.8346, 0.7872 and 0.6852 for ACC, BACC, F1-score and MCC, respectively. In [12], the authors report the performance of their tool, iPro70-PseZNC, also using SVM, for |$2$| of these metrics, ACC: 0.8450 and MCC: 0.6630. Thus, by using the mathematical descriptors, the results improved by 0.0144 (1.44%), for ACC and 0.0222 (2.22%), for MCC.
Case study V-anticancer peptides
In this case study, our aim is to identify anticancer peptides based on [66]. For such, we extracted features CN (|$k = 2$|, |$t = 1$|) and AAC from the benchmark dataset provided by the authors (206 non-anticancer peptides and 138 anticancer peptides). The RF algorithm was applied to the transformed dataset using 10-fold cross-validation. The mean predictive performance of the trained model was assessed using ACC, F1-score and MCC. The performance of this model was superior to the performance reported in [66], (ACC: 0.9300, F1-score: 0.9061 and MCC: 0.8563 against ACC: 0.9273, F1-score: 0.9270 and MCC: 0.8490).
Case study VI-protein lysine crotonylation
Based on [67], we induced and assessed the RF algorithm to identify protein lysine crotonylation sites. The benchmark data provided by the author contains 32 418 sequences for training (2742 positive and 29 676 negative peptides - papaya) and 8169 sequences for tests (711 positive and 7458 negative peptides - papaya). For feature extraction, we applied numerical mapping with EIIP. We assessed the predictive performance with BACC and MCC, which were 0.6450 and 0.1652, respectively. These results were better than those obtained with the some feature extraction techniques used in [67], e.g. |$RF_{AAC}$| (MCC: 0.1030) and |$RF_{CKSAAP}$| (MCC: 0.1110).
Case study VII-long non-coding RNAs
In this case study, we trained the CatBoost algorithm to classify long non-coding RNAs (lncRNAs) sequences from protein-coding genes (mRNAs), using two datasets made available by [13]: Human (training set: 16 000 sequences and test set: 5000 sequences) and Wheat (training set: 8000 sequences and test set: 4000 sequences). From these datasets, we extracted the FT + real mapping, TNC and coding descriptors. Essentially, we followed the same pipeline of previous case studies. Once again, the predictive model induced using our descriptors showed a high predictive performance in the datasets, e.g. Human (ACC: 0.9652, F1-score: 0.9646, MCC: 0.9309) and Wheat (ACC: 0.8870, F1-score: 0.8907, MCC: 0.7757). Our results were better than several tools shown in [13], e.g. CPC [74] (Human - ACC: 0.8304; Wheat - ACC: 0.9595), CNCI [75] (Human - ACC: 0.9450; Wheat - ACC: 0.6158), CPAT [63] (Human–ACC: 0.9642; Wheat–ACC: 0.8743), PLEK [76] (Human–ACC: 0.9274; Wheat–ACC: 0.8773), and CPC2 [77] (Human–ACC: 0.9614; Wheat–ACC: 0.7870).
Case study VIII-using MathFeature with deep learning
According to [78], deep learning (DL) is a field of ML responsible for several advances, due to its high predictive performance in big data [79]. Therefore, we assess our descriptors with a DL architecture, using the same case study problem VII [lncRNAs versus mRNAs - feature vector (FT + real mapping and coding descriptors)], but with a benchmark dataset from [68] (Zea mays dataset (36 000 sequences: 18 000 lncRNA and 18 000 mRNA), whose article is dedicated to a DL approach. Our classifier was generated using Keras [80] (default parameters). Furthermore, we compared our model with three DL tools used in [68] (PlncRNA-HDeep [68], lncRNAnet [81] and LncADeep [82]), using the same pipeline (hold-out (80% of samples for training and 20% for testing), ACC, Recall and F1-score). Our model showed a high predictive performance in the dataset, e.g. ACC: 0.9605, Recall: 0.9917 and F1-score: 0.9616, overcoming lncRNAnet (ACC: 0.7290, Recall: 0.7200, F1-score: 0.7260), LncADeep (ACC: 0.8000, Recall: 0.6660, F1-score: 0.7690) and PlncRNA-HDeep (Recall: 0.9790), but with a small decimal loss in relation (ACC: 0.0045 and F1-score: 0.0034) to PlncRNA-HDeep (ACC: 0.9650 and F1-score: 0.9650). Therefore, based on our results, MathFeature can also generate robust and efficient feature vectors for DL approaches.
Case study IX-MathFeature versus other packages
So far, we have evaluated MathFeature with eight experiments in well-established problems. Nevertheless, in this last case study, we also compared MathFeature with five packages, e.g. BioSeq-Analysis [26], Seq2Feature [29], PyFeat [30], iLearn [7] and SubFeat [69]. The experiments were carried out using the dataset provided by [69], which was the same dataset used in case study IV (Sigma70 Promoters). For this study, we considered 741 positive samples (promoter) and 1400 negative samples (non-promoter) and three metrics (ACC, AUC, MCC), evaluating the RF classifier using 10-fold cross-validation (as our reference). We kept our CGR descriptor. MathFeature (ACC: 0.8576, AUC: 0.9252 and MCC: 0.6797) outperformed all packages, BioSeq-Analysis (ACC: 0.7637, AUC: 0.8297 and MCC: 0.4726), Seq2Feature (ACC: 0.7197, AUC: 0.7637 and MCC: 0.3723), PyFeat (ACC: 0.7842, AUC: 0.8589 and MCC: 0.5064), iLearn (ACC: 0.7597, AUC: 0.8173 and MCC: 0.5275) and SubFeat (ACC: 0.8098, AUC: 0.9232 and MCC: 0.5664). Moreover, based on the results obtained comparing MathFeature and Seq2Feature, we generated a hybrid vector with features from both packages (MathFeature: CGR and Seq2Feature: Nucleotide content, random choice), which provided the best result (ACC: 0.8627, AUC: 0.9332 and MCC: 0.6927). Therefore, we achieved a high predictive performance, applying only MathFeature or a hybrid combination of packages.
Discussion
We assessed the MathFeature package in nine case studies grouped by protein and DNA/RNA sequences. We considered four protein problems and three DNA/RNA problems in the experiments. The classification problems in each case were chosen based on recent articles with distinct domains. For example, for protein molecules, we used the following datasets: (i) non-classical secreted proteins, that according to [8], are important for understanding pathogenesis mechanisms of Gram-positive bacteria; (ii) The PVP identification, e.g. to develop new antibacterial drugs [9]; (iii) anticancer peptides that present a new direction in the treatment of cancer [66, 83] and (4) protein lysine crotonylation, a type of post-translational modification [67, 84]. In these studies, we noticed that the hybrid combination of mathematical and conventional descriptors (available at MathFeature) improves the performance of the models, mainly applying CN, FT, numerical mapping (e.g. EIIP and integer) and AAC, varying the ACC/BACC of 0.6450–0.9300 in all problems. For DNA/RNA molecules, the problems used are (i) SARS-CoV-2, hot topic in bioinformatics [10, 11]; (ii) detection of sigma70 promoters to study the dynamics of gene expression [12, 85]; and (iii) lncRNA sequences, that can play essential roles in biological processes, e.g. transcriptional regulation [68, 86]. For these problems, we obtained highly robust results (varying the ACC/BACC of 0.8594-0.9900), both applying only mathematical descriptors or a hybrid combination, highlighting TE-based features, CGR, FT, TNC and coding descriptors. Finally, our findings report the relevance of MathFeature descriptors in several applications, e.g. humans, plants and bacteria data.
Conclusion
In this study, we described a new package, called MathFeature, comprising an extensive and comprehensive set of |$37$| feature descriptors for biological sequences. From these |$37$| descriptors, |$20$| are based on mathematical approaches and are not available in other feature extraction packages. Seventeen other descriptors, called conventional descriptors, were selected from those often used in the literature. The main motivation for this new package was that, despite the relevance of the features extracted by mathematical descriptors, they are not available in current packages. Thus, MathFeature extends the existing packages, including mathematical techniques. To experimentally assess the descriptors implemented in this package, we conducted nine case studies, using several biological scenarios, e.g. DNA, RNA and Proteins (primary sequence of amino acids), applied in different problem domains. Furthermore, we avoided including any type of bias from selected features, and hence, the quality assessment of each feature can be made by the community with regards to the specific problem of interest. In the experiments, we obtained high predictive performance, both applying only mathematical descriptors (e.g. case studies II, III, VI) and applying a hybrid combination of them with well-known conventional descriptors found in the literature (e.g. AAC, TNC, Coding). Finally, through MathFeature, we outperformed several studies in benchmark datasets, indicating that all descriptors within MathFeature can improve the performance of predictive models induced by ML algorithms. Regarding the limitations, we observed that some of these descriptors (e.g. Fourier, Shannon and Tsallis) have a low performance for short sequences. However, when mathematical descriptors are combined with conventional ones, in hybrid sets, there is a clear improvement in the predictive performance. Finally, as future work, we intend to investigate descriptors for short sequences, especially in prokaryotic organisms, and also include more protein descriptors.
Acknowledgments
The authors would like to thank USP, CAPES, CNPq and FAPESP (2013/07375-0) for the financial support for this research.
Availability of data and materials
The datasets, experiments and descriptors are available in the Github repository: https://github.com/Bonidia/MathFeature.
Financial support
This project was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001 and PROEX-11919694/D, USP, CNPq and FAPESP (2013/07375-0).
Availability and implementation
MathFeature is freely available at https://github.com/Bonidia/MathFeatureDocumentation:https://bonidia.github.io/MathFeature/
Robson P. Bonidia received the M.Sc. degree in bioinformatics from the Federal University of Technology - Paraná (UTFPR), Brazil. He is currently pursuing the Ph.D. degree in computer science and computational mathematics with the University of São Paulo-USP. His main research topics are in computational biology and pattern recognition, feature extraction and selection, metaheuristics, and sports data mining.
Douglas S. Domingues graduated in Biology in the São Paulo State University at Botucatu, Brazil, in 2003. He received the PhD degree in Biotechnology from the University of São Paulo, Brazil, in 2009. He is currently a research professor of Plant Gene Expression in the Department of Biodiversity, São Paulo State University at Rio Claro, Brazil, in charge of the Genomics and Transcriptomics in Plants Group. He is the Head of the PhD in Plant Biology in São Paulo State University at Rio Claro, Brazil. In his research, he uses genomics and transcriptomics approaches in non-model plants to understand gene function, the evolution of gene families and genome components, as well as molecular responses to environmental constraints.
Danilo S. Sanches received the Ph.D. degree in electrical engineering from the University of Sao Paulo, in 2013. He is currently an Associate Professor with the Computer Science Department, Federal University of Technology - Paraná (UTFPR), Brazil. His research includes data mining, machine learning, evolutionary algorithms, bioinformatics, and pattern recognition approaches.
André. C. P. L. F. de Carvalho is a full professor at the Department of Computer Science, University of São Paulo. He is the Vice Dean of the Mathematics and Computer Science Institute of University of São Paulo, ICMC-USP, Vice Director of the Center for Mathematical Sciences Applied to Industry, USP and Vice President of the Brazilian Computer Society, SBC. His research interests are in machine learning, data mining and data science.
References
Richard F