iBet uBet web content aggregator. Adding the entire web to your favor.

Abstract

One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350–0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.

package, feature extraction, mathematical descriptors, biological sequences, python, GUI-based platform

Background

Machine learning (ML) algorithms have been successfully applied to genomics, transcriptomics and proteomics problems [1, 2]. Nevertheless, their predictive performance depends on the representation of the sequences by relevant features, able to extract important aspects present in the original sequences. In [3, 4], the authors address the relevance of using an appropriate mathematical expression to extract features from biological data, which has been adopted by several studies [5–7], e.g. non-classical secreted proteins [8], phage virion proteins (PVP)[9], SARS-CoV-2 [10, 11], sigma70 promoters [12] and long non-coding RNAs [13, 14].

As a result, many techniques have been proposed and experimentally investigated [15, 16], and several of them were made available in public software packages, such as PROFEAT [17], PseAAC [18], propy [19], PseKNC-general [16], SPiCE [20], protr/ProtrWeb [21], ProFET [22], Pse-in-One [4], repDNA [23], Rcpi [24], repRNA [25], BioSeq-analysis [26], iFeature [27], PyBioMed [28], Seq2Feature [29], PyFeat [30], iLearn [7], periodicDNA[31] and iLearnPlus [32].

These software packages have been used to extract features from sequences. However, there are some aspects present in the sequences that the feature extraction techniques included in these tools cannot extract. These features, which were shown to be relevant in previous studies [33–36], describe mathematical aspects observed in biological sequences and will be named here mathematical descriptors [37]. These descriptors are based on several techniques, such as multiple numerical mappings, Fourier transform (FT), chaos game theory, entropy and complex networks (CN). To allow the extraction of these descriptors as features for the study of biological sequences, and also including conventional descriptors available in other packages, we created a novel open-source Python package, named MathFeature.

Table 1

Open in new tab

Descriptor groups in reviewed studies

Group	Initials	Application group	Study
Amino acid composition	AAC	Protein	[7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30]
Pseudo-amino acid composition	PseAAC	Protein	[7] [4] [18] [19] [20] [21] [24] [26] [27] [28]
Composition, transition, distribution	CTD	Protein	[7] [17] [19] [20] [21] [22] [24] [27] [28]
Sequence-order	SO	Protein	[7] [17] [19] [20] [21] [24] [27] [28]
Conjoint triad	CT	Protein	[7] [21] [24] [27] [28]
Proteochemometric descriptors	PCM	Protein	[7] [21] [24] [27]
Profile-based features	PF	Protein	[7] [20] [21] [24] [26] [27]
Nucleic acid composition	NAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28] [30]
Pseudo nucleic acid composition	PseNAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28]
Structure composition	SC	DNA, RNA, Protein	[7] [25] [26] [27]
Sequence similarity	SS	DNA, RNA, Protein	[24]
Autocorrelation	–	DNA, RNA, Protein	[7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28]
Numerical mapping	–	DNA, RNA, Protein	[7] [27]
K-nearest neighbor	KNN	DNA, RNA, Protein	[7] [27]
Physicochemical property	PP	DNA, RNA, Protein	[7] [22] [27] [29]

Group	Initials	Application group	Study
Amino acid composition	AAC	Protein	[7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30]
Pseudo-amino acid composition	PseAAC	Protein	[7] [4] [18] [19] [20] [21] [24] [26] [27] [28]
Composition, transition, distribution	CTD	Protein	[7] [17] [19] [20] [21] [22] [24] [27] [28]
Sequence-order	SO	Protein	[7] [17] [19] [20] [21] [24] [27] [28]
Conjoint triad	CT	Protein	[7] [21] [24] [27] [28]
Proteochemometric descriptors	PCM	Protein	[7] [21] [24] [27]
Profile-based features	PF	Protein	[7] [20] [21] [24] [26] [27]
Nucleic acid composition	NAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28] [30]
Pseudo nucleic acid composition	PseNAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28]
Structure composition	SC	DNA, RNA, Protein	[7] [25] [26] [27]
Sequence similarity	SS	DNA, RNA, Protein	[24]
Autocorrelation	–	DNA, RNA, Protein	[7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28]
Numerical mapping	–	DNA, RNA, Protein	[7] [27]
K-nearest neighbor	KNN	DNA, RNA, Protein	[7] [27]
Physicochemical property	PP	DNA, RNA, Protein	[7] [22] [27] [29]

Table 1

Open in new tab

Descriptor groups in reviewed studies

Group	Initials	Application group	Study
Amino acid composition	AAC	Protein	[7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30]
Pseudo-amino acid composition	PseAAC	Protein	[7] [4] [18] [19] [20] [21] [24] [26] [27] [28]
Composition, transition, distribution	CTD	Protein	[7] [17] [19] [20] [21] [22] [24] [27] [28]
Sequence-order	SO	Protein	[7] [17] [19] [20] [21] [24] [27] [28]
Conjoint triad	CT	Protein	[7] [21] [24] [27] [28]
Proteochemometric descriptors	PCM	Protein	[7] [21] [24] [27]
Profile-based features	PF	Protein	[7] [20] [21] [24] [26] [27]
Nucleic acid composition	NAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28] [30]
Pseudo nucleic acid composition	PseNAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28]
Structure composition	SC	DNA, RNA, Protein	[7] [25] [26] [27]
Sequence similarity	SS	DNA, RNA, Protein	[24]
Autocorrelation	–	DNA, RNA, Protein	[7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28]
Numerical mapping	–	DNA, RNA, Protein	[7] [27]
K-nearest neighbor	KNN	DNA, RNA, Protein	[7] [27]
Physicochemical property	PP	DNA, RNA, Protein	[7] [22] [27] [29]

Group	Initials	Application group	Study
Amino acid composition	AAC	Protein	[7] [4] [17] [19] [20] [21] [22] [24] [26] [27] [28] [29] [30]
Pseudo-amino acid composition	PseAAC	Protein	[7] [4] [18] [19] [20] [21] [24] [26] [27] [28]
Composition, transition, distribution	CTD	Protein	[7] [17] [19] [20] [21] [22] [24] [27] [28]
Sequence-order	SO	Protein	[7] [17] [19] [20] [21] [24] [27] [28]
Conjoint triad	CT	Protein	[7] [21] [24] [27] [28]
Proteochemometric descriptors	PCM	Protein	[7] [21] [24] [27]
Profile-based features	PF	Protein	[7] [20] [21] [24] [26] [27]
Nucleic acid composition	NAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28] [30]
Pseudo nucleic acid composition	PseNAC	DNA, RNA	[7] [4] [16] [23] [25] [26] [28]
Structure composition	SC	DNA, RNA, Protein	[7] [25] [26] [27]
Sequence similarity	SS	DNA, RNA, Protein	[24]
Autocorrelation	–	DNA, RNA, Protein	[7] [17] [19] [16] [20] [21] [4] [23] [24] [26] [27] [28]
Numerical mapping	–	DNA, RNA, Protein	[7] [27]
K-nearest neighbor	KNN	DNA, RNA, Protein	[7] [27]
Physicochemical property	PP	DNA, RNA, Protein	[7] [22] [27] [29]

This package provides, in a single environment, many of the mathematical descriptors previously proposed for feature extraction from biological sequences [33–36]. MathFeature contains |$37$| descriptors, in which, |$20$| of them are mathematically organized into five groups (numerical mapping, chaos game, FT, entropy and graphs). Additionally, MathFeature extends our preliminary investigation [36], where we investigated nine sets of mathematical features. MathFeature also includes descriptors for Protein sequences, i.e. prediction of structural features along the primary sequence of amino acids. To the best of our knowledge, MathFeature is the first package to provide such a large and comprehensive set of feature extraction techniques based on mathematical descriptors for DNA, RNA and Proteins.

Related works

Fundamentally, we consider feature engineering a key step to ML application success [38–40], mainly in biological sequence preprocessing [3, 41, 42]. In terms of terminology, according to [38], feature is synonymous of an input variable or attribute. Nevertheless, studies also use the ‘feature descriptor’ terminology (the majority in our review—15 studies), which is the reason why we adopted this term, where a feature descriptor refers to the feature extraction method/technique that can present several measures/values.

In this section, we described 17 studies (cited in Background Section) related to feature extraction packages (tools, web servers, toolkits, etc), providing several feature descriptors for biological sequence analyses. We organized the selected studies into application categories (that is, DNA, RNA, or protein—Supplementary File S1). Furthermore, we also plotted a Venn Diagram (see Supplementary File S2), including all studies by application. In general, most studies are focused on the representation of proteins (eight studies), while DNA and RNA studies had one application each. Moreover, considering the intersection of applications, we found four studies of applications combining DNA, RNA and protein, whereas DNA+protein with two studies and DNA+RNA with one study, respectively.

In our literature review, we found 173 feature descriptors. It is not feasible to individually analyze and describe each descriptor. For this reason, based on our review, we divided these descriptors into 15 large groups, as shown in Table 1. The group column classifies the feature descriptors based on the reviewed studies, and the study column includes packages that have at least one descriptor from the related group.

Considering the groups introduced in Table 1, we realized that most descriptors are based on AAC, PseAAC, CTD and SO for proteins, while NAC and PseNAC descriptors for DNA/RNA, and autocorrelation for DNA, RNA and protein. Nevertheless, MathFeature overcomes other packages in different types of mathematical descriptors (e.g. chaos game, FT, entropy and graphs), except two descriptors in numerical mapping, available in only two packages [7, 27]. In addition, to better illustrate the advantages of MathFeature compared with other studies, we included Table 2, which shows the number of MathFeature descriptors that can also be found in other tools. In that case, it can be noticed that only iLearn has |$15$| descriptors from a total of |$37$| descriptors available in MathFeature. Moreover, we found only a few sets (2 up to 9) of similar descriptors from other packages compared to our study. Based on this analysis, we realized the novelty of MathFeature for providing different descriptors in biological sequences, which we believe to be an important contribution. Also, most studies (13, 76.47%) were dedicated to evaluating only one type of sequence, while 4 (23.53%) studies cover multiple types of sequences, including MathFeature. Finally, our package is also competitive in terms of the number of descriptors (total of 37).

Table 2

Open in new tab

Descriptors calculated by MathFeature compared to the available feature extraction packages. This table shows the number of MathFeature descriptors that existing packages have implemented

Package	Mathematical descriptors	Conventional descriptors	Number of descriptors calculated
MathFeature	20	17	37
PROFEAT	0	2	2
PseAAC	0	2	2
propy	0	5	5
PseKNC-general	0	5	5
SPiCE	0	4	4
ProtrWeb	0	5	5
ProFET	2	3	5
Pse-in-One	0	5	5
repDNA	0	5	5
Rcpi	0	3	3
repRNA	0	5	5
BioSeq-analysis	0	9	9
iFeature	1	4	5
PyBioMed	0	7	7
Seq2Feature	0	0	0
PyFeat	1	8	9
iLearn	2	13	15

Package	Mathematical descriptors	Conventional descriptors	Number of descriptors calculated
MathFeature	20	17	37
PROFEAT	0	2	2
PseAAC	0	2	2
propy	0	5	5
PseKNC-general	0	5	5
SPiCE	0	4	4
ProtrWeb	0	5	5
ProFET	2	3	5
Pse-in-One	0	5	5
repDNA	0	5	5
Rcpi	0	3	3
repRNA	0	5	5
BioSeq-analysis	0	9	9
iFeature	1	4	5
PyBioMed	0	7	7
Seq2Feature	0	0	0
PyFeat	1	8	9
iLearn	2	13	15

Table 2

Open in new tab

Descriptors calculated by MathFeature compared to the available feature extraction packages. This table shows the number of MathFeature descriptors that existing packages have implemented

Package	Mathematical descriptors	Conventional descriptors	Number of descriptors calculated
MathFeature	20	17	37
PROFEAT	0	2	2
PseAAC	0	2	2
propy	0	5	5
PseKNC-general	0	5	5
SPiCE	0	4	4
ProtrWeb	0	5	5
ProFET	2	3	5
Pse-in-One	0	5	5
repDNA	0	5	5
Rcpi	0	3	3
repRNA	0	5	5
BioSeq-analysis	0	9	9
iFeature	1	4	5
PyBioMed	0	7	7
Seq2Feature	0	0	0
PyFeat	1	8	9
iLearn	2	13	15

Package	Mathematical descriptors	Conventional descriptors	Number of descriptors calculated
MathFeature	20	17	37
PROFEAT	0	2	2
PseAAC	0	2	2
propy	0	5	5
PseKNC-general	0	5	5
SPiCE	0	4	4
ProtrWeb	0	5	5
ProFET	2	3	5
Pse-in-One	0	5	5
repDNA	0	5	5
Rcpi	0	3	3
repRNA	0	5	5
BioSeq-analysis	0	9	9
iFeature	1	4	5
PyBioMed	0	7	7
Seq2Feature	0	0	0
PyFeat	1	8	9
iLearn	2	13	15

Figure 1

Pipeline of descriptors calculated by MathFeature. A: Numerical mapping; B: FT; C: Chaos game representation; D: entropy; E: complex networks.

Open in new tab Download slide

Package description

MathFeature is a user friendly package that covers |$20$| mathematical descriptors, as illustrated by Figure 1. We also elaborate the MathFeature execution workflow, which can be divided into four simple steps, as shown in Figure 2. In Table 3, we organized the |$20$| descriptors into |$5$| groups (numerical mapping (7), chaos game (2), FT (7), entropy (2) and graphs (2)), according to their structure. MathFeature can be run on console, but we also provide a graphical user interface (GUI)-based platform (see Supplementary File: S3). We briefly describe each of the |$5$| groups representing the |$20$| descriptors:

Table 3

Open in new tab

Mathematical descriptors calculated by MathFeature for DNA, RNA and Protein sequences

Descriptor groups	Descriptor	Dimension	Biological Sequence
	Binary	\|$L \cdot 4$\|	DNA/RNA
	Z-curve	\|$L \cdot 3$\|	DNA/RNA
	Real	\|$L$\|	DNA/RNA
Numerical mapping	Integer	\|$L$\|	DNA/RNA/Protein
	EIIP	\|$L$\|	DNA/RNA/Protein
	Complex Number	\|$L$\|	DNA/RNA
	Atomic Number	\|$L$\|	DNA/RNA
	Binary + Fourier	\|$19$\|	DNA/RNA
FT	Z-curve + Fourier	\|$19$\|	DNA/RNA
	Real + Fourier	\|$19$\|	DNA/RNA
	Integer + Fourier	\|$19$\|	DNA/RNA/Protein
	EIIP + Fourier	\|$19$\|	DNA/RNA/Protein
	Complex Number + Fourier	\|$19$\|	DNA/RNA
	Atomic Number + Fourier	\|$19$\|	DNA/RNA
	CGR	\|$L \cdot 2$\|	DNA/RNA
Chaos game	Chaos Game Signal (with Fourier)	\|$19$\|	DNA/RNA
entropy	Shannon	\|$k$\|	DNA/RNA/Protein
	Tsallis	\|$k$\|	DNA/RNA/Protein
Graphs	CN (with threshold)	\|$12 \cdot t$\|	DNA/RNA/Protein
	CN (without threshold)	\|$26 \cdot k$\|	DNA/RNA/Protein

Descriptor groups	Descriptor	Dimension	Biological Sequence
	Binary	\|$L \cdot 4$\|	DNA/RNA
	Z-curve	\|$L \cdot 3$\|	DNA/RNA
	Real	\|$L$\|	DNA/RNA
Numerical mapping	Integer	\|$L$\|	DNA/RNA/Protein
	EIIP	\|$L$\|	DNA/RNA/Protein
	Complex Number	\|$L$\|	DNA/RNA
	Atomic Number	\|$L$\|	DNA/RNA
	Binary + Fourier	\|$19$\|	DNA/RNA
FT	Z-curve + Fourier	\|$19$\|	DNA/RNA
	Real + Fourier	\|$19$\|	DNA/RNA
	Integer + Fourier	\|$19$\|	DNA/RNA/Protein
	EIIP + Fourier	\|$19$\|	DNA/RNA/Protein
	Complex Number + Fourier	\|$19$\|	DNA/RNA
	Atomic Number + Fourier	\|$19$\|	DNA/RNA
	CGR	\|$L \cdot 2$\|	DNA/RNA
Chaos game	Chaos Game Signal (with Fourier)	\|$19$\|	DNA/RNA
entropy	Shannon	\|$k$\|	DNA/RNA/Protein
	Tsallis	\|$k$\|	DNA/RNA/Protein
Graphs	CN (with threshold)	\|$12 \cdot t$\|	DNA/RNA/Protein
	CN (without threshold)	\|$26 \cdot k$\|	DNA/RNA/Protein

|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer, |$t = $| threshold - number of subgraphs.

Table 3

Open in new tab

Mathematical descriptors calculated by MathFeature for DNA, RNA and Protein sequences

Descriptor groups	Descriptor	Dimension	Biological Sequence
	Binary	\|$L \cdot 4$\|	DNA/RNA
	Z-curve	\|$L \cdot 3$\|	DNA/RNA
	Real	\|$L$\|	DNA/RNA
Numerical mapping	Integer	\|$L$\|	DNA/RNA/Protein
	EIIP	\|$L$\|	DNA/RNA/Protein
	Complex Number	\|$L$\|	DNA/RNA
	Atomic Number	\|$L$\|	DNA/RNA
	Binary + Fourier	\|$19$\|	DNA/RNA
FT	Z-curve + Fourier	\|$19$\|	DNA/RNA
	Real + Fourier	\|$19$\|	DNA/RNA
	Integer + Fourier	\|$19$\|	DNA/RNA/Protein
	EIIP + Fourier	\|$19$\|	DNA/RNA/Protein
	Complex Number + Fourier	\|$19$\|	DNA/RNA
	Atomic Number + Fourier	\|$19$\|	DNA/RNA
	CGR	\|$L \cdot 2$\|	DNA/RNA
Chaos game	Chaos Game Signal (with Fourier)	\|$19$\|	DNA/RNA
entropy	Shannon	\|$k$\|	DNA/RNA/Protein
	Tsallis	\|$k$\|	DNA/RNA/Protein
Graphs	CN (with threshold)	\|$12 \cdot t$\|	DNA/RNA/Protein
	CN (without threshold)	\|$26 \cdot k$\|	DNA/RNA/Protein

Descriptor groups	Descriptor	Dimension	Biological Sequence
	Binary	\|$L \cdot 4$\|	DNA/RNA
	Z-curve	\|$L \cdot 3$\|	DNA/RNA
	Real	\|$L$\|	DNA/RNA
Numerical mapping	Integer	\|$L$\|	DNA/RNA/Protein
	EIIP	\|$L$\|	DNA/RNA/Protein
	Complex Number	\|$L$\|	DNA/RNA
	Atomic Number	\|$L$\|	DNA/RNA
	Binary + Fourier	\|$19$\|	DNA/RNA
FT	Z-curve + Fourier	\|$19$\|	DNA/RNA
	Real + Fourier	\|$19$\|	DNA/RNA
	Integer + Fourier	\|$19$\|	DNA/RNA/Protein
	EIIP + Fourier	\|$19$\|	DNA/RNA/Protein
	Complex Number + Fourier	\|$19$\|	DNA/RNA
	Atomic Number + Fourier	\|$19$\|	DNA/RNA
	CGR	\|$L \cdot 2$\|	DNA/RNA
Chaos game	Chaos Game Signal (with Fourier)	\|$19$\|	DNA/RNA
entropy	Shannon	\|$k$\|	DNA/RNA/Protein
	Tsallis	\|$k$\|	DNA/RNA/Protein
Graphs	CN (with threshold)	\|$12 \cdot t$\|	DNA/RNA/Protein
	CN (without threshold)	\|$26 \cdot k$\|	DNA/RNA/Protein

|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer, |$t = $| threshold - number of subgraphs.

Figure 2

MathFeature execution workflow. Step 1: Select input sequence (DNA/RNA/Protein - MathFeature only accepts fasta format); Step 2: Choose the descriptor (mathematical or conventional); Step 3: It is necessary to run each descriptor separately; Step 4: The generated vectors can be used separately or they can be hybridized in a single vector.

Open in new tab Download slide

Numerical mapping: Several sequence analysis studies require converting a biological sequence into a numerical sequence. Previous studies [43–45] have proposed descriptors for such, which are able to represent important aspects of these sequences. This group contains 7 descriptors for numerical mapping: Voss [46] (known as binary mapping), Integer [45], real [47], Z-curve [43], electron-ion interaction potential (EIIP) [48, 49], complex Numbers [44, 50] and atomic number [35, 51].
FT: This group consists of feature extraction methods, which generate sequence features based on genomic signal processing (GSP), using FT, a widely applied approach in several biological sequence analysis problems [34–36, 52]. To implement GSP techniques, we used all numerical mappings. A mathematical exploration can be seen in [36].
Chaos game representation (CGR): This approach is also a mapping for a sequence, but scale-independent and iterative for geometric representation of DNA sequences [53]. Based on available CGR representations, the MathFeature package considers classical CGR [34, 53], frequency CGR [54] and CGR signal with FT [34].
Entropy: Different studies have applied concepts from information theory for sequence feature extraction, mainly Shannon’s entropy (SE) [33, 55]. According to [56], Tsallis entropy (TE) [57] has been successfully explored in several studies. Moreover, Tsallis entropy attempted to generalize the Boltzmann/Gibbs’s traditional entropy. This group includes these two descriptors [36].
Graphs: This group has descriptors based on graph theory (CN), which has been successfully used to represent biological sequence for classification tasks [58, 59]. The descriptors implemented in this group include techniques proposed in [60] and explored in [36].

MathFeature also provides well-known descriptors from other studies with biological sequences (called conventional descriptors here, see Table 4, due to the large number of implementations in the revised packages, see Table 1) such as NAC, dinucleotide composition (DNC), trinucleotide composition (TNC), pseudo K-tuple nucleotide composition (PseKNC) [16], accumulated nucleotide frequency (ANF—DNA, RNA and protein) [61], basic k-mer (DNA, RNA and protein) [62], AAC, dipeptide composition (DPC), tripeptide composition (TPC) and Xmer k-Spaced Ymer composition frequency (kGap - DNA, RNA and protein) [30]. In addition, we also implemented two widely known descriptors in coding sequence studies, e.g. open reading frame (ORF) or coding features [36] and Fickett score [63]. Finally, we summarized the set of features generated by each descriptor investigated in this study (mathematical and conventional), as described in Table 5. MathFeature is freely available at https://github.com/Bonidia/MathFeature, and its documentation is provided at https://bonidia.github.io/MathFeature/.

Table 4

Open in new tab

Conventional descriptors calculated by MathFeature for DNA, RNA and Protein sequences

Descriptor groups	Descriptor	Dimension	Biological sequence
	Basic k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	Customized k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	NAC	\|$4$\|	DNA/RNA
Other descriptors	DNC	\|$16$\|	DNA/RNA
	TNC	\|$64$\|	DNA/RNA
	ORF Features or Coding Features	\|$10$\|	DNA/RNA
	Fickett score	\|$2$\|	DNA/RNA
	PseKNC	–	DNA/RNA
	ANF	\|$L$\|	DNA/RNA/Protein
	kGap	\|$4^X \cdot 4^Y$\| or \|$20^X \cdot 20^Y$\|	DNA/RNA/Protein
	AAC	\|$20$\|	Protein
	DPC	\|$400$\|	Protein
	TPC	\|$8000$\|	Protein

Descriptor groups	Descriptor	Dimension	Biological sequence
	Basic k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	Customized k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	NAC	\|$4$\|	DNA/RNA
Other descriptors	DNC	\|$16$\|	DNA/RNA
	TNC	\|$64$\|	DNA/RNA
	ORF Features or Coding Features	\|$10$\|	DNA/RNA
	Fickett score	\|$2$\|	DNA/RNA
	PseKNC	–	DNA/RNA
	ANF	\|$L$\|	DNA/RNA/Protein
	kGap	\|$4^X \cdot 4^Y$\| or \|$20^X \cdot 20^Y$\|	DNA/RNA/Protein
	AAC	\|$20$\|	Protein
	DPC	\|$400$\|	Protein
	TPC	\|$8000$\|	Protein

|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer

Table 4

Open in new tab

Conventional descriptors calculated by MathFeature for DNA, RNA and Protein sequences

Descriptor groups	Descriptor	Dimension	Biological sequence
	Basic k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	Customized k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	NAC	\|$4$\|	DNA/RNA
Other descriptors	DNC	\|$16$\|	DNA/RNA
	TNC	\|$64$\|	DNA/RNA
	ORF Features or Coding Features	\|$10$\|	DNA/RNA
	Fickett score	\|$2$\|	DNA/RNA
	PseKNC	–	DNA/RNA
	ANF	\|$L$\|	DNA/RNA/Protein
	kGap	\|$4^X \cdot 4^Y$\| or \|$20^X \cdot 20^Y$\|	DNA/RNA/Protein
	AAC	\|$20$\|	Protein
	DPC	\|$400$\|	Protein
	TPC	\|$8000$\|	Protein

Descriptor groups	Descriptor	Dimension	Biological sequence
	Basic k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	Customized k-mer	\|$4^k$\| or \|$20^k$\|	DNA/RNA/Protein
	NAC	\|$4$\|	DNA/RNA
Other descriptors	DNC	\|$16$\|	DNA/RNA
	TNC	\|$64$\|	DNA/RNA
	ORF Features or Coding Features	\|$10$\|	DNA/RNA
	Fickett score	\|$2$\|	DNA/RNA
	PseKNC	–	DNA/RNA
	ANF	\|$L$\|	DNA/RNA/Protein
	kGap	\|$4^X \cdot 4^Y$\| or \|$20^X \cdot 20^Y$\|	DNA/RNA/Protein
	AAC	\|$20$\|	Protein
	DPC	\|$400$\|	Protein
	TPC	\|$8000$\|	Protein

|$L = $| length of the longest sequence, |$k = $| frequencies of k-mer

Table 5

Open in new tab

Features generated by each mathematical and conventional descriptor calculated by MathFeature

Descriptors	Features
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF	Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2.
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier)	Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis.
Shannon, Tsallis	For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure.
CN (with threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global).
CN (without threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices.
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap	Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G.
ORF features or coding features	Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF.
Fickett score	Fickett:orf, Fickett:full:sequence
PseKNC	Modes of PseKNC with physicochemical properties

Descriptors	Features
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF	Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2.
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier)	Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis.
Shannon, Tsallis	For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure.
CN (with threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global).
CN (without threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices.
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap	Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G.
ORF features or coding features	Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF.
Fickett score	Fickett:orf, Fickett:full:sequence
PseKNC	Modes of PseKNC with physicochemical properties

Table 5

Open in new tab

Features generated by each mathematical and conventional descriptor calculated by MathFeature

Descriptors	Features
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF	Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2.
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier)	Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis.
Shannon, Tsallis	For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure.
CN (with threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global).
CN (without threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices.
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap	Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G.
ORF features or coding features	Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF.
Fickett score	Fickett:orf, Fickett:full:sequence
PseKNC	Modes of PseKNC with physicochemical properties

Descriptors	Features
Binary, Z-curve, Real, Integer, EIIP, complex number, atomic number, CGR, ANF	Convert a biological sequence into a numerical sequence, e.g. Integer representation: GAGAGTGACCA == 3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2.
Binary + Fourier, Z-curve + Fourier, real + Fourier, integer + Fourier, EIIP + Fourier, complex number + Fourier, atomic number + Fourier, Chaos Game Signal (with Fourier)	Peak to average power ratio (2 features), average power spectrum, median, maximum, minimum, sample SD, population SD, percentile (15/25/50/75), range, variance, interquartile range, semi-interquartile range, coefficient of variation (cv), skewness and kurtosis.
Shannon, Tsallis	For each k-mer (e.g. 1-mer, 2-mers,..., k-mers), we generated an entropic measure.
CN (with threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global).
CN (without threshold)	Betweenness, assortativity, average degree, average path length, minimum degree, maximum degree, number of edges, degree SD, frequency of motifs (size 3 and 4), clustering coefficient (local and global), Kleinberg’s authority centrality scores, closeness centralities, Burt’s constraint scores, multiplicities, density, diameter, eccentricity, edge betweenness, Kleinberg’s hub score, maximum degree of a vertex set, neighborhood size, radius, strength (weighted degree), number of vertices.
k-mer, Customized k-mer, NAC, DNC, TNC, AAC, DPC, TPC, kGap	Generation of nucleic acid or amino acid statistical information, e.g. NAC for DNA: relative frequency of A, C, T, G.
ORF features or coding features	Maximum ORF length, minimum ORF length, std ORF length, average ORF length, cv ORF length, maximum GC content - ORF, minimum GC content - ORF, std GC content - ORF, average GC content - ORF, cv GC content - ORF.
Fickett score	Fickett:orf, Fickett:full:sequence
PseKNC	Modes of PseKNC with physicochemical properties

Results

The main aim of this paper is to make publicly available a large set of feature extraction techniques for biological sequences, including mathematical descriptors not found in similar packages. These descriptors have been successfully applied to extract relevant features from biological sequences, as can be seen in [36], [34], [52], [33] and [60]. For this reason, to assess the relevance of MathFeature descriptors, we provide case studies, which are detailed and presented in the experimental scenario section.

Experimental scenario

We ran experiments for nine case studies with distinct scenarios for the classification of DNA, RNA and protein sequences, as shown in Table 6. These case studies compare the use of several descriptors in distinct problem domains. Furthermore, we did not include any feature selection or hyperparameter optimization technique. Hence, for a fair comparison, we selected descriptors using stratified random sampling (choosing descriptors in each group defined in the article, e.g. numerical mapping, FT, chaos game, entropy, graphs and conventional) in all case studies to avoid any biased choices according to the problem domain. In addition, to compare our results with state-of-the-art studies, we used different ML algorithms, performance measures and dataset partitions to adapt our pipeline to the benchmark dataset. Finally, we also selected hybridized features using stratified random sampling, to assess how these feature sets can improve the ML model prediction.

Table 6

Open in new tab

Experimental scenario in nine case studies

Problem	Reference	Case study	Application	Number of sequences	Classifier
Non-classical secreted proteins	[8]	I	Protein	655	CatBoost
PVP	[64]	II	Protein	626	Support Vector Machines
SARS-CoV-2 sequences	[65]	III	DNA	24 815	Random Forest
Sigma70 promoters	[12]	IV	DNA	2141	Support Vector Machines
Anticancer Peptides	[66]	V	Protein	344	Random Forest
Protein lysine crotonylation	[67]	VI	Protein	40 587	Random Forest
Long non-coding RNAs	[13]	VII	RNA	21 000 and 12 000	CatBoost
Long non-coding RNAs	[68]	VIII	RNA	36 000	Deep Learning
Sigma70 promoters	[69]	IX	DNA	2141	Random Forest

Problem	Reference	Case study	Application	Number of sequences	Classifier
Non-classical secreted proteins	[8]	I	Protein	655	CatBoost
PVP	[64]	II	Protein	626	Support Vector Machines
SARS-CoV-2 sequences	[65]	III	DNA	24 815	Random Forest
Sigma70 promoters	[12]	IV	DNA	2141	Support Vector Machines
Anticancer Peptides	[66]	V	Protein	344	Random Forest
Protein lysine crotonylation	[67]	VI	Protein	40 587	Random Forest
Long non-coding RNAs	[13]	VII	RNA	21 000 and 12 000	CatBoost
Long non-coding RNAs	[68]	VIII	RNA	36 000	Deep Learning
Sigma70 promoters	[69]	IX	DNA	2141	Random Forest

Table 6

Open in new tab

Experimental scenario in nine case studies

Problem	Reference	Case study	Application	Number of sequences	Classifier
Non-classical secreted proteins	[8]	I	Protein	655	CatBoost
PVP	[64]	II	Protein	626	Support Vector Machines
SARS-CoV-2 sequences	[65]	III	DNA	24 815	Random Forest
Sigma70 promoters	[12]	IV	DNA	2141	Support Vector Machines
Anticancer Peptides	[66]	V	Protein	344	Random Forest
Protein lysine crotonylation	[67]	VI	Protein	40 587	Random Forest
Long non-coding RNAs	[13]	VII	RNA	21 000 and 12 000	CatBoost
Long non-coding RNAs	[68]	VIII	RNA	36 000	Deep Learning
Sigma70 promoters	[69]	IX	DNA	2141	Random Forest

Problem	Reference	Case study	Application	Number of sequences	Classifier
Non-classical secreted proteins	[8]	I	Protein	655	CatBoost
PVP	[64]	II	Protein	626	Support Vector Machines
SARS-CoV-2 sequences	[65]	III	DNA	24 815	Random Forest
Sigma70 promoters	[12]	IV	DNA	2141	Support Vector Machines
Anticancer Peptides	[66]	V	Protein	344	Random Forest
Protein lysine crotonylation	[67]	VI	Protein	40 587	Random Forest
Long non-coding RNAs	[13]	VII	RNA	21 000 and 12 000	CatBoost
Long non-coding RNAs	[68]	VIII	RNA	36 000	Deep Learning
Sigma70 promoters	[69]	IX	DNA	2141	Random Forest

Case study I-non-classical secreted proteins

Here, we induced a classifier for the non-classical secreted proteins using benchmark datasets provided by [8] (training: 141 positive and 446 negative samples; test: 34 positive and 34 negative objects). We extracted features using integer mapping, FT + integer mapping and AAC. Afterward, we applied the CatBoost algorithm to the new datasets and assessed the predictive performance using Accuracy (ACC), F1-score and Matthews Correlation Coefficient (MCC). Our performance (ACC: 0.8382, F1-score: 0.8070 and MCC: 0.7149) was superior to state-of-the-art tools, such as SecretomeP [70] (ACC: 0.5880, F1-score: 0.4620 and MCC: 0.2000) and PeNGaRoo [8] (ACC: 0.7790, F1-score: 0.7890 and MCC: 0.5610).

Case study II-PVP

This study, considering the prediction of PVP, is reported in [9]. For the experiments carried out, we used benchmark data provided by [64], with 500 sequences for training (250 PVP and 250 non-PVP) and 126 for tests (63 PVP and 63 non-PVP). To numerically represent the sequences, we built a hybrid feature set with SE (⁠|$k = 12$|⁠), CN (⁠|$k = 1$|⁠, |$t = 2$|⁠) and AAC. To generate our predictive model, a classifier was induced using an ensemble method (bagging) of Support Vector Machines (SVMs), assessing its predictive performance with the F1-score, ACC, area under the curve (AUC) and MCC. Experimental results showed high performance for F1-score: 0.7934, ACC: 0.8016, AUC: 0.8661 and MCC: 0.6051. The results using the hybrid set of features were superior to the performance obtained using conventional features extracted from the same dataset [64]. Using the hybrid feature set also improved the predictive performance, when compared with the feature set used by PVPred [71] (ACC: 0.7300, AUC: 0.8570 and MCC: 0.5050), PVP-SVM [9] (ACC: 0.7460, AUC: 0.8440 and MCC: 0.5050) and PVPred-SCM [72] (ACC: 0.7140, AUC: – and MCC: 0.4320) and slightly worse than Meta-iPVP [64] (ACC: 0.8170, AUC: 0.8700 and MCC: 0.6420).

Case study III-SARS-CoV-2 sequences

For this case study, we conducted experiments using a dataset to differentiate SARS-CoV-2 from other viruses (e.g. HIV, Influenza, hepatitis, Ebolavirus, SARS). We downloaded all available virus sequences (29 135) from the NCBI Viral Genome database [65] (complete genomic sequences (DNA), e.g. Nucleotide Completeness = ‘complete’ AND host = ‘homo sapiens’). In a preprocessing phase, we removed sequences smaller than 2000bp and larger than 50 000 bp [73] to eliminate any bias in the sequence size, since SARS-CoV-2 has an average length of 29 838 bp, resulting in a dataset with 22 442 and 2373 sequences from other viruses and SARS-CoV-2, respectively. In this experiment, we extracted the TE-based features (⁠|$k = 12$| and |$q = 6$|⁠). We applied the Random Forest (RF) algorithm to the dataset represented by TE-based features, using 10-fold cross-validation (mean). It is important to note that we continued with an unbalanced dataset, keeping performance metrics (e.g. F1-score, balanced accuracy (BACC), and also including Cohen’s kappa coefficient). In the experimental results, the predictive performance of the RF model to discriminate SARS-CoV-2 from several other viruses with F1-score, BACC and kappa of 0.9873, 0.9919, 0.9860, respectively. Moreover, we tested other conventional descriptors (e.g. k-mer, PseKNC, ORF features, Fickett score and TNC). These descriptors performed between (0.9800-0.9900, balanced accuracy), and hence, we carried out the classification task between SARS-CoV-2 and other viruses, which are linearly separable even using different feature vectors. In addition, these results are supported by [10, 11].

Case study IV-Sigma70 promoters

In this case study, we trained a SVM classifier to induce a sigma70 promoter predictor based on the benchmark dataset from [12]. This dataset contains 741 positive samples (promoter) and 1400 negative samples (non-promoter). For the feature extraction, we used the CGR descriptor. The experiments were assessed partitioning the dataset with 5-fold cross-validation (same as in [12]), when the following mean performance values were obtained: 0.8594, 0.8346, 0.7872 and 0.6852 for ACC, BACC, F1-score and MCC, respectively. In [12], the authors report the performance of their tool, iPro70-PseZNC, also using SVM, for |$2$| of these metrics, ACC: 0.8450 and MCC: 0.6630. Thus, by using the mathematical descriptors, the results improved by 0.0144 (1.44%), for ACC and 0.0222 (2.22%), for MCC.

Case study V-anticancer peptides

In this case study, our aim is to identify anticancer peptides based on [66]. For such, we extracted features CN (⁠|$k = 2$|⁠, |$t = 1$|⁠) and AAC from the benchmark dataset provided by the authors (206 non-anticancer peptides and 138 anticancer peptides). The RF algorithm was applied to the transformed dataset using 10-fold cross-validation. The mean predictive performance of the trained model was assessed using ACC, F1-score and MCC. The performance of this model was superior to the performance reported in [66], (ACC: 0.9300, F1-score: 0.9061 and MCC: 0.8563 against ACC: 0.9273, F1-score: 0.9270 and MCC: 0.8490).

Case study VI-protein lysine crotonylation

Based on [67], we induced and assessed the RF algorithm to identify protein lysine crotonylation sites. The benchmark data provided by the author contains 32 418 sequences for training (2742 positive and 29 676 negative peptides - papaya) and 8169 sequences for tests (711 positive and 7458 negative peptides - papaya). For feature extraction, we applied numerical mapping with EIIP. We assessed the predictive performance with BACC and MCC, which were 0.6450 and 0.1652, respectively. These results were better than those obtained with the some feature extraction techniques used in [67], e.g. |$RF_{AAC}$| (MCC: 0.1030) and |$RF_{CKSAAP}$| (MCC: 0.1110).

Case study VII-long non-coding RNAs

In this case study, we trained the CatBoost algorithm to classify long non-coding RNAs (lncRNAs) sequences from protein-coding genes (mRNAs), using two datasets made available by [13]: Human (training set: 16 000 sequences and test set: 5000 sequences) and Wheat (training set: 8000 sequences and test set: 4000 sequences). From these datasets, we extracted the FT + real mapping, TNC and coding descriptors. Essentially, we followed the same pipeline of previous case studies. Once again, the predictive model induced using our descriptors showed a high predictive performance in the datasets, e.g. Human (ACC: 0.9652, F1-score: 0.9646, MCC: 0.9309) and Wheat (ACC: 0.8870, F1-score: 0.8907, MCC: 0.7757). Our results were better than several tools shown in [13], e.g. CPC [74] (Human - ACC: 0.8304; Wheat - ACC: 0.9595), CNCI [75] (Human - ACC: 0.9450; Wheat - ACC: 0.6158), CPAT [63] (Human–ACC: 0.9642; Wheat–ACC: 0.8743), PLEK [76] (Human–ACC: 0.9274; Wheat–ACC: 0.8773), and CPC2 [77] (Human–ACC: 0.9614; Wheat–ACC: 0.7870).

Case study VIII-using MathFeature with deep learning

According to [78], deep learning (DL) is a field of ML responsible for several advances, due to its high predictive performance in big data [79]. Therefore, we assess our descriptors with a DL architecture, using the same case study problem VII [lncRNAs versus mRNAs - feature vector (FT + real mapping and coding descriptors)], but with a benchmark dataset from [68] (Zea mays dataset (36 000 sequences: 18 000 lncRNA and 18 000 mRNA), whose article is dedicated to a DL approach. Our classifier was generated using Keras [80] (default parameters). Furthermore, we compared our model with three DL tools used in [68] (PlncRNA-HDeep [68], lncRNAnet [81] and LncADeep [82]), using the same pipeline (hold-out (80% of samples for training and 20% for testing), ACC, Recall and F1-score). Our model showed a high predictive performance in the dataset, e.g. ACC: 0.9605, Recall: 0.9917 and F1-score: 0.9616, overcoming lncRNAnet (ACC: 0.7290, Recall: 0.7200, F1-score: 0.7260), LncADeep (ACC: 0.8000, Recall: 0.6660, F1-score: 0.7690) and PlncRNA-HDeep (Recall: 0.9790), but with a small decimal loss in relation (ACC: 0.0045 and F1-score: 0.0034) to PlncRNA-HDeep (ACC: 0.9650 and F1-score: 0.9650). Therefore, based on our results, MathFeature can also generate robust and efficient feature vectors for DL approaches.

Case study IX-MathFeature versus other packages

So far, we have evaluated MathFeature with eight experiments in well-established problems. Nevertheless, in this last case study, we also compared MathFeature with five packages, e.g. BioSeq-Analysis [26], Seq2Feature [29], PyFeat [30], iLearn [7] and SubFeat [69]. The experiments were carried out using the dataset provided by [69], which was the same dataset used in case study IV (Sigma70 Promoters). For this study, we considered 741 positive samples (promoter) and 1400 negative samples (non-promoter) and three metrics (ACC, AUC, MCC), evaluating the RF classifier using 10-fold cross-validation (as our reference). We kept our CGR descriptor. MathFeature (ACC: 0.8576, AUC: 0.9252 and MCC: 0.6797) outperformed all packages, BioSeq-Analysis (ACC: 0.7637, AUC: 0.8297 and MCC: 0.4726), Seq2Feature (ACC: 0.7197, AUC: 0.7637 and MCC: 0.3723), PyFeat (ACC: 0.7842, AUC: 0.8589 and MCC: 0.5064), iLearn (ACC: 0.7597, AUC: 0.8173 and MCC: 0.5275) and SubFeat (ACC: 0.8098, AUC: 0.9232 and MCC: 0.5664). Moreover, based on the results obtained comparing MathFeature and Seq2Feature, we generated a hybrid vector with features from both packages (MathFeature: CGR and Seq2Feature: Nucleotide content, random choice), which provided the best result (ACC: 0.8627, AUC: 0.9332 and MCC: 0.6927). Therefore, we achieved a high predictive performance, applying only MathFeature or a hybrid combination of packages.

Discussion

We assessed the MathFeature package in nine case studies grouped by protein and DNA/RNA sequences. We considered four protein problems and three DNA/RNA problems in the experiments. The classification problems in each case were chosen based on recent articles with distinct domains. For example, for protein molecules, we used the following datasets: (i) non-classical secreted proteins, that according to [8], are important for understanding pathogenesis mechanisms of Gram-positive bacteria; (ii) The PVP identification, e.g. to develop new antibacterial drugs [9]; (iii) anticancer peptides that present a new direction in the treatment of cancer [66, 83] and (4) protein lysine crotonylation, a type of post-translational modification [67, 84]. In these studies, we noticed that the hybrid combination of mathematical and conventional descriptors (available at MathFeature) improves the performance of the models, mainly applying CN, FT, numerical mapping (e.g. EIIP and integer) and AAC, varying the ACC/BACC of 0.6450–0.9300 in all problems. For DNA/RNA molecules, the problems used are (i) SARS-CoV-2, hot topic in bioinformatics [10, 11]; (ii) detection of sigma70 promoters to study the dynamics of gene expression [12, 85]; and (iii) lncRNA sequences, that can play essential roles in biological processes, e.g. transcriptional regulation [68, 86]. For these problems, we obtained highly robust results (varying the ACC/BACC of 0.8594-0.9900), both applying only mathematical descriptors or a hybrid combination, highlighting TE-based features, CGR, FT, TNC and coding descriptors. Finally, our findings report the relevance of MathFeature descriptors in several applications, e.g. humans, plants and bacteria data.

Conclusion

In this study, we described a new package, called MathFeature, comprising an extensive and comprehensive set of |$37$| feature descriptors for biological sequences. From these |$37$| descriptors, |$20$| are based on mathematical approaches and are not available in other feature extraction packages. Seventeen other descriptors, called conventional descriptors, were selected from those often used in the literature. The main motivation for this new package was that, despite the relevance of the features extracted by mathematical descriptors, they are not available in current packages. Thus, MathFeature extends the existing packages, including mathematical techniques. To experimentally assess the descriptors implemented in this package, we conducted nine case studies, using several biological scenarios, e.g. DNA, RNA and Proteins (primary sequence of amino acids), applied in different problem domains. Furthermore, we avoided including any type of bias from selected features, and hence, the quality assessment of each feature can be made by the community with regards to the specific problem of interest. In the experiments, we obtained high predictive performance, both applying only mathematical descriptors (e.g. case studies II, III, VI) and applying a hybrid combination of them with well-known conventional descriptors found in the literature (e.g. AAC, TNC, Coding). Finally, through MathFeature, we outperformed several studies in benchmark datasets, indicating that all descriptors within MathFeature can improve the performance of predictive models induced by ML algorithms. Regarding the limitations, we observed that some of these descriptors (e.g. Fourier, Shannon and Tsallis) have a low performance for short sequences. However, when mathematical descriptors are combined with conventional ones, in hybrid sets, there is a clear improvement in the predictive performance. Finally, as future work, we intend to investigate descriptors for short sequences, especially in prokaryotic organisms, and also include more protein descriptors.

Key Points

A novel open-source Python package, called MathFeature.
MathFeature provides 37 descriptors, 20 of them are mathematical, organized into five categories.
MathFeature can be run on the console, but also provide a GUI-based platform.
MathFeature is an extensive and comprehensive set of feature extraction techniques based on mathematical descriptors for encoding DNA, RNA and Proteins (primary sequence of amino acids) sequences.
MathFeature is the first package to provide a large set of features based on mathematical descriptors and also well-known descriptors from other studies with biological sequences.

Acknowledgments

The authors would like to thank USP, CAPES, CNPq and FAPESP (2013/07375-0) for the financial support for this research.

Availability of data and materials

The datasets, experiments and descriptors are available in the Github repository: https://github.com/Bonidia/MathFeature.

Financial support

This project was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001 and PROEX-11919694/D, USP, CNPq and FAPESP (2013/07375-0).

Availability and implementation

MathFeature is freely available at https://github.com/Bonidia/MathFeatureDocumentation:https://bonidia.github.io/MathFeature/

Robson P. Bonidia received the M.Sc. degree in bioinformatics from the Federal University of Technology - Paraná (UTFPR), Brazil. He is currently pursuing the Ph.D. degree in computer science and computational mathematics with the University of São Paulo-USP. His main research topics are in computational biology and pattern recognition, feature extraction and selection, metaheuristics, and sports data mining.

Douglas S. Domingues graduated in Biology in the São Paulo State University at Botucatu, Brazil, in 2003. He received the PhD degree in Biotechnology from the University of São Paulo, Brazil, in 2009. He is currently a research professor of Plant Gene Expression in the Department of Biodiversity, São Paulo State University at Rio Claro, Brazil, in charge of the Genomics and Transcriptomics in Plants Group. He is the Head of the PhD in Plant Biology in São Paulo State University at Rio Claro, Brazil. In his research, he uses genomics and transcriptomics approaches in non-model plants to understand gene function, the evolution of gene families and genome components, as well as molecular responses to environmental constraints.

Danilo S. Sanches received the Ph.D. degree in electrical engineering from the University of Sao Paulo, in 2013. He is currently an Associate Professor with the Computer Science Department, Federal University of Technology - Paraná (UTFPR), Brazil. His research includes data mining, machine learning, evolutionary algorithms, bioinformatics, and pattern recognition approaches.

André. C. P. L. F. de Carvalho is a full professor at the Department of Computer Science, University of São Paulo. He is the Vice Dean of the Mathematics and Computer Science Institute of University of São Paulo, ICMC-USP, Vice Director of the Center for Mathematical Sciences Applied to Industry, USP and Vice President of the Brazilian Computer Society, SBC. His research interests are in machine learning, data mining and data science.

References

da Silva Diniz

Canduri

Bioinformatics: an overview and its applications

Genet Mol Res

2017

;

(

Month:	Total Views:
November 2021	654
December 2021	312
January 2022	293
February 2022	316
March 2022	409
April 2022	318
May 2022	314
June 2022	299
July 2022	254
August 2022	277
September 2022	260
October 2022	381
November 2022	396
December 2022	306
January 2023	305
February 2023	248
March 2023	419
April 2023	362
May 2023	339
June 2023	234
July 2023	294
August 2023	259
September 2023	322
October 2023	356
November 2023	340
December 2023	216
January 2024	303
February 2024	267
March 2024	432
April 2024	318
May 2024	308
June 2024	256
July 2024	327
August 2024	226
September 2024	247
October 2024	257
November 2024	256

Article Contents

MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors

Abstract

Background

Related works

Package description

Results

Experimental scenario

Case study I-non-classical secreted proteins

Case study II-PVP

Case study III-SARS-CoV-2 sequences

Case study IV-Sigma70 promoters

Case study V-anticancer peptides

Case study VI-protein lysine crotonylation

Case study VII-long non-coding RNAs

Case study VIII-using MathFeature with deep learning

Case study IX-MathFeature versus other packages

Discussion

Conclusion

Acknowledgments

Availability of data and materials

Financial support

Availability and implementation

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only