- Split View
-
Views
-
Cite
Cite
Bin Liu, Xin Gao, Hanyu Zhang, BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches, Nucleic Acids Research, Volume 47, Issue 20, 18 November 2019, Page e127, https://doi.org/10.1093/nar/gkz740
- Share Icon Share
Abstract
As the first web server to analyze various biological sequences at sequence level based on machine learning approaches, many powerful predictors in the field of computational biology have been developed with the assistance of the BioSeq-Analysis. However, the BioSeq-Analysis can be only applied to the sequence-level analysis tasks, preventing its applications to the residue-level analysis tasks, and an intelligent tool that is able to automatically generate various predictors for biological sequence analysis at both residue level and sequence level is highly desired. In this regard, we decided to publish an important updated server covering a total of 26 features at the residue level and 90 features at the sequence level called BioSeq-Analysis2.0 (http://bliulab.net/BioSeq-Analysis2.0/), by which the users only need to upload the benchmark dataset, and the BioSeq-Analysis2.0 can generate the predictors for both residue-level analysis and sequence-level analysis tasks. Furthermore, the corresponding stand-alone tool was also provided, which can be downloaded from http://bliulab.net/BioSeq-Analysis2.0/download/. To the best of our knowledge, the BioSeq-Analysis2.0 is the first tool for generating predictors for biological sequence analysis tasks at residue level. Specifically, the experimental results indicated that the predictors developed by BioSeq-Analysis2.0 can achieve comparable or even better performance than the existing state-of-the-art predictors.
INTRODUCTION
Established in 2017, the platform BioSeq-Analysis (1) is for the first time proposed to analyze various biological sequences at sequence level via machine learning approaches. BioSeq-Analysis (1) has been increasingly and extensively applied in many areas of computational biology. Moreover, many new and powerful predictors in the field of computational biology were developed by using the BioSeq-Analysis, such as iLearn (2), QSPred-FL (3), etc.
As shown in Figure 1, there are two main important tasks in biological sequence analysis, including residue-level analysis and sequence-level analysis. The aim of the residue-level analysis task is to study the properties of the residues, for instance protein-protein interaction site prediction (4), protein disordered region prediction (5), N6-Methyladenosine site prediction (6), etc, while the aim of the sequence-level analysis task is to investigate the structure and function characteristics of the entire sequences, such as enhancer identification (7,8), protein remote homology detection and fold recognition (9–12), recombination spot identification (13,14), DNA/RNA binding protein identification (15,16), etc. All these biological sequence analysis tasks are consisted of three main steps: feature extraction, predictor construction, and performance evaluation. The BioSeq-Analysis mainly focuses on analyzing biological sequences at the sequence level, meaning that the BioSeq-Analysis can be only applied to the sequence-level analysis tasks. Can we construct an intelligent tool to generate predictors for both residue-level and sequence-level analysis by automatically implementing all the three processes listed in Figure 1? To answer this question, we have decided to publish an important updated platform called BioSeq-Analysis2.0. Compared with BioSeq-Analysis and other existing tools, BioSeq-Analysis2.0 has the following novel functions and features:
26 new feature extraction methods at residue level were added, of which 7 for DNA residues (17–21), 6 for RNA residues (17–19,22) and 13 for amino acid residues (11,17,18,23–32), and 34 new feature extraction methods at sequence level were also added, of which 9 for DNA sequences (2,33–35), 7 for RNA sequences (2,33,35) and 18 for protein sequences (36–55). To the best of our knowledge, BioSeq-Analysis2.0 is the first web server proposed to generate various residue-level feature extraction methods. As a result, BioSeq-Analysis2.0 covers a total of 26 features at the residue level and 90 features at the sequence level.
For the residue-level analysis tasks, a sliding window approach was applied to extract the information of the sequential neighboring residues, and a sequence labeling model Conditional Random Field (CRF) was added into BioSeq-Analysis2.0 so as to capture the global sequence order information of residues.
MATERIALS AND METHODS
Residue-level analysis
Sequence-level analysis
Now the difficulty is, for a residue or a sequence, how to identify which category it belongs to? To cope with such a problem, we proposed a powerful and multifunctional web server in this study, named BioSeq-Analysis2.0, through which users can construct various sequence-level and residue-level predictors for analyzing DNA, RNA and protein sequences.
BioSeq-Analysis2.0 updates the three sub web servers (DNA-Analysis2.0, RNA-Analysis2.0, Protein-Analysis2.0) for analyzing DNA, RNA and protein sequences, respectively. Each of them is able to automatically implement the three main steps: feature extraction, predictor construction and performance evaluation (see Figure 1).
Feature extraction
The residue-level features explore the properties of the residues, and their relationship among the residues in the sliding windows, while the sequence-level features focus on extracting the global information along the entire sequences. For residue-level analysis, in order to capture the properties of the residues, the sliding window strategy and the fragment strategy were used to extract the corresponding features via a user defined fixed-length window. For sequence-level analysis, the biological sequences (see Equation 1) were converted into feature vectors via sequence information. BioSeq-Analysis2.0 for the first time provides 26 features for residue-level analysis. BioSeq-Analysis2.0 updates 34 new features at sequence level, leading to 90 features for sequence-level analysis. In this section, we mainly focused on introducing the 26 features for residue-level analysis and the 34 new features for sequence-level analysis. For the other 56 features for sequence-level analysis, please refer to (1).
In DNA-Analysis2.0, there are seven different residue-level features for DNA sequences to generate various predictors, which can be further divided into three categories (Table 1).
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19,20) |
TPC | Trinucleotide physicochemical (19) | |
Evolutionary information | BLAST-matrix | BLAST-matrix (21) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19,20) |
TPC | Trinucleotide physicochemical (19) | |
Evolutionary information | BLAST-matrix | BLAST-matrix (21) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19,20) |
TPC | Trinucleotide physicochemical (19) | |
Evolutionary information | BLAST-matrix | BLAST-matrix (21) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19,20) |
TPC | Trinucleotide physicochemical (19) | |
Evolutionary information | BLAST-matrix | BLAST-matrix (21) |
The first category is about residue composition containing four features. Of the four, the first one is of One-hot, where the residues are arranged in a particular order, and then the ith residue type is represented by four binary bits with the ith bit set as 1, and all the other bits are set as 0; the rest of the four are Position-specific-2 (18), Position-specific-3 (18) and Position-specific-4 (18), reflecting different position specificity between any two nucleotides along a DNA sequence based on One-hot.
The second category is about physicochemical property containing two features, DPC and TPC. The former (DPC) is based on the 90 physicochemical indices of dinucleotides extracted from (19,20) to represent residues, while the latter depends on 12 physicochemical properties of trinucleotides extracted from (19) to represent residues. Both the two features can select some physicochemical indices from the built-in index boxes.
The third category is about evolutionary information containing one feature BLAST-matrix based on (21), which can represent the local and global DNA sequence composition.
In RNA-Analysis2.0, there are six different residue-level features for RNA sequences to generate various predictors, which can be separated into three categories (Table 2)
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19) |
Structure composition | SS | Secondary structure (22) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19) |
Structure composition | SS | Secondary structure (22) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19) |
Structure composition | SS | Secondary structure (22) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
Position-specific-2 | Position-specific of two nucleotides (18) | |
Position-specific-3 | Position-specific of three nucleotides (18) | |
Position-specific-4 | Position-specific of four nucleotides (18) | |
Physicochemical property | DPC | Dinucleotide physicochemical (19) |
Structure composition | SS | Secondary structure (22) |
The first category is about residue composition containing four features. Three of the four are Position-specific-2 (18), Position-specific-3 (18) and Position-specific-4 (18), reflecting different position specificity between any two nucleotides along a RNA sequence based on One-hot. The last one of the four is basic One-hot.
The second category is about physicochemical property containing one feature, DPC, which represents residues depended on the 11 physicochemical properties of dinucleotides extracted from (19). Users can select physicochemical indices from the built-in index boxes.
The third category is about structure composition containing one feature SS, which represents the secondary structure of each residue extracted from (22), therefore, SS can represent the local RNA structure composition.
In Protein-Analysis2.0, there are 13 different residue-level features for protein sequences to generate various predictors, which can be further divided into the following four categories (Table 3)
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
One-hot(6-bit) | 6-dimension One-hot method (23) | |
Binary(5-bit) | Use five binary bit to encode (24) | |
AESNN3 | Learn from alignments (25) | |
Position-specific-2 | Position-specific of two residues (18) | |
Physicochemical property | PP | Properties form AAindex (26) |
Structure composition | SS | Secondary structure (27) |
SASA | Solvent accessible surface area (28) | |
Evolutionary information | PAM250 | PAM250 matrix (29) |
BLOSUM62 | BLOSUM62 matrix (30) | |
PSSM | PSSM matrix (31) | |
PSFM | Frequency profiles matrix (11) | |
CS | Conservation score (32) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
One-hot(6-bit) | 6-dimension One-hot method (23) | |
Binary(5-bit) | Use five binary bit to encode (24) | |
AESNN3 | Learn from alignments (25) | |
Position-specific-2 | Position-specific of two residues (18) | |
Physicochemical property | PP | Properties form AAindex (26) |
Structure composition | SS | Secondary structure (27) |
SASA | Solvent accessible surface area (28) | |
Evolutionary information | PAM250 | PAM250 matrix (29) |
BLOSUM62 | BLOSUM62 matrix (30) | |
PSSM | PSSM matrix (31) | |
PSFM | Frequency profiles matrix (11) | |
CS | Conservation score (32) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
One-hot(6-bit) | 6-dimension One-hot method (23) | |
Binary(5-bit) | Use five binary bit to encode (24) | |
AESNN3 | Learn from alignments (25) | |
Position-specific-2 | Position-specific of two residues (18) | |
Physicochemical property | PP | Properties form AAindex (26) |
Structure composition | SS | Secondary structure (27) |
SASA | Solvent accessible surface area (28) | |
Evolutionary information | PAM250 | PAM250 matrix (29) |
BLOSUM62 | BLOSUM62 matrix (30) | |
PSSM | PSSM matrix (31) | |
PSFM | Frequency profiles matrix (11) | |
CS | Conservation score (32) |
Category . | Feature . | Description . |
---|---|---|
Residue composition | One-hot | Basic one-hot (17) |
One-hot(6-bit) | 6-dimension One-hot method (23) | |
Binary(5-bit) | Use five binary bit to encode (24) | |
AESNN3 | Learn from alignments (25) | |
Position-specific-2 | Position-specific of two residues (18) | |
Physicochemical property | PP | Properties form AAindex (26) |
Structure composition | SS | Secondary structure (27) |
SASA | Solvent accessible surface area (28) | |
Evolutionary information | PAM250 | PAM250 matrix (29) |
BLOSUM62 | BLOSUM62 matrix (30) | |
PSSM | PSSM matrix (31) | |
PSFM | Frequency profiles matrix (11) | |
CS | Conservation score (32) |
The first category is about residue composition containing five features. Of the five, the first one is One-hot, the dimension of each residue is 20. The next two features One-hot (6-bit) (23) and Binary (5-bit) (24) are to reduce the dimension and complexity of One-hot. The fourth feature is Position-specific-2 based on One-hot to represent the local protein sequence composition, and the fifth feature is AESNN3 (25) based on the characteristics generated by machine learning techniques.
The second category is PP that represents residues using the 547 amino acid physicochemical indices from AAindex (26), and users can select some physicochemical properties from the index boxes to use.
The third category is about structure composition containing two features: SS (27), and SASA (28) based on secondary structure and relative solvent accessibility information of each residue, respectively.
The fourth category is about evolutionary information that containing five features: PAM250 (29), BLOSUM62 (30), PSSM (31), PSFM (11), and CS (32). Of the five features, PAM250 is based on the homologous protein sequences, and BLOSUM62 is based on the BLOCKS database of aligned protein sequences. Both the PSSM and PSFM features are based on sequence alignments, which were generated by using PSI-BLAST searching against the NRDB90 database with num_iter of 3, evalue_threshold of 0.0001, and num_threads of 40. The CS is based on sequence conservation score.
Please note that nine new sequence-level features in the nucleotide acid composition category for DNA/RNA were added (Table 4), including multiple nucleic acid composition, nucleotide chemical property, Electron-ion interaction pseudopotentials of trinucleotide for DNA. Eighteen new sequence-level features for proteins were added (Table 5) into the three categories: amino acid composition, autocorrelation, predicted structure features.
Category . | Feature . | Type . | Description . |
---|---|---|---|
Nucleotide acid composition | NAC | DNA/RNA | Nucleic Acid Composition (2) |
DNC | DNA/RNA | Di-Nucleotide Composition (2) | |
TNC | DNA/RNA | Tri-Nucleotide Composition (2) | |
CKSNAP | DNA/RNA | Composition of k-spaced Nucleic Acid Pairs (2) | |
NCP | DNA/RNA | Nucleotide Chemical Property (2) | |
ANF | DNA/RNA | Accumulated Nucleotide Frequency (33) | |
Zcurve | DNA/RNA | Representation of DNA/RNA sequence (35) | |
EIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (34) | |
PseEIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (2) |
Category . | Feature . | Type . | Description . |
---|---|---|---|
Nucleotide acid composition | NAC | DNA/RNA | Nucleic Acid Composition (2) |
DNC | DNA/RNA | Di-Nucleotide Composition (2) | |
TNC | DNA/RNA | Tri-Nucleotide Composition (2) | |
CKSNAP | DNA/RNA | Composition of k-spaced Nucleic Acid Pairs (2) | |
NCP | DNA/RNA | Nucleotide Chemical Property (2) | |
ANF | DNA/RNA | Accumulated Nucleotide Frequency (33) | |
Zcurve | DNA/RNA | Representation of DNA/RNA sequence (35) | |
EIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (34) | |
PseEIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (2) |
Category . | Feature . | Type . | Description . |
---|---|---|---|
Nucleotide acid composition | NAC | DNA/RNA | Nucleic Acid Composition (2) |
DNC | DNA/RNA | Di-Nucleotide Composition (2) | |
TNC | DNA/RNA | Tri-Nucleotide Composition (2) | |
CKSNAP | DNA/RNA | Composition of k-spaced Nucleic Acid Pairs (2) | |
NCP | DNA/RNA | Nucleotide Chemical Property (2) | |
ANF | DNA/RNA | Accumulated Nucleotide Frequency (33) | |
Zcurve | DNA/RNA | Representation of DNA/RNA sequence (35) | |
EIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (34) | |
PseEIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (2) |
Category . | Feature . | Type . | Description . |
---|---|---|---|
Nucleotide acid composition | NAC | DNA/RNA | Nucleic Acid Composition (2) |
DNC | DNA/RNA | Di-Nucleotide Composition (2) | |
TNC | DNA/RNA | Tri-Nucleotide Composition (2) | |
CKSNAP | DNA/RNA | Composition of k-spaced Nucleic Acid Pairs (2) | |
NCP | DNA/RNA | Nucleotide Chemical Property (2) | |
ANF | DNA/RNA | Accumulated Nucleotide Frequency (33) | |
Zcurve | DNA/RNA | Representation of DNA/RNA sequence (35) | |
EIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (34) | |
PseEIIP | DNA | Electron-ion interaction pseudopotentials of trinucleotide only for DNA (2) |
Category . | Feature . | Description . |
---|---|---|
Amino acid composition | AAC | Amino Acid Composition (37) |
GAAC | Grouped Amino Acid Composition (38) | |
CTDC | Composition (C), transition (T), and distribution (D) (39) | |
CTDT | Composition (C), transition (T), and distribution (D) (39,40) | |
CTDD | Composition (C), transition (T), and distribution (D) (39,40) | |
CTriad | Conjoint Triad (41) | |
SOCNumber | Sequence-Order-Coupling Number (42) | |
QSOrder | Quasi-sequence-order (43) | |
Z-Scale | ZSCALE (44,45) | |
TPC | Tri-Peptide Composition (37) | |
GTPC | Grouped Tri-Peptide Composition (37) | |
CKSAAP | Composition of k-spaced Amino Acid Pairs (46–49) | |
CKSAAGP | Composition of k-Spaced Amino Acid Group Pairs (46–49) | |
PAAC | Pseudo-Amino Acid Composition (50,51) | |
Autocorrelation | MAC | Moran autocorrelation (52,53) |
GAC | Geary autocorrelation (54) | |
NMMAC | Normalized Moreau-Broto Autocorrelation (53) | |
Predicted structure features | SSEB | Secondary Structure Binary (55) |
Category . | Feature . | Description . |
---|---|---|
Amino acid composition | AAC | Amino Acid Composition (37) |
GAAC | Grouped Amino Acid Composition (38) | |
CTDC | Composition (C), transition (T), and distribution (D) (39) | |
CTDT | Composition (C), transition (T), and distribution (D) (39,40) | |
CTDD | Composition (C), transition (T), and distribution (D) (39,40) | |
CTriad | Conjoint Triad (41) | |
SOCNumber | Sequence-Order-Coupling Number (42) | |
QSOrder | Quasi-sequence-order (43) | |
Z-Scale | ZSCALE (44,45) | |
TPC | Tri-Peptide Composition (37) | |
GTPC | Grouped Tri-Peptide Composition (37) | |
CKSAAP | Composition of k-spaced Amino Acid Pairs (46–49) | |
CKSAAGP | Composition of k-Spaced Amino Acid Group Pairs (46–49) | |
PAAC | Pseudo-Amino Acid Composition (50,51) | |
Autocorrelation | MAC | Moran autocorrelation (52,53) |
GAC | Geary autocorrelation (54) | |
NMMAC | Normalized Moreau-Broto Autocorrelation (53) | |
Predicted structure features | SSEB | Secondary Structure Binary (55) |
Category . | Feature . | Description . |
---|---|---|
Amino acid composition | AAC | Amino Acid Composition (37) |
GAAC | Grouped Amino Acid Composition (38) | |
CTDC | Composition (C), transition (T), and distribution (D) (39) | |
CTDT | Composition (C), transition (T), and distribution (D) (39,40) | |
CTDD | Composition (C), transition (T), and distribution (D) (39,40) | |
CTriad | Conjoint Triad (41) | |
SOCNumber | Sequence-Order-Coupling Number (42) | |
QSOrder | Quasi-sequence-order (43) | |
Z-Scale | ZSCALE (44,45) | |
TPC | Tri-Peptide Composition (37) | |
GTPC | Grouped Tri-Peptide Composition (37) | |
CKSAAP | Composition of k-spaced Amino Acid Pairs (46–49) | |
CKSAAGP | Composition of k-Spaced Amino Acid Group Pairs (46–49) | |
PAAC | Pseudo-Amino Acid Composition (50,51) | |
Autocorrelation | MAC | Moran autocorrelation (52,53) |
GAC | Geary autocorrelation (54) | |
NMMAC | Normalized Moreau-Broto Autocorrelation (53) | |
Predicted structure features | SSEB | Secondary Structure Binary (55) |
Category . | Feature . | Description . |
---|---|---|
Amino acid composition | AAC | Amino Acid Composition (37) |
GAAC | Grouped Amino Acid Composition (38) | |
CTDC | Composition (C), transition (T), and distribution (D) (39) | |
CTDT | Composition (C), transition (T), and distribution (D) (39,40) | |
CTDD | Composition (C), transition (T), and distribution (D) (39,40) | |
CTriad | Conjoint Triad (41) | |
SOCNumber | Sequence-Order-Coupling Number (42) | |
QSOrder | Quasi-sequence-order (43) | |
Z-Scale | ZSCALE (44,45) | |
TPC | Tri-Peptide Composition (37) | |
GTPC | Grouped Tri-Peptide Composition (37) | |
CKSAAP | Composition of k-spaced Amino Acid Pairs (46–49) | |
CKSAAGP | Composition of k-Spaced Amino Acid Group Pairs (46–49) | |
PAAC | Pseudo-Amino Acid Composition (50,51) | |
Autocorrelation | MAC | Moran autocorrelation (52,53) |
GAC | Geary autocorrelation (54) | |
NMMAC | Normalized Moreau-Broto Autocorrelation (53) | |
Predicted structure features | SSEB | Secondary Structure Binary (55) |
Since the dimension of some feature extraction methods is tremendously high, which will result in high-dimension disaster (57).To cope with this problem, in BioSeq-Analysis2.0, users can reduce the feature vector dimension into a user-defined length by using mutual information (58) or chi-square algorithm (59). The chi-square feature selection qualitatively measures the correlation of independent features only for classification purpose. Mutual information is the amount of information of one feature contained in another feature. The chi-square test makes it easier to give high scores for features occurring less frequently. For example, if a feature appears once in the benchmark dataset, it will get a relatively high score, while its mutual information score will be low.
Predictor construction
Most of biological sequence analysis tasks at residue level and sequence level can be treated as classification tasks. Therefore, many classifiers have been applied to biological sequence analysis.
For residue-level analysis, BioSeq-Analysis2.0 incorporates two classification algorithms (Support Vector Machine (SVM) (60), Random Forest (RF) (61)), and a sequence labelling algorithm (Conditional Random Fields (CRF) (62)).
For SVM algorithm, its implementation was depended on the LIBSVM package (63) with the kernel of Gaussian radial basis function (RBF), and users can select the values of the c and g (c is from |${2^{ - 1}}$| to |${2^7}$|, g is from |${2^{ - 7}}$| to |${2^3}$|) or these parameters can be automatically optimized according to specific performance measures, such as accuracy (Acc), Matthew's correlation coefficient (MCC) or area under ROC (64) curve (AUC) (64). RF is a flexible and widely used supervised machine learning algorithm. The Python Scikit-learn (65) package was used as its implementation in BioSeq-Analysis2.0, and the users can select the value of n_estimators (the number of the decision trees, whose range is from 100 to 800). This parameter can also be automatically optimized.
For residue-level analysis, the FlexCRFs (http://flexcrfs.sourceforge.net/documents.html (accessed on June 2019)) toolkit was used as the implementation of CRF, which was modified to deal with the real value features following this study (67). The parameters of the num_iterations (the number of training iterations) and the init_lambda_val (the initial value for the feature weights) were set as 50 and 0.05, respectively.
For sequence-level analysis, four classification algorithms were employed in BioSeq-Analysis2.0. For more details, please refer to (1).
Performance evaluation
According to the aforementioned two processes, a predictor for analyzing biological sequence tasks can be generated. Evaluating performance of the predictor is an important component (68). In BioSeq-Analysis2.0, two methods are used for realizing this purpose, containing 5-fold cross-validation and independent test.
In 5-fold cross-validation, the benchmark dataset is randomly partitioned into five roughly equivalent subsets. The training procedure is repeated five times with different training and test sets. Please note that in order to avoid overestimating the performance of the residue-level predictors, all the residues in one sequence must be in the same subset, which is different from the sequence-level analysis. Besides 5-fold cross-validation, the independent test is usually adopted to evaluate a predictor of the real world applications. The predictor is trained with the benchmark dataset, and tested on the independent dataset. The independent dataset should be fully independent from the benchmark dataset so as to fairly evaluate its performance.
The training sets are often imbalanced for some biological sequence analysis tasks, for example, for the protein disordered region prediction task, the number of the residues in the ordered regions is much larger than the number of residues in the disordered regions (66), which will inevitably lead to a bias consequence (66). In this regard, the oversampling and under sampling techniques were also provided to minimize this bias consequence in BioSeq-Analysis2.0.
RESULTS AND DISCUSSION
Web server
BioSeq-Analysis2.0 is an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches. The pipeline of BioSeq-Analysis2.0 is shown in Figure 3.
Input
The input page of BioSeq-Analysis2.0 web server is shown in Figure 4. The input sequences should be in FASTA format, which can be written into the input box, or uploaded as a separate file. For residue-level analysis tasks, the corresponding label for each residue should be given. For DNA-Analysis2.0, 7 and 29 DNA feature extraction methods at residue level and sequence level respectively are provided. For RNA-Analysis2.0, 6 and 21 RNA feature extraction methods at residue-level and sequence-level respectively are provided. For Protein-Analysis2.0, 13 and 40 protein feature extraction methods at residue-level and sequence-level respectively are provided. The users should select one feature from the above features. For residue-level analysis, the fragment method or the size of the sliding window should be selected. Two feature selection methods (mutual information or chi-square algorithm) can be used to select representative features so as to avoid the high dimension disaster. The next step is to choose one operation engine. The parameters of the feature extraction methods and the machine learning classifiers can be automatically optimized. Furthermore, the oversampling and under sampling techniques can be used to handle the imbalanced training set problem.
Output
Figure 5 is a result page using One-hot feature in the sub web server DNA-Analysis2.0 with the provided example data (sliding window size |$ =$| 7, c|$ =$||${2^{ - 1}}$|, and g =|${\rm{\ }}{2^{ - 6}}$|) as the input.
Figure 5A includes two parts. The first part is the parameters of the selected feature containing feature name and the size of the sliding window, and the other part is the parameters of the selected machine learning algorithm such as the value of c, and g. Figure 5B shows the 5-fold cross-validation evaluation results, which is a |$2{\rm{\ }} \times {\rm{\ }}5$| table listing the values of Acc, MCC, AUC, Sn, and Sp to evaluate the performance of the DNA-Analysis2.0. Figure 5C is the ROC curve generated by the DNA-Analysis2.0, which has good robustness to the distribution of positive and negative samples. Figure 5D is an example output of the trained model that can be directly downloaded for further analysis. The trained model includes the total number of the categories (nr_class), the number of support vectors (total_sv) and the number of support vectors for each category (nr_sv), the parameters of the machine learning algorithm (gamma), etc. Figure 5E shows an example output of the generated features in Scikit-learn format, for convenience, it can be downloaded directly as a separate file. For the stand-alone package, the output file format can be chosen from the tab-delimited format, LIBSVM format, and the CSV format, which will be used for further computational analysis. Figure 5F gives an example output of generated features in Weka format containing three parts: relation, attribute and data, which can also be downloaded directly as a separate file. Relation is the relationship name of the dataset, and attribute is an attribute description for each sample in the dataset.
Stand-alone package of BioSeq-Analysis2.0
In order to deal with the biological sequence analysis tasks with large datasets, the stand-alone package of BioSeq-Analysis2.0 web server is also provided, which can be accessed at http://bliulab.net/BioSeq-Analysis2.0/download. There are two main modules in the BioSeq-Analysis2.0 stand-alone package for residue level analysis, one is feature extraction module with five executive python scripts: ‘ei.py’, ‘ssc.py’, ‘rc.py’, ‘pp.py’ and ‘feature.py’, the other module is ‘train.py’ and ‘rf_method.py’ for predictor construction and performance evaluation. For the convenience of the user, the processes of feature extraction, predictor construction and performance evaluation were combined into one executive python scripts ‘analysis.py’. There are also some scripts that help users to find the best predictor for a specific biological sequence analysis task. Please refer to the user manual for more details. Additionally, the multiprocessing technique was employed to further reduce the computing time of this stand-alone package.
Applications of BioSeq-Analysis2.0
In this section, BioSeq-Analysis2.0 stand-alone package was applied to three important residue-level biological sequence analysis tasks, including protein disordered region prediction (66), enhancer prediction (8), and mRNA N6-methyladenosine (m6A) site prediction (6).
The predictors for these tasks can be easily generated using BioSeq-Analysis2.0. Particularly, the performance of some predictors automatically generated by BioSeq-Analysis2.0 is highly comparable or even better than the existing predictors, indicating that BioSeq-Analysis2.0 is a powerful tool for generating new predictors for analysing biological sequence tasks.
Identification of enhancers
Enhancer is short DNA region that can be bound by proteins (activators) to activate a gene transcription (7). Therefore, the identification of enhancers is important for studying the transcription process, which can be treated as a binary classification task. In this study, the DNA-Analysis2.0 was used to generate 14 different predictors for enhancer prediction based on the 7 residue-level feature extraction methods for DNA sequences (Table 1), and two machine learning algorithms: SVM and RF. Each predictor can be easily generated by running the following command line:
python analysis.py sequence_file DNA –method feature_extraction_method –ml machine_learning_method –labels label_file –fragment 1 –model model_name
Evaluated on a widely used benchmark dataset (7,8), the ROC curves of the 14 predictors were listed in Figure 6, from which we can see that the SVM-One-hot predictor achieves the top performance with an AUC score of 0.8267, even outperforming the existing approach reported in (70), indicating that BioSeq-Analysis2.0 is useful for generating new predictors for enhancer identification.
Identification of mRNAs (m6A) sites
N6-Methyladenosine (m6A) is an RNA methylation modification at the nitrogen-6 position of the adenosine base (6). Research in cancer biology has shown that m6A mRNA modification plays a critical role in glioblastoma stem cell self-renewal and tumorigenesis (71,72). Therefore, the identification of the m6A becomes a hot topic.
In this study, the RNA-Analysis2.0 in BioSeq-Analysis2.0 was used to generate 12 different predictors for mRNAs (m6A) site prediction based on the 6 residue-level feature extraction methods (Table 2), and two machine learning algorithms: SVM and RF. Each predictor can be easily generated by running the following command line:
python analysis.py sequence_file RNA –method feature_extraction_method –ml machine_learning_method –labels label_file –fragment 1 –model model_name
Figure 7 shows the ROC curves of the 12 predictors automatically generated by BioSeq-Analysis2.0. These experimental results further confirmed that RNA-Analysis2.0 was useful for developing new predictors for RNA sequence analysis tasks as well.
Identification of protein disordered regions
Intrinsically disordered proteins lack stable three dimensional structures in their native states (66), which are correlated with many diseases, such as genetic diseases, cancer, etc. Therefore, identification of disordered proteins and regions has become one of the most popular tasks in the studies of protein structures and functions (66,69). Here, Protein-Analysis2.0 in BioSeq-Analysis2.0 was used to automatically generate various predictors for protein disordered region prediction based on the benchmark dataset (66). Finally, 26 predictors were generated based on the 13 residue-level feature extraction methods of proteins (see Table 3), and two machine learning algorithms: CRF and SVM. Each predictor can be easily generated by running the following command line:
python analysis.py sequence_file Protein –method feature_extraction_method –ml machine_learning_method –labels label_file –model model_name –size sliding_window_size
The ROC curves of the 26 predictors were shown in Figure 8, where we can see that the feature extraction methods and machine learning algorithms impact on the performance of the corresponding predictors, and the predictors based on the sequence labeling model CRFs generally outperformed those based on the SVM, which is fully consistent with a recent study (66). Particularly, the CRF-One-hot (6-bit) predictor can achieve an AUC score of 0.7472, highly comparable with the existing state-of-the-art methods in this filed (66).
As shown in some recent studies, machine learning techniques are playing more and more important roles in biological sequence analysis (73,74), such as protein remote homology detection (75), protein fold recognition (76), etc. It can be anticipated that the proposed BioSeq-Analysis2.0 will become a very useful tool for the researchers who are interested in developing new computational predictors for these tasks.
ACKNOWLEDGEMENTS
We are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of this paper.
FUNDING
National Natural Science Foundation of China [61822306]; Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China [161063]; Scientific Research Foundation in Shenzhen [JCYJ20180306172207178]. Funding for open access charge: National Natural Science Foundation of China [61822306].
Conflict of interest statement. None declared.
Notes
Present address: Bin Liu, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Haidian District, Beijing 100081, China.
Comments