iBet uBet web content aggregator. Adding the entire web to your favor.

Author Notes

Abstract

Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.

INTRODUCTION

High-throughput sequencing has significantly advanced and experienced widespread use over the past few decades, generating the unprecedented volume of the DNAs, RNAs and protein sequence data. With the fast accumulation of these data, effectively analyzing, mining and visualizing biological sequences have become a non-trivial task (1). Among a variety of computational solutions, machine-learning methods are a popular and efficient solution for the accurate function prediction/analysis for biological sequences (2–7). Many sequence-based machine-learning approaches have been proposed, contributing to a better understanding of the functions and structures of DNAs, RNAs and proteins (8–10), particularly in the context of human disease (11–13). Despite the diversity of machine-learning frameworks for the sequence analysis and prediction, in general, they follow the same set of five major steps after the sequence data was collected: feature extraction, feature analysis, classifier construction, performance evaluation, and data/result visualization, as demonstrated in Figure 1.

Figure 1.

A summary of the five major steps involved in the development of machine learning-based models for biological sequences analysis. These steps include feature extraction, feature analysis, classifier construction, performance evaluation, and data/result visualization.

Open in new tab Download slide

Bioinformatics-driven data analysis is an essential part of biological studies. The sequence-based analysis and predictions often require complex processing steps, data science expertise, and access to sophisticated software. These requirements have become a significant hurdle, especially for biologists with limited bioinformatics expertise. Several web servers and standalone software packages for the sequence-based analysis and prediction have been recently developed to meet these needs. Representative tools include iFeature (14), iLearn (15), Selene (16), Kipoi (17), Janggu (18), BioSeq-Analysis (19) and BioSeq-Analysis2.0 (20). Selene is PyTorch-based deep-learning library for rapid development, training, and application of deep-learning models from biological sequences. Janggu is a python package that similarly focuses on the deep learning models. Kipoi is a collaborative initiative that defines standards and fosters reuse of trained models. However, these three tools cover only a portion of the complete pipeline outlined in Figure 1. We provide a detailed side-by-side comparison in Supplementary Table S1. BioSeq-Analysis is regarded as the first automated platform for machine learning-based bioinformatics analysis and predictions at the sequence level (19). Its subsequent version, BioSeq-Analysis2.0 covers residue-level analysis, further improving the scope of this platform (20). In 2018, we released the first computational pipeline, iFeature, that generates features for both protein and peptide sequences. Later, we extended iFeature to design and implement iLearn, which is an integrated platform and meta-learner for feature engineering, machine-learning analysis and modelling of DNA, RNA and protein sequence data. Both platforms, iFeature and iLearn, have been applied in many areas of bioinformatics and computational biology including but not limited to the prediction and identification of mutational effects (21), protein-protein interaction hotspots (22), drug-target interactions (23), protein crystallization propensity (24), DNA-binding sites (25) and DNA-binding proteins (26), protein families (27,28), and DNA, RNA and protein modifications (29–32). The breadth and number of these applications show a substantial need for such solutions. However, further work is needed. First, new platforms need to overcome limitations of the current solutions in terms of streamlining and easiness of use, so that they make a sophisticated machine-learning based analysis of biological sequence accessible to both experienced bioinformaticians and biologists with limited programming background. This means that the development of the complex predictive and analytical pipelines should be streamlined by providing one platform that handles and offers support for the entire computational process. Second, the current platforms offer limited facilities for feature extraction, feature analysis and classifier construction. This calls for new approaches that provide a more comprehensive and molecule-specific (DNA veruss RNA versus protein) set of feature descriptors, a broader range of tools for feature analysis, and which should ideally cover state-of-the-art machine-learning algorithms including deep learning.

To this end, we release a comprehensive and automated sequence analysis and prediction platform, iLearnPlus, implemented in Python/PyQt5. iLearnPlus works across all major operating systems (i.e. Windows, macOS and Linux). Our platform includes four modules: iLearnPlus-Basic, iLearnPlus-Estimator, iLearnPlus-AutoML and iLearnPlus-LoadModel. These modules support a wide range of functionality, such as feature extraction, feature analysis, construction of machine-learning framework, training of machine-learning models/classifiers, assessment of predictive performance for these models, statistical analysis, and data/result visualization. iLearnPlus is geared to be used by both experienced bioinformaticians and biologists with limited bioinformatics expertise. When compared to the currently available tools (Supplementary Table S2), iLearnPlus offers the following key advantages:

To the best of our knowledge, iLearnPlus is the first GUI-based platform that facilitates machine learning-based analysis and prediction of biological sequences;
iLearnPlus outperforms the existing platforms in the number of the available machine-learning algorithms and the coverage of features produced from the input sequences: 21 machine-learning algorithms (12 conventional machine-learning methods, two ensemble-learning frameworks and seven deep-learning approaches) and 19 classes of features that cover 147 feature sets;
iLearnPlus provides a variety of ways to visualize the user-defined data and prediction performance including scatter plots, ROC (Receiver Operating Characteristic) curves, PRC (Precision-Recall Curves), histograms, kernel density plots, heatmaps and boxplots;
iLearnPlus supports two popular statistical tests: the Student's t-test and bootstrap test (33), to assess the statistical significance of differences and improvements in the context of the model performance;
iLearnPlus provides the iLearnPlus-AutoML module for evaluating the prediction performance of different machine-learning models and selecting the best-performing model via automatic parameter optimization, to support less data science-savvy users in maximizing the predictive capability of machine-learning pipelines;
iLearnPlus facilitates the deployment of the developed models with the iLearnPlus-LoadModel module. This module applies the already trained machine-learning models on new data;
iLearnPlus provides more options for model integration by exploring possible combinations of the prediction outcomes of separate models as the input, and retrain another machine-learning model (excluding the deep-learning approaches), to assess if the prediction performance can be further improved; and
iLearnPlus provides auxiliary functionalities for data preprocessing, such as file format transformations and combination of multiple feature encodings into one file.

iLearnPlus offers user-friendly interface and integrates four functional modules that streamline the entire computational process related to analysis and sequence-based prediction of the DNA, RNA and protein sequences. This ‘one-stop’ solution facilitates generation of biological hypotheses by supporting the design, testing and deployment of accurate predictive models. Following, we describe the features and capabilities of our platform for each of the five major steps defined above. We also demonstrate its application with two case studies that concern the development and testing of novel machine-learning models for the predictions of long noncoding RNAs (lncRNAs) and crotonylation sites in protein chains.

MATERIALS AND METHODS

Feature extraction

The feature extraction functionality in the iLearnPlus-Basic module generates numeric vectors from biological sequences. These vectors encode biochemical, biophysical, and compositional properties of the input sequences in the format that is compatible with the subsequent machine-learning tasks. iLearnPlus incorporates 19 major classes of features for protein, RNA and DNA sequences (Tables 1 and 2). To compare, iLearnPlus outnumbers the current platforms, including iLearn (15), iFeature (14) and BioSeq-Analysis2.0 (20) by 50, 94 and 31 feature sets. Supplementary Table S2 provides a detailed side-by-side comparison of the feature set numbers that these platforms offer for the DNA, RNA and protein sequences. The input sequences of iLearnPlus are required to be in the FASTA format. We designed an extended version of the header line (standard FASTA format is accepted; refer to the online instructions for more detail) which is processed by the graphical input data explorer. The biological sequence type (i.e. DNA, RNA or protein) is detected automatically based on the input sequences. The lists of the feature sets are provided in Table 1 (for protein sequences) and Table 2 (for DNA and RNA sequences), and can be customized by users. The selected subset of feature sets is output with a convenient table widget, which includes the molecule name, molecule label, feature (column) names and the corresponding values. iLearnPlus supports four formats for saving the calculated features, including LIBSVM format, Comma-Separated Values (CSV), Tab Separated Values (TSV), and Waikato Environment for Knowledge Analysis (WEKA) format. This variety of popular formats facilitates direct use of the features in third-party computational tools, such as scikit-learn (34), WEKA (35) and its web interface.

Table 1.

Open in new tab

Feature descriptors calculated by iLearnPlus for protein sequences

Descriptor group	Descriptor (abbreviation)	Reference
Amino acid composition	Amino acid composition (AAC)	(37)
	Enhanced amino acid composition (EAAC)	(14,15)
	Composition of k-spaced amino acid pairs (CKSAAP)	(38,39)
	Kmer (dipeptides and tripeptides) composition (DPC and TPC)	(37,40)
	Dipeptide deviation from expected mean (DDE)	(40)
	Composition (CTDC)	(41–45)
	Transition (CTDT)	(41–45)
	Distribution (CTDD)	(41–45)
	Conjoint triad (CTriad)	(46)
	Conjoint k-spaced Triad (KSCTriad)	(14,15)
	Adaptive skip dipeptide composition (ASDC)	(47)
	PseAAC of distance-pairs and reduced alphabet (DistancePair)	(20,48)
Grouped amino acid composition	Grouped amino acid composition (GAAC)	(14,15)
	Grouped enhanced amino acid composition (GEAAC)	(14,15)
	Composition of k-spaced amino acid group pairs (CKSAAGP)	(14,15)
	Grouped dipeptide composition (GDPC)	(14,15)
	Grouped tripeptide composition (GTPC)	(14,15)
Autocorrelation	Moran (Moran)	(49,50)
	Geary (Geary)	(51)
	Normalized Moreau-Broto (NMBroto)	(52)
	Auto covariance (AC)	(53–55)
	Cross covariance (CC)	(53–55)
	Auto-cross covariance (ACC)	(53–55)
Quasi-sequence-order	Sequence-order-coupling number (SOCNumber)	(56–58)
	Quasi-sequence-order descriptors (QSOrder)	(56–58)
Pseudo-amino acid composition	Pseudo-amino acid composition (PAAC)	(59,60)
	Amphiphilic PAAC (APAAC)	(59,60)
	Pseudo K-tuple reduced amino acids composition (PseKRAAC_type 1 to type 16)	(61)
Residue composition	Binary - 20bit (binary)	(62,63)
	Binary - 6bit (binary_6bit)	(20,64)
	Binary - 5bit (binary_5bit_type 1 and type 2)	(20,65)
	Binary - 3bit (binary_3bit_type 1 to type 7)	(47)
	Learn from alignments (AESNN3)	(20,66)
	Overlapping property features - 10 bit (OPF_10bit)	(47)
	Overlapping property features - 7 bit (OPF_7bit type 1 to type 3)	(47)
Physicochemical property	AAIndex (AAIndex)	(67)
BLOSUM matrix	BLOSUM62 (BLOSUM62)	(68)
Z-Scale index	Z-Scale (Zscale)	(69)
Similarity-based descriptor	K-nearest neighbor (KNN)	(70)

Descriptor group	Descriptor (abbreviation)	Reference
Amino acid composition	Amino acid composition (AAC)	(37)
	Enhanced amino acid composition (EAAC)	(14,15)
	Composition of k-spaced amino acid pairs (CKSAAP)	(38,39)
	Kmer (dipeptides and tripeptides) composition (DPC and TPC)	(37,40)
	Dipeptide deviation from expected mean (DDE)	(40)
	Composition (CTDC)	(41–45)
	Transition (CTDT)	(41–45)
	Distribution (CTDD)	(41–45)
	Conjoint triad (CTriad)	(46)
	Conjoint k-spaced Triad (KSCTriad)	(14,15)
	Adaptive skip dipeptide composition (ASDC)	(47)
	PseAAC of distance-pairs and reduced alphabet (DistancePair)	(20,48)
Grouped amino acid composition	Grouped amino acid composition (GAAC)	(14,15)
	Grouped enhanced amino acid composition (GEAAC)	(14,15)
	Composition of k-spaced amino acid group pairs (CKSAAGP)	(14,15)
	Grouped dipeptide composition (GDPC)	(14,15)
	Grouped tripeptide composition (GTPC)	(14,15)
Autocorrelation	Moran (Moran)	(49,50)
	Geary (Geary)	(51)
	Normalized Moreau-Broto (NMBroto)	(52)
	Auto covariance (AC)	(53–55)
	Cross covariance (CC)	(53–55)
	Auto-cross covariance (ACC)	(53–55)
Quasi-sequence-order	Sequence-order-coupling number (SOCNumber)	(56–58)
	Quasi-sequence-order descriptors (QSOrder)	(56–58)
Pseudo-amino acid composition	Pseudo-amino acid composition (PAAC)	(59,60)
	Amphiphilic PAAC (APAAC)	(59,60)
	Pseudo K-tuple reduced amino acids composition (PseKRAAC_type 1 to type 16)	(61)
Residue composition	Binary - 20bit (binary)	(62,63)
	Binary - 6bit (binary_6bit)	(20,64)
	Binary - 5bit (binary_5bit_type 1 and type 2)	(20,65)
	Binary - 3bit (binary_3bit_type 1 to type 7)	(47)
	Learn from alignments (AESNN3)	(20,66)
	Overlapping property features - 10 bit (OPF_10bit)	(47)
	Overlapping property features - 7 bit (OPF_7bit type 1 to type 3)	(47)
Physicochemical property	AAIndex (AAIndex)	(67)
BLOSUM matrix	BLOSUM62 (BLOSUM62)	(68)
Z-Scale index	Z-Scale (Zscale)	(69)
Similarity-based descriptor	K-nearest neighbor (KNN)	(70)

Table 1.

Open in new tab

Feature descriptors calculated by iLearnPlus for protein sequences

Descriptor group	Descriptor (abbreviation)	Reference
Amino acid composition	Amino acid composition (AAC)	(37)
	Enhanced amino acid composition (EAAC)	(14,15)
	Composition of k-spaced amino acid pairs (CKSAAP)	(38,39)
	Kmer (dipeptides and tripeptides) composition (DPC and TPC)	(37,40)
	Dipeptide deviation from expected mean (DDE)	(40)
	Composition (CTDC)	(41–45)
	Transition (CTDT)	(41–45)
	Distribution (CTDD)	(41–45)
	Conjoint triad (CTriad)	(46)
	Conjoint k-spaced Triad (KSCTriad)	(14,15)
	Adaptive skip dipeptide composition (ASDC)	(47)
	PseAAC of distance-pairs and reduced alphabet (DistancePair)	(20,48)
Grouped amino acid composition	Grouped amino acid composition (GAAC)	(14,15)
	Grouped enhanced amino acid composition (GEAAC)	(14,15)
	Composition of k-spaced amino acid group pairs (CKSAAGP)	(14,15)
	Grouped dipeptide composition (GDPC)	(14,15)
	Grouped tripeptide composition (GTPC)	(14,15)
Autocorrelation	Moran (Moran)	(49,50)
	Geary (Geary)	(51)
	Normalized Moreau-Broto (NMBroto)	(52)
	Auto covariance (AC)	(53–55)
	Cross covariance (CC)	(53–55)
	Auto-cross covariance (ACC)	(53–55)
Quasi-sequence-order	Sequence-order-coupling number (SOCNumber)	(56–58)
	Quasi-sequence-order descriptors (QSOrder)	(56–58)
Pseudo-amino acid composition	Pseudo-amino acid composition (PAAC)	(59,60)
	Amphiphilic PAAC (APAAC)	(59,60)
	Pseudo K-tuple reduced amino acids composition (PseKRAAC_type 1 to type 16)	(61)
Residue composition	Binary - 20bit (binary)	(62,63)
	Binary - 6bit (binary_6bit)	(20,64)
	Binary - 5bit (binary_5bit_type 1 and type 2)	(20,65)
	Binary - 3bit (binary_3bit_type 1 to type 7)	(47)
	Learn from alignments (AESNN3)	(20,66)
	Overlapping property features - 10 bit (OPF_10bit)	(47)
	Overlapping property features - 7 bit (OPF_7bit type 1 to type 3)	(47)
Physicochemical property	AAIndex (AAIndex)	(67)
BLOSUM matrix	BLOSUM62 (BLOSUM62)	(68)
Z-Scale index	Z-Scale (Zscale)	(69)
Similarity-based descriptor	K-nearest neighbor (KNN)	(70)

Descriptor group	Descriptor (abbreviation)	Reference
Amino acid composition	Amino acid composition (AAC)	(37)
	Enhanced amino acid composition (EAAC)	(14,15)
	Composition of k-spaced amino acid pairs (CKSAAP)	(38,39)
	Kmer (dipeptides and tripeptides) composition (DPC and TPC)	(37,40)
	Dipeptide deviation from expected mean (DDE)	(40)
	Composition (CTDC)	(41–45)
	Transition (CTDT)	(41–45)
	Distribution (CTDD)	(41–45)
	Conjoint triad (CTriad)	(46)
	Conjoint k-spaced Triad (KSCTriad)	(14,15)
	Adaptive skip dipeptide composition (ASDC)	(47)
	PseAAC of distance-pairs and reduced alphabet (DistancePair)	(20,48)
Grouped amino acid composition	Grouped amino acid composition (GAAC)	(14,15)
	Grouped enhanced amino acid composition (GEAAC)	(14,15)
	Composition of k-spaced amino acid group pairs (CKSAAGP)	(14,15)
	Grouped dipeptide composition (GDPC)	(14,15)
	Grouped tripeptide composition (GTPC)	(14,15)
Autocorrelation	Moran (Moran)	(49,50)
	Geary (Geary)	(51)
	Normalized Moreau-Broto (NMBroto)	(52)
	Auto covariance (AC)	(53–55)
	Cross covariance (CC)	(53–55)
	Auto-cross covariance (ACC)	(53–55)
Quasi-sequence-order	Sequence-order-coupling number (SOCNumber)	(56–58)
	Quasi-sequence-order descriptors (QSOrder)	(56–58)
Pseudo-amino acid composition	Pseudo-amino acid composition (PAAC)	(59,60)
	Amphiphilic PAAC (APAAC)	(59,60)
	Pseudo K-tuple reduced amino acids composition (PseKRAAC_type 1 to type 16)	(61)
Residue composition	Binary - 20bit (binary)	(62,63)
	Binary - 6bit (binary_6bit)	(20,64)
	Binary - 5bit (binary_5bit_type 1 and type 2)	(20,65)
	Binary - 3bit (binary_3bit_type 1 to type 7)	(47)
	Learn from alignments (AESNN3)	(20,66)
	Overlapping property features - 10 bit (OPF_10bit)	(47)
	Overlapping property features - 7 bit (OPF_7bit type 1 to type 3)	(47)
Physicochemical property	AAIndex (AAIndex)	(67)
BLOSUM matrix	BLOSUM62 (BLOSUM62)	(68)
Z-Scale index	Z-Scale (Zscale)	(69)
Similarity-based descriptor	K-nearest neighbor (KNN)	(70)

Table 2.

Open in new tab

Feature descriptors calculated by iLearnPlus for DNA and RNA sequences

Descriptor group	Descriptor (abbreviation)	Sequence type	Reference
Nucleic acid composition	Nucleic acid composition (NAC)	DNA/RNA	(15)
	Enhanced nucleic acid composition (ENAC)	DNA/RNA	(15)
	k-spaced nucleic acid pairs (CKSNAP)	DNA/RNA	(15)
	Basic kmer (Kmer)	DNA/RNA	(71)
	Reverse compliment kmer (RCKmer)	DNA	(72,73)
	Accumulated nucleotide frequency (ANF)	DNA/RNA	(74)
	Nucleotide chemical property (NCP)	DNA/RNA	(74)
	The occurrence of kmers, allowing at most m mismatches (Mismatch)	DNA/RNA	(20)
	The occurrences of kmers, allowing non-contiguous matches (Subsequence)	DNA/RNA	(20)
	Adaptive skip dinucleotide composition (ASDC)	DNA/RNA	(47)
	Local position-specific dinucleotide frequency (LPDF)	DNA/RNA	(75)
	The Z curve parameters for frequencies of phase-specific mononucleotides (Z_curve_9bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent dinucleotides (Z_curve_12bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific dinucleotides (Z_curve_36bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent trinucleotides (Z_curve_48bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific trinucleotides (Z_curve_144bit)	DNA/RNA	(76)
Residue composition	Binary (binary)	DNA/RNA	(62,63)
	Dinucleotide binary encoding (DBE)	DNA/RNA	(75)
	Position-specific of two nucleotides (PS2)	DNA/RNA	(20,77)
	Position-specific of three nucleotides (PS3)	DNA/RNA	(20,77)
	Position-specific of four nucleotides (PS4)	DNA/RNA	(20,77)
Position-specific tendencies of trinucleotides	Position-specific trinucleotide propensity based on single-strand (PSTNPss)	DNA/RNA	(78,79)
	Position-specific trinucleotide propensity based on double-strand (PSTNPds)	DNA	(78,79)
Electron-ion interaction pseudopotentials	Electron-ion interaction pseudopotentials value (EIIP)	DNA	(80,81)
	Electron-ion interaction pseudopotentials of trinucleotide (PseEIIP)	DNA	(80,81)
Autocorrelation and cross-covariance	Dinucleotide-based auto covariance (DAC)	DNA/RNA	(53–55)
	Dinucleotide-based cross covariance (DCC)	DNA/RNA	(53–55)
	Dinucleotide-based auto-cross covariance (DACC)	DNA/RNA	(53–55)
	Trinucleotide-based auto covariance (TAC)	DNA	(53)
	Trinucleotide-based cross covariance (TCC)	DNA	(53)
	Trinucleotide-based auto-cross covariance (TACC)	DNA	(53)
	Moran (Moran)	DNA/RNA	(49,50)
	Geary (Geary)	DNA/RNA	(51)
	Normalized Moreau-Broto (NMBroto)	DNA/RNA	(52)
Physicochemical property	Dinucleotide physicochemical properties (DPCP type 1 and type 2)	DNA/RNA	(82)
	Trinucleotide physicochemical properties (TPCP type 1 and type 2)	DNA	(82)
Mutual information	Multivariate mutual information (MMI)	DNA/RNA	(83)
Similarity-based descriptor	K-nearest neighbor (KNN)	DNA/RNA	(83)
Pseudo nucleic acid composition	Pseudo dinucleotide composition (PseDNC)	DNA/RNA	(53,84)
	Pseudo k-tupler composition (PseKNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo dinucleotide composition (PCPseDNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo trinucleotide composition (PCPseTNC)	DNA	(53,84)
	Series correlation pseudo dinucleotide composition (SCPseDNC)	DNA/RNA	(53,84)
	Series correlation pseudo trinucleotide composition (SCPseTNC)	DNA	(53,84)

Descriptor group	Descriptor (abbreviation)	Sequence type	Reference
Nucleic acid composition	Nucleic acid composition (NAC)	DNA/RNA	(15)
	Enhanced nucleic acid composition (ENAC)	DNA/RNA	(15)
	k-spaced nucleic acid pairs (CKSNAP)	DNA/RNA	(15)
	Basic kmer (Kmer)	DNA/RNA	(71)
	Reverse compliment kmer (RCKmer)	DNA	(72,73)
	Accumulated nucleotide frequency (ANF)	DNA/RNA	(74)
	Nucleotide chemical property (NCP)	DNA/RNA	(74)
	The occurrence of kmers, allowing at most m mismatches (Mismatch)	DNA/RNA	(20)
	The occurrences of kmers, allowing non-contiguous matches (Subsequence)	DNA/RNA	(20)
	Adaptive skip dinucleotide composition (ASDC)	DNA/RNA	(47)
	Local position-specific dinucleotide frequency (LPDF)	DNA/RNA	(75)
	The Z curve parameters for frequencies of phase-specific mononucleotides (Z_curve_9bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent dinucleotides (Z_curve_12bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific dinucleotides (Z_curve_36bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent trinucleotides (Z_curve_48bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific trinucleotides (Z_curve_144bit)	DNA/RNA	(76)
Residue composition	Binary (binary)	DNA/RNA	(62,63)
	Dinucleotide binary encoding (DBE)	DNA/RNA	(75)
	Position-specific of two nucleotides (PS2)	DNA/RNA	(20,77)
	Position-specific of three nucleotides (PS3)	DNA/RNA	(20,77)
	Position-specific of four nucleotides (PS4)	DNA/RNA	(20,77)
Position-specific tendencies of trinucleotides	Position-specific trinucleotide propensity based on single-strand (PSTNPss)	DNA/RNA	(78,79)
	Position-specific trinucleotide propensity based on double-strand (PSTNPds)	DNA	(78,79)
Electron-ion interaction pseudopotentials	Electron-ion interaction pseudopotentials value (EIIP)	DNA	(80,81)
	Electron-ion interaction pseudopotentials of trinucleotide (PseEIIP)	DNA	(80,81)
Autocorrelation and cross-covariance	Dinucleotide-based auto covariance (DAC)	DNA/RNA	(53–55)
	Dinucleotide-based cross covariance (DCC)	DNA/RNA	(53–55)
	Dinucleotide-based auto-cross covariance (DACC)	DNA/RNA	(53–55)
	Trinucleotide-based auto covariance (TAC)	DNA	(53)
	Trinucleotide-based cross covariance (TCC)	DNA	(53)
	Trinucleotide-based auto-cross covariance (TACC)	DNA	(53)
	Moran (Moran)	DNA/RNA	(49,50)
	Geary (Geary)	DNA/RNA	(51)
	Normalized Moreau-Broto (NMBroto)	DNA/RNA	(52)
Physicochemical property	Dinucleotide physicochemical properties (DPCP type 1 and type 2)	DNA/RNA	(82)
	Trinucleotide physicochemical properties (TPCP type 1 and type 2)	DNA	(82)
Mutual information	Multivariate mutual information (MMI)	DNA/RNA	(83)
Similarity-based descriptor	K-nearest neighbor (KNN)	DNA/RNA	(83)
Pseudo nucleic acid composition	Pseudo dinucleotide composition (PseDNC)	DNA/RNA	(53,84)
	Pseudo k-tupler composition (PseKNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo dinucleotide composition (PCPseDNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo trinucleotide composition (PCPseTNC)	DNA	(53,84)
	Series correlation pseudo dinucleotide composition (SCPseDNC)	DNA/RNA	(53,84)
	Series correlation pseudo trinucleotide composition (SCPseTNC)	DNA	(53,84)

Table 2.

Open in new tab

Feature descriptors calculated by iLearnPlus for DNA and RNA sequences

Descriptor group	Descriptor (abbreviation)	Sequence type	Reference
Nucleic acid composition	Nucleic acid composition (NAC)	DNA/RNA	(15)
	Enhanced nucleic acid composition (ENAC)	DNA/RNA	(15)
	k-spaced nucleic acid pairs (CKSNAP)	DNA/RNA	(15)
	Basic kmer (Kmer)	DNA/RNA	(71)
	Reverse compliment kmer (RCKmer)	DNA	(72,73)
	Accumulated nucleotide frequency (ANF)	DNA/RNA	(74)
	Nucleotide chemical property (NCP)	DNA/RNA	(74)
	The occurrence of kmers, allowing at most m mismatches (Mismatch)	DNA/RNA	(20)
	The occurrences of kmers, allowing non-contiguous matches (Subsequence)	DNA/RNA	(20)
	Adaptive skip dinucleotide composition (ASDC)	DNA/RNA	(47)
	Local position-specific dinucleotide frequency (LPDF)	DNA/RNA	(75)
	The Z curve parameters for frequencies of phase-specific mononucleotides (Z_curve_9bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent dinucleotides (Z_curve_12bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific dinucleotides (Z_curve_36bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent trinucleotides (Z_curve_48bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific trinucleotides (Z_curve_144bit)	DNA/RNA	(76)
Residue composition	Binary (binary)	DNA/RNA	(62,63)
	Dinucleotide binary encoding (DBE)	DNA/RNA	(75)
	Position-specific of two nucleotides (PS2)	DNA/RNA	(20,77)
	Position-specific of three nucleotides (PS3)	DNA/RNA	(20,77)
	Position-specific of four nucleotides (PS4)	DNA/RNA	(20,77)
Position-specific tendencies of trinucleotides	Position-specific trinucleotide propensity based on single-strand (PSTNPss)	DNA/RNA	(78,79)
	Position-specific trinucleotide propensity based on double-strand (PSTNPds)	DNA	(78,79)
Electron-ion interaction pseudopotentials	Electron-ion interaction pseudopotentials value (EIIP)	DNA	(80,81)
	Electron-ion interaction pseudopotentials of trinucleotide (PseEIIP)	DNA	(80,81)
Autocorrelation and cross-covariance	Dinucleotide-based auto covariance (DAC)	DNA/RNA	(53–55)
	Dinucleotide-based cross covariance (DCC)	DNA/RNA	(53–55)
	Dinucleotide-based auto-cross covariance (DACC)	DNA/RNA	(53–55)
	Trinucleotide-based auto covariance (TAC)	DNA	(53)
	Trinucleotide-based cross covariance (TCC)	DNA	(53)
	Trinucleotide-based auto-cross covariance (TACC)	DNA	(53)
	Moran (Moran)	DNA/RNA	(49,50)
	Geary (Geary)	DNA/RNA	(51)
	Normalized Moreau-Broto (NMBroto)	DNA/RNA	(52)
Physicochemical property	Dinucleotide physicochemical properties (DPCP type 1 and type 2)	DNA/RNA	(82)
	Trinucleotide physicochemical properties (TPCP type 1 and type 2)	DNA	(82)
Mutual information	Multivariate mutual information (MMI)	DNA/RNA	(83)
Similarity-based descriptor	K-nearest neighbor (KNN)	DNA/RNA	(83)
Pseudo nucleic acid composition	Pseudo dinucleotide composition (PseDNC)	DNA/RNA	(53,84)
	Pseudo k-tupler composition (PseKNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo dinucleotide composition (PCPseDNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo trinucleotide composition (PCPseTNC)	DNA	(53,84)
	Series correlation pseudo dinucleotide composition (SCPseDNC)	DNA/RNA	(53,84)
	Series correlation pseudo trinucleotide composition (SCPseTNC)	DNA	(53,84)

Descriptor group	Descriptor (abbreviation)	Sequence type	Reference
Nucleic acid composition	Nucleic acid composition (NAC)	DNA/RNA	(15)
	Enhanced nucleic acid composition (ENAC)	DNA/RNA	(15)
	k-spaced nucleic acid pairs (CKSNAP)	DNA/RNA	(15)
	Basic kmer (Kmer)	DNA/RNA	(71)
	Reverse compliment kmer (RCKmer)	DNA	(72,73)
	Accumulated nucleotide frequency (ANF)	DNA/RNA	(74)
	Nucleotide chemical property (NCP)	DNA/RNA	(74)
	The occurrence of kmers, allowing at most m mismatches (Mismatch)	DNA/RNA	(20)
	The occurrences of kmers, allowing non-contiguous matches (Subsequence)	DNA/RNA	(20)
	Adaptive skip dinucleotide composition (ASDC)	DNA/RNA	(47)
	Local position-specific dinucleotide frequency (LPDF)	DNA/RNA	(75)
	The Z curve parameters for frequencies of phase-specific mononucleotides (Z_curve_9bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent dinucleotides (Z_curve_12bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific dinucleotides (Z_curve_36bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-independent trinucleotides (Z_curve_48bit)	DNA/RNA	(76)
	The Z curve parameters for frequencies of phase-specific trinucleotides (Z_curve_144bit)	DNA/RNA	(76)
Residue composition	Binary (binary)	DNA/RNA	(62,63)
	Dinucleotide binary encoding (DBE)	DNA/RNA	(75)
	Position-specific of two nucleotides (PS2)	DNA/RNA	(20,77)
	Position-specific of three nucleotides (PS3)	DNA/RNA	(20,77)
	Position-specific of four nucleotides (PS4)	DNA/RNA	(20,77)
Position-specific tendencies of trinucleotides	Position-specific trinucleotide propensity based on single-strand (PSTNPss)	DNA/RNA	(78,79)
	Position-specific trinucleotide propensity based on double-strand (PSTNPds)	DNA	(78,79)
Electron-ion interaction pseudopotentials	Electron-ion interaction pseudopotentials value (EIIP)	DNA	(80,81)
	Electron-ion interaction pseudopotentials of trinucleotide (PseEIIP)	DNA	(80,81)
Autocorrelation and cross-covariance	Dinucleotide-based auto covariance (DAC)	DNA/RNA	(53–55)
	Dinucleotide-based cross covariance (DCC)	DNA/RNA	(53–55)
	Dinucleotide-based auto-cross covariance (DACC)	DNA/RNA	(53–55)
	Trinucleotide-based auto covariance (TAC)	DNA	(53)
	Trinucleotide-based cross covariance (TCC)	DNA	(53)
	Trinucleotide-based auto-cross covariance (TACC)	DNA	(53)
	Moran (Moran)	DNA/RNA	(49,50)
	Geary (Geary)	DNA/RNA	(51)
	Normalized Moreau-Broto (NMBroto)	DNA/RNA	(52)
Physicochemical property	Dinucleotide physicochemical properties (DPCP type 1 and type 2)	DNA/RNA	(82)
	Trinucleotide physicochemical properties (TPCP type 1 and type 2)	DNA	(82)
Mutual information	Multivariate mutual information (MMI)	DNA/RNA	(83)
Similarity-based descriptor	K-nearest neighbor (KNN)	DNA/RNA	(83)
Pseudo nucleic acid composition	Pseudo dinucleotide composition (PseDNC)	DNA/RNA	(53,84)
	Pseudo k-tupler composition (PseKNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo dinucleotide composition (PCPseDNC)	DNA/RNA	(53,84)
	Parallel correlation pseudo trinucleotide composition (PCPseTNC)	DNA	(53,84)
	Series correlation pseudo dinucleotide composition (SCPseDNC)	DNA/RNA	(53,84)
	Series correlation pseudo trinucleotide composition (SCPseTNC)	DNA	(53,84)

Besides the data tables, iLearnPlus provides advanced facilities to visualize the data. For instance, it generates hybrid plots that overlay kernel density curves and histograms (Figure 2) that can be used to shed light on the statistical distributions of the extracted features. The histogram provides a visual representation of feature values grouped into discrete intervals, while the kernel density approach produces a smooth curve that represents the probability density function for continuous variables (36) (Figure 2A). The visualization can be conducted for a specific dataset as well as a selected subset of features in that dataset.

Figure 2.

An example of hybrid plots for the extracted features generated by iLearnPlus. The histogram and kernel density plot for visualization of data distribution (A), and the scatter plot for dimension reduction result (B). The N¹-methyladenosine dataset from (30) was used to display the example plots.

Open in new tab Download slide

Feature analysis

Feature analysis is an optional but highly-recommended step that helps to eliminate irrelevant, noisy, or redundant features from the original feature set, with the overarching goal to optimize the predictive performance of the subsequently used machine-learning algorithm(s) (32). iLearnPlus provides multiple options to facilitate feature analysis, including ten feature clustering, three dimensionality reduction, two feature normalization and five feature selection approaches (Table 3). Compared with iLearn, the currently most comprehensive platform in the context of feature analysis (Supplementary Table S1), iLearnPlus provides four additional clustering algorithms: the Mini Batch k-means Clustering (85,86), Markov Clustering (MCL) (87), Agglomerative Clustering (88), and Spectral Clustering (89). The feature analysis supports the same comprehensive list of the file formats as the feature extraction tools (i.e. LIBSVM format, CSV, TSV and WEKA format).

Table 3.

Open in new tab

The feature analysis approaches provided in iLearnPlus

Method	Algorithm (abbreviation)	Reference
Clustering	k-means (kmeans)	(85,86)
	Mini-Batch K-means (MiniBatchKMeans)	(85,86)
	Gaussian mixture (GM)	(85,86)
	Agglomerative (Agglomerative)	(88)
	Spectral (Spectral)	(89)
	Markov clustering (MCL)	(87)
	Hierarchical clustering (hcluster)	(85,90)
	Affinity propagation clustering (APC)	(91)
	Mean shift (meanshift)	(92)
	DBSCAN (dbscan)	(93)
Feature selection	Chi-square test (CHI2)	(38)
	Information gain (IG)	(38,39)
	F-score value (FScore)	(94)
	Mutual information (MIC)	(95)
	Pearson's correlation coefficient (Pearson)	(96)
Dimensionality reduction	Principal component analysis (PCA)	(97)
	Latent dirichlet allocation (LDA)	(98)
	t-distributed stochastic neighbor embedding (t_SNE)	(99)
Feature normalization	Z-Score (ZScore)	(15)
	MinMax (MinMax)	(15)

Method	Algorithm (abbreviation)	Reference
Clustering	k-means (kmeans)	(85,86)
	Mini-Batch K-means (MiniBatchKMeans)	(85,86)
	Gaussian mixture (GM)	(85,86)
	Agglomerative (Agglomerative)	(88)
	Spectral (Spectral)	(89)
	Markov clustering (MCL)	(87)
	Hierarchical clustering (hcluster)	(85,90)
	Affinity propagation clustering (APC)	(91)
	Mean shift (meanshift)	(92)
	DBSCAN (dbscan)	(93)
Feature selection	Chi-square test (CHI2)	(38)
	Information gain (IG)	(38,39)
	F-score value (FScore)	(94)
	Mutual information (MIC)	(95)
	Pearson's correlation coefficient (Pearson)	(96)
Dimensionality reduction	Principal component analysis (PCA)	(97)
	Latent dirichlet allocation (LDA)	(98)
	t-distributed stochastic neighbor embedding (t_SNE)	(99)
Feature normalization	Z-Score (ZScore)	(15)
	MinMax (MinMax)	(15)

Table 3.

Open in new tab

The feature analysis approaches provided in iLearnPlus

Method	Algorithm (abbreviation)	Reference
Clustering	k-means (kmeans)	(85,86)
	Mini-Batch K-means (MiniBatchKMeans)	(85,86)
	Gaussian mixture (GM)	(85,86)
	Agglomerative (Agglomerative)	(88)
	Spectral (Spectral)	(89)
	Markov clustering (MCL)	(87)
	Hierarchical clustering (hcluster)	(85,90)
	Affinity propagation clustering (APC)	(91)
	Mean shift (meanshift)	(92)
	DBSCAN (dbscan)	(93)
Feature selection	Chi-square test (CHI2)	(38)
	Information gain (IG)	(38,39)
	F-score value (FScore)	(94)
	Mutual information (MIC)	(95)
	Pearson's correlation coefficient (Pearson)	(96)
Dimensionality reduction	Principal component analysis (PCA)	(97)
	Latent dirichlet allocation (LDA)	(98)
	t-distributed stochastic neighbor embedding (t_SNE)	(99)
Feature normalization	Z-Score (ZScore)	(15)
	MinMax (MinMax)	(15)

Method	Algorithm (abbreviation)	Reference
Clustering	k-means (kmeans)	(85,86)
	Mini-Batch K-means (MiniBatchKMeans)	(85,86)
	Gaussian mixture (GM)	(85,86)
	Agglomerative (Agglomerative)	(88)
	Spectral (Spectral)	(89)
	Markov clustering (MCL)	(87)
	Hierarchical clustering (hcluster)	(85,90)
	Affinity propagation clustering (APC)	(91)
	Mean shift (meanshift)	(92)
	DBSCAN (dbscan)	(93)
Feature selection	Chi-square test (CHI2)	(38)
	Information gain (IG)	(38,39)
	F-score value (FScore)	(94)
	Mutual information (MIC)	(95)
	Pearson's correlation coefficient (Pearson)	(96)
Dimensionality reduction	Principal component analysis (PCA)	(97)
	Latent dirichlet allocation (LDA)	(98)
	t-distributed stochastic neighbor embedding (t_SNE)	(99)
Feature normalization	Z-Score (ZScore)	(15)
	MinMax (MinMax)	(15)

The clustering groups similar objects (molecules) in a given dataset described by a specific set of features. Upon completion of the clustering process, molecules are grouped, and each group is assigned with a cluster ID. The cluster IDs are displayed in the table widget. The feature selection and dimensionality reduction approaches serve to reduce the number of features, while potentially boosting the prediction performance by eliminating irrelevant (to a given predictive task) and redundant (mutually correlated) features. Finally, feature normalization rescales the feature values to a specific range, so different features can be used together in the same dataset. We provide two widely used normalization algorithms: Z-score normalization and MinMax normalization. In the Z-score normalization, features are rescaled to the normal distribution with the mean of 0 and the standard deviation of 1. In the MinMax normalization, features are scaled to the unit range between 0 and 1. In iLearnPlus, the results produced by the feature selection and normalization methods can be conveniently visualized using the hybrid plots, while a scatter plot can be used to display the outputs produced by the clustering and dimensionality reduction tools (Figure 2B).

Classifier construction and integration

Many objectives related to the analysis of the DNA/RNA/protein sequences can be formulated as a classification problem. Examples include the prediction of structures and functions of protein and nucleic acid sequences (19,100,101). iLearnPlus supports both binary classification (two outcomes) and multi-class classification (multiple outcomes). It offers 12 conventional machine-learning algorithms, two ensemble-learning frameworks, and seven deep-learning approaches (Table 4). This broad selection of algorithms is more comprehensive than what the current platforms offer, i.e. 21 versus 5 in iLearn and BioSeq-Analysis2.0 (Supplementary Table S2). We use the implementation from four popular third-party machine-learning platforms, including scikit-learn (34), XGBoost (102), LightGBM (103) and PyTorch (104). Deep-learning approaches are implemented using the PyTorch library, while LightGBM and XGBoost algorithms are implemented using the LightGBM and XGBoost package, respectively. The scikit-learn library is used to implement the remaining algorithms.

Table 4.

Open in new tab

The integrated machine-learning and deep-learning algorithms in iLearnPlus

Algorithm category	Algorithm	Reference
Conventional machine-learning algorithms	Random forest (RF)	(105)
	Decision tree (DecisionTree)	(106)
	Support vector machine (SVM)	(107)
	K-nearest neighbors (KNN)	(108)
	Logistic regression (LR)	(109)
	Gradient boosting decision tree (GBDT)	(110)
	Light gradient boosting machine (LightBGM)	(111)
	Extreme gradient boosting (XGBoost)	(102)
	Stochastic gradient descent (SGD)	(34)
	Naïve Bayes (NaïveBayes)	(112)
	Linear discriminant analysis (LDA)	(113)
	Quadratic discriminant analysis (QDA)	(113)
Ensemble-learning frameworks	Bagging (Bagging)	(114)
	Adaptive boosting (AdaBoost)	(115)
Deep-learning algorithms	Convolutional neural network (CNN)	(30)
	Attention based convolutional neural network (ABCNN)	(116)
	Recurrent neural network (RNN)	(117)
	Bidirectional recurrent neural network (BRNN)	(118,119)
	Residual network (ResNet)	(120)
	Auto-encoder (AE)	(121)
	Multilayer perceptron (MLP)	(122)

Algorithm category	Algorithm	Reference
Conventional machine-learning algorithms	Random forest (RF)	(105)
	Decision tree (DecisionTree)	(106)
	Support vector machine (SVM)	(107)
	K-nearest neighbors (KNN)	(108)
	Logistic regression (LR)	(109)
	Gradient boosting decision tree (GBDT)	(110)
	Light gradient boosting machine (LightBGM)	(111)
	Extreme gradient boosting (XGBoost)	(102)
	Stochastic gradient descent (SGD)	(34)
	Naïve Bayes (NaïveBayes)	(112)
	Linear discriminant analysis (LDA)	(113)
	Quadratic discriminant analysis (QDA)	(113)
Ensemble-learning frameworks	Bagging (Bagging)	(114)
	Adaptive boosting (AdaBoost)	(115)
Deep-learning algorithms	Convolutional neural network (CNN)	(30)
	Attention based convolutional neural network (ABCNN)	(116)
	Recurrent neural network (RNN)	(117)
	Bidirectional recurrent neural network (BRNN)	(118,119)
	Residual network (ResNet)	(120)
	Auto-encoder (AE)	(121)
	Multilayer perceptron (MLP)	(122)

Table 4.

Open in new tab

The integrated machine-learning and deep-learning algorithms in iLearnPlus

Algorithm category	Algorithm	Reference
Conventional machine-learning algorithms	Random forest (RF)	(105)
	Decision tree (DecisionTree)	(106)
	Support vector machine (SVM)	(107)
	K-nearest neighbors (KNN)	(108)
	Logistic regression (LR)	(109)
	Gradient boosting decision tree (GBDT)	(110)
	Light gradient boosting machine (LightBGM)	(111)
	Extreme gradient boosting (XGBoost)	(102)
	Stochastic gradient descent (SGD)	(34)
	Naïve Bayes (NaïveBayes)	(112)
	Linear discriminant analysis (LDA)	(113)
	Quadratic discriminant analysis (QDA)	(113)
Ensemble-learning frameworks	Bagging (Bagging)	(114)
	Adaptive boosting (AdaBoost)	(115)
Deep-learning algorithms	Convolutional neural network (CNN)	(30)
	Attention based convolutional neural network (ABCNN)	(116)
	Recurrent neural network (RNN)	(117)
	Bidirectional recurrent neural network (BRNN)	(118,119)
	Residual network (ResNet)	(120)
	Auto-encoder (AE)	(121)
	Multilayer perceptron (MLP)	(122)

Algorithm category	Algorithm	Reference
Conventional machine-learning algorithms	Random forest (RF)	(105)
	Decision tree (DecisionTree)	(106)
	Support vector machine (SVM)	(107)
	K-nearest neighbors (KNN)	(108)
	Logistic regression (LR)	(109)
	Gradient boosting decision tree (GBDT)	(110)
	Light gradient boosting machine (LightBGM)	(111)
	Extreme gradient boosting (XGBoost)	(102)
	Stochastic gradient descent (SGD)	(34)
	Naïve Bayes (NaïveBayes)	(112)
	Linear discriminant analysis (LDA)	(113)
	Quadratic discriminant analysis (QDA)	(113)
Ensemble-learning frameworks	Bagging (Bagging)	(114)
	Adaptive boosting (AdaBoost)	(115)
Deep-learning algorithms	Convolutional neural network (CNN)	(30)
	Attention based convolutional neural network (ABCNN)	(116)
	Recurrent neural network (RNN)	(117)
	Bidirectional recurrent neural network (BRNN)	(118,119)
	Residual network (ResNet)	(120)
	Auto-encoder (AE)	(121)
	Multilayer perceptron (MLP)	(122)

For the conventional classifiers, iLearnPlus supports automatic parameter optimization while still allowing users to specify their own parameters. We adopt the grid search strategy to automate the parametrization. For example, users can either directly specify the values of the penalty and gamma parameters for the RBF (Radial Basis Function) kernel of the SVM classifier, or select the ‘Auto optimization’ option to optimize these two parameters automatically. The default parameter search space, which for the gamma values ranges from 2⁻¹⁰ to 2⁵, can be modified to a user-defined range. iLearnPlus also provides two classifier-dependent ensemble-learning frameworks: Bagging and AdaBoost. These frameworks are typically used to boost predictive performance. Importantly, iLearnPlus supports parallelization (via the use of multiple processors) to improve the computational efficiency for parallelable algorithms, such as RF, Bagging, XGBoost and LightGBM.

Another key advantage of iLearnPlus is the availability of multiple modern deep-learning classifiers. The deep-learning techniques rely on multi-layer (deep) neural networks (NNs) to train complex predictive models from high-dimensional data that can be produced from the biological sequences (123,124). To facilitate applications of deep-learning techniques in the analysis of DNA, RNA and protein sequences, iLearnPlus incorporates deep-learning architectures including convolutional NN, attention-based convolutional NN, recurrent NN, bidirectional recurrent NN, residual NN, auto-encoder NN and the traditional multilayer perceptron NN (Table 4). These frameworks rely on a wide range of recent advancements including the convolution operation, attention mechanism, stacked residual blocks, long short-term memory (LSTM) units, and gated recurrent units (GRUs). Their inclusions are motivated by recent successful applications to predict protein contact maps (125,126), protein function (127), DNA-protein binding (128) and compound-protein affinity (129), to name but a few examples. Details concerning the architectures and parameters of these deep-learning networks can be found in the iLearnPlus online manual. We highlight the fact that iLearnPlus automatically detects and uses GPU devices to optimize performance and reduce the computational burden. When training deep-learning models, our platform utilizes the following default parameters: cross-entropy as the loss function, learning rate set as 10⁻³, maximum number of epochs set as 1000, termination of training with no performance improvement within 100 epochs, and parameter optimization utilizing the widely used Adam algorithm (130). Alternatively, these parameters can also be configured manually by users.

iLearnPlus also provides an option to perform meta-learning (131), where results produced by multiple predictive models (so called base models) are used in tandem to train new machine-learning models; the deep-learning approaches are excluded from the meta-learning. The underlying objective is to improve predictive performance compared to the performance of the base models. iLearnPlus assesses the performance for the constructed meta-models and identifies the best one (i.e. model producing the best predictive performance).

Performance evaluation

iLearnPlus implements the K-fold cross-validation and independent test assessments to evaluate the performance of the constructed classifiers. The K-fold cross-validation test divides the dataset at random into K equally-sized subsets of sequences (i.e. folds). One of these folds is used as the validation dataset, and the remaining K – 1 folds are used as the training dataset to train the machine-learning model and optimize its parameters. After repeating this process K times, each fold is used once as the validation dataset (132). The independent test aims to evaluate and compare the predictive performance of multiple classifiers using a non-overlapping test dataset. This allows users to control the level of similarity between the training and test sequences. In iLearnPlus, the samples labeled as ‘training’ are used to implement the K-fold cross-validation test, while the samples labeled as ‘testing’ are used as the test dataset.

For a binary classification task, we provide eight commonly employed measures that quantify the predictive performance including sensitivity (Sn; Recall), specificity (Sp), accuracy (Acc), Matthews correlation coefficient (MCC), Precision, F1 score (F1), the area under ROC curve (AUROC) and the area under the PRC curve (AUPRC) (15,20,133,134), which are defined as:

$$\begin{equation*}Sn\; = \frac{{TP}}{{TP + FN}},\end{equation*}$$

(1)

$$\begin{equation*}Sp\; = \frac{{TN}}{{TN + FP}},\end{equation*}$$

(2)

$$\begin{equation*}Acc\; = \frac{{TP + TN}}{{TP + TN + FP + FN}}\;,\end{equation*}$$

(3)

$$\begin{equation*}MCC\; = \frac{{TP \times TN - FP \times FN}}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }},\end{equation*}$$

(4)

$$\begin{equation*}Precision\; = \frac{{TP}}{{TP + FP}},\end{equation*}$$

(5)

$$\begin{equation*}F1\; = \;2\; \times \frac{{Precision \times Recall}}{{Precision + Recall}},\end{equation*}$$

(6)

where TP, FP, TN and FN represent the numbers of true positives, false positives, true negatives and false negatives, respectively. The AUROC and AUCPRC values, which range between 0 and 1, are calculated based on the receiver-operating-characteristic (ROC) curve and the precision–recall curve, respectively. The higher the AUROC and AUPRC values, the better predictive performance of the underlying model.

For a multi-class classification task, we implement the popular Acc measure, which is defined as (134,135):

$$\begin{equation*}Acc\; = \frac{{TP\left( i \right) + TN\left( i \right)}}{{TP\left( i \right) + TN\left( i \right) + FP\left( i \right) + FN\left( i \right)}},\end{equation*}$$

(7)

where TP(i), FP(i), TN(i) and FN(i) represent the numbers of the samples (molecules) predicted correctly to be in the ith class, the total number of the samples in the ith class that are predicted as one of the other classes, the total number of the samples predicted correctly not to be in the ith class, and the total number of the samples not in the i-th class that are predicted as the ith class, respectively.

Data visualization and statistical analysis

iLearnPlus provides a wide range of tools to support analysis and visualization of the prediction results. It offers a variety of statistical plots including histograms, kernel density curves, heatmaps, boxplots, ROC and PRC curves, to assist users to interpret the prediction outcomes effectively (Table 5). As discussed above, histograms and kernel density plots are particularly suitable to visualize data distributions while the scatter plots should be used to analyze feature clustering and dimensionality reduction results. We supplement the predictive performance quantified with AUROC and AUPRC with the corresponding ROC and PRC curves. Users should employ boxplots to illustrate the distribution of the evaluation metrics values from the K-fold cross-validation experiments, allowing for comparison of predictive quality across different models (e.g. the performance based on different feature descriptors and/or machine-learning algorithms). We use the matplotlib library (136) to generate plots in iLearnPlus. The corresponding graphics can be saved using a variety of image formats (e.g. PNG, JPG, PDF, TIFF etc.).

Table 5.

Open in new tab

Graphical display options and statistical analysis methods in iLearnPlus

Category	Types	Purpose
Graphic display options	Histogram	Display data distribution
	Kernel density plot	Display data distribution
	Heatmap	Display P-value/correlation matrix between different models
	Scatter plot	Display clustering and dimensionality reduction result
	Boxplot	Depict the group values in the K-fold cross-validation for each of the eight metrics
	ROC curve	Depict the overall performance of a model for balanced data
	PRC curve	Depict the overall performance of a model for un-balanced data
Statistical analysis	Student's t-test	Compare the means of two evaluate metric
	Bootstrap test	Evaluate the significance of performance difference between all pairs of ROC or PRC curves

Category	Types	Purpose
Graphic display options	Histogram	Display data distribution
	Kernel density plot	Display data distribution
	Heatmap	Display P-value/correlation matrix between different models
	Scatter plot	Display clustering and dimensionality reduction result
	Boxplot	Depict the group values in the K-fold cross-validation for each of the eight metrics
	ROC curve	Depict the overall performance of a model for balanced data
	PRC curve	Depict the overall performance of a model for un-balanced data
Statistical analysis	Student's t-test	Compare the means of two evaluate metric
	Bootstrap test	Evaluate the significance of performance difference between all pairs of ROC or PRC curves

Table 5.

Open in new tab

Graphical display options and statistical analysis methods in iLearnPlus

Category	Types	Purpose
Graphic display options	Histogram	Display data distribution
	Kernel density plot	Display data distribution
	Heatmap	Display P-value/correlation matrix between different models
	Scatter plot	Display clustering and dimensionality reduction result
	Boxplot	Depict the group values in the K-fold cross-validation for each of the eight metrics
	ROC curve	Depict the overall performance of a model for balanced data
	PRC curve	Depict the overall performance of a model for un-balanced data
Statistical analysis	Student's t-test	Compare the means of two evaluate metric
	Bootstrap test	Evaluate the significance of performance difference between all pairs of ROC or PRC curves

Category	Types	Purpose
Graphic display options	Histogram	Display data distribution
	Kernel density plot	Display data distribution
	Heatmap	Display P-value/correlation matrix between different models
	Scatter plot	Display clustering and dimensionality reduction result
	Boxplot	Depict the group values in the K-fold cross-validation for each of the eight metrics
	ROC curve	Depict the overall performance of a model for balanced data
	PRC curve	Depict the overall performance of a model for un-balanced data
Statistical analysis	Student's t-test	Compare the means of two evaluate metric
	Bootstrap test	Evaluate the significance of performance difference between all pairs of ROC or PRC curves

iLearnPlus supports two statistical tests that can be used to compare the predictive performance across different models or tests. The student's t-test compares the means of two sets of performance measures, typically obtained via the K-fold cross-validation test. The bootstrap test (33) is typically used to assess the significance of differences between data quantified with the ROC and PRC curves. For example, to compare the AUROC values, we apply the following formula:

$$\begin{equation*}D\; = \frac{{AUROC1 - AUROC2}}{{Sd\left( {AUROC{1^{\prime}} - AUROC{2^{\prime}}} \right)}}\;\end{equation*}$$

(8)

where AUROC1 and AUROC2 denote the two original AUROC values, while AUROC1′ and AURCO2′ are the bootstrap resampled values of AUROC1 and AUROC2, respectively and Sd represents the standard deviation. By default, we perform 500 bootstrap replicates. In each replicate, we resample the original measurements with replacement to produce new ROC curves. After resampling, we compute AUROC1′, AUROC2′ and their difference (i.e. AUROC1′ – AUROC2′) and use these values to calculate P-values. We also visualize these results with a heatmap.

RESULTS AND DISCUSSION

The functions and modules in iLearnPlus

iLearnPlus covers the five major steps needed to build effective models for analysis and prediction of nucleic acid and proteins sequences: feature extraction, feature analysis, classifier construction, performance evaluation and data/result visualization (Figure 1). We implement these steps by developing four modules in iLearnPlus: iLearnPlus-Basic, iLearnPlus-Estimator, iLearnPlus-AutoML and iLearnPlus-LoadModel (Figure 3). The iLearnPlus-Basic module facilitates analysis and prediction using a selected feature-based representation of the input protein/RNA/DNA sequences (sequence descriptors) and a selected machine-learning classifier. This module is particularly instrumental when interrogating the impact of using different sequence feature descriptors and machine-learning algorithms on the predictive performance. The iLearnPlus-Estimator module provides a flexible way to perform feature extraction by allowing users to select multiple feature descriptors. The iLearnPlus-AutoML module focuses on automated benchmarking and maximization of the predictive performance across different machine-learning classifiers that are applied on the same set or combined sets of feature descriptors. In addition, by combining the iLearnPlus-Estimator and iLearnPlus-AutoML modules, users can conveniently and efficiently evaluate and compare the predictive quality across different selected sequence descriptors and different machine-learning algorithms. Moreover, models generated by iLearnPlus can be exported and saved as model files with the ‘.pkl’ extension in both the stand-alone software and using the web server. Using the iLearnPlus-LoadModel module, users can upload, deploy and test their models on new (test) data. Moreover, the saved models that rely on conventional machine-learning algorithms can be directly applied in the scikit-learn environment, whereas the exported deep-learning models can be applied using the PyTorch library. Section 8 of the user manual provides detailed instructions.

Figure 3.

The iLearnPlus architecture with four major built-in modules, including iLearnPlus-Basic, iLearnPlus-Estimator, iLearnPlus-AutoML, and iLearnPlus-LoadModel.

Open in new tab Download slide

Building and customizing machine-learning pipelines using iLearnPlus

iLearnPlus makes it easy and straightforward to design and optimize machine-learning pipelines to achieve a competitive (if not the best) predictive performance. The design process typically boils down to two key objectives: extraction and selection of features, and selection and parametrization of machine-learning models, both of which are supported by iLearnPlus. Our platform tackles these objectives via a simple example procedure summarized in Figure 4. Users should first apply the iLearnPlus-Estimator module to generate multiple sequence descriptors (feature sets) from the input sequences and test them by constructing and evaluating a machine-learning model in a batch mode. This allows to establish a point of reference for subsequent optimization/parameterization of the model. The corresponding results and models can be saved for future reference. Subsequently, the iLearnPlus-Basic module should be used to analyze and rank the feature descriptors. Based on the ranking, users should select and evaluate a subset of well-performing features (e.g. a subset of top N features). Next, the evaluation should be performed with the help of the iLearnPlus-AutoML module that optimizes different machine-learning classifiers to the selected feature set. This module also performs statistical comparative analysis of the results and provides the option to save the best model.

Figure 4.

An example of building and customizing machine-learning pipelines using iLearnPlus. First, the iLearnPlus-Estimator module is used to evaluate the performance of multiple feature descriptors based on the input sequences. Next, the feature descriptors with satisfactory performance are selected and the iLearnPlus-Basic module is then used to select the top N important features and save the top N features into a file. Finally, users can upload the feature selection results to the iLearnPlus-AutoML module to evaluate the performance of the machine-learning algorithms of interests in an automated manner.

Open in new tab Download slide

The iLearnPlus web server and source code

The full version of iLearnPlus that covers the four modules (iLearnPlus-Basic, iLearnPlus-Estimator, iLearnPlus-AutoML, and iLearnPlus-LoadModel) and a graphical user-interface is available on the GitHub repository at https://github.com/Superzchen/iLearnPlus/. The GUI for the four modules is shown in Figure 5.

Figure 5.

The screenshot showing the GUI version of iLearnPlus modules, including iLearnPlus-Basic module (A), iLearnPlus-Estimator module (B), iLearnPlus-AutoML module (C) and iLearnPlus-LoadModel module (D).

Open in new tab Download slide

iLearnPlus is also freely available as a web server at http://ilearnplus.erc.monash.edu/. In this case, the calculations are performed on the server side, freeing the users from engaging their own computational resources. This server relies on the Nectar (The National eResearch Collaboration Tools and Resources, which is an online infrastructure that supports researchers to connect with colleagues in Australia and around the world) cloud computing infrastructure, which is managed by the eResearch Centre at Monash University. The iLearnPlus web server was implemented using the open-source web platform LAMP (Linux-Apache-MySQL-PHP) and is equipped with 16 cores, 64GB memory and 2TB hard disk. The server supports five popular web browsers including the Internet Explorer (≥v.7.0), Microsoft Edge, Mozilla Firefox, Google Chrome and Safari. Given the high computational cost, the web server runs only the iLearnPlus-Basic module that supports basic analysis and machine-learning modeling of DNA, RNA and protein sequence data. Figure 6 shows a screenshot of the main page of the iLearnPlus web server, where the inputs and parameters of the analysis are entered.

Figure 6.

A screenshot showing the web server version of iLearnPlus for analyzing DNA (A), RNA (B) and protein (C) sequences. For each sub-server, user can generate their desired analysis pipelines via the major panels marked with (i), (ii), (iii) and (iv). The example input sequences were extracted from the lncRNA dataset prepared by Han et al. (137). The training and validation datasets contain 4,200 lncRNA and 4200 mRNA sequences, while the test dataset includes 1800 lncRNA and 1800 mRNA chains from Mus musculus. The category ‘0’ refers to mRNA sequences, while the category ‘1’ denotes the lncRNA sequences.

Open in new tab Download slide

Case studies

We showcase real-world applications of iLearnPlus with two bioinformatic scenarios: identification of the long noncoding RNAs (lncRNAs) and prediction of the protein crotonylation sites. We emphasize that the underlying objective is to illustrate how to use our platform for two such diverse applications, rather than securing the top predictive performance compared to the state-of-the-art.

The lncRNAs are the transcripts that are over 200 bp long which do not code for proteins (137). Approximately 70% of the noncoding sequences are transcribed into lncRNAs. They regulate a variety of biological processes and are linked to several human diseases (138,139). We applied the iLearnPlus-Basic module to extract the feature sets and train a classifier that accurately differentiated between lncRNAs and mRNA sequences. We used the datasets from a recent study by Han et al. (137). The training and validation datasets contain 4200 lncRNA and 4200 mRNA sequences, while the test dataset includes 1,800 lncRNA and 1,800 mRNA chains from Mus musculus. Several studies demonstrate that the distribution of adjoining bases is different for lncRNAs and mRNAs (140,141). Thus, we selected the ‘Kmer’ (size = 3) feature descriptor to extract the features. We applied the random forest algorithm (number of trees = 1000) to train the classifier using these features and optimized the classifier based on the 5-fold cross-validation test. This simple design secured AUC = 0.897, Acc = 81.89% and MCC = 0.642 on the test dataset. The entire process took about 10 mins to complete with the iLearnPlus web server using one CPU. Figures 6 and 7 show the parameter configurations and prediction results that we obtained using the online version of the iLearnPlus-Basic model. We note that the kernel density curves and histograms for distribution visualization of the extracted features that are included in Figure 7 are the new features of iLearnPlus that are not available in the other current tools. This case study demonstrates that the entire process that produces a simple and well-performing model can be conveniently completed in a matter of minutes. We used the entire datasets (a total of 12 000 sequences) with the iLearnPlus web server. However, considering the computational burden on the server side, we set the maximum number of sequences that can be submitted to the server to 2000. In cases where users need to process larger amount of sequence data, we encourage to use the GUI version of iLearnPlus which does not limit the number of input sequences.

Figure 7.

A screenshot demonstrating the result page of iLearnPlus for lncRNA prediction using the iLearnPlus web server. The result page includes the summary of basic information (A), including the input sequence type, number of sequences used for training and test, respectively, and the selected feature descriptor type, the generated features and feature analysis result (B), the selected machine-learning algorithm and the evaluation metrics listed for each fold and independent test (C), the ROC and PRC for demonstrating the prediction performance (D) and the hyperlink for downloading all the generated files including the generated feature encoding files, feature analysis result and plots, evaluation metrics matrix file, prediction scores, ROC and PRC curves, and the constructed models (E). The category ‘0’ refers to mRNA sequences, while the category ‘1’ denotes the lncRNA sequences.

Open in new tab Download slide

Lysine crotonylation (Kcr) is a post-translational modification (PTM) that was originally found in the histone proteins (142). Recently, this PTM was discovered in non-histone proteins and was found to be involved in the regulation of cell cycle progression and DNA replication cell cycle (143,144). Here, we used the iLearnPlus-Estimator module to comparatively assess the performance of different feature sets using a dataset retrieved from (145,146). We used the data preprocessing strategy from Chen et al. (32) which was developed for a similar prediction problem. Inspired by a recent study by Chen et al. (147), we collected 6687 Kcr sites (positives) and 67 240 non-Kcr sites (negatives) using the sequence segments of 29 residues. We used the iLearnPlus-Estimator module in the standalone GUI version (Figure 8A) to load the data, produced seven feature sets (AAC, EAAC, EGAAC, DDE, binary, ZScale and BLOSUM) and selected a machine-learning algorithm. Similar to the other case study, we used the random forest algorithm (with the default setting of 1000 trees) to construct the classifier via 10-fold cross-validation. We show the corresponding setup in Figure 8A. Figure 8B summarizes the predictive performance quantified with AUROC and AUPRC for models that used each of the seven selected feature sets as inputs. This analysis reveals that the model built utilizing the EGAAC feature descriptors achieved the best performance. This also shows how easy it is to use iLearnPlus to rationally select a well-performing feature encoding for the input sequences. Next, we used the iLearnPlus-AutoML module to comparatively evaluate the predictive performance across seven machine-learning algorithms: SGD, LR, XGBoost, LightGBM, RF, MLP and CNN. We used the bootstrap tests to assess statistical significance of the differences between the ROC curves produced by these algorithms. Figure 9 summarizes the corresponding performance evaluation. More specifically, it shows the evaluation metrics in terms of Sn, Sp, Pre, Acc, MCC and F1 (panel A) whose calculations were based on the default threshold values, correlation matrix that quantifies mutual correlations between classifiers (panel B), ROC (panel C) and PRC (panel D) curves, and boxplots that are used to compare results between classifiers (panel E) directly. We found that the deep-learning model, CNN, achieved the best predictive performance among all the seven machine-learning algorithms, with Acc = 85.4% and AUC = 0.823. Overall, this case study demonstrates how to effectively and efficiently address the two key objectives that lead to designing accurate models: extraction and selection of useful features and selection of the best-performing machine-learning models. We considered seven different feature sets, selected the best set and comparatively evaluated seven machine-learning models using a broad and informative set of metrics. This ultimately led to an informed selection of an accurate solution. We emphasize that four of the seven selected algorithms (i.e. CNN, SGD, XGBoost and LightBGM), ability to run statistical tests, and key methods for visualization of results (i.e. boxplots and heatmaps of the correlations between the models) are among the new features offered by iLearnPlus that are not available in the current platforms.

Figure 8.

The prediction results for protein crotonylation sites using the selected feature descriptors and the local GUI version of iLearnPlus, including a screenshot of the parameter setup using iLearnPlus-Estimator module (A), and the ROC and PRC curves of the seven RF models using different feature descriptors (B).

Open in new tab Download slide

Figure 9.

An illustration of the prediction results generated by different machine-learning algorithms using the iLearnPlus-AutoML module for identification of protein crotonylation sites, including the evaluation metrics showing the predictive performance in terms of eight evaluation metrics (A), the correlation matrix of seven selected classifiers (B), ROC curves (C), PRC curves (D) and the boxplots (E) of eight metrics for comparative performance assessment of all the seven selected machine-learning algorithms.

Open in new tab Download slide

These case studies demonstrate that iLearnPlus is a comprehensive platform for the design, evaluation and analysis of the predictive models for both nucleic acid and protein sequences. It can be used to produce an accurate model effectively and, at the same time, to run a fully-fledged design protocol that encompasses feature extraction, feature selection, model selection, comprehensive comparative assessment, and result visualization. Moreover, the trained models can be exported and deployed on new data using the iLearnPlus-LoadModel module.

CONCLUSION

Massive accumulation of sequence data calls for the equally aggressive efforts to develop computational models that can analyze and make an inference from these data. In this article, we addressed this need by delivering a comprehensive automated pipeline, iLearnPlus, which provides ‘one-stop’ services for machine learning-based predictions from the DNA, RNA and protein sequence data. iLearnPlus includes four built-in modules, calculates a variety of feature set and provides 21 machine-learning algorithms including seven popular and modern deep-learning methods. Our platform offers a diverse collection of strategies to conceptualize, design, test, comparatively assess and deploy predictive models. iLearnPlus caters to a broad range of users, including biologists with limited bioinformatics expertise who can benefit from the easy-to-use web server. We provide two case studies using iLearnPlus: predictions of lncRNAs and protein crotonylation sites. The first highlights the fact that our platform supports rapid development of accurate models, while the second demonstrates a sophisticated process that performs feature and classifier selection to maximize the predictive performance of the constructed model. We conclude that iLearnPlus is an effective tool for the design, testing and deployment of machine-learning pipelines for analysis and prediction of the rapidly increasing volume of sequence data for biologists and bioinformaticians.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Health and Medical Research Council of Australia (NHMRC) [APP1127948, APP1144652]; Young Scientists Fund of the National Natural Science Foundation of China [31701142]; National Natural Science Foundation of China [31971846]; Australian Research Council [LP110200333, DP120104460]; National Institute of Allergy and Infectious Diseases of the National Institutes of Health [R01 AI111965]; Major Inter-Disciplinary Research project awarded by Monash University, and the Collaborative Research Program of Institute for Chemical Research, Kyoto University; Fundamental Research Funds for the Central Universities [3132020170, 3132019323]; National Natural Science Foundation of Liaoning Province [20180550307]; C.L. is supported by an NHMRC CJ Martin Early Career Research Fellowship [1143366]; L.K. is supported in part by the Robert J. Mattauch Endowment funds. Funding for open access charge: Grants in support of this submission.

Conflict of interest statement. None declared.

REFERENCES

Toronen

Medlar

Holm

PANNZER2: a rapid functional annotation web server

Nucleic Acids Res.

2018

;

W84

–

W88

Chen

Wang

Jin

Chi

C.H.

Kurgan

Song

Shen

Systematic evaluation of machine learning methods for identifying human-pathogen protein-protein interactions

Brief. Bioinform.

2020

;

doi:10.1093/bib/bbaa068

Google Scholar

OpenURL Placeholder Text

WorldCat

Bonetta

Valentino

Machine learning techniques for protein function prediction

Proteins

2020

;

397

–

413

Wei

Zou

Recent progress in machine learning-based methods for protein fold recognition

Int. J. Mol. Sci.

2016

;

2118

Google Scholar

Crossref

WorldCat

Xie

Ding

Chen

Guo

Zhang

Advances in protein contact map prediction based on machine learning

Med. Chem.

2015

;

265

–

270

Sacar

M.D.

Allmer

Machine learning methods for microRNA gene prediction

Methods Mol. Biol.

2014

;

1107

177

–

187

Walia

R.R.

Caragea

Lewis

B.A.

Towfic

Terribilini

El-Manzalawy

Dobbs

Honavar

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

BMC Bioinformatics

2012

;

Zhou

Zeng

Y.H.

Zhang

Cui

SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features

Nucleic Acids Res.

2016

;

e91

Kim

H.K.

Lee

Seo

J.H.

Choi

J.W.

Park

Min

Yoon

Cho

S.R.

Kim

H.H.

Prediction of the sequence-specific cleavage activity of Cas9 variants

Nat. Biotechnol.

2020

;

1328

–

1336

10.

Jia

Chen

Leier

Song

PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs

Bioinformatics

2020

;

4276

–

4282

11.

Zhou

Theesfeld

C.L.

Yao

Chen

K.M.

Wong

A.K.

Troyanskaya

O.G.

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Nat. Genet.

2018

;

1171

–

1179

12.

Zhou

Troyanskaya

O.G.

Predicting effects of noncoding variants with deep learning-based sequence model

Nat. Methods

2015

;

931

–

934

13.

Zhou

Park

C.Y.

Theesfeld

C.L.

Wong

A.K.

Yuan

Scheckel

Fak

J.J.

Funk

Yao

Tajima

et al. .

Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk

Nat. Genet.

2019

;

973

–

980

14.

Zhou

Wang

Liu

Zhou

Liu

Guo

Peng

Song

Zhang

Chen

et al. .

Identification and analysis of adenine N(6)-methylation sites in the rice genome

Nat. Plants

2018

;

554

–

563

15.

Chen

Zhao

Marquez-Lago

T.T.

Leier

Revote

Zhu

Powell

D.R.

Akutsu

Webb

G.I.

et al. .

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Brief. Bioinform.

2020

;

1047

–

1057

16.

Chen

K.M.

Cofer

E.M.

Zhou

Troyanskaya

O.G.

Selene: a PyTorch-based deep learning library for sequence data

Nat. Methods

2019

;

315

–

318

17.

Avsec

Ž.

Kreuzhuber

Israeli

Cheng

Shrikumar

Banerjee

Kim

D.S.

Beier

Urban

Kundaje

Stegle

Gagneur

TheKipoirepository accelerates community exchange and reuse of predictive models for genomics

Nat. Biotechnol.

2019

;

592

–

600

18.

Kopp

Monti

Tamburrini

Ohler

Akalin

Deep learning for genomics using Janggu

Nat. Commun.

2020

;

3488

19.

Liu

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

Brief. Bioinform.

2017

;

1280

–

1294

Google Scholar

OpenURL Placeholder Text

WorldCat

20.

Liu

Gao

Zhang

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches

Nucleic Acids Res.

2019

;

e127

21.

Rodrigues

C.H.M.

Myung

Pires

D.E.V.

Ascher

D.B.

mCSM-PPI2: predicting the effects of mutations on protein-protein interactions

Nucleic Acids Res.

2019

;

W338

–

W344

22.

Liu

Chen

Wang

Zhang

Hot spot prediction in protein-protein interactions by an ensemble system

BMC Syst. Biol.

2018

;

132

23.

Mahmud

S.M.H.

Chen

Meng

Jahan

Liu

Hasan

S.M.M.

Prediction of drug-target interaction based on protein features using undersampling and feature selection techniques with boosting

Anal. Biochem.

2020

;

589

113507

24.

Zhu

Y.H.

Song

Zhang

D.J.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features

Brief. Bioinform.

2020

;

doi:10.1093/bib/bbaa076

Google Scholar

OpenURL Placeholder Text

WorldCat

25.

Zhu

Y.H.

Song

X.N.

D.J.

DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines

J. Chem. Inf. Model.

2019

;

3057

–

3071

26.

Zhou

Song

D.J.

Sun

Sequence-based detection of DNA-binding proteins using multiple-view features allied with feature selection

Mol Inform

2020

;

e2000006

27.

Zhang

Kabuka

M.R.

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

2018

;

2390

–

2393

28.

Zhang

Kabuka

Multimodal deep representation learning for protein interaction identification and protein family classification

BMC Bioinformatics

2019

;

531

29.

Jia

Zhao

Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning

Brief. Bioinform.

2020

;

doi:10.1093/bib/bbaa099

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Chen

Zhao

Wang

Smith

A.I.

Webb

G.I.

Akutsu

Baggag

Bensmail

Song

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences

Brief. Bioinform.

2019

;

1676

–

1696

Google Scholar

Crossref

WorldCat

31.

Chen

Liu

Marquez-Lago

Leier

Akutsu

Webb

G.I.

Smith

A.I.

et al. .

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites

Brief. Bioinform.

2019

;

2267

–

2290

32.

Chen

Huang

Qin

W.T.

Liu

Integration of a deep learning classifier with a random forest approach for predicting malonylation sites

Genomics Proteomics Bioinformatics

2018

;

451

–

459

33.

Hanley

J.A.

McNeil

B.J.

A method of comparing the areas under receiver operating characteristic curves derived from the same cases

Radiology

1983

;

148

839

–

843

34.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

et al. .

Scikit-learn: Machine learning in python

2011

;

2825

–

2830

35.

Hall

Frank

Holmes

Pfahringer

Reutemann

Witten

I.H.

The WEKA data mining software: an update

2009

;

–

36.

Howell

D.C.

Statistical Methods for Psychology

2002

;

Pacific Grove, CA

Duxbury/Thomson Learning

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

37.

Bhasin

Raghava

G.P.

Classification of nuclear receptors based on amino acid composition and dipeptide composition

J. Biol. Chem.

2004

;

279

23262

–

23266

38.

Chen

Jiang

Kurgan

Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs

J. Comput. Chem.

2009

;

163

–

172

39.

Chen

Kurgan

L.A.

Ruan

Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

BMC Struct. Biol.

2007

;

40.

Saravanan

Gautham

Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor

OMICS

2015

;

648

–

658

41.

Cai

C.Z.

Han

L.Y.

Z.L.

Chen

Y.Z.

SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence

Nucleic Acids Res.

2003

;

3692

–

3697

42.

Cai

C.Z.

Han

L.Y.

Z.L.

Chen

Y.Z.

Enzyme family classification by support vector machines

Proteins

2004

;

–

43.

Dubchak

Muchnik

Holbrook

S.R.

Kim

S.H.

Prediction of protein folding class using global description of amino acid sequence

Proc. Natl. Acad. Sci. U.S.A.

1995

;

8700

–

8704

44.

Dubchak

Muchnik

Mayor

Dralyuk

Kim

S.H.

Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification

Proteins

1999

;

401

–

407

45.

Han

L.Y.

Cai

C.Z.

S.L.

Chung

M.C.

Chen

Y.Z.

Prediction of RNA-binding proteins from primary sequence by a support vector machine approach

RNA

2004

;

355

–

368

46.

Shen

Zhang

Luo

Zhu

Chen

Jiang

Predicting protein-protein interactions based only on sequences information

Proc. Natl. Acad. Sci. U.S.A.

2007

;

104

4337

–

4341

47.

Wei

Zhou

Chen

Song

ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides

Bioinformatics

2018

;

4007

–

4016

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

48.

Liu

Lan

Zhou

Wang

Chou

K.C.

iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition

PLoS One

2014

;

e106691

49.

Feng

Z.P.

Zhang

C.T.

Prediction of membrane protein types based on the hydrophobic index of amino acids

J. Protein Chem.

2000

;

269

–

275

50.

Lin

Pan

X.M.

Accurate prediction of protein secondary structural content

J. Protein Chem.

2001

;

217

–

220

51.

Sokal

R.R.

Thomson

B.A.

Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population

Am. J. Phys. Anthropol.

2006

;

129

121

–

131

52.

Horne

D.S.

Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities

Biopolymers

1988

;

451

–

477

53.

Liu

Fang

Wang

Chou

K.C.

repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects

Bioinformatics

2015

;

1307

–

1309

54.

Dong

Zhou

Guan

A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation

Bioinformatics

2009

;

2655

–

2662

55.

Guo

Wen

Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences

Nucleic Acids Res.

2008

;

3025

–

3030

56.

Chou

K.C.

Prediction of protein subcellular locations by incorporating quasi-sequence-order effect

Biochem. Biophys. Res. Commun.

2000

;

278

477

–

483

57.

Chou

K.C.

Cai

Y.D.

Prediction of protein subcellular locations by GO-FunD-PseAA predictor

Biochem. Biophys. Res. Commun.

2004

;

320

1236

–

1239

58.

Schneider

Wrede

The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site

Biophys. J.

1994

;

335

–

344

59.

Chou

K.C.

Prediction of protein cellular attributes using pseudo-amino acid composition

Proteins

2001

;

246

–

255

60.

Chou

K.C.

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Bioinformatics

2005

;

–

61.

Zuo

Chen

Yan

Yang

PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition

Bioinformatics

2017

;

122

–

124

62.

Chen

Y.Z.

Wang

X.F.

Wang

Yan

R.X.

Zhang

Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs

PLoS One

2011

;

e22930

63.

Chen

Zhou

Song

Zhang

hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties

Biochim. Biophys. Acta

2013

;

1834

1461

–

1467

64.

Wang

J.T.L.

Shasha

C.H.

New techniques for extracting features from protein sequences

2001

;

426

–

441

65.

White

Seffens

Using a neural network to backtranslate amino acid sequences

Electron. J. Biotechnol.

1998

;

196

–

201

Google Scholar

OpenURL Placeholder Text

WorldCat

66.

Lin

May

A.C.

Taylor

W.R.

Amino acid encoding schemes from protein structure alignments: multi-dimensional vectors to describe residue types

J. Theor. Biol.

2002

;

216

361

–

365

67.

Tung

C.W.

S.Y.

Computational identification of ubiquitylation sites from protein sequences

BMC Bioinformatics

2008

;

310

68.

Lee

T.Y.

Chen

S.A.

Hung

H.Y.

Y.Y.

Incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites

PLoS One

2011

;

e17331

69.

Chen

Y.Z.

Chen

Gong

Y.A.

Ying

SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties

PLoS One

2012

;

e39195

70.

Chen

Qiu

J.D.

Shi

S.P.

Suo

S.B.

Huang

S.Y.

Liang

R.P.

Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites

Bioinformatics

2013

;

1614

–

1622

71.

Lee

Karchin

Beer

M.A.

Discriminative prediction of mammalian enhancers from DNA sequence

Genome Res.

2011

;

2167

–

2180

72.

Noble

W.S.

Kuehn

Thurman

Stamatoyannopoulos

Predicting the in vivo signature of human gene regulatory sequences

Bioinformatics

2005

;

i338

–

343

73.

Gupta

Dennis

Thurman

R.E.

Kingston

Stamatoyannopoulos

J.A.

Noble

W.S.

Predicting human nucleosome occupancy from primary sequence

PLoS Comput. Biol.

2008

;

e1000134

74.

Chen

Tran

Liang

Lin

Zhang

Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome

Sci. Rep.

2015

;

13859

75.

Qiang

Chen

Wei

M6AMRFS: robust prediction of N6-methyladenosine sites with sequence-based features in multiple species

Front Genet

2018

;

495

76.

Gao

Zhang

C.T.

Comparison of various algorithms for recognizing short coding sequences of human genes

Bioinformatics

2004

;

673

–

681

77.

Doench

J.G.

Fusi

Sullender

Hegde

Vaimberg

E.W.

Donovan

K.F.

Smith

Tothova

Wilen

Orchard

et al. .

Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9

Nat. Biotechnol.

2016

;

184

–

191

78.

Cursons

Pillman

K.A.

Scheer

K.G.

Gregory

P.A.

Foroutan

Hediyeh-Zadeh

Toubia

Crampin

E.J.

Goodall

G.J.

Bracken

C.P.

et al. .

Combinatorial targeting by microRNAs co-ordinates post-transcriptional control of EMT

Cell Syst.

2018

;

–

79.

Jia

Zou

4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction

Bioinformatics

2018

;

593

–

601

Google Scholar

Crossref

WorldCat

80.

Lalovic

Veljkovic

The global average DNA base composition of coding regions may be determined by the electron-ion interaction potential

Biosystems

1990

;

311

–

316

81.

Nair

A.S.

Sreenadhan

S.P.

A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)

Bioinformation

2006

;

197

–

202

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

82.

Manavalan

Basith

Shin

T.H.

Lee

D.Y.

Wei

Lee

4mCpred-EL: an ensemble learning framework for identification of DNA N(4)-methylcytosine sites in the mouse genome

Cells

2019

;

1332

Google Scholar

Crossref

WorldCat

83.

Wei

Luan

Liao

Manavalan

Zou

Shi

Iterative feature representations improve N4-methylcytosine site prediction

Bioinformatics

2019

;

4930

–

4937

84.

Liu

Wang

Chen

Fang

Chou

K.C.

Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences

Nucleic Acids Res.

2015

;

W65

–

W71

85.

Jain

A.K.

Murty

M.N.

Flynn

P.J.

Data clustering: a review

ACM Comput. Surv.

1999

;

264

–

323

Google Scholar

Crossref

WorldCat

86.

Rokach

Maimon

Rokach

Data Mining and Knowledge Discovery Handbook

2005

;

Boston, MA

Springer US

321

–

352

87.

Enright

A.J.

Van Dongen

Ouzounis

C.A.

An efficient algorithm for large-scale detection of protein families

Nucleic Acids Res.

2002

;

1575

–

1584

88.

Theodoridis

Koutroumbas

Theodoridis

Koutroumbas

Pattern Recognition

2009

; 4th edn

Boston

Academic Press

653

–

700

89.

Filippone

Camastra

Masulli

Rovetta

A survey of kernel and spectral methods for clustering

Pattern Recognit.

2008

;

176

–

190

Google Scholar

Crossref

WorldCat

90.

Jain

A.K.

Data clustering: 50 years beyond K-means

Pattern Recognit. Lett.

2010

;

651

–

666

Google Scholar

Crossref

WorldCat

91.

Frey

B.J.

Dueck

Clustering by passing messages between data points

Science

2007

;

315

972

–

976

92.

Cheng

Y.Z.

Mean shift, mode seeking, and clustering

IEEE Trans. Pattern Anal. Mach. Intell.

1995

;

790

–

799

Google Scholar

Crossref

WorldCat

93.

Ester

Kriegel

H.-P.

Sander

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining

1996

;

Portland, Oregon

AAAI Press

226

–

231

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

94.

Chen

Guo

Wang

Liu

A comprehensive review and comparison of different computational methods for protein remote homology detection

Brief. Bioinform.

2018

;

231

–

244

95.

Peng

H.C.

Long

F.H.

Ding

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

IEEE Trans. Pattern Anal. Mach. Intell.

2005

;

1226

–

1238

96.

Stigler

S.M.

Francis Galton's account of the invention of correlation

Stat. Sci.

1988

;

–

Google Scholar

OpenURL Placeholder Text

WorldCat

97.

Pearson

LIII. On lines and planes of closest fit to systems of points in space

London Edinburgh Dublin Philos. Mag. J. Sci.

1901

;

559

–

572

Google Scholar

Crossref

WorldCat

98.

Blei

D.M.

A.Y.

Jordan

M.I.

Latent dirichlet allocation

2003

;

993

–

1022

OpenURL Placeholder Text

WorldCat

99.

Maaten

L.V.D.

Accelerating t-SNE using tree-based algorithms

J. Mach. Learn. Res.

2014

;

3221

–

3245

Google Scholar

OpenURL Placeholder Text

WorldCat

100.

Larranaga

Calvo

Santana

Bielza

Galdiano

Inza

Lozano

J.A.

Armananzas

Santafe

Perez

et al. .

Machine learning in bioinformatics

Brief. Bioinform.

2006

;

–

112

101.

Libbrecht

M.W.

Noble

W.S.

Machine learning applications in genetics and genomics

Nat. Rev. Genet.

2015

;

321

–

332

102.

Chen

Guestrin

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

2016

;

San Francisco, California

Association for Computing Machinery

785

–

794

103.

Meng

Finley

Wang

Chen

Liu

T.-Y.

Proceedings of the 31st International Conference on Neural Information Processing Systems

2017

;

Long Beach, California

Curran Associates Inc

3149

–

3157

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

104.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

et al. .

Wallach

Larochelle

Beygelzimer

d’Alche-Buc

Fox

Garnett

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Advances in Neural Information Processing Systems 32

2019

;

Curran Associates, Inc

8024

–

8035

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

105.

Breiman

Random forests

Mach. Learn.

2001

;

–

Google Scholar

Crossref

WorldCat

106.

Breiman

Friedman

Stone

C.J.

Olshen

R.A.

Classification and Regression Trees

1984

;

Wadsworth, Belmont, CA

Taylor & Francis

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

107.

Cortes

Vapnik

Support-vector networks

Mach. Learn.

1995

;

273

–

297

Google Scholar

OpenURL Placeholder Text

WorldCat

108.

Altman

N.S.

An introduction to kernel and nearest-neighbor nonparametric regression

Am. Stat.

1992

;

175

–

185

Google Scholar

OpenURL Placeholder Text

WorldCat

109.

Freedman

A.D.

Statistical models: theory and practice

Technometrics

2006

;

315

–

315

Google Scholar

OpenURL Placeholder Text

WorldCat

110.

Friedman

J.H.

Greedy function approximation: a gradient boosting machine

Ann. Stat.

2001

;

1189

–

1232

Google Scholar

Crossref

WorldCat

111.

Meng

Finley

Wang

Chen

Liu

T.-Y.

LightGBM: a highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems

2017

;

Long Beach, California

Curran Associates Inc

3149

–

3157

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

112.

Rennie

J.D.M.

Shih

Teevan

Karger

D.R.

Tackling the poor assumptions of naive Bayes text classifiers

Proceedings of the 20th International Conference on International Conference on Machine Learning

2003

;

616

–

623

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

113.

McLachlan

G.J.

Discriminant Analysis and Statistical Pattern Recognition

1992

;

John Wiley & Sons

114.

Breiman

Bagging predictors

Mach. Learn.

1996

;

123

–

140

Google Scholar

OpenURL Placeholder Text

WorldCat

115.

Rojas

AdaBoost and the Super Bowl of Classifiers A Tutorial Introduction to Adaptive Boosting

2009

;

Berlin, Tech Rep

Freie University

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

116.

Wang

Zeng

Qiu

Liang

Joshi

MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction

Bioinformatics

2017

;

3909

–

3916

117.

Hochreiter

Schmidhuber

Long short-term memory

Neural Comput.

1997

;

1735

–

1780

118.

Hanson

Yang

Paliwal

Zhou

Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks

Bioinformatics

2017

;

685

–

692

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

119.

Heffernan

Yang

Paliwal

Zhou

Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility

Bioinformatics

2017

;

2842

–

2849

120.

Zhang

Ren

Sun

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2016

;

770

–

778

121.

Pan

A deep learning method for lincRNA detection using auto-encoder algorithm

BMC Bioinformatics

2017

;

511

122.

Wang

S.-C.

Interdisciplinary Computing in Java Programming

2003

;

Boston, MA

Springer US

–

123.

LeCun

Bengio

Hinton

Deep learning

Nature

2015

;

521

436

–

444

124.

Min

Lee

Yoon

Deep learning in bioinformatics

Brief. Bioinform.

2017

;

851

–

869

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

125.

Jones

D.T.

Kandathil

S.M.

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features

Bioinformatics

2018

;

3308

–

3315

126.

Zhang

D.J.

Zhang

ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks

Bioinformatics

2019

;

4647

–

4655

127.

Zhang

Song

Zeng

Kurgan

DeepFunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions

Proteomics

2019

;

e1900019

128.

Zeng

Edwards

M.D.

Liu

Gifford

D.K.

Convolutional neural network architectures for predicting DNA-protein binding

Bioinformatics

2016

;

i121

–

i127

129.

Karimi

Wang

Shen

DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks

Bioinformatics

2019

;

3329

–

3338

130.

Kingma

D.P.

Adam: a method for stochastic optimization

2014

;

arXiv doi:

30 January 2017, preprint: not peer reviewed

https://arxiv.org/abs/1412.6980v1.

131.

Lemke

Budka

Gabrys

Metalearning: a survey of trends and technologies

Artif Intell Rev

2015

;

117

–

130

132.

Lopez

Dehzangi

Lal

S.P.

Taherzadeh

Michaelson

Sattar

Tsunoda

Sharma

SucStruct: prediction of succinylated lysine residues by using structural properties of amino acids

Anal. Biochem.

2017

;

527

–

133.

Liu

Fang

Long

Lan

Chou

K.C.

iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition

Bioinformatics

2016

;

362

–

369

134.

Liu

Yang

Huang

D.S.

Chou

K.C.

iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC

Bioinformatics

2018

;

–

135.

Feng

P.M.

Chen

Lin

Chou

K.C.

iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition

Anal. Biochem.

2013

;

442

118

–

125

136.

Hunter

J.D.

Matplotlib: a 2D graphics environment

Comput. Sci. Eng.

2007

;

–

Google Scholar

Crossref

WorldCat

137.

Han

Liang

Zhang

Wang

LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property

Brief. Bioinform.

2019

;

2009

–

2027

138.

Yang

Zhang

Liu

Peng

Jiang

Oncogenic role of long noncoding RNA AF118081 in anti-benzo[a]pyrene-trans-7,8-dihydrodiol-9,10-epoxide-transformed 16HBE cells

Toxicol. Lett.

2014

;

229

430

–

439

139.

Yao

R.W.

Wang

Chen

L.L.

Cellular functions of long noncoding RNAs

Nat. Cell Biol.

2019

;

542

–

551

140.

Sun

Luo

Zhao

Zhang

Liu

Chen

Zhao

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Nucleic Acids Res.

2013

;

e166

141.

Wang

Park

H.J.

Dasari

Wang

Kocher

J.P.

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

Nucleic Acids Res.

2013

;

e74

142.

Tan

Luo

Lee

Jin

Yang

J.S.

Montellier

Buchou

Cheng

Rousseaux

Rajagopal

et al. .

Identification of 67 histone marks and histone lysine crotonylation as a new type of histone modification

Cell

2011

;

146

1016

–

1028

143.

Wei

Mao

Tang

Zeng

Gao

Liu

J.X.

et al. .

Large-Scale identification of protein crotonylation reveals its role in multiple cellular functions

J. Proteome Res.

2017

;

1743

–

1752

144.

Huang

Wang

D.L.

Zhao

Quantitative crotonylome analysis expands the roles of p300 in the regulation of lysine crotonylation pathway

Proteomics

2018

;

e1700230

145.

Wan

Zhan

Shi

Zhang

Global profiling of crotonylation on non-histone proteins

Cell Res.

2017

;

946

–

949

146.

Wang

Fan

Cao

Wang

Ultradeep lysine crotonylome reveals the crotonylation enhancement on both histones and nonhistone proteins by SAHA treatment

J. Proteome Res.

2017

;

3664

–

3671

147.

Zhao

Chen

Identification of protein lysine crotonylation sites by a deep learning framework with convolutional neural networks

IEEE Access

2020

;

14244

–

14252

Google Scholar

Crossref

WorldCat

Author notes

The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Download all slides

Month:	Total Views:
February 2021	61
March 2021	1,157
April 2021	350
May 2021	355
June 2021	593
July 2021	433
August 2021	606
September 2021	370
October 2021	322
November 2021	357
December 2021	284
January 2022	322
February 2022	363
March 2022	417
April 2022	336
May 2022	368
June 2022	217
July 2022	321
August 2022	292
September 2022	357
October 2022	416
November 2022	362
December 2022	319
January 2023	342
February 2023	319
March 2023	388
April 2023	364
May 2023	240
June 2023	180
July 2023	185
August 2023	241
September 2023	293
October 2023	268
November 2023	223
December 2023	309
January 2024	323
February 2024	269
March 2024	281
April 2024	349
May 2024	304
June 2024	263
July 2024	274
August 2024	173
September 2024	291
October 2024	442
November 2024	307

Article Contents

iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

Abstract

INTRODUCTION

MATERIALS AND METHODS

Feature extraction

Feature analysis

Classifier construction and integration

Performance evaluation

Data visualization and statistical analysis

RESULTS AND DISCUSSION

The functions and modules in iLearnPlus

Building and customizing machine-learning pipelines using iLearnPlus

The iLearnPlus web server and source code

Case studies

CONCLUSION

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Author notes

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

Abstract

INTRODUCTION

MATERIALS AND METHODS

Feature extraction

Feature analysis

Classifier construction and integration

Performance evaluation

Data visualization and statistical analysis

RESULTS AND DISCUSSION

The functions and modules in iLearnPlus

Building and customizing machine-learning pipelines using iLearnPlus

The iLearnPlus web server and source code

Case studies

CONCLUSION

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Author notes

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only