Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning

Xu, Haodong; Jia, Peilin; Zhao, Zhongming

doi:10.1093/bib/bbaa099

Abstract

DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species’ genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005–0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.

epigenetic modification, DNA N4-methylcytosine, 4mC, methyladenine, deep learning, feature selection

Introduction

The rapid development of genome sequencing technology has made it possible to examine the functional impact on DNA chemical modifications in a high resolution. Catalyzed by DNA methyltransferases, methyl-base modifications, such as N4-methylcytosine (4mC), 5-methylcytosine (5mC) and N6-methyladenine (6mA), account for a large portion of DNA modification in the genomes of diverse species [1–4]. These epigenetic modifications greatly expand the diversity of genomic organization and regulation in various biological processes. In eukaryotic genomes, DNA 5mC modification has been widely explored to demonstrate that the dynamic regulation of 5mC plays critical roles in regulating chromatin architecture and gene expression [5]. Recent studies have also shed light on the distribution and regulatory function of extensive 6mA modification in eukaryotic genomes, although 6mA modification was predominantly considered as a modification in prokaryotic previously [6]. In addition to DNA 5mC and 6mA modifications, 4mC has been reported as a potent epigenetic modification that protects its self-DNA from the restriction enzyme-mediated degradation [7]. By adding a methyl group to the 4th position of cytosine in DNA, 4mC modification plays an important role in the regulation of DNA replication, cell cycle and gene expression levels and participates in genome stabilization, recombination and evolution [8, 9]. So far, identification of 4mC modification and understanding of its roles has still been limited, especially with much limited data generated from experiments. Therefore, there is a strong demand for developing approaches that can effectively identify or predict 4mC sites in a genome.

Several experimental approaches have been developed for identification of 4mC sites. In 2010, a mainstream platform of third-generation sequencing, single-molecule real-time sequencing (SMRT), has emerged as a popular method with the advantages of long-read sequencing and capability of detecting DNA modifications. SMRT has since been widely applied to detected 4mC sites from unknown DNA sequences in several bacterial genomes [10]. Later, Yu et al. introduced a next-generation sequencing method, called 4mC-Tet-assisted bisulfite-sequencing, to rapidly and cost-efficiently detect the genome-wide 4mC loci in bacterial species [1]. Rathi et al. [11] applied the transcription activator-like effectors method to reveal 4mC sites in DNA sequence. With the increasing number of experimental 4mC studies, the collection and integration of known 4mC sites gradually become an important research topic for sharing and mining these data. Ye et al. [12] developed a database of MethSMRT, the first resource hosting DNA 6mA and 4mC methylomes based on the publicly available SMRT sequencing datasets in 156 species. Later, the DNAmod database was built to annotate the chemical properties and structures of all curated modified DNA bases, including 4mC, which enables researchers to check previous studies and the identification methodology [13]. Recently, Liu et al. [14] released a database of MDR to curate DNA 6mA and 4mC modification for Rosaceae family using SMRT sequencing datasets.

These high-quality datasets, although still limited, provide an opportunity for identification of potential 4mC sites in a DNA sequence by computational approaches, similar to the prediction of CpG methylation and 6mA modification [15]. Here, we first summarized the current progress in computational prediction of 4mC sites and, based on that, we developed a novel deep learning-based species-specific 4mC site predictor, namely Deep4mC. In addition to the 4mC datasets that commonly used in previous studies, which contains 7163 experimentally identified 4mC sites (benchmark data 1) from six organisms, we also recollected newly added 4mC sites in the six species genomes from the MethSMRT database [12] and compiled a nonredundant dataset, including 285 851 experimentally identified 4mC sites (benchmark data 2) from Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Geobacter pickeringii and Geoalkalibacter subterraneus. We encoded a total of 12 features including nine sequence-derived features: accumulated nucleotide frequency (ANF), binary, composition of K-spaced nucleic acid pairs (CKSNAP), dinucleotide composition (DNC), enhanced nucleic acid composition (ENAC), Kmer, nucleic acid composition (NAC), reverse compliment Kmer (RCKmer) and trinucleotide composition (TNC) and three physicochemical features: electron–ion interaction pseudopotentials (EIIPs) of trinucleotide, nucleotide chemical property (NCP) and pseudo dinucleotide composition (PseDNC). Eight conventional classification algorithms were assessed by 10-fold cross-validation (CV) in a pairwise way using the benchmark data 1. Four features (binary, EIIP, ENAC and NCP) performed well across all six species with the average area under curve (AUC) values greater than 0.79. For the classification algorithms, SVM was the most effective classifier, while other classifiers, i.e. logistic regression (LR), stochastic gradient descent (SGD) and random forests (RF), also held good classification ability with AUC values greater than 0.75. We also explored the contribution of two strategies to the final model construction, i.e. feature selection (recursive feature elimination) and model ensemble. Our results demonstrated that the optimal feature group could reduce noisy features and improve performance, while proper reinforcement learning was conducive to model construction. In addition to the above review and evaluation, a novel 4mC site predictor, namely Deep4mC, was built based on the deep convolutional neural networks (CNNs) using four representative features. Specifically, Deep4mC uses the feature evaluation results, including binary, ENAC, EIIP and NCP. For species with small numbers of samples (sequences for 4mC sites), we extended our deep learning framework with a bootstrapping method to make full use of the large number of negative samples (Ns) to avoid false positives. The performance of Deep4mC was critically evaluated by multiple CVs and compared with the existing methods. The average AUC values of multiple CVs were all greater than 0.9 in these species, with the minimum (AUC = 0.9005) in A. thaliana and the maximum (AUC = 0.9722) in E. coli. Using the independent dataset, we found that Deep4mC could achieve AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. This evaluation demonstrated promising accuracy and robustness of Deep4mC. We developed a user-friendly, public online server for Deep4mC (https://bioinfo.uth.edu/Deep4mC).

Materials and methods

Our work has three components (Figure 1): (i) compile two 4mC site benchmark datasets from six organisms; (ii) transform DNA sequences in the benchmark data into mathematical vectors by using 12 types of sequence and physicochemical property features (see below), followed by assessment of these features, multiple machine learning algorithms, and two approaches for model construction and (iii) develop the Deep4mC by deep CNNs with attention mechanism. The method is provided as an online web service available at https://bioinfo.uth.edu/Deep4mC.

Figure 1

The workflow. It includes benchmark dataset processing, feature and approach evaluation and the development of Deep4mC and web server.

Open in new tab Download slide

Data collection and processing

To facilitate fair comparison and develop a powerful prediction model, two benchmark datasets for 4mC site were compiled from six species, including A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii and G. subterraneus. Benchmark data 1 were initially processed by Chen et al. obtained from the MethSMRT database [12, 16], which contained 7163 experimentally identified 4mC sites in six species: 1978 in A. thaliana, 1554 in C. elegans, 1769 in D. melanogaster, 388 in E. coli, 569 in G. pickeringii and 905 in G. subterraneus (Supplementary Table 1, see Supplementary Data available online at https://academic.oup.com/bib). Of note, this dataset has been widely used in the currently available prediction tools and has been preprocessed to remove sequences with high similarity. Benchmark data 2 were newly collected for the same six species from the MethSMRT database. Each 4mC site is represented by a 41 base pair (bp) DNA segment, with 20 bp flanking regions upstream and downstream of the 4mC site, respectively. We performed two strict filter procedures to ensure the reliability of the benchmark datasets by following Chen et al. [16]. First, all 4mC sites were required to have a modification confidence score (QV) of 30 or higher based on the Methylome Analysis Technical Note [17]. Second, we excluded those 4mC sites that had a sequence similarity for more than 70% to others. The similarity score was calculated using the CD-HIT software [18]. After these quality check steps, we obtained a nonredundant, experimentally identified dataset with 285 851 4mC sites (Supplementary Table 2, see Supplementary Data available online at https://academic.oup.com/bib). The number of m4C sites in the positive datasets were 111 927, 60 662, 90 333, 2067, 5727 and 15 135 in the A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii and G. subterraneus, respectively (Supplementary Table 2, see Supplementary Data available online at https://academic.oup.com/bib). For the three species with more than 50 000 4mC sites (A. thaliana, C. elegans and D. melanogaster), we randomly selected the same number of Ns as the positive samples (Ps) to construct balanced datasets. Of note, Ns were not detected by the SMRT sequencing technology. For the remaining species, we randomly selected Ns that were five times to the number of Ps. The compiled dataset was divided into training dataset (90% of the total sample) and independent dataset (10% of the total sample) in each species. These collected and processed benchmark datasets can be downloaded at https://bioinfo.uth.edu/Deep4mC/Download.php.

Feature encoding scheme

We designed and tested a total of 12 types of features based on the sequence and physicochemical properties.

Accumulated nucleotide frequency

The accumulated nucleotide frequency (ANF) feature encoding system [17] represents the nucleotide density and the distribution of each nucleotide in a DNA segment. We first defined the DNA 4mC sequence, i.e. DCS (m, n), to represent each 4mC segment with m nucleotides upstream and n nucleotides downstream of a cytosine. In our case, each 4mC segment is represented as a DCS (20, 20). According to ANF, we calculated a density for each position in DCS (20, 20) following the formula below:

$$ {d}_l=\frac{1}{l}\sum_{j=1}^lf\left({n}_j\right),f\left({n}_j\right)=\left\{\begin{array}{@{}l}1,\ \mathrm{if}\ {n}_j=q\\{}0,\ \mathrm{other}\end{array}\right.,l=1,\dots, 41, $$

where |${n}_j$| represents the nucleotide at the j-th position and |$q\in \Big(\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\Big)$|⁠. Taking the sequence of ‘CACAGTCG’ as an example, when l = 3, the nucleotide at the l-th position is C and the density of this position is calculated as |${d}_3=\frac{1}{3}\times \sum_{j=1}^3f\Big({n}_j\Big)=\frac{1}{3}\times \Big[f(C)+f(A)+f(C)\Big]=\frac{1}{3}\times \Big[1+0+1\Big]=0.667$|⁠. The density of all 41 positions can be similarly calculated. The number of features from the ANF representation of DCS (20, 20) is thus 41.

Binary

The binary encoding [19, 20] donates the position-specific composition of the nucleotides in a DNA segment, such as a DCS (20, 20). Each nucleotide is encoded by a four digit binary vector. Specifically, A is encoded by (1, 0, 0, 0), C is encoded by (0, 1, 0, 0), G is encoded by (0, 0, 1, 0) and T is encoded by (0, 0, 0, 1), respectively. For a DCS (20, 20), a digital vector in the length of 84 is encoded as:

$$ B=\left({b}_{n1},{b}_{n2},{b}_{n3},{b}_{n4},\dots, {b}_{n41}\ \right),b\in \left\{\begin{array}{c}\mathrm{A}:1,0,0,0\\{}\mathrm{C}:0,1,0,0\\{}\mathrm{G}:0,0,1,0\\{}\mathrm{T}:0,0,0,1\end{array}\right.,n\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$

Composition of K-spaced nucleic acid pairs

The composition of K-spaced nucleic acid pairs (CKSNAP) feature encoding [21, 22] represents the composition of nucleotide pairs that are K-steps away from each other in a DCS (20, 20) segment. In our case, we used K = 0, 1, 2, 3, 4 and 5. Specifically, we calculated the frequency of a nucleotide pair with the two nucleotides at positions i and i + K + 1, respectively, where i = 1, …, (l − K − 1) and l = 41. For example, the nucleotide CG with K = 4 steps away represents the following case: |$\mathrm{A}\underset{\_}{\mathrm{C}}\underset{\mathrm{K}}{\underbrace{\mathrm{G}\mathrm{TAC}}}\underset{\_}{\mathrm{G}}\mathrm{TAC}\mathrm{GT}$|⁠, where C is located at the position i = 2 and G is at the 7th position. Because there are a total of 16 possible nucleotide pairs in the human genome regardless of K (i.e. ‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CT’, ‘CG’, ‘GA’, ‘GC’, ‘GG’, ‘GT’, ‘TA’, ‘TC’, ‘TG’ and ‘TT’), we thus generated 16 features for each possible value of K = 0, 1, 2, 3, 4 and 5 and a total of 16 × 6 = 96 features from the CKSNAP coding. For example, a feature vector is calculated as below for K = 0:

$$\begin{eqnarray*} &&\hskip-6pt{\left(\frac{N_{\mathrm{AA}}}{N_{\mathrm{Total}}},\frac{N_{\mathrm{AC}}}{N_{\mathrm{Total}}},\frac{N_{\mathrm{AG}}}{N_{\mathrm{Total}}},\frac{N_{\mathrm{AT}}}{N_{\mathrm{Total}}}, \dots, \frac{N_{\mathrm{TT}}}{N_{\mathrm{Total}}}\right)}_{K=0},\nonumber\\ &&\hskip-6pt\quad{N}_{\mathrm{Total}}={N}_{\mathrm{AA}}+{N}_{\mathrm{AC}}+\dots, {N}_{\mathrm{TT}} \end{eqnarray*}$$

This coding system reflects the short-range interactions of nucleic acids within a DNA sequence segment.

EIIPs of trinucleotide

Nair et al. [23] calculated the energy of delocalized electrons in nucleotides as the EIIP. Four EIIP values were set as A: 0.1260, C: 0.1340, G: 0.0806 and T: 0.1335. The EIIP encoding [24] directly uses the EIIP value representing the nucleotide in the DNA sequence. Therefore, each DCS (20, 20) i was characterized by a 41 dimensional digital vector as:

$$ D=\left({E}_{n1},{E}_{n2},{E}_{n3},{E}_{n4},\dots, {E}_{n41}\ \right),E\in \left\{\begin{array}{c}\mathrm{A}:0.1260\\{}\mathrm{C}:0.1340\\{}\mathrm{G}:0.0806\\{}\mathrm{T}:0.1335\end{array}\right.,n\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$

Nucleic acid composition

As one of the commonly used methods to represent DNA sequences, the nucleic acid composition (NAC) encoding [25, 26] reflects the nucleotides frequencies of the sequence fragments surrounding the 4mC site. In this study, NAC feature encoding represents the frequency of each type of nucleotides in a DCS (20, 20). The frequencies of all four natural nucleotides (‘A’, ‘C’, ‘G’ and ‘T’) can be calculated as:

$$ f(i)=\frac{N_{(i)}}{N},i\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$

where N_(i) donates the number of nucleotide type and N represents the length of a DCS (20, 20).

Di-nucleotide composition

Di-nucleotide composition (DNC) feature encoding [27, 28] represents the composition of continuous dinucleotide pairs in the DCS (20, 20). There are 16 descriptors in DNC feature encoding, which can be defined as:

$$ D\left(i,j\right)=\frac{N_{(ij)}}{N-1},i,j\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$

where N_ij donates the number of dinucleotide represented by nucleotide types i and j.

Tri-nucleotide composition

Tri-nucleotide composition (TNC) feature encoding [29, 30] represents the composition of the composition of continuous trinucleotide pairs in the DCS (20, 20). There are 64 descriptors in TNC feature encoding as (‘AAA’, ‘AAC’, ‘AAG’, ‘AAT’, …, ‘TTT’)₆₄, which can be defined as:

$$ D\left(i,j,k\right)=\frac{N_{(ijk)}}{N-2},i,j,k\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\}, $$

where N_ijk donates the number of trinucleotides represented by nucleotide types i, j and k.

Enhanced nucleic acid composition

The enhanced nucleic acid composition (ENAC) encoding [31–33] calculates the local NAC based on the sequence window of fixed length (the window size was set as five in this study) that continuously slides from the 5′ to 3′ terminus of each nucleotide sequence and can be usually applied to encode the nucleotide sequence with an equal length. The dimension of the ENAC encoding is determined by two parameters, including the sequence length and the sliding window size, which can be calculated as (sequence length − window size + 1) × 4. Thus, a DCS (20, 20) correspond to 4 × (41 − 5 + 1) sliding windows and its vector dimension of the EAAC encoding is 4 × 37 = 148. The ENAC encoding can be defined as:

$$ E=\left({b}_1,{b}_2,\dots, {b}_n\ \right), $$

$$ b(i)=\frac{N_{(i)}}{N},i\in \left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\}, $$

where N donates window size and n equals sequence length − window size + 1.

Kmer

The Kmer encoding [34, 35] calculates the occurrence frequencies of k neighboring nucleotide in the DCS (20, 20), which was commonly used in the field of enhancer identification and regulatory sequence prediction (2). The Kmer (k = 4) descriptor can be defined as:

$$ K(i)=\frac{N_{(i)}}{N},i\in \left\{\mathrm{AAAA},\mathrm{AAAC},\mathrm{AAAG},\mathrm{AAAT},\dots, \mathrm{TTTT}\right\}, $$

where N_i donates the number of types i descriptor of Kmer and N represented the length of the DCS (20, 20).

Reverse compliment Kmer

The reverse compliment Kmer (RCKmer) encoding [36] is a variant of Kmer descriptor, which calculates the occurrence frequencies of reverse compliment k neighboring nucleotide in the DCS (20, 20). For example, there are 16 types of 2-mers (i.e. ‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CT’, ‘CG’, ‘GA’, ‘GC’, ‘GG’, ‘GT’, ‘TA’, ‘TC’, ‘TG’ and ‘TT’) in a DNA sequence. Among them, ‘TT’ is reverse compliment with ‘AA’. Thus, there are only 10 types of 2-mers in the RCKmer approach (i.e. ‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CG’, ‘GA’, ‘GC’ and ‘TA’) by removing the reverse complimentary Kmers.

Nucleotide chemical property

There are four different kinds of nucleotides in a DNA sequence and each nucleotide has a different chemical structure and binding property, whereas all types of nucleotides can be classified into three major groups based on chemical properties [16, 37] as follows:

$$ \mathrm{Ring}\ \mathrm{Structure}=\left\{\begin{array}{c}\mathrm{Purine},\mathrm{A},\mathrm{G}\ \\{}\mathrm{Pyrimidine},\mathrm{C},\mathrm{T}\ \end{array}\right. $$

$$ \mathrm{Functional}\ \mathrm{Group}=\left\{\begin{array}{c}\mathrm{Amino},\mathrm{A},\mathrm{C}\ \\{}\mathrm{Keto},\mathrm{G},\mathrm{T}\ \end{array}\right. $$

$$ \mathrm{Hydrogen}\ \mathrm{Bond}=\left\{\begin{array}{c}\mathrm{Weak},\mathrm{A},\mathrm{T}\ \\{}\mathrm{Strong},\mathrm{C},\mathrm{G}\ \end{array}\right. $$

Incorporating these chemical features, the following equation is used to denote the i-th nucleotide in a DNA sequence:

$$\begin{eqnarray*} &&\hskip-6pt {R}_i=\left\{\begin{array}{c}1,\mathrm{if}\ {N}_i\in \left\{\mathrm{A},\mathrm{G}\right\}\\{}0,\mathrm{if}\ {N}_i\in \left\{\mathrm{C},\mathrm{T}\right\}\end{array}\right.\ {F}_i=\left\{\begin{array}{c}1,\mathrm{if}\ {N}_i\in \left\{\mathrm{A},\mathrm{C}\right\}\\{}0,\mathrm{if}\ {N}_i\in \left\{\mathrm{G},\mathrm{T}\right\}\end{array}\right.\ \nonumber\\ &&\hskip-6pt {H}_i=\left\{\begin{array}{c}1,\mathrm{if}\ {N}_i\in \left\{\mathrm{A},\mathrm{T}\right\}\\{}0,\mathrm{if}\ {N}_i\in \left\{\mathrm{C},\mathrm{G}\right\}\end{array}\right. \end{eqnarray*}$$

Based on chemical properties, ‘A’ can be encoded to (1, 1, 1), ‘C’ to (0, 1, 0), ‘G’ to (1, 0, 0) and ‘T’ to (0, 0, 1), respectively.

Pseudo dinucleotide composition

Pseudo dinucleotide composition (PseDNC) feature encoding [17, 38] can cover local sequence-order and the global sequence-order information into the feature vector of the DCS (20, 20). The PseDNC encoding is defined as follows:

$$ P={\left({p}_1,{p}_2,\dots, {p}_{16},{p}_{16+1},\dots, {p}_{16+\lambda }\ \right)}^T $$

$$ {p}_k=\left\{\begin{array}{c}\frac{f_k}{\sum_{i=1}^{16}{f}_i+w\sum_{j=1}^{\lambda }{\theta}_j},\left(1\le k\le 16\right)\\{}\frac{w{\theta}_{k-16}}{\sum_{i=1}^{16}{f}_i+w\sum_{j=1}^{\lambda }{\theta}_j},\left(17\le k\le 16+\lambda \right)\end{array}\right., $$

where f_k (k = 1, 2, …, 16) reflects the normalized occurrence frequency of dinucleotide in the DCS (20, 20), λ represents the highest counted rank of the correlation along the DCS (20, 20), w (0–1) is the weight factor and θ_j (j = 1, 2, …, λ) is the j-tier correlation factor, which is defined:

$$ {\theta}_1=\frac{1}{L-2}{\sum}_{i=1}^{L-2}\varTheta \left({R}_i{R}_{i+1},{R}_{i+1}{R}_{i+2}\ \right), $$

$$ \dots $$

$$ {\theta}_{\lambda }=\frac{1}{L-1-\lambda}\sum_{i=1}^{L-1-\lambda}\varTheta \left({R}_i{R}_{i+1},{R}_{i+\lambda }{R}_{i+\lambda +1}\ \right), $$

where the correlation function is defined:

$$ \varTheta \left({R}_i{R}_{i+1},{R}_j{R}_{j+1}\ \right)=\frac{1}{u}\sum_{u=1}^u{\left({C}_u\left({R}_i{R}_{i+1}\right)-{C}_u\left({R}_j{R}_{j+1}\right)\right)}^2, $$

where μ denotes the number of physicochemical indexes. Six physicochemical indexes, including rise, roll, shift, slide, tilt and twist were considered in this work. C_u (R_i R_i + 1) is the numerical value of the u-th physicochemical index of the dinucleotide R_iR_i + 1 at position i and C_u (R_jR_j + 1) denotes corresponding value of the dinucleotide R_jR_j + 1 at position j.

Two-step feature selection strategy via recursive feature elimination

Feature selection was a critical step to eliminate the noisy features and improve the performance. In this study, we performed a two-step feature selection procedure to identify the most prominent feature vectors [39, 40]. In the first step, we conducted statistics tests (t-test for quantitative features and chi-square for categorical features) to identify features that are associated with the target labels. This procedure thus generated an index of feature ranking indicating their classification importance. In the 2nd step, the method of recursive feature elimination was adopted to determine the optimal feature representations by recursively eliminating a small number of weakest features per loop [41]. More specifically, to determine the optimal group, features from the ranked index were eliminated in batch (batch size = 10) from lower rank to higher rank each time, where the features with the least importance gradually pruned. The remaining features were selected to rebuild the SVM-based prediction model on the 10-fold CV repeatedly. Finally, the feature subset with the best performance, measured by AUC value, was selected as the optimal feature subset to build the prediction model.

Figure 2

Pairwise evaluation of 12 features on eight machine learning algorithms using 10-fold CV in six species: (A) A. thaliana, (B) C. elegans, (C) D. melanogaster, (D) E. coli, (E) G. pickeringii and (F) G. subterraneus.

Open in new tab Download slide

Development of the stacking framework

The stacking framework starts with a comprehensive assessment of eight classical machine learning algorithms, followed by an ensemble approach to integrate the predictions from each classifier. The eight classifiers included AdaBoost (AB), decision trees (DT), gradient boosting (GB), K-nearest neighbors (KNN), LR, RF, SGD and SVM. We trained each classification algorithm using the 12 types of features and calculated the AUC values based on the 10-fold CV to assess the performance. This process was repeated 10 times to ensure the reliability of the results. Moreover, hyperparameter optimization was made using the RandomizedSearchCV of scikit-learn v0.21.3 (https://scikit-learn.org/) for each classification algorithms to obtain the best model. As a result, we obtained the optimal feature subset for each species for each of the tested algorithms. In addition, we obtained prediction models for six of the tested algorithms, including AB, GB, LR, RF, SGD and SVM, while two algorithms, KNN and DT, were dropped from further analyses due to relatively poor performance.

Figure 3

AUC values of 10-fold CV with decrease of feature dimension in (A) A. thaliana, (B) C. elegans, (C) D. melanogaster, (D) E. coli, (E) G. pickeringii and (F) G. subterraneus. X-axis denotes the round of feature selection. (G, H) t-distributed stochastic neighbor embedding visualization of G. pickeringii benchmark dataset in a two-dimensional feature space before and after feature optimization.

Open in new tab Download slide

In the 2nd part, we implemented a stacking framework [42] to improve the model construction. The output from the six algorithms, i.e. the predicted probabilities, was taken as the input to these machine learning algorithms with five rounds of learning. The model with the best performance (AUC value) was selected as the final prediction model. Three measurements of sensitivity (Sn), specificity (Sp) and Matthews correlation coefficient (MCC) were calculated to evaluate the prediction performance. The three measurements were defined as shown below:

$$ \mathrm{Sn}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\ \mathrm{Sp}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$

$$ \mathrm{MCC}=\frac{\left(\mathrm{TP}\times \mathrm{TN}\right)-\left(\mathrm{FN}\times \mathrm{FP}\right)}{\sqrt{\left(\mathrm{TP}+\mathrm{FN}\right)\times \left(\mathrm{TN}+\mathrm{FP}\right)\times \left(\mathrm{TP}+\mathrm{FP}\right)\times \left(\mathrm{TN}+\mathrm{FN}\right)}}. $$

The 4-, 6-, 8- and 10-fold CVs were performed. The receiver operating characteristic curve (ROC) and AUC values were also calculated in this study.

Figure 4

Change of AUC values by increase of learning time in the stacking model in (A) A. thaliana, (B) C. elegans, (C) D. melanogaster, (D) E. coli, (E) G. pickeringii and (F) G. subterraneus.

Open in new tab Download slide

Figure 5

Comparison of AUC values between stacking models and single feature, individually trained SVM algorithm in (A) A. thaliana, (B) C. elegans, (C) D. melanogaster, (D) E. coli, (E) G. pickeringii and (F) G. subterraneus.

Open in new tab Download slide

Figure 6

Performance evaluation and comparison of Deep4mC with Meta-4mCpred. ROC curves and AUC values of Deep4mC upon the training data in (A) A. thaliana, (B) C. elegans, (C) D. melanogaster, (D) E. coli, (E) G. pickeringii and (F) G. subterraneus. (G) ROC curves and AUC values of Deep4mC using the independent dataset in six species. (H) ROC curves and AUC values of Meta-4mCpred using the independent dataset in six species. (I) Comparison of specificity between Deep4mC and Meta-4mCpred upon the independent dataset in six species.

Open in new tab Download slide

Deep CNNs architecture

Recently, as a cutting-edge technique, deep learning has been widely used in many applications, such as natural language processing [43], image recognition [44] and a number of bioinformatics studies [45, 46]. The framework of deep learning is basically artificial neural networks composing of multiple nonlinear layers. In the field of bioinformatics, deep learning-based approaches, including CNNs, have been successfully applied for the prediction of protein phosphorylation sites [47, 48], RNA modification sites [33] and virus integration sites [49]. CNNs usually contain multiple parts, including the input layer, the convolutional layers, the fully connected layer and the output layer. In this work, we designed our model with an input layer, several convolutional layers, an attention layer and an output layer. We used the rectified linear unit (ReLU) as the activation function:

$$ \mathrm{ReLU}(x)=\left\{\begin{array}{c}x,\mathrm{if}\ x\ge 0\\{}0,\mathrm{if}\ x<0\end{array}\right., $$

where x denotes the weighted sum of a neuron.

Specifically, the input layer accepts the training dataset with labels and representative features and the convolutional layers were adopted for feature extraction and representation. The attention layer was included to catch underlying importance of the DCS (20, 20) [49]. The attention layer takes the feature representation of the last convolutional layer as input and calculated a score suggesting whether the neural network should pay more attention to the features at that position. Subsequently, the feature vectors captured by the convolutional layers and the attention scores were integrated and fed to an LR classifier to acquire an output score that indicates the probability of 4mC site, which can be defined as follow:

$$ \mathrm{prediction}\ (y)=\frac{1}{1+{e}^{-y}}, $$

where y denotes the input of the sigmoid node from the combination of convolutional feature vectors and attention scores. The prediction score ranged between 0 and 1, representing the probability of a DCS (20, 20) to be a 4mC site.

For species with unbalanced Ps and Ns, we extended our architecture with a bootstrapping method [48]. First, the same number of Ps and Ns from the benchmark dataset was selected to construct one model based on this balanced dataset. In order to fully train all Ns, all Ns will be divided into t bins according to Ps. In this study, five times (t = 5) of bootstrap iterations were executed to generate one classifier. This procedure was repeated for five times to generate five classifiers. When predicting 4mC of a query site, the average output calculated by the five classifiers would be taken as the final prediction.

Results

In silico prediction of DNA 4mC sites: current progress

Due to the rapid development of high-throughput techniques, such as the SMRT sequencing, the genome-wide distribution patterns and functional roles of 4mC have been extensively investigated. In addition to experimental approaches, a number of computational tools have also been developed for identifying potential 4mC sites in genomes (Supplementary Table 3, see Supplementary Data available online at https://academic.oup.com/bib). Chen et al. [16] developed the first computational method to predict DNA 4mC sites based on a SVM model, called iDNA4mC. In iDNA4mC, the authors collected a nonredundant benchmark dataset in the genomes of six species and considered features of ANF and NCP. Later, He et al. [50] developed 4mCPred, which used datasets constructed by Chen et al. [16] and introduced the EIIP and nucleotide physicochemical properties as features, followed by feature selection using the F-score. Subsequently, Wei et al. [51] proposed the 4mcPred-SVM for the genome-wide detection of DNA 4mC site. 4mcPred-SVM takes different types of nucleotide composition features including Kmer, NAC, DNC, TNC and ANF with SVM algorithm for training the computational models. Then, 4mcPred-IFL was released with an iterative feature representation [52]. A number of features were encoded and F-score method and SVM algorithm were adopted to determine the optimal features and train the models. Recently, Manavalan et al. [53] introduced a novel predictor called Meta-4mCpred by integrating different sequence-based features and physicochemical-based features with an ensemble model. In summary, many efforts have been dedicated to computational identification of 4mC sites, and multiple features of sequence and physicochemical properties and classification algorithms have been employed. However, it remains unclear which features are the most informative and which machine learning algorithm was the most prominent in different species. Thus, a systematic analysis of features contribution as well as the predictive ability of different classifiers upon distinct feature(s) is much needed. Such a study will provide a practical guide for future bioinformatics studies of DNA 4mC sites.

Pairwise assessment of 12 features upon multiple machine learning algorithms

To evaluate the contribution of individual features to the prediction of 4mC sites, we first performed the sequence preference analysis for 4mC modification sites in different species. A strong difference was found in the context of sequence pattern of 4mC modification in different species (Figure S1, see Supplementary Data available online at https://academic.oup.com/bib). Then, 12 features (Supplementary Table 4, see Supplementary Data available online at https://academic.oup.com/bib) were encoded, including nine sequence-based features (ANF, binary, CKSNAP, DNC, ENAC, Kmer, NAC, TNC and RCKmer) and three types of physicochemical properties-based features (EIIP, NCP and PseDNC). All features were evaluated pairwise using eight classification algorithms, i.e. SVM, RF, LR, AB, SGD, DT, KNN and GB. Note that the parameters of all classification algorithms have been carefully optimized to achieve the most objective results. Although the performance of different features varies via distinct classifiers in different species, our results on different features showed that all AUC values were greater than 0.5, indicating that all sequence and physicochemical features were efficient and informative for the prediction of 4mC sites. Moreover, the predictive capability of eight classification algorithms were also investigated (Figure 2). Based on our results, SVM represents the most powerful classifier with an average AUC value of 0.7662 across 12 types of features in various species. Other algorithms, i.e. LR, SGD, RF and GB, also performed well with average AUC values of 0.7582, 0.7578, 0.7570 and 0.7531, whereas the algorithms of KNN and DT showed the worst performance. Moreover, the AUC values of individual feature upon each classification algorithm were calculated and illustrated based on 10-fold CV (Figure 2, Supplementary Table 5, see Supplementary Data available online at https://academic.oup.com/bib). The results showed that the NCP, binary, ENAC and EIIP encodings achieved high performance in multiple classification algorithms in A. thaliana, C. elegans, D. melanogaster, E. coli and G. subterraneus with the average AUC value of 0.8445, 0.8421, 0.8035 and 0.7922, respectively. The performance of other features, such as TNC, CKSNAP, RCKmer, Kmer, PseDNC and DNC, were less competitive, with the average AUC value ranged from 0.6746 (NAC) to 0.7360 (TNC). ANF encoding had the lowest average AUC value (0.5968) in these five species. For G. pickeringii, all features except ANF performed well in multiple classification algorithms. Taken together, our results revealed that 12 types of sequence and physicochemical features are all informative and SVM is the most powerful classification algorithm for 4mC site prediction.

Two-step feature selection strategy contributes to performance improvement

Different features contributed to model performance unequally, leading to an unavoidable step in machine learning for feature optimization. To this end, we performed a two-step feature selection via the recursive feature elimination method for 4mC prediction in each species. For each feature vector, we calculated a chi-square statistic to assess its association with the target labels. All features were then ranked by decreasing chi-square values. The features at the lower end of the rank were sequentially pruned. Figure 3 showed the change of AUC values from 10-fold CV as a function of the round of feature selection, with the best performance highlighted by the red dot in each curve. The numbers of optimal features for each species were 313 in A. thaliana, 253 in C. elegans, 313 in D. melanogaster, 63 in E. coli, 153 in G. pickeringii and 233 in G. subterraneus, respectively. We found a common trend of feature optimization for all species, where the performance of the model increased sharply at the beginning, reached the highest point of performance and then gradually decreasing. These results suggested that our recursive feature elimination strategy was effective to improve performance. More specifically, we used E. coli as an example to explore the data distribution using t-distributed stochastic neighbor embedding [54]. As shown in Figure 3, the positive (4mC sites) and negative (non-4mC sites) data points could be much better distinguished after feature selection (Figure 3H) compared to the distribution using all features (Figure 3G). By implementing the recursive feature elimination process, the feature space tended to be relatively stable, in which the distinction between the Ps and Ns in feature space was clearer.

The stacking strategy promoted the performance

In the stacking framework, only six machine learning algorithms were considered, i.e. RF, LR, AB, GB, SGD and SVM, due to their high performance on 12 types of feature encodings, whereas the classifier of KNN and DT were dropped. Based on the optimal feature group, the predicted probability output from six models were considered as the 2nd feature vector and was input again to six different classifiers to develop their corresponding stacking models with five rounds. The model with the best performance (AUC value) was selected as the final prediction model to construct the Deep4mC. Interestingly, compared to the prime model, we found the stacking model could improve the performance except for the SVM classifier in E. coli (Figure 4). Especially for RF classifier, the stacking model led to an AUC value improvement with the range of 3–7%. For the SVM classifier, the performance improvement was not as large as the RF, but it also had a certain contribution to the construction of the final model. It should be noted that in the stacking framework, the performance of the model did not always increase with the number of learning times. We found that most of the stacking models reached their peak for the 2nd time and then gradually decreased, although we constantly optimized the parameters of the model as the feature input was different. In addition, we compared stacking models to single features individually trained by SVM algorithm and observed that the stacking model improved the prediction performance for all benchmark datasets in these species (Figure 5). Taken together, the stacking model improved performance when compared to the best baseline model, indicating the stacking strategy could combine the strength of multiple predictors and thereby promoted the performance.

Deep4mC accurately predicted DNA 4mC sites

In addition to the above review and evaluation, we developed a new deep learning-based DNA 4mC site predictor, namely Deep4mC, with an attention mechanism. Four representative features encoded from sequence profile, including binary, ENAC, EIIP and NCP, were taken as input. Then, two convolutional layers without pooling function were followed to execute feature extraction and representation. An attention layer was added to connect the last convolutional layer and the output layer. The hyperparameters of Deep4mC were optimized in each species with the tree-structured Parzen estimator approach using the Hyperas package [55]. Specifically, 100 evaluations were executed using separate training and validation sets. The optimal parameters across different species were shown in Supplementary Table 6 (see Supplementary Data available online at https://academic.oup.com/bib).

To assess the accuracy and robustness of Deep4mC, we performed 4-, 6-, 8- and 10-fold CVs on the training dataset in each species (Figure 6). We found Deep4mC achieved high performance: the average AUC values of multiple CVs were greater than 0.9 in all the six species, with a range from 0.9005 (A. thaliana) to 0.9722 (E. coli) (Figure 6). For E. coli, the AUC values of 4-, 6-, 8- and 10-fold CVs were 0.9736, 0.9728, 0.9697 and 0.9726, respectively. Moreover, the average AUC value of multiple CVs in the C. elegans was 0.9526, and n-fold CVs also generated similar results in D. melanogaster (0.9468), G. pickeringii (0. 9235) and G. subterraneus (0. 9285). Different CV results of Deep4mC were in high congruence, indicating the promising accuracy and robust computational models.

To further exhibit the superiority of Deep4mC, we compared Deep4mC with previously reported 4mC site predictors using an independent dataset. Recently, two novel predictors, including 4mcPred-IFL [52] and Meta-4mCpred [53], were reported and outperformed other tools. However, the webserver of 4mcPred-IFL is not accessible. Therefore, we only compared Deep4mC with Meta-4mCpred. We submitted the independent dataset to the online service of Meta-4mCpred and downloaded their prediction results. We then compared the Meta-4mCpred output with our Deep4mC, both of which were based on the same data. As shown in Figure 6G and H, Deep4mC achieved large AUC value improvement from 10.14 (E. coli) to 46.21% (G. subterraneus) when compared to the results by Meta-4mCpred across the six species. More importantly, using the independent dataset in each species, we calculated the Sp (see Materials and methods) to investigate the false positive issue. We found that Deep4mC achieved higher Sp (indicating a lower false positive rate) when compared to Meta-4mCpred in each species (Figure 6I). As above, comparison demonstrated the robustness and superiority of Deep4mC.

Discussion

In this study, we first conducted a comprehensive assessment on the state-of-the-art computational tools for predicting DNA 4mC modification sites. Based on the benchmark dataset that was widely used for all the previous tools, we encoded 12 features, including nine sequence-based features and three physicochemical properties-based features. To evaluate the contribution of individual features and predictive power of various machine learning algorithms, all features were assessed by each of the eight classification algorithms and the AUC values were calculated using 10-fold CV. The result demonstrated that sequence and physicochemical features were all efficient and informative for the prediction of 4mC site and four features encodings, i.e. the NCP, binary, ENAC and EIIP, achieved high performance in multiple classification algorithms across these species. For the classification algorithms, SVM could show the most powerful classifier across 12 types of features in various species, followed by LR, SGD, RF and AB. Multiple features contain a large number of dimensions, but they are not equivalently essential for the model performance. Thus, we also explored whether two-step feature selection approach can improve model accuracy. Our result suggested that the recursive feature elimination contributed to the feature representation and was potent to improve performance. Depending on the optimal feature subset in each species, we further introduced a stacking framework combining the predicted probability from six advanced machine learning algorithms as the new feature vectors to train a new model. The result indicated that the stacking strategy could combine the strength of multiple predictors and thereby promote the performance.

In addition to the above review and survey, we recollected a large number of newly added 4mC sites in the six species’ genomes and developed a novel online tool, Deep4mC, for identifying 4mC sites in different genomes. Multiple fold CVs and comparison results with previous tool demonstrated the robustness and superiority of Deep4mC. To better serve the wider biomedical research community, an online web server for Deep4mC was implemented and is freely accessible at https://bioinfo.uth.edu/Deep4mC. For future prediction of DNA 4mC sites, currently available tools, including Deep4mC, should be maintained for facilitating research. In addition, newly identified DNA 4mC sites in new species will be continuously collected to construct novel computational models, for a better prediction and validation for computational approaches. However, the limitations of current forecasting methods remain that only the sequence information and chemical properties are considered due to less experimental investigation. More information, such as structural information and gene expression information, should be considered when these data for 4mC sites become available. Although a steady stream of DNA 4mC sites have been identified, the biological or regulatory function of most of these sites and their substrates remain largely unknown. Thus, combining both computational prediction and experimental validation will provide more insightful clues for future functional studies of 4mC roles.

Key Points

We conducted a comprehensive assessment of existing tools for DNA N4-methylcytosine (4mC) site prediction, particularly in terms of feature engineering and classification algorithm construction.
The two-step feature selection strategy and the stacking framework could enhance feature representation and contribute to the performance improvement for 4mC site prediction.
A novel deep learning-based 4mC site predictor, namely Deep4mC, was developed with convolutional neural networks incorporated with attention mechanism
A web portal (https://bioinfo.uth.edu/Deep4mC) is developed for online prediction of 4mC sites in multiple species.

Acknowledgements

The authors thank all members of the Bioinformatics and Systems Medicine Laboratory for their valuable help and insightful discussion.

Funding

This work was partially supported by the National Institutes of Health grant (R01LM012806). We thank the technical support from the Cancer Genomics Core funded by the Cancer Prevention and Research Institute of Texas (CPRIT RP170668 and RP180734). The funder had no role in the study design, data collection and analysis and decision to publish or preparation of the manuscript.

Conflict of interest

The authors declare that they have no competing interests.

Haodong Xu is a postdoctoral fellow in the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. He obtained his PhD in Bioinformatics from Huazhong University of Science and Technology, China. His research interest includes bioinformatics, machine learning and database construction.

Peilin Jia is an assistant professor of bioinformatics in the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. Her research interest includes bioinformatics, machine learning, methodology development and integrative genomics.

Zhongming Zhao holds Chair Professor for Precision Health and is the founding director of the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. He directs the Bioinformatics and Systems Medicine Laboratory and UTHealth Cancer Genomics Core. His research interest includes bioinformatics, integrative genomics and methodology development.

References

1.

Yu

M

,

Ji

L

,

Neumann

DA

, et al.

Base-resolution detection of N4-methylcytosine in genomic DNA using 4mC-Tet-assisted-bisulfite- sequencing

.

Nucleic Acids Res

2015

;

43

:

e148

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

2.

Booth

MJ

,

Branco

MR

,

Ficz

G

, et al.

Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution

.

Science

2012

;

336

:

934

–

7

.

3.

Xiao

C-L

,

Zhu

S

,

He

M

, et al.

N6-methyladenine DNA modification in the human genome

.

Mol Cell

2018

;

71

:

306

–

18

e7

.

4.

Ko

M

,

Huang

Y

,

Jankowska

AM

, et al.

Impaired hydroxylation of 5-methylcytosine in myeloid cancers with mutant TET2

.

Nature

2010

;

468

:

839

.

5.

Breiling

A

,

Lyko

F

.

Epigenetic regulatory functions of DNA modifications: 5-methylcytosine and beyond

.

Epigenetics Chromatin

2015

;

8

:

24

.

6.

Zhang

G

,

Huang

H

,

Liu

D

, et al.

N6-methyladenine DNA modification in Drosophila

.

Cell

2015

;

161

:

893

–

906

.

7.

Ehrlich

M

,

Wilson

G

,

Kuo

K

, et al.

N4-methylcytosine as a minor base in bacterial DNA

.

J Bacteriol

1987

;

169

:

939

–

43

.

8.

Glickman

BW

,

Radman

M

.

Escherichia coli mutator mutants deficient in methylation-instructed DNA mismatch correction

.

P Natl Acad Sci

1980

;

77

:

1063

–

7

.

Google Scholar

Crossref

WorldCat

9.

Pukkila

PJ

,

Peterson

J

,

Herman

G

, et al.

Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli

.

Genetics

1983

;

104

:

571

–

82

.

10.

Flusberg

BA

,

Webster

DR

,

Lee

JH

, et al.

Direct detection of DNA methylation during single-molecule, real-time sequencing

.

Nat Methods

2010

;

7

:

461

.

11.

Rathi

P

,

Maurer

S

,

Summerer

D

.

Selective recognition of N4-methylcytosine in DNA by engineered transcription-activator-like effectors

.

Philos Trans R Soc Lond B Biol Sci

2018

;

373

:

20170078

.

12.

Ye

P

,

Luan

Y

,

Chen

K

, et al.

MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing

.

Nucleic Acids Res

2017

;

45

:

85

–

89

.

Google Scholar

Crossref

WorldCat

13.

Sood

AJ

,

Viner

C

,

Hoffman

MM

.

DNAmod: the DNA modification database

.

J Chem

2019

;

11

:

30

.

Google Scholar

Crossref

WorldCat

14.

Liu

Z-Y

,

Xing

J-F

,

Chen

W

, et al.

MDR: an integrative DNA N6-methyladenine and N4-methylcytosine modification database for Rosaceae

.

Hortic Res

2019

;

6

:

78

.

15.

Haodong

X

,

Ruifeng

H

,

Peilin

J

, et al.

6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes

.

Bioinformatics

2020

;

36

:

3257

–

3259

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

16.

Chen

W

,

Yang

H

,

Feng

P

, et al.

iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties

.

Bioinformatics

2017

;

33

:

3518

–

23

.

17.

Feng

P

,

Yang

H

,

Ding

H

, et al.

iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC

.

Genomics

2019

;

111

:

96

–

102

.

18.

Fu

L

,

Niu

B

,

Zhu

Z

, et al.

CD-HIT: accelerated for clustering the next-generation sequencing data

.

Bioinformatics

2012

;

28

:

3150

–

2

.

19.

Zhou

Y

,

Zeng

P

,

Li

Y-H

, et al.

SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features

.

Nucleic Acids Res

2016

;

44

:

e91

.

20.

Xu

H-D

,

Shi

S-P

,

Chen

X

, et al.

Systematic analysis of the genetic variability that impacts SUMO conjugation and their involvement in human diseases

.

Sci Rep

2015

;

5

:

10900

.

21.

Zhang

W

,

Xu

X

,

Yin

M

, et al.

Prediction of methylation sites using the composition of K-spaced amino acid pairs

.

Protein Pept Lett

2013

;

20

:

911

–

7

.

22.

Liu

B

,

Gao

X

,

Zhang

H

.

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches

.

Nucleic Acids Res

2019

;

47

:

e127

.

23.

Nair

AS

,

Sreenadhan

SP

.

A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)

.

Bioinformation

2006

;

1

:

197

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

24.

He

W

,

Jia

C

.

EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron–ion interaction potential feature selection

.

Mol Biosyst

2017

;

13

:

767

–

74

.

25.

Chen

W

,

Feng

P-M

,

Lin

H

, et al.

iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition

.

Nucleic Acids Res

2013

;

41

:

e68

.

26.

Chen

W

,

Feng

P

,

Ding

H

, et al.

iRNA-methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition

.

Anal Biochem

2015

;

490

:

26

–

33

.

27.

Grabherr

MG

,

Pontiller

J

,

Mauceli

E

, et al.

Exploiting nucleotide composition to engineer promoters

.

PLoS One

2011

;

6

:

e20136

.

28.

Panwar

B

,

Raghava

GP

.

Identification of protein-interacting nucleotides in a RNA sequence using composition profile of tri-nucleotides

.

Genomics

2015

;

105

:

197

–

203

.

29.

Qiu

W-R

,

Xiao

X

,

Chou

K-C

.

iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components

.

Int J Mol Sci

2014

;

15

:

1746

–

66

.

30.

Panwar

B

,

Arora

A

,

Raghava

GP

.

Prediction and classification of ncRNAs using structural information

.

BMC Genomics

2014

;

15

:

127

.

31.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences

.

Bioinformatics

2018

;

34

:

2499

–

502

.

32.

Chen

Z

,

He

N

,

Huang

Y

, et al.

Integration of a deep learning classifier with a random forest approach for predicting malonylation sites

.

Genomics Proteomics Bioinformatics

2018

;

16

:

451

–

9

.

33.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences

.

Brief Bioinform

2019

. https://doi.org/10.1093/bib/bbz112.

Google Scholar

OpenURL Placeholder Text

WorldCat

34.

Manavalan

B

,

Basith

S

,

Shin

TH

, et al.

4mCpred-EL: an ensemble learning framework for identification of DNA N4-methylcytosine sites in the mouse genome

.

Cell

2019

;

8

:

1332

.

Google Scholar

Crossref

WorldCat

35.

Huang

Y

,

He

N

,

Chen

Y

, et al.

BERMP: a cross-species classifier for predicting m6A sites by integrating a deep learning algorithm and a random forest approach

.

Int J Mol Sci

2018

;

14

:

1669

.

Google Scholar

OpenURL Placeholder Text

WorldCat

36.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

.

Brief Bioinform

2019

. https://doi.org/10.1093/bib/bbz041.

Google Scholar

OpenURL Placeholder Text

WorldCat

37.

Chen

W

,

Tang

H

,

Ye

J

, et al.

iRNA-PseU: identifying RNA pseudouridine sites

.

Mol Ther-Nucl Acids

2016

;

5

:

e332

.

Google Scholar

OpenURL Placeholder Text

WorldCat

38.

Fang

T

,

Zhang

Z

,

Sun

R

, et al.

RNAm5CPred: prediction of RNA 5-methylcytosine sites based on three different kinds of nucleotide composition

.

Mol Ther-Nucl Acids

2019

;

18

:

739

–

47

.

Google Scholar

Crossref

WorldCat

39.

Xu

H-D

,

Shi

S-P

,

Wen

P-P

, et al.

SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy

.

Bioinformatics

2015

;

31

:

3748

–

50

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

40.

Chen

G

,

Cao

M

,

Luo

K

, et al.

ProAcePred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization

.

Bioinformatics

2018

;

34

:

3999

–

4006

.

41.

Granitto

PM

,

Furlanello

C

,

Biasioli

F

, et al.

Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products

.

Chemometr Intell Lab

2006

;

83

:

83

–

90

.

Google Scholar

Crossref

WorldCat

42.

Liu

X-J

,

Gong

X-J

,

Yu

H

, et al.

A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers

.

Genes

2018

;

9

:

394

.

Google Scholar

Crossref

WorldCat

43.

Li

H

.

Deep learning for natural language processing: advantages and challenges

.

Natl Sci Rev

2017

;

5

:

24

–

26

.

Google Scholar

Crossref

WorldCat

44.

He

K

,

Zhang

X

,

Ren

S

, et al. Deep residual learning for image recognition. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

.

2016

, pp.

770

–

8

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

45.

Hu

R

,

Pei

G

,

Jia

P

, et al.

Decoding regulatory structures and features from epigenomics profiles: a Roadmap-ENCODE Variational Auto-Encoder (RE-VAE) model

.

Methods

2019

. https://doi.org/10.1016/j.ymeth.2019.10.012.

Google Scholar

OpenURL Placeholder Text

WorldCat

46.

Du

J

,

Jia

P

,

Dai

Y

, et al.

Gene2vec: distributed representation of genes based on co-expression

.

BMC Genomics

2019

;

20

:

82

.

47.

Wang

C

,

Xu

H

,

Lin

S

, et al.

GPS 5.0: an update on the prediction of kinase-specific phosphorylation sites in proteins

.

Genomics Proteomics Bioinformatics

2020

. https://doi.org/10.1016/j.gpb.2020.01.001.

Google Scholar

OpenURL Placeholder Text

WorldCat

48.

Wang

D

,

Zeng

S

,

Xu

C

, et al.

MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction

.

Bioinformatics

2017

;

33

:

3909

–

16

.

49.

Hu

H

,

Xiao

A

,

Zhang

S

, et al.

DeepHINT: understanding HIV-1 integration via deep learning with attention

.

Bioinformatics

2019

;

35

:

1660

–

7

.

50.

He

W

,

Jia

C

,

Zou

Q

.

4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction

.

Bioinformatics

2018

;

35

:

593

–

601

.

Google Scholar

Crossref

WorldCat

51.

Wei

L

,

Luan

S

,

Nagai

LAE

, et al.

Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species

.

Bioinformatics

2018

;

35

:

1326

–

33

.

Google Scholar

Crossref

WorldCat

52.

Wei

L

,

Su

R

,

Luan

S

, et al.

Iterative feature representations improve N4-methylcytosine site prediction

.

Bioinformatics

2019

;

35

:

4930

–

7

.

53.

Manavalan

B

,

Basith

S

,

Shin

TH

, et al.

Meta-4mCpred: a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation

.

Mol Ther-Nucl Acids

2019

;

16

:

733

–

44

.

Google Scholar

Crossref

WorldCat

54.

Maaten

L

,

Hinton

G.

.

Visualizing data using t-SNE

.

J Mach Learn Res

2008

;

9

:

2579

–

2605

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

55.

Pumperla

M

.

Hyperas: a very simple convenience wrapper around hyperopt for fast prototyping with keras models (2017)

.

2019

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
June 2020	32
July 2020	60
August 2020	29
September 2020	42
October 2020	41
November 2020	26
December 2020	21
January 2021	28
February 2021	16
March 2021	24
April 2021	18
May 2021	33
June 2021	26
July 2021	20
August 2021	12
September 2021	40
October 2021	25
November 2021	50
December 2021	34
January 2022	22
February 2022	6
March 2022	28
April 2022	27
May 2022	21
June 2022	5
July 2022	16
August 2022	13
September 2022	9
October 2022	24
November 2022	24
December 2022	5
January 2023	13
February 2023	25
March 2023	16
April 2023	17
May 2023	49
June 2023	26
July 2023	30
August 2023	34
September 2023	33
October 2023	57
November 2023	34
December 2023	43
January 2024	58
February 2024	52
March 2024	43
April 2024	44
May 2024	60
June 2024	44
July 2024	35
August 2024	40
September 2024	53
October 2024	55
November 2024	43

Article Contents

Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning

Abstract

Introduction

Materials and methods

Data collection and processing

Feature encoding scheme

Accumulated nucleotide frequency

Binary

Composition of K-spaced nucleic acid pairs

EIIPs of trinucleotide

Nucleic acid composition

Di-nucleotide composition

Tri-nucleotide composition

Enhanced nucleic acid composition

Kmer

Reverse compliment Kmer

Nucleotide chemical property

Pseudo dinucleotide composition

Two-step feature selection strategy via recursive feature elimination

Development of the stacking framework

Deep CNNs architecture

Results

In silico prediction of DNA 4mC sites: current progress

Pairwise assessment of 12 features upon multiple machine learning algorithms

Two-step feature selection strategy contributes to performance improvement

The stacking strategy promoted the performance

Deep4mC accurately predicted DNA 4mC sites

Discussion

Acknowledgements

Funding

Conflict of interest

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only