- Split View
-
Views
-
Cite
Cite
Haodong Xu, Peilin Jia, Zhongming Zhao, Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning, Briefings in Bioinformatics, Volume 22, Issue 3, May 2021, bbaa099, https://doi.org/10.1093/bib/bbaa099
- Share Icon Share
Abstract
DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species’ genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005–0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.
Introduction
The rapid development of genome sequencing technology has made it possible to examine the functional impact on DNA chemical modifications in a high resolution. Catalyzed by DNA methyltransferases, methyl-base modifications, such as N4-methylcytosine (4mC), 5-methylcytosine (5mC) and N6-methyladenine (6mA), account for a large portion of DNA modification in the genomes of diverse species [1–4]. These epigenetic modifications greatly expand the diversity of genomic organization and regulation in various biological processes. In eukaryotic genomes, DNA 5mC modification has been widely explored to demonstrate that the dynamic regulation of 5mC plays critical roles in regulating chromatin architecture and gene expression [5]. Recent studies have also shed light on the distribution and regulatory function of extensive 6mA modification in eukaryotic genomes, although 6mA modification was predominantly considered as a modification in prokaryotic previously [6]. In addition to DNA 5mC and 6mA modifications, 4mC has been reported as a potent epigenetic modification that protects its self-DNA from the restriction enzyme-mediated degradation [7]. By adding a methyl group to the 4th position of cytosine in DNA, 4mC modification plays an important role in the regulation of DNA replication, cell cycle and gene expression levels and participates in genome stabilization, recombination and evolution [8, 9]. So far, identification of 4mC modification and understanding of its roles has still been limited, especially with much limited data generated from experiments. Therefore, there is a strong demand for developing approaches that can effectively identify or predict 4mC sites in a genome.
Several experimental approaches have been developed for identification of 4mC sites. In 2010, a mainstream platform of third-generation sequencing, single-molecule real-time sequencing (SMRT), has emerged as a popular method with the advantages of long-read sequencing and capability of detecting DNA modifications. SMRT has since been widely applied to detected 4mC sites from unknown DNA sequences in several bacterial genomes [10]. Later, Yu et al. introduced a next-generation sequencing method, called 4mC-Tet-assisted bisulfite-sequencing, to rapidly and cost-efficiently detect the genome-wide 4mC loci in bacterial species [1]. Rathi et al. [11] applied the transcription activator-like effectors method to reveal 4mC sites in DNA sequence. With the increasing number of experimental 4mC studies, the collection and integration of known 4mC sites gradually become an important research topic for sharing and mining these data. Ye et al. [12] developed a database of MethSMRT, the first resource hosting DNA 6mA and 4mC methylomes based on the publicly available SMRT sequencing datasets in 156 species. Later, the DNAmod database was built to annotate the chemical properties and structures of all curated modified DNA bases, including 4mC, which enables researchers to check previous studies and the identification methodology [13]. Recently, Liu et al. [14] released a database of MDR to curate DNA 6mA and 4mC modification for Rosaceae family using SMRT sequencing datasets.
These high-quality datasets, although still limited, provide an opportunity for identification of potential 4mC sites in a DNA sequence by computational approaches, similar to the prediction of CpG methylation and 6mA modification [15]. Here, we first summarized the current progress in computational prediction of 4mC sites and, based on that, we developed a novel deep learning-based species-specific 4mC site predictor, namely Deep4mC. In addition to the 4mC datasets that commonly used in previous studies, which contains 7163 experimentally identified 4mC sites (benchmark data 1) from six organisms, we also recollected newly added 4mC sites in the six species genomes from the MethSMRT database [12] and compiled a nonredundant dataset, including 285 851 experimentally identified 4mC sites (benchmark data 2) from Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Geobacter pickeringii and Geoalkalibacter subterraneus. We encoded a total of 12 features including nine sequence-derived features: accumulated nucleotide frequency (ANF), binary, composition of K-spaced nucleic acid pairs (CKSNAP), dinucleotide composition (DNC), enhanced nucleic acid composition (ENAC), Kmer, nucleic acid composition (NAC), reverse compliment Kmer (RCKmer) and trinucleotide composition (TNC) and three physicochemical features: electron–ion interaction pseudopotentials (EIIPs) of trinucleotide, nucleotide chemical property (NCP) and pseudo dinucleotide composition (PseDNC). Eight conventional classification algorithms were assessed by 10-fold cross-validation (CV) in a pairwise way using the benchmark data 1. Four features (binary, EIIP, ENAC and NCP) performed well across all six species with the average area under curve (AUC) values greater than 0.79. For the classification algorithms, SVM was the most effective classifier, while other classifiers, i.e. logistic regression (LR), stochastic gradient descent (SGD) and random forests (RF), also held good classification ability with AUC values greater than 0.75. We also explored the contribution of two strategies to the final model construction, i.e. feature selection (recursive feature elimination) and model ensemble. Our results demonstrated that the optimal feature group could reduce noisy features and improve performance, while proper reinforcement learning was conducive to model construction. In addition to the above review and evaluation, a novel 4mC site predictor, namely Deep4mC, was built based on the deep convolutional neural networks (CNNs) using four representative features. Specifically, Deep4mC uses the feature evaluation results, including binary, ENAC, EIIP and NCP. For species with small numbers of samples (sequences for 4mC sites), we extended our deep learning framework with a bootstrapping method to make full use of the large number of negative samples (Ns) to avoid false positives. The performance of Deep4mC was critically evaluated by multiple CVs and compared with the existing methods. The average AUC values of multiple CVs were all greater than 0.9 in these species, with the minimum (AUC = 0.9005) in A. thaliana and the maximum (AUC = 0.9722) in E. coli. Using the independent dataset, we found that Deep4mC could achieve AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. This evaluation demonstrated promising accuracy and robustness of Deep4mC. We developed a user-friendly, public online server for Deep4mC (https://bioinfo.uth.edu/Deep4mC).
Materials and methods
Our work has three components (Figure 1): (i) compile two 4mC site benchmark datasets from six organisms; (ii) transform DNA sequences in the benchmark data into mathematical vectors by using 12 types of sequence and physicochemical property features (see below), followed by assessment of these features, multiple machine learning algorithms, and two approaches for model construction and (iii) develop the Deep4mC by deep CNNs with attention mechanism. The method is provided as an online web service available at https://bioinfo.uth.edu/Deep4mC.
Data collection and processing
To facilitate fair comparison and develop a powerful prediction model, two benchmark datasets for 4mC site were compiled from six species, including A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii and G. subterraneus. Benchmark data 1 were initially processed by Chen et al. obtained from the MethSMRT database [12, 16], which contained 7163 experimentally identified 4mC sites in six species: 1978 in A. thaliana, 1554 in C. elegans, 1769 in D. melanogaster, 388 in E. coli, 569 in G. pickeringii and 905 in G. subterraneus (Supplementary Table 1, see Supplementary Data available online at https://academic.oup.com/bib). Of note, this dataset has been widely used in the currently available prediction tools and has been preprocessed to remove sequences with high similarity. Benchmark data 2 were newly collected for the same six species from the MethSMRT database. Each 4mC site is represented by a 41 base pair (bp) DNA segment, with 20 bp flanking regions upstream and downstream of the 4mC site, respectively. We performed two strict filter procedures to ensure the reliability of the benchmark datasets by following Chen et al. [16]. First, all 4mC sites were required to have a modification confidence score (QV) of 30 or higher based on the Methylome Analysis Technical Note [17]. Second, we excluded those 4mC sites that had a sequence similarity for more than 70% to others. The similarity score was calculated using the CD-HIT software [18]. After these quality check steps, we obtained a nonredundant, experimentally identified dataset with 285 851 4mC sites (Supplementary Table 2, see Supplementary Data available online at https://academic.oup.com/bib). The number of m4C sites in the positive datasets were 111 927, 60 662, 90 333, 2067, 5727 and 15 135 in the A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii and G. subterraneus, respectively (Supplementary Table 2, see Supplementary Data available online at https://academic.oup.com/bib). For the three species with more than 50 000 4mC sites (A. thaliana, C. elegans and D. melanogaster), we randomly selected the same number of Ns as the positive samples (Ps) to construct balanced datasets. Of note, Ns were not detected by the SMRT sequencing technology. For the remaining species, we randomly selected Ns that were five times to the number of Ps. The compiled dataset was divided into training dataset (90% of the total sample) and independent dataset (10% of the total sample) in each species. These collected and processed benchmark datasets can be downloaded at https://bioinfo.uth.edu/Deep4mC/Download.php.
Feature encoding scheme
We designed and tested a total of 12 types of features based on the sequence and physicochemical properties.
Accumulated nucleotide frequency
Binary
Composition of K-spaced nucleic acid pairs
This coding system reflects the short-range interactions of nucleic acids within a DNA sequence segment.
EIIPs of trinucleotide
Nucleic acid composition
Di-nucleotide composition
Tri-nucleotide composition
Enhanced nucleic acid composition
Kmer
Reverse compliment Kmer
The reverse compliment Kmer (RCKmer) encoding [36] is a variant of Kmer descriptor, which calculates the occurrence frequencies of reverse compliment k neighboring nucleotide in the DCS (20, 20). For example, there are 16 types of 2-mers (i.e. ‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CT’, ‘CG’, ‘GA’, ‘GC’, ‘GG’, ‘GT’, ‘TA’, ‘TC’, ‘TG’ and ‘TT’) in a DNA sequence. Among them, ‘TT’ is reverse compliment with ‘AA’. Thus, there are only 10 types of 2-mers in the RCKmer approach (i.e. ‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CG’, ‘GA’, ‘GC’ and ‘TA’) by removing the reverse complimentary Kmers.
Nucleotide chemical property
Based on chemical properties, ‘A’ can be encoded to (1, 1, 1), ‘C’ to (0, 1, 0), ‘G’ to (1, 0, 0) and ‘T’ to (0, 0, 1), respectively.
Pseudo dinucleotide composition
Two-step feature selection strategy via recursive feature elimination
Feature selection was a critical step to eliminate the noisy features and improve the performance. In this study, we performed a two-step feature selection procedure to identify the most prominent feature vectors [39, 40]. In the first step, we conducted statistics tests (t-test for quantitative features and chi-square for categorical features) to identify features that are associated with the target labels. This procedure thus generated an index of feature ranking indicating their classification importance. In the 2nd step, the method of recursive feature elimination was adopted to determine the optimal feature representations by recursively eliminating a small number of weakest features per loop [41]. More specifically, to determine the optimal group, features from the ranked index were eliminated in batch (batch size = 10) from lower rank to higher rank each time, where the features with the least importance gradually pruned. The remaining features were selected to rebuild the SVM-based prediction model on the 10-fold CV repeatedly. Finally, the feature subset with the best performance, measured by AUC value, was selected as the optimal feature subset to build the prediction model.
Development of the stacking framework
The stacking framework starts with a comprehensive assessment of eight classical machine learning algorithms, followed by an ensemble approach to integrate the predictions from each classifier. The eight classifiers included AdaBoost (AB), decision trees (DT), gradient boosting (GB), K-nearest neighbors (KNN), LR, RF, SGD and SVM. We trained each classification algorithm using the 12 types of features and calculated the AUC values based on the 10-fold CV to assess the performance. This process was repeated 10 times to ensure the reliability of the results. Moreover, hyperparameter optimization was made using the RandomizedSearchCV of scikit-learn v0.21.3 (https://scikit-learn.org/) for each classification algorithms to obtain the best model. As a result, we obtained the optimal feature subset for each species for each of the tested algorithms. In addition, we obtained prediction models for six of the tested algorithms, including AB, GB, LR, RF, SGD and SVM, while two algorithms, KNN and DT, were dropped from further analyses due to relatively poor performance.
The 4-, 6-, 8- and 10-fold CVs were performed. The receiver operating characteristic curve (ROC) and AUC values were also calculated in this study.
Deep CNNs architecture
For species with unbalanced Ps and Ns, we extended our architecture with a bootstrapping method [48]. First, the same number of Ps and Ns from the benchmark dataset was selected to construct one model based on this balanced dataset. In order to fully train all Ns, all Ns will be divided into t bins according to Ps. In this study, five times (t = 5) of bootstrap iterations were executed to generate one classifier. This procedure was repeated for five times to generate five classifiers. When predicting 4mC of a query site, the average output calculated by the five classifiers would be taken as the final prediction.
Results
In silico prediction of DNA 4mC sites: current progress
Due to the rapid development of high-throughput techniques, such as the SMRT sequencing, the genome-wide distribution patterns and functional roles of 4mC have been extensively investigated. In addition to experimental approaches, a number of computational tools have also been developed for identifying potential 4mC sites in genomes (Supplementary Table 3, see Supplementary Data available online at https://academic.oup.com/bib). Chen et al. [16] developed the first computational method to predict DNA 4mC sites based on a SVM model, called iDNA4mC. In iDNA4mC, the authors collected a nonredundant benchmark dataset in the genomes of six species and considered features of ANF and NCP. Later, He et al. [50] developed 4mCPred, which used datasets constructed by Chen et al. [16] and introduced the EIIP and nucleotide physicochemical properties as features, followed by feature selection using the F-score. Subsequently, Wei et al. [51] proposed the 4mcPred-SVM for the genome-wide detection of DNA 4mC site. 4mcPred-SVM takes different types of nucleotide composition features including Kmer, NAC, DNC, TNC and ANF with SVM algorithm for training the computational models. Then, 4mcPred-IFL was released with an iterative feature representation [52]. A number of features were encoded and F-score method and SVM algorithm were adopted to determine the optimal features and train the models. Recently, Manavalan et al. [53] introduced a novel predictor called Meta-4mCpred by integrating different sequence-based features and physicochemical-based features with an ensemble model. In summary, many efforts have been dedicated to computational identification of 4mC sites, and multiple features of sequence and physicochemical properties and classification algorithms have been employed. However, it remains unclear which features are the most informative and which machine learning algorithm was the most prominent in different species. Thus, a systematic analysis of features contribution as well as the predictive ability of different classifiers upon distinct feature(s) is much needed. Such a study will provide a practical guide for future bioinformatics studies of DNA 4mC sites.
Pairwise assessment of 12 features upon multiple machine learning algorithms
To evaluate the contribution of individual features to the prediction of 4mC sites, we first performed the sequence preference analysis for 4mC modification sites in different species. A strong difference was found in the context of sequence pattern of 4mC modification in different species (Figure S1, see Supplementary Data available online at https://academic.oup.com/bib). Then, 12 features (Supplementary Table 4, see Supplementary Data available online at https://academic.oup.com/bib) were encoded, including nine sequence-based features (ANF, binary, CKSNAP, DNC, ENAC, Kmer, NAC, TNC and RCKmer) and three types of physicochemical properties-based features (EIIP, NCP and PseDNC). All features were evaluated pairwise using eight classification algorithms, i.e. SVM, RF, LR, AB, SGD, DT, KNN and GB. Note that the parameters of all classification algorithms have been carefully optimized to achieve the most objective results. Although the performance of different features varies via distinct classifiers in different species, our results on different features showed that all AUC values were greater than 0.5, indicating that all sequence and physicochemical features were efficient and informative for the prediction of 4mC sites. Moreover, the predictive capability of eight classification algorithms were also investigated (Figure 2). Based on our results, SVM represents the most powerful classifier with an average AUC value of 0.7662 across 12 types of features in various species. Other algorithms, i.e. LR, SGD, RF and GB, also performed well with average AUC values of 0.7582, 0.7578, 0.7570 and 0.7531, whereas the algorithms of KNN and DT showed the worst performance. Moreover, the AUC values of individual feature upon each classification algorithm were calculated and illustrated based on 10-fold CV (Figure 2, Supplementary Table 5, see Supplementary Data available online at https://academic.oup.com/bib). The results showed that the NCP, binary, ENAC and EIIP encodings achieved high performance in multiple classification algorithms in A. thaliana, C. elegans, D. melanogaster, E. coli and G. subterraneus with the average AUC value of 0.8445, 0.8421, 0.8035 and 0.7922, respectively. The performance of other features, such as TNC, CKSNAP, RCKmer, Kmer, PseDNC and DNC, were less competitive, with the average AUC value ranged from 0.6746 (NAC) to 0.7360 (TNC). ANF encoding had the lowest average AUC value (0.5968) in these five species. For G. pickeringii, all features except ANF performed well in multiple classification algorithms. Taken together, our results revealed that 12 types of sequence and physicochemical features are all informative and SVM is the most powerful classification algorithm for 4mC site prediction.
Two-step feature selection strategy contributes to performance improvement
Different features contributed to model performance unequally, leading to an unavoidable step in machine learning for feature optimization. To this end, we performed a two-step feature selection via the recursive feature elimination method for 4mC prediction in each species. For each feature vector, we calculated a chi-square statistic to assess its association with the target labels. All features were then ranked by decreasing chi-square values. The features at the lower end of the rank were sequentially pruned. Figure 3 showed the change of AUC values from 10-fold CV as a function of the round of feature selection, with the best performance highlighted by the red dot in each curve. The numbers of optimal features for each species were 313 in A. thaliana, 253 in C. elegans, 313 in D. melanogaster, 63 in E. coli, 153 in G. pickeringii and 233 in G. subterraneus, respectively. We found a common trend of feature optimization for all species, where the performance of the model increased sharply at the beginning, reached the highest point of performance and then gradually decreasing. These results suggested that our recursive feature elimination strategy was effective to improve performance. More specifically, we used E. coli as an example to explore the data distribution using t-distributed stochastic neighbor embedding [54]. As shown in Figure 3, the positive (4mC sites) and negative (non-4mC sites) data points could be much better distinguished after feature selection (Figure 3H) compared to the distribution using all features (Figure 3G). By implementing the recursive feature elimination process, the feature space tended to be relatively stable, in which the distinction between the Ps and Ns in feature space was clearer.
The stacking strategy promoted the performance
In the stacking framework, only six machine learning algorithms were considered, i.e. RF, LR, AB, GB, SGD and SVM, due to their high performance on 12 types of feature encodings, whereas the classifier of KNN and DT were dropped. Based on the optimal feature group, the predicted probability output from six models were considered as the 2nd feature vector and was input again to six different classifiers to develop their corresponding stacking models with five rounds. The model with the best performance (AUC value) was selected as the final prediction model to construct the Deep4mC. Interestingly, compared to the prime model, we found the stacking model could improve the performance except for the SVM classifier in E. coli (Figure 4). Especially for RF classifier, the stacking model led to an AUC value improvement with the range of 3–7%. For the SVM classifier, the performance improvement was not as large as the RF, but it also had a certain contribution to the construction of the final model. It should be noted that in the stacking framework, the performance of the model did not always increase with the number of learning times. We found that most of the stacking models reached their peak for the 2nd time and then gradually decreased, although we constantly optimized the parameters of the model as the feature input was different. In addition, we compared stacking models to single features individually trained by SVM algorithm and observed that the stacking model improved the prediction performance for all benchmark datasets in these species (Figure 5). Taken together, the stacking model improved performance when compared to the best baseline model, indicating the stacking strategy could combine the strength of multiple predictors and thereby promoted the performance.
Deep4mC accurately predicted DNA 4mC sites
In addition to the above review and evaluation, we developed a new deep learning-based DNA 4mC site predictor, namely Deep4mC, with an attention mechanism. Four representative features encoded from sequence profile, including binary, ENAC, EIIP and NCP, were taken as input. Then, two convolutional layers without pooling function were followed to execute feature extraction and representation. An attention layer was added to connect the last convolutional layer and the output layer. The hyperparameters of Deep4mC were optimized in each species with the tree-structured Parzen estimator approach using the Hyperas package [55]. Specifically, 100 evaluations were executed using separate training and validation sets. The optimal parameters across different species were shown in Supplementary Table 6 (see Supplementary Data available online at https://academic.oup.com/bib).
To assess the accuracy and robustness of Deep4mC, we performed 4-, 6-, 8- and 10-fold CVs on the training dataset in each species (Figure 6). We found Deep4mC achieved high performance: the average AUC values of multiple CVs were greater than 0.9 in all the six species, with a range from 0.9005 (A. thaliana) to 0.9722 (E. coli) (Figure 6). For E. coli, the AUC values of 4-, 6-, 8- and 10-fold CVs were 0.9736, 0.9728, 0.9697 and 0.9726, respectively. Moreover, the average AUC value of multiple CVs in the C. elegans was 0.9526, and n-fold CVs also generated similar results in D. melanogaster (0.9468), G. pickeringii (0. 9235) and G. subterraneus (0. 9285). Different CV results of Deep4mC were in high congruence, indicating the promising accuracy and robust computational models.
To further exhibit the superiority of Deep4mC, we compared Deep4mC with previously reported 4mC site predictors using an independent dataset. Recently, two novel predictors, including 4mcPred-IFL [52] and Meta-4mCpred [53], were reported and outperformed other tools. However, the webserver of 4mcPred-IFL is not accessible. Therefore, we only compared Deep4mC with Meta-4mCpred. We submitted the independent dataset to the online service of Meta-4mCpred and downloaded their prediction results. We then compared the Meta-4mCpred output with our Deep4mC, both of which were based on the same data. As shown in Figure 6G and H, Deep4mC achieved large AUC value improvement from 10.14 (E. coli) to 46.21% (G. subterraneus) when compared to the results by Meta-4mCpred across the six species. More importantly, using the independent dataset in each species, we calculated the Sp (see Materials and methods) to investigate the false positive issue. We found that Deep4mC achieved higher Sp (indicating a lower false positive rate) when compared to Meta-4mCpred in each species (Figure 6I). As above, comparison demonstrated the robustness and superiority of Deep4mC.
Discussion
In this study, we first conducted a comprehensive assessment on the state-of-the-art computational tools for predicting DNA 4mC modification sites. Based on the benchmark dataset that was widely used for all the previous tools, we encoded 12 features, including nine sequence-based features and three physicochemical properties-based features. To evaluate the contribution of individual features and predictive power of various machine learning algorithms, all features were assessed by each of the eight classification algorithms and the AUC values were calculated using 10-fold CV. The result demonstrated that sequence and physicochemical features were all efficient and informative for the prediction of 4mC site and four features encodings, i.e. the NCP, binary, ENAC and EIIP, achieved high performance in multiple classification algorithms across these species. For the classification algorithms, SVM could show the most powerful classifier across 12 types of features in various species, followed by LR, SGD, RF and AB. Multiple features contain a large number of dimensions, but they are not equivalently essential for the model performance. Thus, we also explored whether two-step feature selection approach can improve model accuracy. Our result suggested that the recursive feature elimination contributed to the feature representation and was potent to improve performance. Depending on the optimal feature subset in each species, we further introduced a stacking framework combining the predicted probability from six advanced machine learning algorithms as the new feature vectors to train a new model. The result indicated that the stacking strategy could combine the strength of multiple predictors and thereby promote the performance.
In addition to the above review and survey, we recollected a large number of newly added 4mC sites in the six species’ genomes and developed a novel online tool, Deep4mC, for identifying 4mC sites in different genomes. Multiple fold CVs and comparison results with previous tool demonstrated the robustness and superiority of Deep4mC. To better serve the wider biomedical research community, an online web server for Deep4mC was implemented and is freely accessible at https://bioinfo.uth.edu/Deep4mC. For future prediction of DNA 4mC sites, currently available tools, including Deep4mC, should be maintained for facilitating research. In addition, newly identified DNA 4mC sites in new species will be continuously collected to construct novel computational models, for a better prediction and validation for computational approaches. However, the limitations of current forecasting methods remain that only the sequence information and chemical properties are considered due to less experimental investigation. More information, such as structural information and gene expression information, should be considered when these data for 4mC sites become available. Although a steady stream of DNA 4mC sites have been identified, the biological or regulatory function of most of these sites and their substrates remain largely unknown. Thus, combining both computational prediction and experimental validation will provide more insightful clues for future functional studies of 4mC roles.
Acknowledgements
The authors thank all members of the Bioinformatics and Systems Medicine Laboratory for their valuable help and insightful discussion.
Funding
This work was partially supported by the National Institutes of Health grant (R01LM012806). We thank the technical support from the Cancer Genomics Core funded by the Cancer Prevention and Research Institute of Texas (CPRIT RP170668 and RP180734). The funder had no role in the study design, data collection and analysis and decision to publish or preparation of the manuscript.
Conflict of interest
The authors declare that they have no competing interests.
Haodong Xu is a postdoctoral fellow in the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. He obtained his PhD in Bioinformatics from Huazhong University of Science and Technology, China. His research interest includes bioinformatics, machine learning and database construction.
Peilin Jia is an assistant professor of bioinformatics in the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. Her research interest includes bioinformatics, machine learning, methodology development and integrative genomics.
Zhongming Zhao holds Chair Professor for Precision Health and is the founding director of the Center for Precision Health, School of Biomedical Informatics, the University of Texas Health Science Center at Houston. He directs the Bioinformatics and Systems Medicine Laboratory and UTHealth Cancer Genomics Core. His research interest includes bioinformatics, integrative genomics and methodology development.