iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://www.ncbi.nlm.nih.gov/pubmed/26291518
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug 20;11(8):e1004418.
doi: 10.1371/journal.pcbi.1004418. eCollection 2015 Aug.

Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast

Affiliations

Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast

Zing Tsung-Yeh Tsai et al. PLoS Comput Biol. .

Abstract

Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA "intrinsic properties" (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Evaluation of features distinguishing between bound and unbound regions and between regions bound by a single TF compared to the other TFs.
(A,B) The p-values (color scale shown, adjusted by false discovery rate control for multiple testing) from two-sided Wilcoxon rank sum tests of differences in feature values (A) between bound and unbound regions of all the 40 analyzed TFs jointly (ALL) and separately, and (B) between bound regions of a single vs. the remaining TFs. The p-values for (A) and (B) are shown in S1 Fig and S2 Fig, respectively. (C,D) The value distributions of the 23 features for regions bound (black) and not bound (white) by (C) RAP1 and (D) ZAP1, respectively. The values were normalized into [0, 1] for each feature. The p-values of two-tailed Wilcoxon rank sum tests are shown below the boxplots: red, p < 10−3; white, p = 10−3; blue, p > 10−3.
Fig 2
Fig 2. Performance improvement in binding region prediction models by incorporating chromatin state (CS) and DNA structure (DS) features.
(A,C) The relationship between binding region prediction performance of models using sequence motif (SM) only and SM+CS+DS for each TF when contrasting (A) bound and unbound regions of a TF and (C) regions bound by one TF compared to regions bound by. the other TFs. The triangle indicates the average performance. The line indicates 1-to-1 relationship. (B,D) The relationship between the improvement in F-measure when incorporating CS and DS and the F-measures of random forest classifications using SM-only when contrasting (B) bound and unbound regions of a TF and (D) regions bound by one TF compared to regions bound by the other TFs.
Fig 3
Fig 3. Contribution of SM, CS, and DS features to overall and individual TF binding region prediction.
(A) The F-measure distributions of random forest classifications with different individual features or combinations of features. The y-axis indicates the probability with a specific F-measure. The arrowheads indicate the average F-measures. (B) The relationship between F-measures of binding predictions based on CS features only and DS features only. The dotted line shows the 1-to-1 relationship. Points in green (yellow) represent the TFs in which performance is better in the DS-only (CS-only) model. (C) Heat map showing the relative performance (i.e. standardized the F-measures to mean zero and variance one) in predicting binding region of each TF using individual features or combinations of features. The TFs are grouped into three classes: TFs with binding regions that can be predicted by either CS or DS (Group 1); TFs with binding regions that cannot be predicted well with only CS (Group 2) or only DS (Group 3) features.
Fig 4
Fig 4. The relative importance of features for predicting binding regions of different TFs.
Importance was defined as the decrease in accuracy after dropping a feature. The accuracy range was normalized to [0, 1] for each TF, where 0 is blue and 1 is red. The TFs were grouped into three classes as shown in Fig 3. Arrowheads indicate the most important features for predicting binding regions for most TFs.
Fig 5
Fig 5. The performances of cross-DBD validations based on predictions using in silico predicted nucleosome occupancy, DNA major groove geometry, and dinucleotide free energy.
The five DBD families examined were helix-turn-helix (HTH, 6130 sites), zinc finger (ZF, 8372 sites), leucine zipper (LZ, 3560 sites), winged helix (WH, 1070 sites), and helix-loop-helix (HLH, 2944 sites). Each value in the heat map is the F-measure of a model trained with the dataset of DBDx family member binding regions to predict the test dataset consisting of binding regions of TFs with DBDy. The F-measures on the diagonal are obtained by 10-fold cross-validation.

Similar articles

Cited by

References

    1. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. Nature Publishing Group; 2013;31: 126–34. 10.1038/nbt.2486 - DOI - PMC - PubMed
    1. Bulyk ML. Computational prediction of transcription-factor binding site locations. Genome Biol. 2003;5: 201 - PMC - PubMed
    1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23: 137–44. - PubMed
    1. Wunderlich Z, Mirny LA. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 2009;25: 429–34. - PMC - PubMed
    1. Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci. 2014;39: 381–399. 10.1016/j.tibs.2014.07.002 - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

This work was partly supported by the Taiwan Ministry of Science and Technology (http://www.most.gov.tw) [MOST104-2917-I-564-070 to ZTYT and MOST103-2221-E-001-024-MY2 to HKT] and the US National Science Foundation (http://www.nsf.gov) [MCB-1119778 to SHS]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources