Abstract
Transcription start site (TSS) selection is a key step in gene expression and occurs at many promoter positions over a wide range of efficiencies. Here we develop a massively parallel reporter assay to quantitatively dissect contributions of promoter sequence, nucleoside triphosphate substrate levels and RNA polymerase II (Pol II) activity to TSS selection by ‘promoter scanning’ in Saccharomyces cerevisiae (Pol II MAssively Systematic Transcript End Readout, ‘Pol II MASTER’). Using Pol II MASTER, we measure the efficiency of Pol II initiation at 1,000,000 individual TSS sequences in a defined promoter context. Pol II MASTER confirms proposed critical qualities of S. cerevisiae TSS −8, −1 and +1 positions, quantitatively, in a controlled promoter context. Pol II MASTER extends quantitative analysis to surrounding sequences and determines that they tune initiation over a wide range of efficiencies. These results enabled the development of a predictive model for initiation efficiency based on sequence. We show that genetic perturbation of Pol II catalytic activity alters initiation efficiency mostly independently of TSS sequence, but selectively modulates preference for the initiating nucleotide. Intriguingly, we find that Pol II initiation efficiency is directly sensitive to guanosine-5′-triphosphate levels at the first five transcript positions and to cytosine-5′-triphosphate and uridine-5′-triphosphate levels at the second position genome wide. These results suggest individual nucleoside triphosphate levels can have transcript-specific effects on initiation, representing a cryptic layer of potential regulation at the level of Pol II biochemical properties. The results establish Pol II MASTER as a method for quantitative dissection of transcription initiation in eukaryotes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Raw sequencing data generated in this study are available in the National Center for Biotechnology Information BioProject, under the accession number PRJNA766624. Processed data are available in GEO, under the accession number GSE185290. Source data are provided with this paper.
Code availability
Code for analyses in this study is provided at https://github.com/Kaplan-Lab-Pitt/PolII_MASTER-TSS_sequence.
References
Zhang, Z. & Dietrich, F. S. Mapping of transcription start sites in Saccharomyces cerevisiae using 5′ SAGE. Nucleic Acids Res. 33, 2838–2851 (2005).
Park, D., Morris, A. R., Battenhouse, A. & Iyer, V. R. Simultaneous mapping of transcript ends at single-nucleotide resolution and identification of widespread promoter-associated non-coding RNA governed by TATA elements. Nucleic Acids Res. 42, 3736–3749 (2014).
Pelechano, V., Wei, W. & Steinmetz, L. M. Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497, 127–131 (2013).
Chia, M. et al. High-resolution analysis of cell-state transitions in yeast suggests widespread transcriptional tuning by alternative starts. Genome Biol. 22, 34 (2021).
Nepal, C. et al. Dynamic regulation of the transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis. Genome Res. 23, 1938–1950 (2013).
Consortium, F. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Yamashita, R. et al. Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Res. 21, 775–789 (2011).
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).
Hoskins, R. A. et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 21, 182–192 (2011).
Zheng, H. et al. Global identification of transcription start sites in the genome of Apis mellifera using 5′ LongSAGE. J. Exp. Zool. B Mol. Dev. Evol. 316, 500–514 (2011).
Chen, R. A. et al. The landscape of RNA polymerase II transcription initiation in C. elegans reveals promoter and enhancer architectures. Genome Res. 23, 1339–1347 (2013).
Cheng, Z. et al. Pervasive, coordinated protein-level changes driven by transcript isoform switching during Meiosis. Cell 172, 910–923 (2018).
Rojas-Duran, M. F. & Gilbert, W. V. Alternative transcription start site selection leads to large differences in translation activity in yeast. RNA 18, 2299–2305 (2012).
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).
Zhang, P. et al. Relatively frequent switching of transcription start sites during cerebellar development. BMC Genomics 18, 461 (2017).
Lu, Z. & Lin, Z. Pervasive and dynamic transcription initiation in Saccharomyces cerevisiae. Genome Res. 29, 1198–1210 (2019).
Demircioglu, D. et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell 178, 1465–1477 (2019).
Thorsen, K. et al. Tumor-specific usage of alternative transcription start sites in colorectal cancer identified by genome-wide exon array analysis. BMC Genomics 12, 505 (2011).
Boyd, M. et al. Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies. Nat. Commun. 9, 1661 (2018).
Giardina, C. & Lis, J. T. DNA melting on yeast RNA polymerase II promoter. Science 261, 759–762 (1993).
Qiu, C. et al. Universal promoter scanning by Pol II during transcription initiation in Saccharomyces cerevisiae. Genome Biol. 21, 132 (2020).
Kuehner, J. N. & Brow, D. A. Quantitative analysis of in vivo initiator selection by yeast RNA polymerase II supports a scanning model. J. Biol. Chem. 281, 14119–14128 (2006).
Kaplan, C. D., Jin, H., Zhang, I. L. & Belyanin, A. Dissection of Pol II trigger loop function and Pol II activity-dependent control of start site selection in vivo. PLoS Genet. 8, e1002627 (2012).
Miller, G. & Hahn, S. A DNA-tethered cleavage probe reveals the path for promoter DNA in the yeast preinitiation complex. Nat. Struct. Mol. Biol. 13, 603–610 (2006).
Fazal, F. M., Meng, C. A., Murakami, K., Kornberg, R. D. & Block, S. M. Real-time observation of the initiation of RNA polymerase II transcription. Nature 525, 274–277 (2015).
Hampsey, M. Molecular genetics of the RNA polymerase II general transcriptional machinery. Microbiol Mol. Biol. Rev. 62, 465–503 (1998).
Zhao, T. et al. Ssl2/TFIIH function in transcription start site scanning by RNA polymerase II in Saccharomyces cerevisiae. eLife 10, e71013 (2021).
Hahn S, H. E. & Guarente, L. Each of three ‘TATA elements’ specifies a subset of the transcription initiation sites at the CYC-1 promoter of Saccharomy cescerevisiae. Proc. Natl Acad. Sci. USA 82, 8562–8566 (1985).
Cortes, T. et al. Genome-wide mapping of transcriptional start sites defines an extensive leaderless transcriptome in Mycobacterium tuberculosis. Cell Rep. 5, 1121–1131 (2013).
Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990).
Smale, S. T. & Baltimore, D. The ‘initiator’ as a transcription control element. Cell 57, 103–113 (1989).
Corden, J. et al. Promoter sequences of eukaryotic protein-coding genes. Science 209, 1406–1414 (1980).
McNeil, J. B. & Smith, M. Saccharomyces cerevisiae CYC1 mRNA 5′-end positioning: analysis by in vitro mutagenesis, using synthetic duplexes with random mismatch base pairs. Mol. Cell. Biol. 5, 3545–3551 (1985).
Malabat, C., Feuerbach, F., Ma, L., Saveanu, C. & Jacquier, A. Quality control of transcription start site selection by nonsense-mediated mRNA decay. eLife 4, e06722 (2015).
Policastro, R. A., Raborn, R. T., Brendel, V. P. & Zentner, G. E. Simple and efficient profiling of transcription initiation and transcript levels with STRIPE-seq. Genome Res. 30, 910–923 (2020).
Healy, A. M., Helser, T. L. & Zitomer, R. S. Sequences required for transcriptional initiation of the Saccharomyces cerevisiae CYC7 genes. Mol. Cell. Biol. 7, 3785–3791 (1987).
Furter-Graves, E. M. & Hall, B. D. DNA sequence elements required for transcription initiation of the Schizosaccharomyces pombe ADH gene in Saccharomyces cerevisiae. Mol. Gen. Genet 223, 407–416 (1990).
Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).
Hashimoto, S. et al. 5′-end SAGE for the analysis of transcriptional start sites. Nat. Biotechnol. 22, 1146–1149 (2004).
Suzuki, Y. et al. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep. 2, 388–393 (2001).
Kim, D. et al. Comparative analysis of regulatory elements between Escherichia coli and Klebsiella pneumoniae by genome-wide transcription start site profiling. PLoS Genet. 8, e1002867 (2012).
Vvedenskaya, I. O. et al. Massively systematic transcript end readout, ‘MASTER’: transcription start site selection, transcriptional slippage, and transcript yields. Mol. Cell 60, 953–965 (2015).
Gleghorn, M. L., Davydova, E. K., Basu, R., Rothman-Denes, L. B. & Murakami, K. S. X-ray crystal structures elucidate the nucleotidyl transfer reaction of transcript initiation using two nucleotides. Proc. Natl Acad. Sci. USA 108, 3566–3571 (2011).
Basu, R. S. et al. Structural basis of transcription initiation by bacterial RNA polymerase holoenzyme. J. Biol. Chem. 289, 24549–24559 (2014).
Lu, Z. & Lin, Z. The origin and evolution of a distinct mechanism of transcription initiation in yeasts. Genome Res. 31, 51–63 (2020).
Maicas, E. & Friesen, J. D. A sequence pattern that occurs at the transcription initiation region of yeast RNA polymerase II promoters. Nucleic Acids Res. 18, 3387–3393 (1990).
Lubliner, S., Keren, L. & Segal, E. Sequence features of yeast and human core promoters that are predictive of maximal promoter activity. Nucleic Acids Res. 41, 5569–5581 (2013).
Dujon, B. The yeast genome project: what did we learn? Trends Genet. 12, 263–270 (1996).
Lubliner, S. et al. Core promoter sequence in yeast is a major determinant of expression level. Genome Res. 25, 1008–1017 (2015).
Blazeck, J., Garg, R., Reed, B. & Alper, H. S. Controlling promoter strength and regulation in Saccharomyces cerevisiae using synthetic hybrid promoters. Biotechnol. Bioeng. 109, 2884–2895 (2012).
Dhillon, N. et al. Permutational analysis of Saccharomyces cerevisiae regulatory elements. Synth. Biol. 5, ysaa007 (2020).
Wang, H., Schilbach, S., Ninov, M., Urlaub, H. & Cramer, P. Structures of transcription preinitiation complex engaged with the +1 nucleosome. Nat. Struct. Mol. Biol. 30, 226–232 (2022).
Vvedenskaya, I. O., Goldman, S. R. & Nickels, B. E. Analysis of bacterial transcription by ‘Massively Systematic Transcript End Readout,’ MASTER. Methods Enzymol. 612, 269–302 (2018).
Vvedenskaya, I. O. et al. Interactions between RNA polymerase and the core recognition element are a determinant of transcription start site selection. Proc. Natl Acad. Sci. USA 113, E2899–E2905 (2016).
Winkelman, J. T. et al. Multiplexed protein–DNA cross-linking: scrunching in transcription start site selection. Science 351, 1090–1093 (2016).
Hochschild, A. Mastering transcription: multiplexed analysis of transcription start site sequences. Mol. Cell 60, 829–831 (2015).
Faitar, S. L., Brodie, S. A. & Ponticelli, A. S. Promoter-specific shifts in transcription initiation conferred by yeast TFIIB mutations are determined by the sequence in the immediate vicinity of the start sites. Mol. Cell. Biol. 21, 4427–4440 (2001).
Deshpande, A. P. & Patel, S. S. Mechanism of transcription initiation by the yeast mitochondrial RNA polymerase. Biochim. Biophys. Acta 1819, 930–938 (2012).
Javahery, R., Khachi, A., Lo, K., Zenzie-Gregory, B. & Smale, S. T. DNA sequence requirements for transcriptional initiator activity in mammalian cells. Mol. Cell. Biol. 14, 116–127 (1994).
Arkhipova, I. R. Promoter elements in Drosophila melanogaster revealed by sequence analysis. Genetics 139, 1359–1369 (1995).
Yarden, G., Elfakess, R., Gazit, K. & Dikstein, R. Characterization of sINR, a strict version of the Initiator core promoter element. Nucleic Acids Res. 37, 4234–4246 (2009).
Wong, M. S., Kinney, J. B. & Krainer, A. R. Quantitative activity profile and context dependence of all human 5′ splice sites. Mol. Cell 71, 1012–1026 (2018).
Roca, X. et al. Features of 5′-splice-site efficiency derived from disease-causing mutations and comparative genomics. Genome Res. 18, 77–87 (2008).
Carmel, I., Tal, S., Vig, I. & Ast, G. Comparative analysis detects dependencies among the 5′ splice-site positions. RNA 10, 828–840 (2004).
McPhillips, C. C., Hyle, J. W. & Reines, D. Detection of the mycophenolate-inhibited form of IMP dehydrogenase in vivo. Proc. Natl Acad. Sci. USA 101, 12171–12176 (2004).
Hyle, J. W., Shaw, R. J. & Reines, D. Functional distinctions between IMP dehydrogenase genes in providing mycophenolate resistance and guanine prototrophy to yeast. J. Biol. Chem. 278, 28470–28478 (2003).
Kuehner, J. N. & Brow, D. A. Regulation of a eukaryotic gene by GTP-dependent start site selection and transcription attenuation. Mol. Cell 31, 201–211 (2008).
Rhee, H. S. & Pugh, B. F. Genome-wide structure and organization of eukaryotic preinitiation complexes. Nature 483, 295–301 (2012).
Vo ngoc, L., Huang, C. Y., Cassidy, C. J., Medrano, C. & Kadonaga, J. T. Identification of the human DPR core promoter element using machine learning. Nature 585, 459–463 (2020).
Luse, D. S., Parida, M., Spector, B. M., Nilson, K. A. & Price, D. H. A unified view of the sequence and functional organization of the human RNA polymerase II promoter. Nucleic Acids Res. 48, 7767–7785 (2020).
Zhang, Y. et al. Structural basis of transcription initiation. Science 338, 1076–1080 (2012).
Walmacq, C. et al. Mechanism of translesion transcription by RNA polymerase II and its role in cellular resistance to DNA damage. Mol. Cell 46, 18–29 (2012).
Braberg, H. et al. From structure to systems: high-resolution, quantitative genetic analysis of RNA polymerase II. Cell 154, 775–788 (2013).
Malik, I., Qiu, C., Snavely, T. & Kaplan, C. D. Wide-ranging and unexpected consequences of altered Pol II catalytic activity in vivo. Nucleic Acids Res. 45, 4431–4451 (2017).
Kwapisz, M. et al. Mutations of RNA polymerase II activate key genes of the nucleoside triphosphate biosynthetic pathways. EMBO J. 27, 2411–2421 (2008).
Thiebaut, M. et al. Futile cycle of transcription initiation and termination modulates the response to nucleotide shortage in S. cerevisiae. Mol. Cell 31, 671–682 (2008).
Steinmetz, E. J. et al. Genome-wide distribution of yeast RNA polymerase II and its control by Sen1 helicase. Mol. Cell 24, 735–746 (2006).
Hein, P. P., Palangat, M. & Landick, R. RNA transcript 3′-proximal sequence affects translocation bias of RNA polymerase. Biochemistry 50, 7002–7014 (2011).
Cabart, P., Jin, H., Li, L. & Kaplan, C. D. Activation and reactivation of the RNA polymerase II trigger loop for intrinsic RNA cleavage and catalysis. Transcription 5, e28869 (2014).
Sainsbury, S., Niesser, J. & Cramer, P. Structure and function of the initially transcribing RNA polymerase II-TFIIB complex. Nature 493, 437–440 (2013).
Segal, E. & Widom, J. Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr. Opin. Struct. Biol. 19, 65–71 (2009).
Tillo, D. & Hughes, T. R. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinf. 10, 442 (2009).
Lee, W. et al. A high-resolution atlas of nucleosome occupancy in yeast. Nat. Genet. 39, 1235–1244 (2007).
Peckham, H. E. et al. Nucleosome positioning signals in genomic DNA. Genome Res. 17, 1170–1177 (2007).
Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).
Jin, H. & Kaplan, C. D. Relationships of RNA polymerase II genetic interactors to transcription start site usage defects and growth in Saccharomyces cerevisiae. G3 5, 21–33 (2014).
Amberg, D. C., Burke, D., Strathern, J. N., Burke, D. & Cold Spring Harbor Laboratory. Methods in Yeast Genetics: A Cold Spring Harbor Laboratory Course Manual, XVII (Cold Spring Harbor Laboratory Press, 2005).
Chee, M. K. & Haase, S. B. New and redesigned pRS plasmid shuttle vectors for genetic manipulation of Saccharomyces cerevisiae. G3 2, 515–526 (2012).
Gietz, R. D. & Schiestl, R. H. High-efficiency yeast transformation using the LiAc/SS carrier DNA/PEG method. Nat. Protoc. 2, 31–34 (2007).
Benatuil, L., Perez, J. M., Belk, J. & Hsieh, C. M. An improved yeast transformation method for the generation of very large human antibody libraries. Protein Eng. Des. Sel. 23, 155–159 (2010).
Schmitt, M. E., Brown, T. A. & Trumpower, B. L. A rapid and simple method for preparation of RNA from Saccharomyces cerevisiae. Nucleic Acids Res. 18, 3091–3092 (1990).
Vvedenskaya, I. O., Goldman, S. R. & Nickels, B. E. Preparation of cDNA libraries for high-throughput RNA sequencing analysis of RNA 5′ ends. Methods Mol. Biol. 1276, 211–228 (2015).
Ranish, J. A. & Hahn, S. The yeast general transcription factor TFIIA is composed of 2 polypeptide subunits. J. Biol. Chem. 266, 19320–19327 (1991).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
Tareen, A. & Kinney, J. B. Logomaker: beautiful sequence logos in Python. Bioinformatics 36, 2272–2274 (2020).
Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Acknowledgements
The authors thank Kaplan lab members for helpful comments on the manuscript. We are deeply grateful to C. Qiu for discussions and comments on this project. We acknowledge J. Kinney (Cold Spring Harbor Laboratory) and S. Li (Statistical Consulting Center at University of Pittsburgh) for discussions on modeling. We thank C.D. Johnson, R. Metz (Texas A&M AgriLife Genomics and Bioinformatics Service), A. Hillhouse (Texas A&M Institute for Genome Sciences & Society), W.A. MacDonald and R. Elbakri (the University of Pittsburgh Health Sciences Sequencing Core at UPMC Children’s Hospital of Pittsburgh), Y. Pan (the UPMC Genome Center), D. Kumar (the Waksman Genomics Core Facility at Rutgers University) and L. Freeman (Illumina) for discussions and advice regarding deep sequencing strategies. We thank S.J. Mullett and S.G. Wendell (Metabolomics and Lipidomics Core, NIHS10OD023402) for performing NTP measurements. We acknowledge support from National Institutes of Health (NIH) grant R01GM097260 to C.D.K. for the early part of this work and NIH grants R01GM120450 and R35GM144116 to C.D.K. and R35GM118059 to B.E.N. This research was supported in part by the University of Pittsburgh Center for Research Computing, RRID:SCR_022735, through the resources provided. Specifically, this work used the HTC cluster, which is supported by NIH award number S10OD028483.
Author information
Authors and Affiliations
Contributions
Y.Z. designed the project, performed experiments, analyzed data, made figures, and drafted and revised the manuscript. I.O.V. generated libraries for TSS-seq. S-H.Z. analyzed data and discussed the analysis. B.E.N. provided funding and methodology of TSS-seq, and revised the manuscript. C.D.K. conceived and designed the project, guided analyses and interpretation of data, provided funding and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Structural & Molecular Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Sara Osman was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 High level of reproducibility and coverage depth of library variants.
(A) Schematic of experimental approach. Promoter libraries with almost all possible sequences within a 9 nt randomized region were constructed on plasmids. Libraries were designated ‘AYR’, ‘BYR’, and ‘ARY’ based on randomized region composition. Plasmids were amplified in E. coli and transformed into yeast with wild type or mutated Pol II. DNA and RNA were extracted and prepared for DNA-seq and TSS-seq. (B) Base frequencies at positions within the randomized region of promoter variants demonstrate unbiased synthesis of randomized regions. Bars are mean +/- standard deviation of the mean for promoter variants in WT and four Pol II mutants. (C) Heatmap illustrating hierarchical clustering of Pearson correlation coefficients of reads per promoter variant E. coli libraries and three biological replicates of libraries transformed into yeast. (D) Example correlation plots of DNA reads count of promoter variants for E. coli and yeast WT biological replicates. Pearson r and number (N) of compared variants are shown. (E) Bulk primer extension for RNA produced from promoter variant libraries transformed into WT yeast. ‘No GFP’ control used RNA from an untransformed strain. ‘No RNA’ control used a sample of nuclease-free water. Dots represent three biological replicates. Bars are mean +/- standard deviation of the mean. (F) TSS usage based TSS-seq read lengths from transformed libraries. Dots represent three biological replicates. Bars are mean +/- standard deviation of mean. Distributions are similar to the distributions in E. Note that primer extension will blur usage into adjacent upstream position due to some level of non-templated addition of C to RNA 5’ ends. (G) Heat scatter plots of Coefficient of Variation (CoV, y axis) versus total RNA reads per promoter variant in each Pol II MASTER library. A cutoff of CoV = 0.5 was used to filter higher variance variants. (H) Heat scatter plots of relative expression versus TSS efficiency of major TSSs per promoter variant, with contour lines indicating deciles of data. Number (N) of promoter variants with [−1, +1] relative expression values (log2) and corresponding percentage of total promoter variants are shown.
Extended Data Fig. 2 Surrounding sequence of TSSs modulates initiation efficiency.
(A) +1 TSS efficiency of all −7 to −2 sequences within each N-8N-1N+1 motif in WT, rank ordered by efficiency of A-8C-1A+1 version shown as a heat map. x-axis is ordered based on median efficiency for each N-8N-1N+1 motif group, as shown in Fig. 2B. Spearman’s rank correlation tests between A-8C-1A+1 group and all groups are shown beneath the heat map. (B) Efficiencies of designed +1 TSSs grouped by base identities between −8 and +1 positions. Statistical analyses by Kruskal-Wallis with Dunn’s multiple comparisons test for base preference at individual positions relative to +1 TSS are shown beneath plots. Lines represent median values of subgroups. ****, P ≤ 0.0001; ***, P ≤ 0.001; **, P ≤ 0.01; *, P ≤ 0.05. (C) Histogram showing the distribution of measured efficiencies for all designed −8 to +4 TSSs of all promoter variants from ‘AYR’, ‘BYR’ and ‘ARY’ libraries in WT. Dashed line marks the 5% efficiency cutoff. (D) A+2G+3G+4 motif enrichment is apparent for the top 10% most efficient designed −8 TSS. A(/G)+2G(/C)+3G(/C)+4 motif enrichment was observed for the top 10% most efficient −8 TSSs but not for the next 10% most efficient TSSs. A(/G)+1 enrichment observed for top 20% most efficient TSSs is consistent with the +1 R preference of TSS. Numbers (N) of variants assessed are shown. Sequence logos were generated using WebLogo 3. Bars represent an approximate Bayesian 95% confidence interval. (E) An A at position −9 results in different sequence preferences at position −8. The dataset of designed +4 TSSs deriving from ‘AYR’, ‘BYR’ and ‘ARY’ libraries was used to detect the −9/−8 interaction. All variants were divided into 16 subgroups defined by bases at positions −9 and −8 relative to designed +4 TSS, and then their TSS efficiencies were plotted. Lines represent median values of subgroups. (F) An A at position −8 results in different sequence preferences at position −7. The dataset of designed +1 TSSs deriving from ‘AYR’ and ‘BYR’ libraries was used to detect −8/−7 interaction. Calculations same as −9/−8 interaction described in E.
Extended Data Fig. 3 High level of reproducibility of library variants in Pol II mutants.
(A) Histograms showing the distribution of measured efficiencies for all designed −8 to +4 TSSs for MASTER libraries in Pol II mutants. Dashed lines mark the 5% efficiency cutoff with number (N) of TSS variants shown. (B) TSS usage distributions at designed −10 to +25 TSSs for MASTER libraries in Pol II mutants. Dots represent three biological replicates. Bars are mean +/- standard deviation. (C) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for major TSSs (designed +1 TSS for ‘AYR’ and ‘BYR’ libraries, +2 TSS for ‘ARY’ library) for WT or Pol II mutants illustrated as a heat map for three biological replicates. (D) Example correlation plots of TSS efficiency of major TSSs between representative biological replicates. Pearson r and number (N) of compared variants are shown. (E) Plots of CoV versus total RNA reads (three biological replicates) for Pol II mutants. The red dashed lines mark the CoV = 0.5 cutoff, an arbitrary cutoff for promoters with reasonable reproducibility across replicates. G1097D replicates contain outliers because this mutant is susceptible to genetic suppressors. A suppressor existing in one biological replicate generates a high CoV allowing filtering.
Extended Data Fig. 4 Pol II mutants alter TSS efficiency in general.
(A) TSS efficiency distributions of designed +1 TSSs of Pol II mutants for base subgroups at individual positions relative to +1. Identical analysis as in Extended Data Fig. 2B for WT Pol II. (B) Pol II GOF G1097D showed greater increase in efficiency than GOF allele E1103G at upstream TSSs (designed −32 and −8 TSSs), while E1103G showed stronger effects at designed +1 TSS than G1097D. (C) Pol II initiation sequence preference in Pol II mutants. Identical analysis as in Fig. 3B for WT Pol II. (D) Motif enrichment for top the 10% most efficient −8 TSSs for Pol II mutants. Identical motif enrichment analysis as in Extended Data Fig. 2D top panel for WT Pol II. Numbers (N) of variants assessed are indicated. Bars represent an approximate Bayesian 95% confidence interval.
Extended Data Fig. 5 High of reproducibility of TSS usage and efficiency upon MPA treatment.
(A) TSS usage distributions at designed −10 to +25 TSSs in WT ‘NYR’ library (mixed ‘AYR’ and ‘BYR’ libraries) treated with 100% ethanol or with 20 μg/ml MPA. MPA treatment shifted TSS usage downstream relative to EtOH treatment. Dots represent three biological replicates. Bars are mean +/- standard deviation of the mean. (B) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for designed +1 TSS for three biological replicates for MPA or EtOH treatment, illustrated as a heat map. (C) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for all genome positions within defined promoter windows with >=3 reads in each replicate, illustrated as a heat map. (D) Correlation plots for combined biological replicates for TSS efficiency upon MPA treatment (y axes) versus EtOH treatment (x axes) for all TSSs ≥ 2% efficiency in the 25%–75% of the distribution for a curated set of 5,979 yeast promoters (see Methods). TSSs are separated into groups depending on base identity at positions −3 (control) or positions +1 to +6.
Extended Data Fig. 6 Modeling identifies sequence features for TSS selection in WT and Pol II mutants.
(A) Overview of TSS efficiency modeling. (1) TSS efficiencies including designed −8 to +2 and +4 TSSs deriving from ‘AYR’, ‘BYR’ and ‘ARY’ libraries were pooled for modeling. (2) Sequences from −11 to +9 relative to variant TSSs were extracted. (3) To identify robust features, a forward stepwise selection strategy coupled with a five-fold cross-validation for logistic regression was used, with random splitting into training (80%) and test (20%) sets. Stepwise regression starting with a constant term only with stepwise variable addition, until a stopping criterion is met, was performed. Additive terms (sequences at positions −11 to +9) and interactions were tested in stages. Model performance was evaluated with R2. The stopping criterion for adding additional variables was an increase R2 < 0.01. (4) A logistic regression model containing selected robust features was trained using the training set and then evaluated with the test set. (B) Comparison of measured efficiencies and predicted efficiencies. Model performance R2 on entire test set and number (N) of data points shown in plot are shown. (C) Principal component analysis (PCA) for parameters of models trained using individual replicates of WT and Pol II mutants. Close clustering of individual replicates indicates that models are not overfit. The top 15 contributing variables are shown. GOF and LOF mutants were separated from WT by the 1st principal component. GOF G1097D and E1103G were further distinguished by 2nd principal component by additional position +2 information, which is consistent with results in Extended Data Fig. 4D, where G1097D and E1103G differentially altered +2 sequence enrichment. (D) A scatter plot of comparison of measured and predicted TSS efficiencies of all positions within 5,979 known genomic promoter windows21 with available measured efficiency. Pearson r and number (N) of compared variants are shown. Most promoter positions (82%, 1,678,406 out of 2,047,205) showed no observed efficiency, which is expected because TSSs need to be specified by a core promoter and scanning occurs over some distance downstream.
Supplementary information
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Fig. 6
Statistical source data.
Source Data Fig. 7
Statistical source data.
Source Data Extended Data Fig. 1/Table 1
Statistical source data.
Source Data Extended Data Fig. 2/Table 2
Statistical source data.
Source Data Extended Data Fig. 3/Table 3
Statistical source data.
Source Data Extended Data Fig. 4/Table 4
Statistical source data.
Source Data Extended Data Fig. 5/Table 5
Statistical source data.
Source Data Extended Data Fig. 6/Table 6
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, Y., Vvedenskaya, I.O., Sze, SH. et al. Quantitative analysis of transcription start site selection reveals control by DNA sequence, RNA polymerase II activity and NTP levels. Nat Struct Mol Biol 31, 190–202 (2024). https://doi.org/10.1038/s41594-023-01171-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41594-023-01171-9