Quantitative analysis of transcription start site selection reveals control by DNA sequence, RNA polymerase II activity and NTP levels

Zhu, Yunye; Vvedenskaya, Irina O.; Sze, Sing-Hoi; Nickels, Bryce E.; Kaplan, Craig D.

doi:10.1038/s41594-023-01171-9

Article
Published: 04 January 2024

Quantitative analysis of transcription start site selection reveals control by DNA sequence, RNA polymerase II activity and NTP levels

Nature Structural & Molecular Biology volume 31, pages 190–202 (2024)Cite this article

3154 Accesses
2 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Transcription start site (TSS) selection is a key step in gene expression and occurs at many promoter positions over a wide range of efficiencies. Here we develop a massively parallel reporter assay to quantitatively dissect contributions of promoter sequence, nucleoside triphosphate substrate levels and RNA polymerase II (Pol II) activity to TSS selection by ‘promoter scanning’ in Saccharomyces cerevisiae (Pol II MAssively Systematic Transcript End Readout, ‘Pol II MASTER’). Using Pol II MASTER, we measure the efficiency of Pol II initiation at 1,000,000 individual TSS sequences in a defined promoter context. Pol II MASTER confirms proposed critical qualities of S. cerevisiae TSS −8, −1 and +1 positions, quantitatively, in a controlled promoter context. Pol II MASTER extends quantitative analysis to surrounding sequences and determines that they tune initiation over a wide range of efficiencies. These results enabled the development of a predictive model for initiation efficiency based on sequence. We show that genetic perturbation of Pol II catalytic activity alters initiation efficiency mostly independently of TSS sequence, but selectively modulates preference for the initiating nucleotide. Intriguingly, we find that Pol II initiation efficiency is directly sensitive to guanosine-5′-triphosphate levels at the first five transcript positions and to cytosine-5′-triphosphate and uridine-5′-triphosphate levels at the second position genome wide. These results suggest individual nucleoside triphosphate levels can have transcript-specific effects on initiation, representing a cryptic layer of potential regulation at the level of Pol II biochemical properties. The results establish Pol II MASTER as a method for quantitative dissection of transcription initiation in eukaryotes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: A high-throughput system for studying TSS selection.**

**Fig. 2: Wide range of initiation efficiency measured using MASTER.**

**Fig. 3: Sequence contributions to Pol II initiation efficiency from positions surrounding the TSS.**

**Fig. 4: Pol II mutants alter TSS efficiency for all possible TSS motifs while showing selective effects for base at +1.**

**Fig. 5: Pol II initiation is sensitive to NTP pools.**

**Fig. 6: Learned initiation preferences are predictive of TSS efficiencies at genomic promoters.**

**Fig. 7: Logistic regression model of DNA sequence contribution to TSS efficiency.**

**Fig. 8: Model for TSS sequence preference regulated by multiple mechanisms.**

Dual-initiation promoters with intertwined canonical and TCT/TOP transcription start sites diversify transcript processing

Article Open access 10 January 2020

Single-cell nascent RNA sequencing unveils coordinated global transcription

Article Open access 05 June 2024

Co-transcriptional gene regulation in eukaryotes and prokaryotes

Article 20 March 2024

Data availability

Raw sequencing data generated in this study are available in the National Center for Biotechnology Information BioProject, under the accession number PRJNA766624. Processed data are available in GEO, under the accession number GSE185290. Source data are provided with this paper.

Code availability

Code for analyses in this study is provided at https://github.com/Kaplan-Lab-Pitt/PolII_MASTER-TSS_sequence.

References

Zhang, Z. & Dietrich, F. S. Mapping of transcription start sites in Saccharomyces cerevisiae using 5′ SAGE. Nucleic Acids Res. 33, 2838–2851 (2005).
Article CAS PubMed PubMed Central Google Scholar
Park, D., Morris, A. R., Battenhouse, A. & Iyer, V. R. Simultaneous mapping of transcript ends at single-nucleotide resolution and identification of widespread promoter-associated non-coding RNA governed by TATA elements. Nucleic Acids Res. 42, 3736–3749 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pelechano, V., Wei, W. & Steinmetz, L. M. Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497, 127–131 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chia, M. et al. High-resolution analysis of cell-state transitions in yeast suggests widespread transcriptional tuning by alternative starts. Genome Biol. 22, 34 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nepal, C. et al. Dynamic regulation of the transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis. Genome Res. 23, 1938–1950 (2013).
Article CAS PubMed PubMed Central Google Scholar
Consortium, F. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Article Google Scholar
Yamashita, R. et al. Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. Genome Res. 21, 775–789 (2011).
Article CAS PubMed PubMed Central Google Scholar
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).
Article CAS PubMed Google Scholar
Hoskins, R. A. et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 21, 182–192 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zheng, H. et al. Global identification of transcription start sites in the genome of Apis mellifera using 5′ LongSAGE. J. Exp. Zool. B Mol. Dev. Evol. 316, 500–514 (2011).
Article CAS PubMed Google Scholar
Chen, R. A. et al. The landscape of RNA polymerase II transcription initiation in C. elegans reveals promoter and enhancer architectures. Genome Res. 23, 1339–1347 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cheng, Z. et al. Pervasive, coordinated protein-level changes driven by transcript isoform switching during Meiosis. Cell 172, 910–923 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rojas-Duran, M. F. & Gilbert, W. V. Alternative transcription start site selection leads to large differences in translation activity in yeast. RNA 18, 2299–2305 (2012).
Article CAS PubMed PubMed Central Google Scholar
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, P. et al. Relatively frequent switching of transcription start sites during cerebellar development. BMC Genomics 18, 461 (2017).
Article PubMed PubMed Central Google Scholar
Lu, Z. & Lin, Z. Pervasive and dynamic transcription initiation in Saccharomyces cerevisiae. Genome Res. 29, 1198–1210 (2019).
Article CAS PubMed PubMed Central Google Scholar
Demircioglu, D. et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell 178, 1465–1477 (2019).
Article CAS PubMed Google Scholar
Thorsen, K. et al. Tumor-specific usage of alternative transcription start sites in colorectal cancer identified by genome-wide exon array analysis. BMC Genomics 12, 505 (2011).
Article CAS PubMed PubMed Central Google Scholar
Boyd, M. et al. Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies. Nat. Commun. 9, 1661 (2018).
Article PubMed PubMed Central Google Scholar
Giardina, C. & Lis, J. T. DNA melting on yeast RNA polymerase II promoter. Science 261, 759–762 (1993).
Article CAS PubMed Google Scholar
Qiu, C. et al. Universal promoter scanning by Pol II during transcription initiation in Saccharomyces cerevisiae. Genome Biol. 21, 132 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kuehner, J. N. & Brow, D. A. Quantitative analysis of in vivo initiator selection by yeast RNA polymerase II supports a scanning model. J. Biol. Chem. 281, 14119–14128 (2006).
Article CAS PubMed Google Scholar
Kaplan, C. D., Jin, H., Zhang, I. L. & Belyanin, A. Dissection of Pol II trigger loop function and Pol II activity-dependent control of start site selection in vivo. PLoS Genet. 8, e1002627 (2012).
Article CAS PubMed PubMed Central Google Scholar
Miller, G. & Hahn, S. A DNA-tethered cleavage probe reveals the path for promoter DNA in the yeast preinitiation complex. Nat. Struct. Mol. Biol. 13, 603–610 (2006).
Article CAS PubMed PubMed Central Google Scholar
Fazal, F. M., Meng, C. A., Murakami, K., Kornberg, R. D. & Block, S. M. Real-time observation of the initiation of RNA polymerase II transcription. Nature 525, 274–277 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hampsey, M. Molecular genetics of the RNA polymerase II general transcriptional machinery. Microbiol Mol. Biol. Rev. 62, 465–503 (1998).
Article CAS PubMed PubMed Central Google Scholar
Zhao, T. et al. Ssl2/TFIIH function in transcription start site scanning by RNA polymerase II in Saccharomyces cerevisiae. eLife 10, e71013 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hahn S, H. E. & Guarente, L. Each of three ‘TATA elements’ specifies a subset of the transcription initiation sites at the CYC-1 promoter of Saccharomy cescerevisiae. Proc. Natl Acad. Sci. USA 82, 8562–8566 (1985).
Article PubMed PubMed Central Google Scholar
Cortes, T. et al. Genome-wide mapping of transcriptional start sites defines an extensive leaderless transcriptome in Mycobacterium tuberculosis. Cell Rep. 5, 1121–1131 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990).
Article CAS PubMed Google Scholar
Smale, S. T. & Baltimore, D. The ‘initiator’ as a transcription control element. Cell 57, 103–113 (1989).
Article CAS PubMed Google Scholar
Corden, J. et al. Promoter sequences of eukaryotic protein-coding genes. Science 209, 1406–1414 (1980).
Article CAS PubMed Google Scholar
McNeil, J. B. & Smith, M. Saccharomyces cerevisiae CYC1 mRNA 5′-end positioning: analysis by in vitro mutagenesis, using synthetic duplexes with random mismatch base pairs. Mol. Cell. Biol. 5, 3545–3551 (1985).
CAS PubMed PubMed Central Google Scholar
Malabat, C., Feuerbach, F., Ma, L., Saveanu, C. & Jacquier, A. Quality control of transcription start site selection by nonsense-mediated mRNA decay. eLife 4, e06722 (2015).
Article PubMed PubMed Central Google Scholar
Policastro, R. A., Raborn, R. T., Brendel, V. P. & Zentner, G. E. Simple and efficient profiling of transcription initiation and transcript levels with STRIPE-seq. Genome Res. 30, 910–923 (2020).
Article CAS PubMed PubMed Central Google Scholar
Healy, A. M., Helser, T. L. & Zitomer, R. S. Sequences required for transcriptional initiation of the Saccharomyces cerevisiae CYC7 genes. Mol. Cell. Biol. 7, 3785–3791 (1987).
CAS PubMed PubMed Central Google Scholar
Furter-Graves, E. M. & Hall, B. D. DNA sequence elements required for transcription initiation of the Schizosaccharomyces pombe ADH gene in Saccharomyces cerevisiae. Mol. Gen. Genet 223, 407–416 (1990).
Article CAS PubMed Google Scholar
Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).
Article CAS PubMed Google Scholar
Hashimoto, S. et al. 5′-end SAGE for the analysis of transcriptional start sites. Nat. Biotechnol. 22, 1146–1149 (2004).
Article CAS PubMed Google Scholar
Suzuki, Y. et al. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep. 2, 388–393 (2001).
Article CAS PubMed PubMed Central Google Scholar
Kim, D. et al. Comparative analysis of regulatory elements between Escherichia coli and Klebsiella pneumoniae by genome-wide transcription start site profiling. PLoS Genet. 8, e1002867 (2012).
Article CAS PubMed PubMed Central Google Scholar
Vvedenskaya, I. O. et al. Massively systematic transcript end readout, ‘MASTER’: transcription start site selection, transcriptional slippage, and transcript yields. Mol. Cell 60, 953–965 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gleghorn, M. L., Davydova, E. K., Basu, R., Rothman-Denes, L. B. & Murakami, K. S. X-ray crystal structures elucidate the nucleotidyl transfer reaction of transcript initiation using two nucleotides. Proc. Natl Acad. Sci. USA 108, 3566–3571 (2011).
Article CAS PubMed PubMed Central Google Scholar
Basu, R. S. et al. Structural basis of transcription initiation by bacterial RNA polymerase holoenzyme. J. Biol. Chem. 289, 24549–24559 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lu, Z. & Lin, Z. The origin and evolution of a distinct mechanism of transcription initiation in yeasts. Genome Res. 31, 51–63 (2020).
Article PubMed Google Scholar
Maicas, E. & Friesen, J. D. A sequence pattern that occurs at the transcription initiation region of yeast RNA polymerase II promoters. Nucleic Acids Res. 18, 3387–3393 (1990).
Article CAS PubMed PubMed Central Google Scholar
Lubliner, S., Keren, L. & Segal, E. Sequence features of yeast and human core promoters that are predictive of maximal promoter activity. Nucleic Acids Res. 41, 5569–5581 (2013).
Article CAS PubMed PubMed Central Google Scholar
Dujon, B. The yeast genome project: what did we learn? Trends Genet. 12, 263–270 (1996).
Article CAS PubMed Google Scholar
Lubliner, S. et al. Core promoter sequence in yeast is a major determinant of expression level. Genome Res. 25, 1008–1017 (2015).
Article CAS PubMed PubMed Central Google Scholar
Blazeck, J., Garg, R., Reed, B. & Alper, H. S. Controlling promoter strength and regulation in Saccharomyces cerevisiae using synthetic hybrid promoters. Biotechnol. Bioeng. 109, 2884–2895 (2012).
Article CAS PubMed Google Scholar
Dhillon, N. et al. Permutational analysis of Saccharomyces cerevisiae regulatory elements. Synth. Biol. 5, ysaa007 (2020).
Wang, H., Schilbach, S., Ninov, M., Urlaub, H. & Cramer, P. Structures of transcription preinitiation complex engaged with the +1 nucleosome. Nat. Struct. Mol. Biol. 30, 226–232 (2022).
Vvedenskaya, I. O., Goldman, S. R. & Nickels, B. E. Analysis of bacterial transcription by ‘Massively Systematic Transcript End Readout,’ MASTER. Methods Enzymol. 612, 269–302 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vvedenskaya, I. O. et al. Interactions between RNA polymerase and the core recognition element are a determinant of transcription start site selection. Proc. Natl Acad. Sci. USA 113, E2899–E2905 (2016).
Article CAS PubMed PubMed Central Google Scholar
Winkelman, J. T. et al. Multiplexed protein–DNA cross-linking: scrunching in transcription start site selection. Science 351, 1090–1093 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hochschild, A. Mastering transcription: multiplexed analysis of transcription start site sequences. Mol. Cell 60, 829–831 (2015).
Article CAS PubMed Google Scholar
Faitar, S. L., Brodie, S. A. & Ponticelli, A. S. Promoter-specific shifts in transcription initiation conferred by yeast TFIIB mutations are determined by the sequence in the immediate vicinity of the start sites. Mol. Cell. Biol. 21, 4427–4440 (2001).
Article CAS PubMed PubMed Central Google Scholar
Deshpande, A. P. & Patel, S. S. Mechanism of transcription initiation by the yeast mitochondrial RNA polymerase. Biochim. Biophys. Acta 1819, 930–938 (2012).
Article CAS PubMed PubMed Central Google Scholar
Javahery, R., Khachi, A., Lo, K., Zenzie-Gregory, B. & Smale, S. T. DNA sequence requirements for transcriptional initiator activity in mammalian cells. Mol. Cell. Biol. 14, 116–127 (1994).
CAS PubMed PubMed Central Google Scholar
Arkhipova, I. R. Promoter elements in Drosophila melanogaster revealed by sequence analysis. Genetics 139, 1359–1369 (1995).
Article CAS PubMed PubMed Central Google Scholar
Yarden, G., Elfakess, R., Gazit, K. & Dikstein, R. Characterization of sINR, a strict version of the Initiator core promoter element. Nucleic Acids Res. 37, 4234–4246 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wong, M. S., Kinney, J. B. & Krainer, A. R. Quantitative activity profile and context dependence of all human 5′ splice sites. Mol. Cell 71, 1012–1026 (2018).
Article CAS PubMed PubMed Central Google Scholar
Roca, X. et al. Features of 5′-splice-site efficiency derived from disease-causing mutations and comparative genomics. Genome Res. 18, 77–87 (2008).
Article CAS PubMed PubMed Central Google Scholar
Carmel, I., Tal, S., Vig, I. & Ast, G. Comparative analysis detects dependencies among the 5′ splice-site positions. RNA 10, 828–840 (2004).
Article CAS PubMed PubMed Central Google Scholar
McPhillips, C. C., Hyle, J. W. & Reines, D. Detection of the mycophenolate-inhibited form of IMP dehydrogenase in vivo. Proc. Natl Acad. Sci. USA 101, 12171–12176 (2004).
Article CAS PubMed PubMed Central Google Scholar
Hyle, J. W., Shaw, R. J. & Reines, D. Functional distinctions between IMP dehydrogenase genes in providing mycophenolate resistance and guanine prototrophy to yeast. J. Biol. Chem. 278, 28470–28478 (2003).
Article CAS PubMed Google Scholar
Kuehner, J. N. & Brow, D. A. Regulation of a eukaryotic gene by GTP-dependent start site selection and transcription attenuation. Mol. Cell 31, 201–211 (2008).
Article CAS PubMed Google Scholar
Rhee, H. S. & Pugh, B. F. Genome-wide structure and organization of eukaryotic preinitiation complexes. Nature 483, 295–301 (2012).
Article CAS PubMed PubMed Central Google Scholar
Vo ngoc, L., Huang, C. Y., Cassidy, C. J., Medrano, C. & Kadonaga, J. T. Identification of the human DPR core promoter element using machine learning. Nature 585, 459–463 (2020).
Article CAS PubMed PubMed Central Google Scholar
Luse, D. S., Parida, M., Spector, B. M., Nilson, K. A. & Price, D. H. A unified view of the sequence and functional organization of the human RNA polymerase II promoter. Nucleic Acids Res. 48, 7767–7785 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Structural basis of transcription initiation. Science 338, 1076–1080 (2012).
Article CAS PubMed PubMed Central Google Scholar
Walmacq, C. et al. Mechanism of translesion transcription by RNA polymerase II and its role in cellular resistance to DNA damage. Mol. Cell 46, 18–29 (2012).
Article CAS PubMed PubMed Central Google Scholar
Braberg, H. et al. From structure to systems: high-resolution, quantitative genetic analysis of RNA polymerase II. Cell 154, 775–788 (2013).
Article CAS PubMed PubMed Central Google Scholar
Malik, I., Qiu, C., Snavely, T. & Kaplan, C. D. Wide-ranging and unexpected consequences of altered Pol II catalytic activity in vivo. Nucleic Acids Res. 45, 4431–4451 (2017).
CAS PubMed PubMed Central Google Scholar
Kwapisz, M. et al. Mutations of RNA polymerase II activate key genes of the nucleoside triphosphate biosynthetic pathways. EMBO J. 27, 2411–2421 (2008).
Article CAS PubMed PubMed Central Google Scholar
Thiebaut, M. et al. Futile cycle of transcription initiation and termination modulates the response to nucleotide shortage in S. cerevisiae. Mol. Cell 31, 671–682 (2008).
Article CAS PubMed Google Scholar
Steinmetz, E. J. et al. Genome-wide distribution of yeast RNA polymerase II and its control by Sen1 helicase. Mol. Cell 24, 735–746 (2006).
Article CAS PubMed Google Scholar
Hein, P. P., Palangat, M. & Landick, R. RNA transcript 3′-proximal sequence affects translocation bias of RNA polymerase. Biochemistry 50, 7002–7014 (2011).
Article CAS PubMed Google Scholar
Cabart, P., Jin, H., Li, L. & Kaplan, C. D. Activation and reactivation of the RNA polymerase II trigger loop for intrinsic RNA cleavage and catalysis. Transcription 5, e28869 (2014).
Article PubMed PubMed Central Google Scholar
Sainsbury, S., Niesser, J. & Cramer, P. Structure and function of the initially transcribing RNA polymerase II-TFIIB complex. Nature 493, 437–440 (2013).
Article CAS PubMed Google Scholar
Segal, E. & Widom, J. Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr. Opin. Struct. Biol. 19, 65–71 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tillo, D. & Hughes, T. R. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinf. 10, 442 (2009).
Article Google Scholar
Lee, W. et al. A high-resolution atlas of nucleosome occupancy in yeast. Nat. Genet. 39, 1235–1244 (2007).
Article CAS PubMed Google Scholar
Peckham, H. E. et al. Nucleosome positioning signals in genomic DNA. Genome Res. 17, 1170–1177 (2007).
Article CAS PubMed PubMed Central Google Scholar
Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).
Article CAS PubMed PubMed Central Google Scholar
Jin, H. & Kaplan, C. D. Relationships of RNA polymerase II genetic interactors to transcription start site usage defects and growth in Saccharomyces cerevisiae. G3 5, 21–33 (2014).
Article PubMed PubMed Central Google Scholar
Amberg, D. C., Burke, D., Strathern, J. N., Burke, D. & Cold Spring Harbor Laboratory. Methods in Yeast Genetics: A Cold Spring Harbor Laboratory Course Manual, XVII (Cold Spring Harbor Laboratory Press, 2005).
Chee, M. K. & Haase, S. B. New and redesigned pRS plasmid shuttle vectors for genetic manipulation of Saccharomyces cerevisiae. G3 2, 515–526 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gietz, R. D. & Schiestl, R. H. High-efficiency yeast transformation using the LiAc/SS carrier DNA/PEG method. Nat. Protoc. 2, 31–34 (2007).
Article CAS PubMed Google Scholar
Benatuil, L., Perez, J. M., Belk, J. & Hsieh, C. M. An improved yeast transformation method for the generation of very large human antibody libraries. Protein Eng. Des. Sel. 23, 155–159 (2010).
Article CAS PubMed Google Scholar
Schmitt, M. E., Brown, T. A. & Trumpower, B. L. A rapid and simple method for preparation of RNA from Saccharomyces cerevisiae. Nucleic Acids Res. 18, 3091–3092 (1990).
Article CAS PubMed PubMed Central Google Scholar
Vvedenskaya, I. O., Goldman, S. R. & Nickels, B. E. Preparation of cDNA libraries for high-throughput RNA sequencing analysis of RNA 5′ ends. Methods Mol. Biol. 1276, 211–228 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ranish, J. A. & Hahn, S. The yeast general transcription factor TFIIA is composed of 2 polypeptide subunits. J. Biol. Chem. 266, 19320–19327 (1991).
Article CAS PubMed Google Scholar
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).
Article CAS PubMed Google Scholar
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tareen, A. & Kinney, J. B. Logomaker: beautiful sequence logos in Python. Bioinformatics 36, 2272–2274 (2020).
Article CAS PubMed Google Scholar
Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).
Article CAS PubMed PubMed Central Google Scholar
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Article Google Scholar

Download references

Acknowledgements

The authors thank Kaplan lab members for helpful comments on the manuscript. We are deeply grateful to C. Qiu for discussions and comments on this project. We acknowledge J. Kinney (Cold Spring Harbor Laboratory) and S. Li (Statistical Consulting Center at University of Pittsburgh) for discussions on modeling. We thank C.D. Johnson, R. Metz (Texas A&M AgriLife Genomics and Bioinformatics Service), A. Hillhouse (Texas A&M Institute for Genome Sciences & Society), W.A. MacDonald and R. Elbakri (the University of Pittsburgh Health Sciences Sequencing Core at UPMC Children’s Hospital of Pittsburgh), Y. Pan (the UPMC Genome Center), D. Kumar (the Waksman Genomics Core Facility at Rutgers University) and L. Freeman (Illumina) for discussions and advice regarding deep sequencing strategies. We thank S.J. Mullett and S.G. Wendell (Metabolomics and Lipidomics Core, NIHS10OD023402) for performing NTP measurements. We acknowledge support from National Institutes of Health (NIH) grant R01GM097260 to C.D.K. for the early part of this work and NIH grants R01GM120450 and R35GM144116 to C.D.K. and R35GM118059 to B.E.N. This research was supported in part by the University of Pittsburgh Center for Research Computing, RRID:SCR_022735, through the resources provided. Specifically, this work used the HTC cluster, which is supported by NIH award number S10OD028483.

Author information

Authors and Affiliations

Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA
Yunye Zhu & Craig D. Kaplan
Department of Genetics and Waksman Institute, Rutgers University, Piscataway, NJ, USA
Irina O. Vvedenskaya & Bryce E. Nickels
Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, USA
Sing-Hoi Sze
Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
Sing-Hoi Sze

Authors

Yunye Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Irina O. Vvedenskaya
View author publications
You can also search for this author in PubMed Google Scholar
Sing-Hoi Sze
View author publications
You can also search for this author in PubMed Google Scholar
Bryce E. Nickels
View author publications
You can also search for this author in PubMed Google Scholar
Craig D. Kaplan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Z. designed the project, performed experiments, analyzed data, made figures, and drafted and revised the manuscript. I.O.V. generated libraries for TSS-seq. S-H.Z. analyzed data and discussed the analysis. B.E.N. provided funding and methodology of TSS-seq, and revised the manuscript. C.D.K. conceived and designed the project, guided analyses and interpretation of data, provided funding and revised the manuscript.

Corresponding author

Correspondence to Craig D. Kaplan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Structural & Molecular Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Sara Osman was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 High level of reproducibility and coverage depth of library variants.

(A) Schematic of experimental approach. Promoter libraries with almost all possible sequences within a 9 nt randomized region were constructed on plasmids. Libraries were designated ‘AYR’, ‘BYR’, and ‘ARY’ based on randomized region composition. Plasmids were amplified in E. coli and transformed into yeast with wild type or mutated Pol II. DNA and RNA were extracted and prepared for DNA-seq and TSS-seq. (B) Base frequencies at positions within the randomized region of promoter variants demonstrate unbiased synthesis of randomized regions. Bars are mean +/- standard deviation of the mean for promoter variants in WT and four Pol II mutants. (C) Heatmap illustrating hierarchical clustering of Pearson correlation coefficients of reads per promoter variant E. coli libraries and three biological replicates of libraries transformed into yeast. (D) Example correlation plots of DNA reads count of promoter variants for E. coli and yeast WT biological replicates. Pearson r and number (N) of compared variants are shown. (E) Bulk primer extension for RNA produced from promoter variant libraries transformed into WT yeast. ‘No GFP’ control used RNA from an untransformed strain. ‘No RNA’ control used a sample of nuclease-free water. Dots represent three biological replicates. Bars are mean +/- standard deviation of the mean. (F) TSS usage based TSS-seq read lengths from transformed libraries. Dots represent three biological replicates. Bars are mean +/- standard deviation of mean. Distributions are similar to the distributions in E. Note that primer extension will blur usage into adjacent upstream position due to some level of non-templated addition of C to RNA 5’ ends. (G) Heat scatter plots of Coefficient of Variation (CoV, y axis) versus total RNA reads per promoter variant in each Pol II MASTER library. A cutoff of CoV = 0.5 was used to filter higher variance variants. (H) Heat scatter plots of relative expression versus TSS efficiency of major TSSs per promoter variant, with contour lines indicating deciles of data. Number (N) of promoter variants with [−1, +1] relative expression values (log₂) and corresponding percentage of total promoter variants are shown.

Source data

Extended Data Fig. 2 Surrounding sequence of TSSs modulates initiation efficiency.

(A) +1 TSS efficiency of all −7 to −2 sequences within each N_-8N_-1N₊₁ motif in WT, rank ordered by efficiency of A_-8C_-1A₊₁ version shown as a heat map. x-axis is ordered based on median efficiency for each N_-8N_-1N₊₁ motif group, as shown in Fig. 2B. Spearman’s rank correlation tests between A_-8C_-1A₊₁ group and all groups are shown beneath the heat map. (B) Efficiencies of designed +1 TSSs grouped by base identities between −8 and +1 positions. Statistical analyses by Kruskal-Wallis with Dunn’s multiple comparisons test for base preference at individual positions relative to +1 TSS are shown beneath plots. Lines represent median values of subgroups. ****, P ≤ 0.0001; ***, P ≤ 0.001; **, P ≤ 0.01; *, P ≤ 0.05. (C) Histogram showing the distribution of measured efficiencies for all designed −8 to +4 TSSs of all promoter variants from ‘AYR’, ‘BYR’ and ‘ARY’ libraries in WT. Dashed line marks the 5% efficiency cutoff. (D) A₊₂G₊₃G₊₄ motif enrichment is apparent for the top 10% most efficient designed −8 TSS. A(/G)₊₂G(/C)₊₃G(/C)₊₄ motif enrichment was observed for the top 10% most efficient −8 TSSs but not for the next 10% most efficient TSSs. A(/G)₊₁ enrichment observed for top 20% most efficient TSSs is consistent with the +1 R preference of TSS. Numbers (N) of variants assessed are shown. Sequence logos were generated using WebLogo 3. Bars represent an approximate Bayesian 95% confidence interval. (E) An A at position −9 results in different sequence preferences at position −8. The dataset of designed +4 TSSs deriving from ‘AYR’, ‘BYR’ and ‘ARY’ libraries was used to detect the −9/−8 interaction. All variants were divided into 16 subgroups defined by bases at positions −9 and −8 relative to designed +4 TSS, and then their TSS efficiencies were plotted. Lines represent median values of subgroups. (F) An A at position −8 results in different sequence preferences at position −7. The dataset of designed +1 TSSs deriving from ‘AYR’ and ‘BYR’ libraries was used to detect −8/−7 interaction. Calculations same as −9/−8 interaction described in E.

Source data

Extended Data Fig. 3 High level of reproducibility of library variants in Pol II mutants.

(A) Histograms showing the distribution of measured efficiencies for all designed −8 to +4 TSSs for MASTER libraries in Pol II mutants. Dashed lines mark the 5% efficiency cutoff with number (N) of TSS variants shown. (B) TSS usage distributions at designed −10 to +25 TSSs for MASTER libraries in Pol II mutants. Dots represent three biological replicates. Bars are mean +/- standard deviation. (C) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for major TSSs (designed +1 TSS for ‘AYR’ and ‘BYR’ libraries, +2 TSS for ‘ARY’ library) for WT or Pol II mutants illustrated as a heat map for three biological replicates. (D) Example correlation plots of TSS efficiency of major TSSs between representative biological replicates. Pearson r and number (N) of compared variants are shown. (E) Plots of CoV versus total RNA reads (three biological replicates) for Pol II mutants. The red dashed lines mark the CoV = 0.5 cutoff, an arbitrary cutoff for promoters with reasonable reproducibility across replicates. G1097D replicates contain outliers because this mutant is susceptible to genetic suppressors. A suppressor existing in one biological replicate generates a high CoV allowing filtering.

Source data

Extended Data Fig. 4 Pol II mutants alter TSS efficiency in general.

(A) TSS efficiency distributions of designed +1 TSSs of Pol II mutants for base subgroups at individual positions relative to +1. Identical analysis as in Extended Data Fig. 2B for WT Pol II. (B) Pol II GOF G1097D showed greater increase in efficiency than GOF allele E1103G at upstream TSSs (designed −32 and −8 TSSs), while E1103G showed stronger effects at designed +1 TSS than G1097D. (C) Pol II initiation sequence preference in Pol II mutants. Identical analysis as in Fig. 3B for WT Pol II. (D) Motif enrichment for top the 10% most efficient −8 TSSs for Pol II mutants. Identical motif enrichment analysis as in Extended Data Fig. 2D top panel for WT Pol II. Numbers (N) of variants assessed are indicated. Bars represent an approximate Bayesian 95% confidence interval.

Source data

Extended Data Fig. 5 High of reproducibility of TSS usage and efficiency upon MPA treatment.

(A) TSS usage distributions at designed −10 to +25 TSSs in WT ‘NYR’ library (mixed ‘AYR’ and ‘BYR’ libraries) treated with 100% ethanol or with 20 μg/ml MPA. MPA treatment shifted TSS usage downstream relative to EtOH treatment. Dots represent three biological replicates. Bars are mean +/- standard deviation of the mean. (B) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for designed +1 TSS for three biological replicates for MPA or EtOH treatment, illustrated as a heat map. (C) Hierarchical clustering of Pearson correlation coefficients of TSS efficiencies for all genome positions within defined promoter windows with >=3 reads in each replicate, illustrated as a heat map. (D) Correlation plots for combined biological replicates for TSS efficiency upon MPA treatment (y axes) versus EtOH treatment (x axes) for all TSSs ≥ 2% efficiency in the 25%–75% of the distribution for a curated set of 5,979 yeast promoters (see Methods). TSSs are separated into groups depending on base identity at positions −3 (control) or positions +1 to +6.

Source data

Extended Data Fig. 6 Modeling identifies sequence features for TSS selection in WT and Pol II mutants.

(A) Overview of TSS efficiency modeling. (1) TSS efficiencies including designed −8 to +2 and +4 TSSs deriving from ‘AYR’, ‘BYR’ and ‘ARY’ libraries were pooled for modeling. (2) Sequences from −11 to +9 relative to variant TSSs were extracted. (3) To identify robust features, a forward stepwise selection strategy coupled with a five-fold cross-validation for logistic regression was used, with random splitting into training (80%) and test (20%) sets. Stepwise regression starting with a constant term only with stepwise variable addition, until a stopping criterion is met, was performed. Additive terms (sequences at positions −11 to +9) and interactions were tested in stages. Model performance was evaluated with R². The stopping criterion for adding additional variables was an increase R² < 0.01. (4) A logistic regression model containing selected robust features was trained using the training set and then evaluated with the test set. (B) Comparison of measured efficiencies and predicted efficiencies. Model performance R² on entire test set and number (N) of data points shown in plot are shown. (C) Principal component analysis (PCA) for parameters of models trained using individual replicates of WT and Pol II mutants. Close clustering of individual replicates indicates that models are not overfit. The top 15 contributing variables are shown. GOF and LOF mutants were separated from WT by the 1^st principal component. GOF G1097D and E1103G were further distinguished by 2^nd principal component by additional position +2 information, which is consistent with results in Extended Data Fig. 4D, where G1097D and E1103G differentially altered +2 sequence enrichment. (D) A scatter plot of comparison of measured and predicted TSS efficiencies of all positions within 5,979 known genomic promoter windows²¹ with available measured efficiency. Pearson r and number (N) of compared variants are shown. Most promoter positions (82%, 1,678,406 out of 2,047,205) showed no observed efficiency, which is expected because TSSs need to be specified by a core promoter and scanning occurs over some distance downstream.

Source data

Supplementary information

Reporting Summary

Peer Review File

Supplementary Table

Supplementary Tables 1–7.

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Fig. 6

Statistical source data.

Source Data Fig. 7

Statistical source data.

Source Data Extended Data Fig. 1/Table 1

Statistical source data.

Source Data Extended Data Fig. 2/Table 2

Statistical source data.

Source Data Extended Data Fig. 3/Table 3

Statistical source data.

Source Data Extended Data Fig. 4/Table 4

Statistical source data.

Source Data Extended Data Fig. 5/Table 5

Statistical source data.

Source Data Extended Data Fig. 6/Table 6

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhu, Y., Vvedenskaya, I.O., Sze, SH. et al. Quantitative analysis of transcription start site selection reveals control by DNA sequence, RNA polymerase II activity and NTP levels. Nat Struct Mol Biol 31, 190–202 (2024). https://doi.org/10.1038/s41594-023-01171-9

Download citation

Received: 14 December 2021
Accepted: 03 November 2023
Published: 04 January 2024
Issue Date: January 2024
DOI: https://doi.org/10.1038/s41594-023-01171-9

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links