Abstract
Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a ‘bait’ gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR–Cas systems using the ‘CRISPRicity’ metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data and code availability
The source code of the Icity pipeline is freely available under open-source NCBI license (https://github.com/ncbi/ICITY/blob/master/LICENSE.txt) at the NCBI GitHub page (https://github.com/ncbi/ICITY). Questions and comments can be addressed to authors through the GitHub portal or by email. All example datasets and the results of their analysis presented in the paper are available at the NCBI FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/).
References
Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S. & Koonin, E. V. Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001).
Rogozin, I. B., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004).
Aravind, L. Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000).
Galperin, M. Y. & Koonin, E. V. Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol. 18, 609–613 (2000).
Janga, S. C., Collado-Vides, J. & Moreno-Hagelsieb, G. Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005).
Moreno-Hagelsieb, G. The power of operon rearrangements for predicting functional associations. Comput. Struct. Biotechnol. J. 13, 402–406 (2015).
Moreno-Hagelsieb, G. & Santoyo, G. Predicting functional interactions among genes in prokaryotes by genomic context. Adv. Exp. Med. Biol. 883, 97–106 (2015).
Price, M. N., Huang, K. H., Alm, E. J. & Arkin, A. P. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 33, 880–892 (2005).
de Crecy-Lagard, V. & Hanson, A. D. Finding novel metabolic genes through plant-prokaryote phylogenomics. Trends Microbiol. 15, 563–570 (2007).
Zhao, S. et al. Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature 502, 698–702 (2013).
Calhoun, S. et al. Prediction of enzymatic pathways by integrative pathway mapping. Elife 7, e31097 (2018).
Koonin, E. V., Wolf, Y. I. & Aravind, L. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res. 11, 240–252 (2001).
Evguenieva-Hackenberg, E., Hou, L., Glaeser, S. & Klug, G. Structure and function of the archaeal exosome. Wiley Interdiscip. Rev. RNA 5, 623–635 (2014).
Shmakov, S. et al. Discovery and functional characterization of diverse class 2 CRISPR–Cas systems. Mol. Cell 60, 385–397 (2015).
Shmakov, S. et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol. 15, 169–182 (2017).
Burstein, D. et al. Major bacterial lineages are essentially devoid of CRISPR–Cas viral defence systems. Nat. Commun. 7, 10613 (2016).
Yan, W. X. et al. Cas13d is a compact RNA-targeting type VI CRISPR effector positively modulated by a WYL-domain-containing accessory protein. Mol. Cell 70, 327–339.e5 (2018).
Makarova, K. S., Aravind, L., Grishin, N. V., Rogozin, I. B. & Koonin, E. V. A DNA repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res. 30, 482–496 (2002).
Shmakov, S. A., Makarova, K. S., Wolf, Y. I., Severinov, K. V. & Koonin, E. V. Systematic prediction of genes functionally linked to CRISPR–Cas systems by gene neighborhood analysis. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018).
Pawluk, A. et al. Naturally occurring off-switches for CRISPR–Cas9. Cell 167, 1829–1838e1829 (2016).
Pawluk, A., Davidson, A. R. & Maxwell, K. L. Anti-CRISPR: discovery, mechanism and function. Nat. Rev. Microbiol. 16, 12–17 (2018).
Lasken, R. S. & McLean, J. S. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet. 15, 577–584 (2014).
Stern, A. & Sorek, R. The phage-host arms race: shaping the evolution of microbes. Bioessays 33, 43–51 (2011).
Koonin, E. V., Makarova, K. S. & Wolf, Y. I. Evolutionary genomics of defense systems in archaea and bacteria. Annu. Rev. Microbiol. 71, 233–261 (2017).
Makarova, K. S., Wolf, Y. I., Snir, S. & Koonin, E. V. Defense islands in bacterial and archaeal genomes and prediction of novel defense systems. J. Bacteriol 193, 6039–6056 (2011).
Doron, S. et al. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018).
Rogozin, I. B. et al. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002).
Zheng, Y., Szustakowski, J. D., Fortnow, L., Roberts, R. J. & Kasif, S. Computational identification of operons in microbial genomes. Genome Res. 12, 1221–1230 (2002).
Yan, Y. & Moult, J. Detection of operons. Proteins 64, 615–628 (2006).
Mitra, K., Carvunis, A. R., Ramesh, S. K. & Ideker, T. Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet. 14, 719–732 (2013).
Burroughs, A. M., Zhang, D., Schaffer, D. E., Iyer, L. M. & Aravind, L. Comparative genomic analyses reveal a vast, novel network of nucleotide-centric systems in biological conflicts, immunity and signaling. Nucleic Acids Res. 43, 10633–10654 (2015).
Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 41, 4360–4377 (2013).
Galperin, M. Y. Bacterial signal transduction network in a genomic perspective. Environ. Microbiol. 6, 552–567 (2004).
Mishra, V., Lal, R. & Srinivasan Enzymes and operons mediating xenobiotic degradation in bacteria. Crit. Rev. Microbiol. 27, 133–166 (2001).
Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001).
Marchler-Bauer, A. et al. Troubleshooting advice can be: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–226 (2015).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–285 (2016).
Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Soding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
Makarova, K. S. et al. An updated evolutionary classification of CRISPR–Cas systems. Nat. Rev. Microbiol. 13, 722–736 (2015).
Bath, C., Cukalac, T., Porter, K. & Dyall-Smith, M. L. His1 and His2 are distantly related, spindle-shaped haloviruses belonging to the novel virus group, Salterprovirus. Virology 350, 228–239 (2006).
Swarts, D. C. et al. The evolutionary journey of argonaute proteins. Nat. Struct. Mol. Biol. 21, 743–753 (2014).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Sasaki, Y. The truth of the F-measure. Teach Tutor Mater. 1, 1–5 (2007).
Acknowledgements
This research was funded through the Intramural Research Program of the National Institutes of Health of the USA, the RFBR (for research project 18-34-00012, S.A.S.), a systems biology fellowship funded by Philip Morris Sales and Marketing (to S.A.S.), the Ministry of Education and Science of the Russian Federation (subsidy agreement 14.606.21.0006; project identifier RFMEFI60617X0006; to S.A.S. and K.V.S.) and an NIH grant (R01 GM10407 to K.V.S.).
Author information
Authors and Affiliations
Contributions
S.A.S., Y.I.W. and E.V.K. designed the protocol; S.A.S. implemented the protocol with assistance from G.F.; S.A.S., G.F., K.S.M., Y.I.W. and K.V.S. analyzed the results; S.A.S., Y.I.W. and E.V.K. wrote the manuscript, which was read, edited and approved by all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Protocols thanks Christine Pourcel and other anonymous reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Key references using this protocol
Shmakov, S. A., Makarova, K. S., Wolf, Y. I., Severinov, K. V. & Koonin, E. V. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018): https://doi.org/10.1073/pnas.1803440115
Shmakov, S. et al. Nat. Rev. Microbiol. 15, 169–182 (2017): https://doi.org/10.1038/nrmicro.2016.184
Shmakov, S. et al. Mol. Cell 60, 385–397 (2015): https://doi.org/10.1016/j.molcel.2015.10.008
Supplementary information
Supplementary Data
Step-by-step explanation of the RunClust.sh script used for protein clustering, and an alternative iterative clustering procedure.
Rights and permissions
About this article
Cite this article
Shmakov, S.A., Faure, G., Makarova, K.S. et al. Systematic prediction of functionally linked genes in bacterial and archaeal genomes. Nat Protoc 14, 3013–3031 (2019). https://doi.org/10.1038/s41596-019-0211-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41596-019-0211-1
This article is cited by
-
Genomic language model predicts protein co-regulation and function
Nature Communications (2024)
-
CRISPR/Cas genome editing in plants: mechanisms, applications, and overcoming bottlenecks
Functional & Integrative Genomics (2024)
-
Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants
Nature Reviews Microbiology (2020)