SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

doi:10.1089/cmb.2012.0021

. 2012 May;19(5):455-77.

doi: 10.1089/cmb.2012.0021. Epub 2012 Apr 16.

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

Anton Bankevich¹, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, Alexey V Pyshkin, Alexander V Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A Alekseyev, Pavel A Pevzner

Affiliations

PMID: 22506599
PMCID: PMC3342519
DOI: 10.1089/cmb.2012.0021

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

Anton Bankevich et al. J Comput Biol. 2012 May.

. 2012 May;19(5):455-77.

doi: 10.1089/cmb.2012.0021. Epub 2012 Apr 16.

Authors

Affiliation

¹ Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg, Russia.

PMID: 22506599
PMCID: PMC3342519
DOI: 10.1089/cmb.2012.0021

Abstract

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online ( http://bioinf.spbau.ru/spades ). It is distributed as open source software.

PubMed Disclaimer

Figures

**FIG. 1.**
Notation for decomposing a de Bruijn graph into non-branching paths (h-paths). A de Bruijn graph on reads ACCGTCAGAAT and ACCGTGAGAAT with edge size k = 4, vertex size k − 1 = 3. *Hubs* are shown as solid vertices, while vertices with indegree 1, outdegree 1 are hollow. An *h-path* CGT → GTG → TGA → GAG → AGA (shown in red with h-edge denoted α) defines an *h-read* CGTGAGA. The whole path is denoted path(α), and consists of |path(α)| = 4 edges. The edges on this path have *offsets* 1, 2, 3, 4, as indicated. Each edge can be addressed by its path's h-edge and its offset.

**FIG. 2.**
Standard and multisized de Bruijn graph. A circular Genome CATCAGATAGGA is covered by a set Reads consisting of nine 4-mers, {ACAT, CATC, ATCA, TCAG, CAGA, AGAT, GATA, TAGG, GGAC}. Three out of 12 possible 4-mers from Genome are missing from Reads (namely {ATAG,AGGA,GACA}), but all 3-mers from Genome are present in Reads. (A) The outside circle shows a separate black edge for each 3-mer from Reads. Dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. (B) The graph DB(Reads, 3) resulting from all the glues is tangled. The three h-paths of length 2 in this graph (shown in blue) correspond to h-reads ATAG, AGGA, and GACA. Thus Reads_3,4 contains all 4-mers from Genome. (C) The outside circle shows a separate edge for each of the nine 4-mer reads. The next inner circle shows the graph DB(Reads, 4), and the innermost circle represents the Genome. The graph DB(Reads, 4) is fragmented into 3 connected components. (D) The multisized de Bruijn graph DB(Reads, 3, 4).

**FIG. 3.**
Stage 2 of SPAdes. **(A)** Bireads are decomposed into pairs of k-mers with estimated genomic distances (B-transformation). These are tabulated into histograms of estimated genomic distances between pairs of h-edges (H-transformation), and peaks in the histograms and paths in the graph are used to reveal the actual genomic distances between h-edges (A-transformation). This may be converted back to genomic distances between k-mers on pairs of h-paths (E-transformation, used for presentation purposes but not needed in the implementation). **(B)** The h-biedge histogram (α|β,*) corresponding to the exact h-biedge (α|β, 72163) in the assembly graph. path(α) is an h-path (condensed edge representing 72049 edges) in the upper right, and path(β) is an h-path (representing 46097 edges) at the lower left. The histogram collects all distance estimates between α and β derived from bireads. The h-biedge histogram was smoothed using the Fast Fourier Transform (red curve). The peak in the smoothed histogram (marked red) well approximates the actual distance (marked blue). **(C)** The h-biedge histogram (α|β,*) estimates the distance between h-edges α and β (|path(α)| = 46054, |path(β)| = 72). Because of the directed cycle formed by the two h-paths of lengths 72 and 13, there may be multiple walks through the graph between α and β. The h-biedge histogram has been divided into clusters with centers at 46060 and 46145. Thus SPAdes transforms the entire histogram into two h-biedges: (α|β, 46054) and (α|β, 46139).

**FIG. 4.**
Construction of the paired assembly graph for bireads sampled from a circular 24 bp genome Genome = ACGTCAAGTTCTGACGTGGGTTCT (single reads referred to as Reads). The de Bruijn graph DB(Reads, 4) has four hubs (ACG, CGT, GTT, and TCT) **(A)** and six h-paths , with lengths respectively **(B)**. The h-edge of path *P_i*, denoted *α_i*, is its first edge. The cycle C in DB(Reads, 4) that spells Genome passes through the h-paths in order P₁, P₆, P₂, P₄, P₁, P₅, P₂, P₃ (P₁ and P₂ represent repeats). (B) Reads are paired with separation d = 5, yielding estimated distances D between various h-edges *α_i* and *α_j*, denoted as the h-biedge (*α_i*|*α_j*, D). The 13 h-biedges constructed from all bireads are listed as . **(C)** The *rectangle diagram* of h-biedge (α₆|α₂, 6) is a rectangle (R₃) with sides P₆ and P₂ and 45^° line segment y = x + (d − 4) = x − 1, from (1, 0) to (3, 2). Point (1, 0) is labeled by bivertex (GTC|GTT) formed by vertex 1 (GTC) in path P₆ and vertex 0 (GTT) in path P₂. Point (3, 2) is labeled by bivertex (CAA|TCT) formed by vertex 3 (CAA) in path P₃ and vertex 2 (TCT) in path P₂. (D) Vertices to glue together from different rectangle diagrams are indicated by dotted red lines. (E) Rectangles glued into a 24 × 24 grid, yielding a cycle (blue path) through the genome.

formula image — **FIG. 4.**
Construction of the paired assembly graph for bireads sampled from a circular 24 bp genome Genome = ACGTCAAGTTCTGACGTGGGTTCT (single reads referred to as Reads). The de Bruijn graph DB(Reads, 4) has four hubs (ACG, CGT, GTT, and TCT) **(A)** and six h-paths , with lengths respectively **(B)**. The h-edge of path *P_i*, denoted *α_i*, is its first edge. The cycle C in DB(Reads, 4) that spells Genome passes through the h-paths in order P₁, P₆, P₂, P₄, P₁, P₅, P₂, P₃ (P₁ and P₂ represent repeats). (B) Reads are paired with separation d = 5, yielding estimated distances D between various h-edges *α_i* and *α_j*, denoted as the h-biedge (*α_i*|*α_j*, D). The 13 h-biedges constructed from all bireads are listed as . **(C)** The *rectangle diagram* of h-biedge (α₆|α₂, 6) is a rectangle (R₃) with sides P₆ and P₂ and 45^° line segment y = x + (d − 4) = x − 1, from (1, 0) to (3, 2). Point (1, 0) is labeled by bivertex (GTC|GTT) formed by vertex 1 (GTC) in path P₆ and vertex 0 (GTT) in path P₂. Point (3, 2) is labeled by bivertex (CAA|TCT) formed by vertex 3 (CAA) in path P₃ and vertex 2 (TCT) in path P₂. (D) Vertices to glue together from different rectangle diagrams are indicated by dotted red lines. (E) Rectangles glued into a 24 × 24 grid, yielding a cycle (blue path) through the genome.

**Fig. 5.**
Topology of selected features within a de Bruijn graph. The red h-path, P, is the current h-path under consideration for deletion (tip removal, chimeric h-path removal) or projection to another path (bulge corremoval). The blue path(s), Q, are alternative paths. Note that other factors such as lengths and coverage are considered in addition to topology, and that the graphs continue past the regions shown. **(A)** A potential *bulge*. Q may contain hubs within it, though P does not. **(B)** A potential *tip*; h-path P starts or ends at a vertex of total degree 1 (represented as solid), and there is an alternative h-path Q. **(C)** A potential *chimeric h-path*. There must be alternative h-paths Q₁, Q₂ both for the entrance and the exit to P. **(D)** h-path is a *repeat*. Note that P starts with a vertex of outdegree one and ends with a vertex of indegree one and has no alternative h-path. These degree conditions differentiate it from (A,B,C).

**FIG. 6.**
Example of parallel paths and bulges. Edges are labeled as vectors and vertices are labeled as scalars.

See this image and copyright information in PMC

Cited by

Genome assembly of an endemic butterfly (Minois Aurata) shed light on the genetic mechanisms underlying ecological adaptation to arid valley habitat.
Hu W, Wang Y, Chen X, Huang J, Kuang J, Wang L, Mao K, Dou L. Hu W, et al. BMC Genomics. 2024 Nov 23;25(1):1134. doi: 10.1186/s12864-024-11058-8. BMC Genomics. 2024. PMID: 39580397 Free PMC article.
Phage cocktail amikacin combination as a potential therapy for bacteremia associated with carbapenemase producing colistin resistant Klebsiella pneumoniae.
Shein AMS, Wannigama DL, Hurst C, Monk PN, Amarasiri M, Wongsurawat T, Jenjaroenpun P, Phattharapornjaroen P, Ditcham WGF, Ounjai P, Saethang T, Chantaravisoot N, Badavath VN, Luk-In S, Nilgate S, Rirerm U, Srisakul S, Kueakulpattana N, Laowansiri M, Rad SMAH, Wacharapluesadee S, Rodpan A, Ngamwongsatit N, Thammahong A, Ishikawa H, Storer RJ, Leelahavanichkul A, Ragupathi NKD, Classen AY, Kanjanabuch T, Pletzer D, Miyanaga K, Cui L, Hamamoto H, Higgins PG, Kicic A, Chatsuwan T, Hongsing P, Abe S. Shein AMS, et al. Sci Rep. 2024 Nov 22;14(1):28992. doi: 10.1038/s41598-024-79924-9. Sci Rep. 2024. PMID: 39578508 Free PMC article.
Methylocystis borbori sp.nov., a novel methanotrophic bacterium from the sludge of a freshwater lake and its metabolic properties.
Kaparullina EN, Agafonova NV, Suzina NE, Grouzdev DS, Doronina NV. Kaparullina EN, et al. Antonie Van Leeuwenhoek. 2024 Nov 22;118(1):29. doi: 10.1007/s10482-024-02039-8. Antonie Van Leeuwenhoek. 2024. PMID: 39576297
The first complete chloroplast genome of Halodule uninervis (Forssk.) Boiss. 1882 (Cymodoceaceae).
Liu M, Shan R, Wu J, Shi Y, Zhao M. Liu M, et al. Mitochondrial DNA B Resour. 2024 Nov 20;9(11):1564-1568. doi: 10.1080/23802359.2024.2429635. eCollection 2024. Mitochondrial DNA B Resour. 2024. PMID: 39575205 Free PMC article.
The Complete Genome Sequences of 3 Species of Malacoctenus Fishes (Labrisomidae, Blenniiformes).
Pedraza-Marron CD, Acero AP, Pirro S, Betancur R. Pedraza-Marron CD, et al. Biodivers Genomes. 2024;2024:10.56179/001c.125783. doi: 10.56179/001c.125783. Epub 2024 Nov 10. Biodivers Genomes. 2024. PMID: 39574512 Free PMC article.

See all "Cited by" articles

References

1. Bandeira N. Clauser K. Pevzner P. Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell Proteomics. 2007;6:1123–1134. - PubMed
1. Bandeira N. Pham V. Pevzner P., et al. Automated de novo protein sequencing of monoclonal antibodies. Nat. Biotechnol. 2008;26:1336–1338. - PMC - PubMed
1. Blainey P. Mosier A. Potanina A., et al. Genome of a low-salinity ammonia-oxidizing archaeon determined by single-cell and metagenomic analysis. PLoS One. 2011;6:e16626. - PMC - PubMed
1. Butler J. MacCallum I. Kleber M., et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
1. Chaisson M. Brinza D. Pevzner P. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

3P41RR024851-02S1/RR/NCRR NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Bandeira N. Clauser K. Pevzner P. Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell Proteomics. 2007;6:1123–1134. - PubMed

[2] Bandeira N. Clauser K. Pevzner P. Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell Proteomics. 2007;6:1123–1134. - PubMed

[3] Bandeira N. Pham V. Pevzner P., et al. Automated de novo protein sequencing of monoclonal antibodies. Nat. Biotechnol. 2008;26:1336–1338. - PMC - PubMed

[4] Bandeira N. Pham V. Pevzner P., et al. Automated de novo protein sequencing of monoclonal antibodies. Nat. Biotechnol. 2008;26:1336–1338. - PMC - PubMed

[5] Blainey P. Mosier A. Potanina A., et al. Genome of a low-salinity ammonia-oxidizing archaeon determined by single-cell and metagenomic analysis. PLoS One. 2011;6:e16626. - PMC - PubMed

[6] Blainey P. Mosier A. Potanina A., et al. Genome of a low-salinity ammonia-oxidizing archaeon determined by single-cell and metagenomic analysis. PLoS One. 2011;6:e16626. - PMC - PubMed

[7] Butler J. MacCallum I. Kleber M., et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed

[8] Butler J. MacCallum I. Kleber M., et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed

[9] Chaisson M. Brinza D. Pevzner P. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed

[10] Chaisson M. Brinza D. Pevzner P. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

Affiliation

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources