iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://pubmed.ncbi.nlm.nih.gov/21304684/
Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 28;2(1):117-34.
doi: 10.4056/sigs.531120.

Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison

Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison

Alexander F Auch et al. Stand Genomic Sci. .

Abstract

The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.

Keywords: Archaea; BLAST; Bacteria; GBDP; MUMmer; genomics; phylogeny; species concept; taxonomy.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Non-parametric correlations of DDH values with distances based on HSP determination. Each boxplot comprises Kendall's correlation coefficients calculated for all GBDP distance formulas and the selected program for HSP determination. Note that lower values indicate better DDH prediction. For abbreviations of the method-parameter combinations, see Table 1; 'NF' was added if HSP filtering was not applied.
Figure 2
Figure 2
Error ratios of distances based on HSP determination in predicting whether DDH values are at least as large as 70% or lower. Each boxplot comprises error ratios calculated for all GBDP distance formulas and the selected program for HSP determination. For abbreviations of the method-parameter combinations, see Table 1; 'NF' was added if HSP filtering was not applied.
Figure 3
Figure 3
Non-parametric correlations of DDH values with distances based on MUMmer. Each boxplot comprises Kendall's correlation coefficients calculated for all GBDP distance formulas, the greedy-with-trimming algorithm and the selected MUMmer parameter combination. Note that lower values indicate better DDH prediction. The x-axis comprises the three investigated series of minimum MUM lengths ranging between 16 and 50, one series per setting for the treatment of matches in both forward and reverse strand, abbreviated max, mum and ref, respectively. For the meaning of these abbreviations, see Table 1.
Figure 4
Figure 4
Error ratios of distances based on MUMmer in predicting whether DDH values are at least as large as 70% or lower. Each boxplot comprises error ratios calculated for all GBDP distance formulas and the selected MUMmer parameter combination. The x-axis comprises the three investigated series of minimum MUM lengths ranging between 16 and 50, one series per setting for the treatment of matches in both forward and reverse strand, abbreviated max, mum and ref, respectively. For the meaning of these abbreviations, see Table 1.
Figure 5
Figure 5
Scatterplot of DDH (x-axis) and GGD inferred with BLAT under default values without HSP filtering, greedy-with-trimming and formula (3). The vertical line indicates the 70% DDH threshold, the horizontal line indicates the GGD threshold that results in the lowest error ratio for these settings. The result of a robust-line fit using the R function line() is also shown, indicating that regression and determination of a threshold with lowest error ratio may differ in their estimates of a GGD threshold to replace the DDH 70% cutoff.
Figure 6
Figure 6
Scatterplot of DDH (x-axis) and GGD inferred with MUMmer using a minimal MUM length of 44 bp, no trimming and formula (1). The vertical line indicates the 70% DDH threshold, the horizontal line indicates the GGD threshold that results in the lowest error ratio. For the robust-line fit, see caption of Figure 5. The plot shows that MUMmer-based GGD more rapidly reaches saturation than HSP-based GGD and that it is not linearly related to DDH, underlining the need to rely on rank-based correlation coefficients for an unbiased comparison of distance functions.
Figure 7
Figure 7
Boxplots showing the error ratios in predicting a DDH value ≤70% or >70% if applied to genomes artificially made incomplete. GGD were calculated using NCBI-BLASTN without filtering and all ten GBDP distance functions. The x-axis indicates the combination of the retained proportion of the genome (in percent) and the distance formula; F1, F2 and F3 refer to formulas (1), (2) and (3) as described above.
Figure 8
Figure 8
Boxplots showing the Euclidean distances between the original GGD inferred from complete genomes and those inferred from genomes artificially made incomplete. GGD were calculated using NCBI-BLASTN without filtering and all ten GBDP distance functions. The x-axis indicates the combination of the retained proportion of the genome (in percent) and the distance formula; F1, F2 and F3 refer to formulas (1), (2) and (3) as described above.

Similar articles

Cited by

References

    1. Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci USA 1990; 87:4576-4579 10.1073/pnas.87.12.4576 - DOI - PMC - PubMed
    1. Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, Krichevsky MI, Moore LH, Moore WEC, Murray RGE, Stackebrandt E, et al. Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Int J Syst Bacteriol 1987; 37:463-464
    1. Schleifer KH. Classification of Bacteria and Archaea: Past, present and future. Syst Appl Microbiol (In press). - PubMed
    1. Tindall BJ, Rosselló-Móra R, Busse HJ, Ludwig W, Kämpfer P. Notes on the characterization of prokaryote strains for taxonomic purposes. Int J Syst Evol Microbiol 2010; 60:249-266 10.1099/ijs.0.016949-0 - DOI - PubMed
    1. Brenner DJ. Deoxyribonucleic acid reassociation in the taxonomy of enteric Bacteria. Int J Syst Bacteriol 1973; 23:298-307