Abstract

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.

INTRODUCTION

Sequence comparison continues to play crucial roles in molecular sequence analysis. The dominant approaches for sequence comparison are alignment-based including the Smith–Waterman algorithm [1] and BLAST [2]. Although alignment-based approaches generally yield excellent results when the molecular sequences of interest can be reliably aligned, their applications are limited when the sequences are divergent or come from different regions of various genomes and a reliable alignment cannot be obtained. Another drawback of alignment-based approaches is that they are generally time-consuming and thus, are limited in dealing with large-scale sequence data generated with the new sequencing technologies. The next-generation sequencing (NGS) technologies usually generate relatively short reads that can be difficult to assemble, and alignment-based approaches cannot be applied when the complete sequences are not known. Alignment-free sequence comparison approaches provide attractive alternatives when alignment-based approaches fail.

Studies of alignment-free sequence comparison began in the mid 80s [3, 4] and have been under continuous development. Two excellent reviews are available on this topic [5, 6]. Here we review new developments on alignment-free sequence comparison with emphasis on methods based on k-tuple (k-word, k-gram) frequencies. For studies on alignment-free sequence comparison based on other ideas such as chaos theory and common substrings, please see [6] for details.

Alignment-free sequence comparison has been successfully applied to many biological problems including (i) the study of evolution of organisms using whole-genome sequences, (ii) the evolution of regulatory sequences such as promoters, enhancers and inhibitors, (iii) the identification of cis-regulatory modules (CRM) and (iv) the comparison of metagenomic communities based on the sequence data using NGS technologies.

For alignment-free sequence comparison of two sequences using k-tuples, the first step is to count the number of occurrences of every k-tuple in both sequences separately and record these in a vector of k-tuple frequencies for each sequence; this counting can be carried out in linear time. Second, a measure d of difference between the two sequences A and B is defined based on the two frequency vectors. If the measure satisfies distance constraints, i.e. (a) formula and equality holds if and only if formula, and (b) for any sequences A, B and C, formula, then the measure is a distance measure. Otherwise, the measure is called a dissimilarity measure. Third, the sequences are then clustered based on the distance or dissimilarity measures and the resulting clusters are finally compared with current biological knowledge about the sequences to evaluate the effectiveness of the measures. Many measures have been developed over the years. Here we present a general review of such measures and their applications to molecular sequence analysis with an emphasis on NGS data.

The article is organized as follows. In Section 1, we review theoretical studies of the approximate distributions of the popular D2 statistic and of its power. As D2 may mainly measure background noise in each sequence separately, in Section 2 we describe adjusted similarity measures based on word counts. Section 3 focuses on alignment-free genome and metagenome comparison using NGS data. Section 4 contains a discussion and conclusions.

THEORETICAL STUDIES OF THE APPROXIMATE DISTRIBUTIONS OF formula AND ITS STATISTICAL POWER

Alignment-free sequence comparison by the number of word matches: the D2 statistic

The earliest development of alignment-free sequence comparison can be traced back to the 1980's. Blaisdell [4] used the likelihood ratio statistic or the corresponding Chi-square statistic, to determine the order of Markov chain (MC) of sequences of interest. More specifically, for a given sequence formula with letters from a finite alphabet formula, let formula be the number of occurrences of tuple w in A. The objective is to test if the sequence A can be modelled as a (k-2)-th order MC as opposed to a (k-1)-th order MC using the likelihood ratio statistic
or the Chi-square statistic
where formula is the expected count of the k-tuple formula under the (k-2)-th order MC, which can be estimated by
(1)

Both formula and formula have an approximate formula-distribution with formula degrees of freedom. Blaisdell [3] used similar ideas to test if a set of sequences have the same transition matrix, and thus are more likely to be related. These statistics can be considered as the origin of alignment-free sequence comparison statistics.

Torney et al. [7] used the number of k-tuple matches between two sequences A and B as a statistic to measure the similarity between them. Let
(2)

where formula and formula are the number of occurrences of tuple w in sequences A and B, respectively, and formula denotes the logical indicator: formula if event E is true, and 0 otherwise. The formula statistic has been used in many applications including sequence database searches [8] and clustering of expressed sequence tags [9]. Owing to its wide range of applications, extensive studies on the distributions of formula have been carried out.

The distribution of the D2 statistic under the null model of two independent sequences

Lippert et al. [10] studied the limiting distribution of formula under the independent identically distributed (i.i.d.) model for both sequences with the same nucleotide frequencies formula, where formula indicates the set of all the possible letters. When formula are not all equal, it was shown that when formula where n is the length of both sequences, formula has an approximate Poisson distribution, and when formula, formula has an approximate normal distribution. It was further suggested in [10] and explicitly proved in [11] that the variance of formula is dominated by the variance of the number of occurrences of each k-tuple in individual sequences. However, when formula, formula is approximately neither normal nor Poisson. Instead, formula tends to the sum of products of normal distributions.

The fundamental results in [10] were further extended to more general models for the sequences of interest [12–14]. Kantorovitz et al. [12] showed that formula is approximately normal as both k and n tend to infinity when the nucleotide frequencies are the same for both sequences. Burden et al. [14] extended the formula statistic to allow word matches with up to a certain number of mismatches and again showed that this new statistic is approximately normally distributed. Foret et al. [13] compared the empirical and theoretical distributions of formula and its variations and found that the approximations are consistent with the empirical distributions.

The power of the D2 statistic under the alternative model of two related sequences

The results from [10, 12–14] played key roles in estimating the statistical significance of formula between two sequences under the null hypothesis that the two sequences are unrelated. In contrast, these studies do not address what kind of relationship the statistic formula and its variants can detect, and what the statistical power is when the alternative hypothesis that the two sequences are related holds. To answer these questions, precise definitions of relatedness between the sequences of interest are needed. The alternative hypothesis depends on the scientific questions of interest. One of the key applications of alignment-free sequence comparison is to find CRMs that locate in the upstream regions of genes controlled by the same sets of transcription factors. Sequences in the same CRM tend to have the same transcription factor binding sites and thus share similar sets of k-tuples. Therefore, Reinert et al. [11] modelled the relatedness of sequences by the sharing of common formula, that is, word patterns that are significantly enriched in the sequences. They refer to the model as a common motif model. The model for each sequence consists of the following three components:

  1. The background sequences are modelled by an i.i.d. model.

  2. The foreground motifs are modelled by position weight matrices that give the nucleotide probability distribution at each position of the motifs. The foreground motif model can be easily extended to the situation that the nucleotides along the motifs depend on each other. The motifs can also be easily generalized to CRMs consisting of combinations of several motifs.

  3. The occurrences of the motifs are modelled as binomial random variables along the genome sequence, with formula denoting the probability that a motif instance starts at a nucleotide position; formula is referred as the formula. Once a motif is inserted, the nucleotide positions, which are now covered by the motif, are ignored, and the insertion process resumes at the end of the motif, so that inserted motifs do not overlap.

Intuitively, the relatedness between the two sequences increases with the motif density, and thus the power of a reasonable statistic should also increase with motif density. Simulation studies under the common motif model showed that for small values of k, the power of formula can be smaller than the type I error rate and can decrease with sequence length. Thus, formula is not an appropriate statistic to test the relationship between sequences under the common motif model for small values of k. When k is relatively large, the power does increase with sequence length. Thus, for many practical applications with relatively large values of k, formula can be useful. These observations were further proved theoretically in [15] based on the hidden Markov model for each sequence developed in [16].

ADJUSTED DISSIMILARITY MEASURES FOR ALIGNMENT-FREE SEQUENCE COMPARISON BASED ON K-TUPLE COUNTS

In addition to formula, many other distance and dissimilarity measures based on k-tuple counts have been proposed for the comparison of molecular sequences. Those include (a) normalization of formula by formula, (b) correlation of relative differences of the tuple counts from their expectations, (c) comparison of relative abundance of k-tuples and (d) comparison of k-tuple distributions based on Markov models.

Normalization of D2 by D2z

It was realized that the formula statistic defined in Equation (2) depends on the underlying sequence models [12]. Normalizing the formula statistic using its mean and standard deviation can potentially improve the power of detecting the relationships between the sequences and remove the biases due to the background models of the sequences. For normalization, the background models for the sequences of interest are needed. Kantorovitz et al. [12] modified the formula statistic to formula
(3)
where the expectation and variance of formula are calculated based on a Markov model for the sequences.

The authors compared the effectiveness of the formula statistic with several other distance or dissimilarity measures between the k-tuple vectors including (i) the Euclidean distance [3], (ii) Kullback-Leibler discrepancy (kld) [17], (iii) the linear Pearson correlation coefficient (lcc) between the count vectors, (iv) the cosine of the angle between the two k-tuple count vectors (cos) and (v) two other measures based on the approximate Poisson distribution of word counts [18]. Using four fly and three human CRMs as examples, it was shown that the formula statistic outperforms the statistics described above for the identification of CRMs [12]. Note though, the Euclidean, kld, lcc and cos measures need only one parameter, the value of k, while formula needs an additional parameter r, the order of the MC for the sequences. In the seven data sets studied, the combination of formula and formula yielded the best performance for formula in five of the data sets, and the combination of formula and formula has the best performance in two of the seven data sets.

Correlation of relative differences of k-tuple counts between two sequences

Instead of normalizing formula by formula, Hao and colleagues [19–21] considered the relative difference vector of the number of occurrences of every k-tuple w with its expected count under the (k-2)-th order MC model given in Equation (1) for each sequence. They then used the correlation coefficient between the relative difference vectors corresponding to two sequences to measure their similarity. We use the corresponding author's last name Hao as the short form of this measure to simplify the notation.
(4)
where formula and formula are defined as in Equation (1) based on sequences A and B, respectively.

The authors applied this dissimilarity measure to whole-genome phylogenetic analysis [21–25] and classification of metagenomic samples [26]. Liu et al. [27] used the composition vector approach to assign metagenomics reads to organisms or gene families. However, Jiang et al. [28] recently found that, although the formula statistic performs reasonably well with high-coverage data sets, it is not stable when the data are limited.

Alignment-free sequence comparison based on di-, tri- and tetra-nucleotide relative abundance

Karlin and colleagues observed that the relative dinucleotide frequencies defined by
(5)
where fa is the frequency of nucleotide a and, more generally, formula is the observed frequency of word pattern w, are relatively stable across different parts of the same genome and differ across different genomes [29–33]. Therefore, they proposed to use the l1-norm between the relative dinucleotide frequencies between two genome sequences as dissimilarity measure, that is,
(6)
This distance measure has been used to study the evolutionary relationships among viruses [29], bacteria [30], plasmids, prokaryotes [34] and eukaryotes [30, 32]. The dinucleotide frequencies have been extended to tri- and tetra-nucleotide biases [35, 36] as
and
where formula and formula indicate any letters. Under the MC model, the corresponding biases are defined by the ratio of the observed count of a tri- or tetra-nucleotide in the sequence over its expected count under the first and second order Markov models given in Equation (1). Similarly, the difference between two genomes can be calculated by the formula norm of the 64-dimensional vector formula or the 256-dimensional vector formula.

It can be seen that the formula dissimilarity measure is closely related to the dissimilarity measure developed from Karlin's group when the MC model was used to model the sequences. formula's group used one minus the correlation between the relative difference vectors while Karlin's group used the formula-norm between the relative difference vectors to measure the dissimilarity.

Alignment-free sequence comparison based on direct comparison of tuple frequencies with Markov models

Kim and colleagues [37–40] designed an approach based on the Jensen-Shannon entropy as a distance measure to study the evolutionary relationships of prokaryotes, Escherichia coli/Shigella, and viruses.

In a series of articles, Wang's group compared two sequences using the k-tuple frequency vectors coupled with MC models of order formula [41, 42]. The methods in [42] can be described as follows. Assuming an r-th order MC for each sequence, the transition probability matrix can be estimated using the sequence data. The probability of each k-tuple to occur in the sequence A can then be calculated based on the transition probabilities denoted as formula, where formula is the r-th order MC for sequence A. Similarly, the k-tuple probability distribution can be defined based on the second sequence B. The statistic formula is defined by the symmetric Jensen-Shannon divergence of the two probability measures formula and formula. They also considered a weighted version of formula denoted as formula that was defined by replacing formula with formula, where formula is the frequency of formula in sequence formula or B. It was shown in [41,42] that for the identification of CRMs with appropriate choices of k and r, the dissimilarity measure formula performs the best compared with the dissimilarity measures discussed above. We also tried S2 on the data sets used in [43] and found that S2 outperforms other dissimilarity measures by a significant margin for the identification of CRM sequences (see Supplementary Tables S1–S4).

The statistics formula and formula and their statistical power

Two relatively new normalization methods for the tuple counts have recently been proposed [11, 15]. The first statistic is based on the observation by Shepp [44] that for two independent normal random variables formula and formula with mean zero, formula is also normally distributed. We normalize formula and formula by
where formula and formula, and formula and formula are the probability of k-tuple w under the background model for sequences A and B, respectively. The statistic formula is defined by
(7)
where formula is the set of all k-tuples. The second statistic, formula, is based on the intuitive idea that the number of occurrences of tuple w is approximately Poisson and thus its mean and variance are approximately the same for relatively long tuples;
(8)

These statistics were used to test the null hypothesis H0 that the two sequences are not related versus the alternative hypothesis H1 that two sequences are related by the common motif model. It was shown by simulations [11] and theoretically [15] that the new statistics formula and formula have the following properties:

  1. The power of both formula and formula is generally higher than that of formula, and increases with the sequence length.

  2. The statistic formula has the highest power when the length of the tuples, k, equals the length of the inserted motif.

  3. When the sequence length is relatively short, the statistic formula is more powerful than formula, while when the sequence length is long, the power of formula is generally higher than that of formula.

Although formula and formula are powerful statistics for the comparison of genomic sequences, they have the drawback that their magnitudes depend on a variety of factors including sequence length and nucleotide frequencies. To overcome these problems, the statistics formula and formula are further normalized to formula and formula, respectively, with range from 0 to 1, where
and

When the two sequences are the same, both formula and formula equal to 0, while if they are anti-correlated, these statistics are close to 1. Therefore, they can be used as dissimilarity measures for two sequences and can be used to cluster a group of sequences of interest.

In addition to the alternative hypothesis that the two sequences share common motifs or CRMs, Reinert et al. [11] and Wan et al. [15] considered another alternative model formula, called the pattern transfer model, which relates two sequences by randomly transferring patterns from one sequence to the other. Under this model, however, the power of formula and formula first increases with sequence length and then approximates a value less than 1. Therefore, neither is appropriate for testing the alternative hypothesis formula. To overcome this problem, Liu et al. [45] developed two new statistics, formula and formula, corresponding to formula and formula, respectively. To define formula, the maximum similarity measured by formula between the fragment from formula to formula in sequence A and any fragment of length W in sequence B is calculated and it is denoted as formula. Similarly, let formula be the maximum similarity measured by formula between the fragment from formula to formula in sequence B and any fragments of length W in sequence A. Then

The statistic formula can be similarly defined by replacing formula with formula.

It was shown in [45] that the power of both formula and formula under the alternative model formula increases with sequence length and approximates 1 as sequence length tends to infinity.

Consideration of mismatched tuples

In the above studies, exact matches of the tuples are needed. During evolution, mutations can occur and hence it is natural to consider tuple matches allowing a giving number of mismatches. Also because genomic orientations of CRMs are usually unknown, both strands of the sequences need to be taken into account for alignment-free genome comparison. Thus, another line of extension of the statistics discussed above is the consideration of reverse complement and mismatched tuples [43]. For each tuple w, its neighbourhood, formula, is defined as the set of tuples with up to a certain number of mismatches with w. Each tuple formula in the neighbourhood formula is associated with a weight formula. The weighted tuple neighbourhood count formula for every tuple w in sequence A is defined by
The corresponding extensions of formula and formula in a mismatch model are given as
(9)
and
(10)
Here formula and formula is defined analogously. Göke et al. [43] also proposed another similarity measure formula defined as
(11)
where formula, the variance of formula, can be obtained as in [43]. The dissimilarity measure corresponding to formula is defined by formula.

The idea of using mismatched tuples can be applied to other dissimilarity measures such as the Jensen-Shannon divergence, formula, and the formula-distance between relative abundance vectors of two different sequences. The details are omitted here.

Empirical comparison of dissimilarity measures with mismatches

Because the consideration of mismatched tuples in alignment-free sequence comparison is relatively new, comprehensive evaluations of various dissimilarity measures allowing mismatched tuples have not been carried out yet. Here we use the comparison of CRM sequences as an example to study the effect of mismatch weight on formula, formula, formula and formula. We also consider two more statistics: the Jensen-Shannon information (JS) and S2 in [42] discussed above.

In our implementation, we combine both the reverse complement and one-word mismatches in [43], and the neighbourhood of a word w can be defined as
where formula is the reverse complement of w, and formula indicates the Hamming distance.

We used enhancers active in the forebrain, midbrain, limb and heart tissues of developing mouse embryo [43, 46, 47] to study the effectiveness of the different dissimilarity measures to cluster CRMs. These sequences form the set of formula samples. As in [12, 42], sequences of the same length as the positive samples are randomly chosen from the mouse genome, ensuring a maximum of formula of repetitive sequence. These sequences form the set of formula samples. As in [43], we randomly chose 500 sequences in both the positive set and the negative set. Each pair of sequences in the positive set was compared and so was each pair in the negative set using the dissimilarity measures described above. We tested if sequence pairs from the positive set were more similar than sequence pairs from the negative set.

For a given dissimilarity measure with a fixed mismatch weight, we calculated the area under the receiver operating characteristic curve (AUC-ROC) as follows. First, we randomly chose 500 positive sequences from the CRM sequences and 500 random sequences from the mouse genome as negative sequences. Second, we calculated the dissimilarity measures for all pairs of positive sequences and all pairs of negative sequences. Third, all the dissimilarity scores for the positive pairs and negative pairs were mixed together. If the dissimilarity score of a pair was lower than a given threshold, the pair was predicted as from the positive samples and otherwise, the pair was predicted as from the negative samples. Fourth, the predictions were compared with the real data to find the false positive rate and the true positive rate. By changing the threshold, the ROC curve can be plotted and thus the AUC score can be calculated. We repeated these four steps 25 times. We let the tuple size to be 4, 5 and 6, the order of MC to be 0, 1 and 2 and the mismatch weight to be 0, 0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75 and 1.00 as in [43]. Supplementary Tables S1–S4 give the average AUC-ROC scores for all the six dissimilarity measures for the four sets of CRM sequences.

We make the following three observations. First, the performances of formula, formula and formula are close and are better than that of JS and formula in general. Second, for formula, the performances of formula, formula and formula are the best when the mismatch weight is around 0.05, but the differences with respect to different mismatch weights are negligible. For formula or 6, the optimal performances are observed for the mismatch weight close to 1. These observations are consistent with that in [43]. Third, the statistic S2 of Dai et al. [42] performs surprisingly well when k = 6 with independent identically distributed model for the background sequences. One potential explanation for the good performance of S2 is that it does not consider k-tuples that are not present in the sequence. The number of such 6-tuples is large for CRM sequences of length around 1 kb.

ALIGNMENT-FREE GENOME AND METAGENOME COMPARISON USING NGS DATA

With the development of NGS technologies, many short sequence reads can be easily generated resulting in a huge amount of available sequencing data. Yet, the assembly of these sequence reads can be challenging owing to their short length. Sequence-signature–based methods for molecular sequence data analysis have the advantage over alignment-based methods in that they can be directly applied to NGS data. The sequence reads are homogeneously or heterogeneously sampled from the original genome. Here homogeneous means that the sequence reads start at each position of the genome with equal probability, while heterogeneous means that some regions may be preferentially sequenced.

To overcome these challenges, Song et al. [48] extended formula, formula and formula to develop new statistics to be applicable for NGS data and studied their corresponding power by both simulations and theoretical studies. Jiang et al. [28] used sequence signature methods to cluster microbial communities. The theoretical studies of the power of formula, formula and formula are based on the limiting distributions of tuple counts in NGS data [49]. It was shown that the qualitative relative performances of these statistics on a large set of short reads are the same as that for long sequences. However, owing to the additional randomness involved in the sampling of the reads from the genomes during NGS, the power of these statistics is lower than when the complete genomic sequences are known. When the sampling of reads is homogeneously distributed, it was shown that, as the sequencing depth increases, the power of these statistics for NGS data approximates that when the genome sequences are known. In addition, heterogeneous sampling of the reads along the genome can further decrease the power of formula and formula. On the other hand, NGS read length does not significantly affect their power. Further, the dissimilarity measures formula and formula were then extended to NGS data and were applied to cluster different tree species with unknown complete genome sequences [48, 50]. It was shown that among these three dissimilarity measures the resulting clustering tree based on the dissimilarity measure formula is the most consistent with the characteristics of the trees [48].

The three dissimilarity measures formula and formula were also used to compare complex microbial communities consisting of hundreds to thousands of species [28]; each microbial community was treated as a pan-genome. These dissimilarity measures were also compared with several other dissimilarity measures including the formula dissimilarity measure defined in Equation (4), measures based on relative di-, tri- and tetra-nucleotide frequencies as in Equation (6) [36], and the standard formula-measures between the frequency vectors.

Simulation studies were first used to see if these dissimilarity measures can recover the relationships among microbial communities using metagenomic short read data. Two types of relationships among the microbial communities were studied. First, they can be related through group relationships such as when the communities are divided into several groups where communities in each group have similar microbial species compositions. For any given dissimilarity measure, the dissimilarity between any pair of microbial communities can be calculated to form a dissimilarity matrix. Hierarchical clustering with average linkage was then used to cluster the microbial communities based on the dissimilarity matrix. The resulting clusters were compared with the simulated group relationships of the microbial communities. It was shown that among the three dissimilarity measures tested, the clustering tree derived based on formula is the most similar to the simulated group relationships among the microbial communities. Second, the microbial communities can be related through a gradient relationship for the abundance levels of microbial organisms. Principal coordinate analysis was then applied to the dissimilarity matrix. The correlation between the principal coordinate and the gradient can be calculated; high correlation indicates better performance of a dissimilarity measure. It was shown that when sequencing depth is low to moderate (1000–10 000 reads per community), the correlation is highest when based on formula and formula. When the sequencing depth is high, the correlations based on formula, formula and formula are similar and are higher than those based on other dissimilarity measures.

These dissimilarity measures were also used to analyse 39 fecal samples from 33 mammalian host species [51], 56 marine samples across the world [52] and 13 fecal samples from human individuals [53]. Using the formula dissimilarity measure, the fecal samples from carnivores can be separated from that of the herbivores. Even within the herbivores, the fecal samples from the hindgut-fermenting herbivores can be separated from that of the foregut-fermenting herbivores. In contrast, the fecal samples from the omnivore samples are diverse and are mixed together with the carnivore and herbivore samples. For the marine data, metagenomic samples from the same region tend to cluster together. For the human gut samples, the metagenomic samples from the adult can be separated from that of unweaned infants. These studies showed the importance of using alignment-free methods for the comparison of metagenomic samples.

In [28], the dissimilarity measure S2 of Dai et al. [42] was not evaluated with respect to the classification of microbial communities. Here we carried out the same analysis as in [28] with the three real data sets described above using S2, and the parsimony scores of the resulting trees are given in Supplementary Table S5 for the fecal samples of mammalian host species, Tables S6 and S7 for the open and coastal water samples, respectively, and Table S8 for the human fecal samples in the Supplementary Material, where the parsimony score is the minimum number of changes of the community labels needed to explain the clusters. The lower the parsimony score, the better the dissimilarity score is. In all these tables, only the results related to S2 are new and the results related to other statistics were presented in [28]. We found that for the comparison of microbial communities, formula outperforms S2, and formula has the best performance among all the dissimilarity measures we studied so far.

The programs for calculating most of the statistics described in this review are available online and the corresponding websites are given in Table 1.

DISCUSSION AND CONCLUSIONS

Alignment-based approaches for sequence comparison will continue to play dominant roles in genomic studies when the sequences of interest can be reliably aligned. On the other hand, alignment-free sequence comparison approaches have been shown to provide important information on the evolution of gene regulatory regions and the comparison of genomes and metagenomes in situations where reliable alignments are not available. Alignment-free approaches based on k-tuple counts are especially powerful for the analysis of NGS short read data from unknown genomes and metagenomes.

In this review, we summarized recent developments of alignment-free sequence comparison concentrating on approaches based on k-tuple count vectors. We emphasized the dissimilarity measures used for genome and metagenome comparisons, the statistical distributions of the measures and the power of these statistics when genomes of interest are related. Both theoretical studies and real data analysis showed that the newly developed statistics formula and formula are generally more powerful than the original statistic formula. The introduction of mismatched tuples in these statistics can further increase their power. For the comparison of relatively short (e.g. 1 kb) CRM sequences, the statistic S2 seems to perform well with appropriate choices of tuple length and order of MC for the sequences. For the comparison of long genome sequences and metagenomic communities, the formula dissimilarity measure seems to yield the best results.

Many other alignment-free sequence comparison approaches are available including those based on chaos game representation [54], common substrings between sequences [55], longest common words [56], the minimal words [57], the difference between the longest common and shortest absent words [58] and sequence representation based on natural vectors [59]. Also, there are recent results from the area of machine learning applied to alignment-free sequence comparison, see for example [60]. Because these approaches are based on different philosophies of sequence comparison from the k-tuple–based methods and it is not clear whether they can be applied to NGS data, we did not review them here. Further studies are needed to understand the advantages and disadvantages of the various alignment-free sequence comparison methods.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Key Points

  • Alignment-free sequence comparison is essential for the comparison of genomes based on NGS reads even if the reads are not from homologous regions.

  • Metagenomes can be effectively clustered based on NGS reads using sequence signatures.

  • Normalization of pattern counts by centralizing around their means can significantly increase the power of alignment-free genome comparison.

  • The newly developed dissimilarity measures formula and formula outperform others for genome and metagenome comparison.

Acknowledgements

We thank Dr Xuegong Zhang and Mr Bai Jiang for collaboration on the use of sequence signatures to compare microbial communities.

FUNDING

This work was supported by the US National Institutes of Health [P50HG002790, R21HG006199] and NSF [DMS-1043075, OCE 1136818]. M.D., K.S. and J.R. are supported by the National Natural Science Foundation of China [31171262, 11021463] and the National Key Basic Research Project of China [2009CB918503].

References

1
Smith
TF
Waterman
MS
Comparison of biosequences
Adv Appl Math
1981
, vol. 
2
 (pg. 
482
-
9
)
2
Altschul
SF
Gish
W
Miller
W
et al. 
Basic local alignment search tool
J Mol Biol
1990
, vol. 
215
 (pg. 
403
-
10
)
3
Blaisdell
BE
A measure of the similarity of sets of sequences not requiring sequence alignment
Proc Natl Acad Sci USA
1986
, vol. 
83
 (pg. 
5155
-
9
)
4
Blaisdell
BE
Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding
J Mol Evol
1985
, vol. 
21
 (pg. 
278
-
88
)
5
Vinga
S
Almeida
J
Alignment-free sequence comparison - a review
Bioinformatics
2003
, vol. 
19
 (pg. 
513
-
23
)
6
Vinga
S
Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification
In: Advanced Computational Methods for Biocomputing and Bioimaging
2007
New York
Nova Science Publishers
(pg. 
71
-
107
)
7
Torney
DC
Burks
C
Davison
D
et al. 
Bell
GI
Marr
TG
Computation of d2: a measure of sequence dissimilarity
 
Computers and DNA1990:109–125: the Proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, December 12–16, 1988 in Santa Fe, New Mexico, USA
8
Hide
W
Burke
J
Davison
DB
Biological evaluation of d2, an algorithm for high-performance sequence comparison
J Comput Biol
1994
, vol. 
1
 (pg. 
199
-
215
)
9
Miller
RT
Christoffels
AG
Gopalakrishnan
C
et al. 
A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base
Genome Res
1999
, vol. 
9
 (pg. 
1143
-
55
)
10
Lippert
RA
Huang
HY
Waterman
MS
Distributional regimes for the number of k-word matches between two random sequences
Proc Natl Acad Sci USA
2002
, vol. 
99
 (pg. 
13980
-
9
)
11
Reinert
G
Chew
D
Sun
F
et al. 
Alignment-free sequence comparison (I): statistics and power
J Comput Biol
2009
, vol. 
16
 (pg. 
1615
-
34
)
12
Kantorovitz
MR
Robinson
GE
Sinha
S
A statistical method for alignment-free comparison of regulatory sequences
Bioinformatics
2007
, vol. 
23
 (pg. 
I249
-
55
)
13
Foret
S
Wilson
SR
Burden
CJ
Empirical distribution of k-word matches in biological sequences
Pattern Recognit
2009
, vol. 
42
 (pg. 
539
-
548
)
14
Burden
CJ
Kantorovitz
MR
Wilson
SR
Approximate word matches between two random sequences
Ann Appl Probab
2008
, vol. 
18
 (pg. 
1
-
21
)
15
Wan
L
Reinert
G
Sun
F
et al. 
Alignment-free sequence comparison (II): theoretical power of comparison statistics
J Comput Biol
2010
, vol. 
17
 (pg. 
1467
-
90
)
16
Zhai
Z
Ku
S-Y
Luan
Y
et al. 
The power of detecting enriched patterns: an HMM approach
J Comput Biol
2010
, vol. 
17
 (pg. 
581
-
92
)
17
Wu
TJ
Hsieh
YC
Li
LA
Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition
Biometrics
2001
, vol. 
57
 (pg. 
441
-
8
)
18
Van Helden
J
Metrics for comparing regulatory sequences on the basis of pattern counts
Bioinformatics
2004
, vol. 
20
 (pg. 
399
-
406
)
19
Xu
Z
Hao
BL
CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes
Nucleic Acids Res
2009
, vol. 
37
 (pg. 
W174
-
8
)
20
Qi
J
Luo
H
Hao
BL
CVTree: a phylogenetic tree reconstruction tool based on whole genomes
Nucleic Acids Res
2004
, vol. 
32
 (pg. 
W45
-
7
)
21
Qi
J
Wang
B
Hao
BL
Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach
J Mol Evol
2004
, vol. 
58
 (pg. 
1
-
11
)
22
Gao
L
Qi
J
Whole genome molecular phylogeny of large dsDNA viruses using composition vector method
BMC Evol Biol
2007
, vol. 
7
 
1
pg. 
41
 
23
Wang
H
Xu
Z
Gao
L
et al. 
A fungal phylogeny based on 82 complete genomes using the composition vector method
BMC Evol Biol
2009
, vol. 
9
 
1
pg. 
195
 
24
Foret
S
Wilson
SR
Burden
CJ
Characterizing the D2 statistic: word matches in biological sequences
Stat Appl Genet Mol Biol
2009
, vol. 
8
 
1
(pg. 
1
-
21
)
25
Li
Q
Xu
Z
Hao
BL
Composition vector approach to whole-genome-based prokaryotic phylogeny: success and foundations
J Biotechnol
2010
, vol. 
149
 (pg. 
115
-
19
)
26
Hua
WY
Xu
Z
Zhang
MH
et al. 
The application of CVTree in structural analysis of microbial communities by 454 pyrosequencing
Chin J Microecol
2010
, vol. 
22
 
4
(pg. 
312
-
16
)
27
Liu
JM
Wang
HF
Yang
HX
et al. 
Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms
Nucleic Acids Res
2013
, vol. 
41
 
1
pg. 
e3
 
28
Jiang
B
Song
K
Ren
J
et al. 
Comparison of metagenomic samples using sequence signatures
BMC Genomics
2012
, vol. 
13
 
1
pg. 
730
 
29
Mrázek
J
Karlin
S
Distinctive features of large complex virus genomes and proteomes
Proc Natl Acad Sci USA
2007
, vol. 
104
 (pg. 
5127
-
32
)
30
Karlin
S
Mrazek
J
Compositional differences within and between eukaryotic genomes
Proc Natl Acad Sci USA
1997
, vol. 
94
 (pg. 
10227
-
32
)
31
Karlin
S
Burge
C
Dinucleotide relative abundance extremes: a genomic signature
Trends Genet
1995
, vol. 
11
 (pg. 
283
-
90
)
32
Gentles
AJ
Karlin
S
Genome-scale compositional comparisons in eukaryotes
Genome Res
2001
, vol. 
11
 (pg. 
540
-
6
)
33
Burge
C
Campbell
AM
Karlin
S
Over-and under-representation of short oligonucleotides in DNA sequences
Proc Natl Acad Sci
1992
, vol. 
89
 (pg. 
1358
-
62
)
34
Campbell
A
Mrazek
J
Karlin
S
Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA
Proc Natl Acad Sci USA
1999
, vol. 
96
 (pg. 
9184
-
89
)
35
Karlin
S
Mrazek
J
Campbell
AM
Compositional biases of bacterial genomes and evolutionary implications
J Bacteriol
1997
, vol. 
179
 (pg. 
3899
-
913
)
36
Willner
D
Thurber
RV
Rohwer
F
Metagenomic signatures of 86 microbial and viral metagenomes
Environ Microbiol
2009
, vol. 
11
 (pg. 
1752
-
66
)
37
Jun
SR
Sims
GE
Wu
GA
et al. 
Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution
Proc Natl Acad Sci USA
2010
, vol. 
107
 (pg. 
133
-
8
)
38
Wu
GA
Jun
SR
Sims
GE
et al. 
Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method
Proc Natl Acad Sci USA
2009
, vol. 
106
 (pg. 
12826
-
31
)
39
Sims
GE
Jun
SR
Wu
GA
et al. 
Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
Proc Natl Acad Sci USA
2009
, vol. 
106
 (pg. 
2677
-
82
)
40
Sims
GE
Kim
SH
Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
Proc Natl Acad Sci USA
2011
, vol. 
108
 (pg. 
8329
-
34
)
41
Dai
Q
Wang
T
Comparison study on k-word statistical measures for protein: from sequence to ‘sequence space'
BMC Bioinformatics
2008
, vol. 
9
 
1
pg. 
394
 
42
Dai
Q
Yang
Y
Wang
T
Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison
Bioinformatics
2008
, vol. 
24
 (pg. 
2296
-
302
)
43
Göke
J
Schulz
MH
Lasserre
J
et al. 
Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts
Bioinformatics
2012
, vol. 
28
 (pg. 
656
-
63
)
44
Shepp
L
Normal functions of normal random variables
SIAM Rev
1964
, vol. 
6
 
4
(pg. 
459
-
460
)
45
Liu
X
Wan
L
Li
J
et al. 
New powerful statistics for alignment-free sequence comparison under a pattern transfer model
Journal of Theoretical Biology
2011
, vol. 
284
 (pg. 
106
-
16
)
46
Blow
MJ
McCulley
DJ
Li
Z
et al. 
ChIP-Seq identification of weakly conserved heart enhancers
Nat Genet
2010
, vol. 
42
 (pg. 
806
-
10
)
47
Visel
A
Blow
MJ
Li
Z
et al. 
ChIP-seq accurately predicts tissue-specific activity of enhancers
Nature
2009
, vol. 
457
 (pg. 
854
-
8
)
48
Song
K
Ren
J
Zhai
Z
et al. 
Alignment-free sequence comparison based on next-generation sequencing reads
J Comput Biol
2013
, vol. 
20
 (pg. 
64
-
79
)
49
Zhai
Z
Reinert
G
Song
K
et al. 
Normal and compound Poisson approximations for pattern occurrences in NGS reads
J Comput Biol
2012
, vol. 
19
 (pg. 
839
-
854
)
50
Cannon
CH
Kua
CS
Zhang
D
et al. 
Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack
Mol Ecol
2010
, vol. 
19
 (pg. 
147
-
61
)
51
Muegge
BD
Kuczynski
J
Knights
D
et al. 
Diet Drives Convergence in gut microbiome functions across mammalian phylogeny and within humans
Science
2011
, vol. 
332
 (pg. 
970
-
4
)
52
Rusch
DB
Halpern
AL
Sutton
G
et al. 
The Sorcerer II Global ocean sampling expedition: northwest atlantic through eastern tropical pacific
PloS Biol
2007
, vol. 
5
 (pg. 
398
-
431
)
53
Kurokawa
K
Itoh
T
Kuwahara
T
et al. 
Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes
DNA Res
2007
, vol. 
14
 (pg. 
169
-
81
)
54
Jeffrey
HJ
Chaos game representation of gene structure
Nucleic Acids Res
1990
, vol. 
18
 (pg. 
2163
-
70
)
55
Ulitsky
I
Burstein
D
Tuller
T
et al. 
The average common substring approach to phylogenomic reconstruction
J Comput Biol
2006
, vol. 
13
 (pg. 
336
-
50
)
56
Haubold
B
Pierstorff
N
Möller
F
et al. 
Genome comparison without alignment using shortest unique substrings
BMC Bioinformatics
2005
, vol. 
6
 pg. 
123
 
57
Pinho
AJ
Ferreira
PJ
Garcia
SP
et al. 
On finding minimal absent words
BMC Bioinformatics
2009
, vol. 
10
 pg. 
137
 
58
Yang
L
Zhang
X
Wang
T
et al. 
Large local analysis of the unaligned genome and its application
J Comput Biol
2013
, vol. 
20
 (pg. 
19
-
29
)
59
Zhao
B
He
RL
Yau
SST
A new distribution vector and its application in genome clustering
Mol Phylogenet Evol
2011
, vol. 
59
 (pg. 
438
-
443
)
60
Didier
G
Corel
E
Laprevotte
I
et al. 
Variable length local decoding and alignment-free sequence comparison
Theor Comput Sci
2012
, vol. 
462
 (pg. 
1
-
11
)

Supplementary data