- Split View
-
Views
-
Cite
Cite
Kai Song, Jie Ren, Gesine Reinert, Minghua Deng, Michael S. Waterman, Fengzhu Sun, New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing, Briefings in Bioinformatics, Volume 15, Issue 3, May 2014, Pages 343–353, https://doi.org/10.1093/bib/bbt067
- Share Icon Share
Abstract
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
INTRODUCTION
Sequence comparison continues to play crucial roles in molecular sequence analysis. The dominant approaches for sequence comparison are alignment-based including the Smith–Waterman algorithm [1] and BLAST [2]. Although alignment-based approaches generally yield excellent results when the molecular sequences of interest can be reliably aligned, their applications are limited when the sequences are divergent or come from different regions of various genomes and a reliable alignment cannot be obtained. Another drawback of alignment-based approaches is that they are generally time-consuming and thus, are limited in dealing with large-scale sequence data generated with the new sequencing technologies. The next-generation sequencing (NGS) technologies usually generate relatively short reads that can be difficult to assemble, and alignment-based approaches cannot be applied when the complete sequences are not known. Alignment-free sequence comparison approaches provide attractive alternatives when alignment-based approaches fail.
Studies of alignment-free sequence comparison began in the mid 80s [3, 4] and have been under continuous development. Two excellent reviews are available on this topic [5, 6]. Here we review new developments on alignment-free sequence comparison with emphasis on methods based on k-tuple (k-word, k-gram) frequencies. For studies on alignment-free sequence comparison based on other ideas such as chaos theory and common substrings, please see [6] for details.
Alignment-free sequence comparison has been successfully applied to many biological problems including (i) the study of evolution of organisms using whole-genome sequences, (ii) the evolution of regulatory sequences such as promoters, enhancers and inhibitors, (iii) the identification of cis-regulatory modules (CRM) and (iv) the comparison of metagenomic communities based on the sequence data using NGS technologies.
For alignment-free sequence comparison of two sequences using k-tuples, the first step is to count the number of occurrences of every k-tuple in both sequences separately and record these in a vector of k-tuple frequencies for each sequence; this counting can be carried out in linear time. Second, a measure d of difference between the two sequences A and B is defined based on the two frequency vectors. If the measure satisfies distance constraints, i.e. (a) and equality holds if and only if , and (b) for any sequences A, B and C, , then the measure is a distance measure. Otherwise, the measure is called a dissimilarity measure. Third, the sequences are then clustered based on the distance or dissimilarity measures and the resulting clusters are finally compared with current biological knowledge about the sequences to evaluate the effectiveness of the measures. Many measures have been developed over the years. Here we present a general review of such measures and their applications to molecular sequence analysis with an emphasis on NGS data.
The article is organized as follows. In Section 1, we review theoretical studies of the approximate distributions of the popular D2 statistic and of its power. As D2 may mainly measure background noise in each sequence separately, in Section 2 we describe adjusted similarity measures based on word counts. Section 3 focuses on alignment-free genome and metagenome comparison using NGS data. Section 4 contains a discussion and conclusions.
THEORETICAL STUDIES OF THE APPROXIMATE DISTRIBUTIONS OF AND ITS STATISTICAL POWER
Alignment-free sequence comparison by the number of word matches: the D2 statistic
Both and have an approximate -distribution with degrees of freedom. Blaisdell [3] used similar ideas to test if a set of sequences have the same transition matrix, and thus are more likely to be related. These statistics can be considered as the origin of alignment-free sequence comparison statistics.
where and are the number of occurrences of tuple w in sequences A and B, respectively, and denotes the logical indicator: if event E is true, and 0 otherwise. The statistic has been used in many applications including sequence database searches [8] and clustering of expressed sequence tags [9]. Owing to its wide range of applications, extensive studies on the distributions of have been carried out.
The distribution of the D2 statistic under the null model of two independent sequences
Lippert et al. [10] studied the limiting distribution of under the independent identically distributed (i.i.d.) model for both sequences with the same nucleotide frequencies , where indicates the set of all the possible letters. When are not all equal, it was shown that when where n is the length of both sequences, has an approximate Poisson distribution, and when , has an approximate normal distribution. It was further suggested in [10] and explicitly proved in [11] that the variance of is dominated by the variance of the number of occurrences of each k-tuple in individual sequences. However, when , is approximately neither normal nor Poisson. Instead, tends to the sum of products of normal distributions.
The fundamental results in [10] were further extended to more general models for the sequences of interest [12–14]. Kantorovitz et al. [12] showed that is approximately normal as both k and n tend to infinity when the nucleotide frequencies are the same for both sequences. Burden et al. [14] extended the statistic to allow word matches with up to a certain number of mismatches and again showed that this new statistic is approximately normally distributed. Foret et al. [13] compared the empirical and theoretical distributions of and its variations and found that the approximations are consistent with the empirical distributions.
The power of the D2 statistic under the alternative model of two related sequences
The results from [10, 12–14] played key roles in estimating the statistical significance of between two sequences under the null hypothesis that the two sequences are unrelated. In contrast, these studies do not address what kind of relationship the statistic and its variants can detect, and what the statistical power is when the alternative hypothesis that the two sequences are related holds. To answer these questions, precise definitions of relatedness between the sequences of interest are needed. The alternative hypothesis depends on the scientific questions of interest. One of the key applications of alignment-free sequence comparison is to find CRMs that locate in the upstream regions of genes controlled by the same sets of transcription factors. Sequences in the same CRM tend to have the same transcription factor binding sites and thus share similar sets of k-tuples. Therefore, Reinert et al. [11] modelled the relatedness of sequences by the sharing of common , that is, word patterns that are significantly enriched in the sequences. They refer to the model as a common motif model. The model for each sequence consists of the following three components:
The background sequences are modelled by an i.i.d. model.
The foreground motifs are modelled by position weight matrices that give the nucleotide probability distribution at each position of the motifs. The foreground motif model can be easily extended to the situation that the nucleotides along the motifs depend on each other. The motifs can also be easily generalized to CRMs consisting of combinations of several motifs.
The occurrences of the motifs are modelled as binomial random variables along the genome sequence, with denoting the probability that a motif instance starts at a nucleotide position; is referred as the . Once a motif is inserted, the nucleotide positions, which are now covered by the motif, are ignored, and the insertion process resumes at the end of the motif, so that inserted motifs do not overlap.
Intuitively, the relatedness between the two sequences increases with the motif density, and thus the power of a reasonable statistic should also increase with motif density. Simulation studies under the common motif model showed that for small values of k, the power of can be smaller than the type I error rate and can decrease with sequence length. Thus, is not an appropriate statistic to test the relationship between sequences under the common motif model for small values of k. When k is relatively large, the power does increase with sequence length. Thus, for many practical applications with relatively large values of k, can be useful. These observations were further proved theoretically in [15] based on the hidden Markov model for each sequence developed in [16].
ADJUSTED DISSIMILARITY MEASURES FOR ALIGNMENT-FREE SEQUENCE COMPARISON BASED ON K-TUPLE COUNTS
In addition to , many other distance and dissimilarity measures based on k-tuple counts have been proposed for the comparison of molecular sequences. Those include (a) normalization of by , (b) correlation of relative differences of the tuple counts from their expectations, (c) comparison of relative abundance of k-tuples and (d) comparison of k-tuple distributions based on Markov models.
Normalization of D2 by D2z
The authors compared the effectiveness of the statistic with several other distance or dissimilarity measures between the k-tuple vectors including (i) the Euclidean distance [3], (ii) Kullback-Leibler discrepancy (kld) [17], (iii) the linear Pearson correlation coefficient (lcc) between the count vectors, (iv) the cosine of the angle between the two k-tuple count vectors (cos) and (v) two other measures based on the approximate Poisson distribution of word counts [18]. Using four fly and three human CRMs as examples, it was shown that the statistic outperforms the statistics described above for the identification of CRMs [12]. Note though, the Euclidean, kld, lcc and cos measures need only one parameter, the value of k, while needs an additional parameter r, the order of the MC for the sequences. In the seven data sets studied, the combination of and yielded the best performance for in five of the data sets, and the combination of and has the best performance in two of the seven data sets.
Correlation of relative differences of k-tuple counts between two sequences
The authors applied this dissimilarity measure to whole-genome phylogenetic analysis [21–25] and classification of metagenomic samples [26]. Liu et al. [27] used the composition vector approach to assign metagenomics reads to organisms or gene families. However, Jiang et al. [28] recently found that, although the statistic performs reasonably well with high-coverage data sets, it is not stable when the data are limited.
Alignment-free sequence comparison based on di-, tri- and tetra-nucleotide relative abundance
It can be seen that the dissimilarity measure is closely related to the dissimilarity measure developed from Karlin's group when the MC model was used to model the sequences. 's group used one minus the correlation between the relative difference vectors while Karlin's group used the -norm between the relative difference vectors to measure the dissimilarity.
Alignment-free sequence comparison based on direct comparison of tuple frequencies with Markov models
Kim and colleagues [37–40] designed an approach based on the Jensen-Shannon entropy as a distance measure to study the evolutionary relationships of prokaryotes, Escherichia coli/Shigella, and viruses.
In a series of articles, Wang's group compared two sequences using the k-tuple frequency vectors coupled with MC models of order [41, 42]. The methods in [42] can be described as follows. Assuming an r-th order MC for each sequence, the transition probability matrix can be estimated using the sequence data. The probability of each k-tuple to occur in the sequence A can then be calculated based on the transition probabilities denoted as , where is the r-th order MC for sequence A. Similarly, the k-tuple probability distribution can be defined based on the second sequence B. The statistic is defined by the symmetric Jensen-Shannon divergence of the two probability measures and . They also considered a weighted version of denoted as that was defined by replacing with , where is the frequency of in sequence or B. It was shown in [41,42] that for the identification of CRMs with appropriate choices of k and r, the dissimilarity measure performs the best compared with the dissimilarity measures discussed above. We also tried S2 on the data sets used in [43] and found that S2 outperforms other dissimilarity measures by a significant margin for the identification of CRM sequences (see Supplementary Tables S1–S4).
The statistics and and their statistical power
These statistics were used to test the null hypothesis H0 that the two sequences are not related versus the alternative hypothesis H1 that two sequences are related by the common motif model. It was shown by simulations [11] and theoretically [15] that the new statistics and have the following properties:
The power of both and is generally higher than that of , and increases with the sequence length.
The statistic has the highest power when the length of the tuples, k, equals the length of the inserted motif.
When the sequence length is relatively short, the statistic is more powerful than , while when the sequence length is long, the power of is generally higher than that of .
When the two sequences are the same, both and equal to 0, while if they are anti-correlated, these statistics are close to 1. Therefore, they can be used as dissimilarity measures for two sequences and can be used to cluster a group of sequences of interest.
The statistic can be similarly defined by replacing with .
It was shown in [45] that the power of both and under the alternative model increases with sequence length and approximates 1 as sequence length tends to infinity.
Consideration of mismatched tuples
The idea of using mismatched tuples can be applied to other dissimilarity measures such as the Jensen-Shannon divergence, , and the -distance between relative abundance vectors of two different sequences. The details are omitted here.
Empirical comparison of dissimilarity measures with mismatches
Because the consideration of mismatched tuples in alignment-free sequence comparison is relatively new, comprehensive evaluations of various dissimilarity measures allowing mismatched tuples have not been carried out yet. Here we use the comparison of CRM sequences as an example to study the effect of mismatch weight on , , and . We also consider two more statistics: the Jensen-Shannon information (JS) and S2 in [42] discussed above.
We used enhancers active in the forebrain, midbrain, limb and heart tissues of developing mouse embryo [43, 46, 47] to study the effectiveness of the different dissimilarity measures to cluster CRMs. These sequences form the set of samples. As in [12, 42], sequences of the same length as the positive samples are randomly chosen from the mouse genome, ensuring a maximum of of repetitive sequence. These sequences form the set of samples. As in [43], we randomly chose 500 sequences in both the positive set and the negative set. Each pair of sequences in the positive set was compared and so was each pair in the negative set using the dissimilarity measures described above. We tested if sequence pairs from the positive set were more similar than sequence pairs from the negative set.
For a given dissimilarity measure with a fixed mismatch weight, we calculated the area under the receiver operating characteristic curve (AUC-ROC) as follows. First, we randomly chose 500 positive sequences from the CRM sequences and 500 random sequences from the mouse genome as negative sequences. Second, we calculated the dissimilarity measures for all pairs of positive sequences and all pairs of negative sequences. Third, all the dissimilarity scores for the positive pairs and negative pairs were mixed together. If the dissimilarity score of a pair was lower than a given threshold, the pair was predicted as from the positive samples and otherwise, the pair was predicted as from the negative samples. Fourth, the predictions were compared with the real data to find the false positive rate and the true positive rate. By changing the threshold, the ROC curve can be plotted and thus the AUC score can be calculated. We repeated these four steps 25 times. We let the tuple size to be 4, 5 and 6, the order of MC to be 0, 1 and 2 and the mismatch weight to be 0, 0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75 and 1.00 as in [43]. Supplementary Tables S1–S4 give the average AUC-ROC scores for all the six dissimilarity measures for the four sets of CRM sequences.
We make the following three observations. First, the performances of , and are close and are better than that of JS and in general. Second, for , the performances of , and are the best when the mismatch weight is around 0.05, but the differences with respect to different mismatch weights are negligible. For or 6, the optimal performances are observed for the mismatch weight close to 1. These observations are consistent with that in [43]. Third, the statistic S2 of Dai et al. [42] performs surprisingly well when k = 6 with independent identically distributed model for the background sequences. One potential explanation for the good performance of S2 is that it does not consider k-tuples that are not present in the sequence. The number of such 6-tuples is large for CRM sequences of length around 1 kb.
ALIGNMENT-FREE GENOME AND METAGENOME COMPARISON USING NGS DATA
With the development of NGS technologies, many short sequence reads can be easily generated resulting in a huge amount of available sequencing data. Yet, the assembly of these sequence reads can be challenging owing to their short length. Sequence-signature–based methods for molecular sequence data analysis have the advantage over alignment-based methods in that they can be directly applied to NGS data. The sequence reads are homogeneously or heterogeneously sampled from the original genome. Here homogeneous means that the sequence reads start at each position of the genome with equal probability, while heterogeneous means that some regions may be preferentially sequenced.
To overcome these challenges, Song et al. [48] extended , and to develop new statistics to be applicable for NGS data and studied their corresponding power by both simulations and theoretical studies. Jiang et al. [28] used sequence signature methods to cluster microbial communities. The theoretical studies of the power of , and are based on the limiting distributions of tuple counts in NGS data [49]. It was shown that the qualitative relative performances of these statistics on a large set of short reads are the same as that for long sequences. However, owing to the additional randomness involved in the sampling of the reads from the genomes during NGS, the power of these statistics is lower than when the complete genomic sequences are known. When the sampling of reads is homogeneously distributed, it was shown that, as the sequencing depth increases, the power of these statistics for NGS data approximates that when the genome sequences are known. In addition, heterogeneous sampling of the reads along the genome can further decrease the power of and . On the other hand, NGS read length does not significantly affect their power. Further, the dissimilarity measures and were then extended to NGS data and were applied to cluster different tree species with unknown complete genome sequences [48, 50]. It was shown that among these three dissimilarity measures the resulting clustering tree based on the dissimilarity measure is the most consistent with the characteristics of the trees [48].
The three dissimilarity measures and were also used to compare complex microbial communities consisting of hundreds to thousands of species [28]; each microbial community was treated as a pan-genome. These dissimilarity measures were also compared with several other dissimilarity measures including the dissimilarity measure defined in Equation (4), measures based on relative di-, tri- and tetra-nucleotide frequencies as in Equation (6) [36], and the standard -measures between the frequency vectors.
Simulation studies were first used to see if these dissimilarity measures can recover the relationships among microbial communities using metagenomic short read data. Two types of relationships among the microbial communities were studied. First, they can be related through group relationships such as when the communities are divided into several groups where communities in each group have similar microbial species compositions. For any given dissimilarity measure, the dissimilarity between any pair of microbial communities can be calculated to form a dissimilarity matrix. Hierarchical clustering with average linkage was then used to cluster the microbial communities based on the dissimilarity matrix. The resulting clusters were compared with the simulated group relationships of the microbial communities. It was shown that among the three dissimilarity measures tested, the clustering tree derived based on is the most similar to the simulated group relationships among the microbial communities. Second, the microbial communities can be related through a gradient relationship for the abundance levels of microbial organisms. Principal coordinate analysis was then applied to the dissimilarity matrix. The correlation between the principal coordinate and the gradient can be calculated; high correlation indicates better performance of a dissimilarity measure. It was shown that when sequencing depth is low to moderate (1000–10 000 reads per community), the correlation is highest when based on and . When the sequencing depth is high, the correlations based on , and are similar and are higher than those based on other dissimilarity measures.
These dissimilarity measures were also used to analyse 39 fecal samples from 33 mammalian host species [51], 56 marine samples across the world [52] and 13 fecal samples from human individuals [53]. Using the dissimilarity measure, the fecal samples from carnivores can be separated from that of the herbivores. Even within the herbivores, the fecal samples from the hindgut-fermenting herbivores can be separated from that of the foregut-fermenting herbivores. In contrast, the fecal samples from the omnivore samples are diverse and are mixed together with the carnivore and herbivore samples. For the marine data, metagenomic samples from the same region tend to cluster together. For the human gut samples, the metagenomic samples from the adult can be separated from that of unweaned infants. These studies showed the importance of using alignment-free methods for the comparison of metagenomic samples.
In [28], the dissimilarity measure S2 of Dai et al. [42] was not evaluated with respect to the classification of microbial communities. Here we carried out the same analysis as in [28] with the three real data sets described above using S2, and the parsimony scores of the resulting trees are given in Supplementary Table S5 for the fecal samples of mammalian host species, Tables S6 and S7 for the open and coastal water samples, respectively, and Table S8 for the human fecal samples in the Supplementary Material, where the parsimony score is the minimum number of changes of the community labels needed to explain the clusters. The lower the parsimony score, the better the dissimilarity score is. In all these tables, only the results related to S2 are new and the results related to other statistics were presented in [28]. We found that for the comparison of microbial communities, outperforms S2, and has the best performance among all the dissimilarity measures we studied so far.
The programs for calculating most of the statistics described in this review are available online and the corresponding websites are given in Table 1.
Statistical measure . | Website . |
---|---|
D2, D2S, D2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html |
d2, d2S, d2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2_NGS/D2NGSmain.html |
D2z | http://veda.cs.uiuc.edu/cgi-bin/d2z/download.pl |
Hao | http://tlife.fudan.edu.cn/cvtree/ |
S2 | http://math.dlut.edu.cn/daiqi/mplusd.html |
N2 | http://www.seqan.de/projects/alf/ |
Statistical measure . | Website . |
---|---|
D2, D2S, D2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html |
d2, d2S, d2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2_NGS/D2NGSmain.html |
D2z | http://veda.cs.uiuc.edu/cgi-bin/d2z/download.pl |
Hao | http://tlife.fudan.edu.cn/cvtree/ |
S2 | http://math.dlut.edu.cn/daiqi/mplusd.html |
N2 | http://www.seqan.de/projects/alf/ |
Statistical measure . | Website . |
---|---|
D2, D2S, D2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html |
d2, d2S, d2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2_NGS/D2NGSmain.html |
D2z | http://veda.cs.uiuc.edu/cgi-bin/d2z/download.pl |
Hao | http://tlife.fudan.edu.cn/cvtree/ |
S2 | http://math.dlut.edu.cn/daiqi/mplusd.html |
N2 | http://www.seqan.de/projects/alf/ |
Statistical measure . | Website . |
---|---|
D2, D2S, D2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2/d2-all.html |
d2, d2S, d2Star | http://www-rcf.usc.edu/∼fsun/Programs/D2_NGS/D2NGSmain.html |
D2z | http://veda.cs.uiuc.edu/cgi-bin/d2z/download.pl |
Hao | http://tlife.fudan.edu.cn/cvtree/ |
S2 | http://math.dlut.edu.cn/daiqi/mplusd.html |
N2 | http://www.seqan.de/projects/alf/ |
DISCUSSION AND CONCLUSIONS
Alignment-based approaches for sequence comparison will continue to play dominant roles in genomic studies when the sequences of interest can be reliably aligned. On the other hand, alignment-free sequence comparison approaches have been shown to provide important information on the evolution of gene regulatory regions and the comparison of genomes and metagenomes in situations where reliable alignments are not available. Alignment-free approaches based on k-tuple counts are especially powerful for the analysis of NGS short read data from unknown genomes and metagenomes.
In this review, we summarized recent developments of alignment-free sequence comparison concentrating on approaches based on k-tuple count vectors. We emphasized the dissimilarity measures used for genome and metagenome comparisons, the statistical distributions of the measures and the power of these statistics when genomes of interest are related. Both theoretical studies and real data analysis showed that the newly developed statistics and are generally more powerful than the original statistic . The introduction of mismatched tuples in these statistics can further increase their power. For the comparison of relatively short (e.g. 1 kb) CRM sequences, the statistic S2 seems to perform well with appropriate choices of tuple length and order of MC for the sequences. For the comparison of long genome sequences and metagenomic communities, the dissimilarity measure seems to yield the best results.
Many other alignment-free sequence comparison approaches are available including those based on chaos game representation [54], common substrings between sequences [55], longest common words [56], the minimal words [57], the difference between the longest common and shortest absent words [58] and sequence representation based on natural vectors [59]. Also, there are recent results from the area of machine learning applied to alignment-free sequence comparison, see for example [60]. Because these approaches are based on different philosophies of sequence comparison from the k-tuple–based methods and it is not clear whether they can be applied to NGS data, we did not review them here. Further studies are needed to understand the advantages and disadvantages of the various alignment-free sequence comparison methods.
SUPPLEMENTARY DATA
Supplementary data are available online at http://bib.oxfordjournals.org/.
Alignment-free sequence comparison is essential for the comparison of genomes based on NGS reads even if the reads are not from homologous regions.
Metagenomes can be effectively clustered based on NGS reads using sequence signatures.
Normalization of pattern counts by centralizing around their means can significantly increase the power of alignment-free genome comparison.
The newly developed dissimilarity measures and outperform others for genome and metagenome comparison.
Acknowledgements
We thank Dr Xuegong Zhang and Mr Bai Jiang for collaboration on the use of sequence signatures to compare microbial communities.
FUNDING
This work was supported by the US National Institutes of Health [P50HG002790, R21HG006199] and NSF [DMS-1043075, OCE 1136818]. M.D., K.S. and J.R. are supported by the National Natural Science Foundation of China [31171262, 11021463] and the National Key Basic Research Project of China [2009CB918503].