New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

2

Altschul

SF

Gish

W

Miller

W

et al.

Basic local alignment search tool

,

J Mol Biol

,

1990

, vol.

215

(pg.

403

-

10

)

3

Blaisdell

BE

A measure of the similarity of sets of sequences not requiring sequence alignment

,

Proc Natl Acad Sci USA

,

1986

, vol.

83

(pg.

5155

-

9

)

4

Blaisdell

BE

Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding

,

J Mol Evol

,

1985

, vol.

21

(pg.

278

-

88

)

5

Vinga

S

Almeida

J

Alignment-free sequence comparison - a review

,

Bioinformatics

,

2003

, vol.

19

(pg.

513

-

23

)

6

Vinga

S

Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification

,

In: Advanced Computational Methods for Biocomputing and Bioimaging

,

2007

New York

Nova Science Publishers

(pg.

71

-

107

)

Google Preview

7

Torney

DC

Burks

C

Davison

D

et al.

Bell

GI

Marr

TG

Computation of d2: a measure of sequence dissimilarity

Computers and DNA1990:109–125: the Proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, December 12–16, 1988 in Santa Fe, New Mexico, USA

8

Hide

W

Burke

J

Davison

DB

Biological evaluation of d2, an algorithm for high-performance sequence comparison

,

J Comput Biol

,

1994

, vol.

1

(pg.

199

-

215

)

9

Miller

RT

Christoffels

AG

Gopalakrishnan

C

et al.

A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base

,

Genome Res

,

1999

, vol.

9

(pg.

1143

-

55

)

10

Lippert

RA

Huang

HY

Waterman

MS

Distributional regimes for the number of k-word matches between two random sequences

,

Proc Natl Acad Sci USA

,

2002

, vol.

99

(pg.

13980

-

9

)

11

Reinert

G

Chew

D

Sun

F

et al.

Alignment-free sequence comparison (I): statistics and power

,

J Comput Biol

,

2009

, vol.

16

(pg.

1615

-

34

)

12

Kantorovitz

MR

Robinson

GE

Sinha

S

A statistical method for alignment-free comparison of regulatory sequences

,

Bioinformatics

,

2007

, vol.

23

(pg.

I249

-

55

)

13

Foret

S

Wilson

SR

Burden

CJ

Empirical distribution of k-word matches in biological sequences

,

Pattern Recognit

,

2009

, vol.

42

(pg.

539

-

548

)

14

Burden

CJ

Kantorovitz

MR

Wilson

SR

Approximate word matches between two random sequences

,

Ann Appl Probab

,

2008

, vol.

18

(pg.

1

-

21

)

15

Wan

L

Reinert

G

Sun

F

et al.

Alignment-free sequence comparison (II): theoretical power of comparison statistics

,

J Comput Biol

,

2010

, vol.

17

(pg.

1467

-

90

)

16

Zhai

Z

Ku

S-Y

Luan

Y

et al.

The power of detecting enriched patterns: an HMM approach

,

J Comput Biol

,

2010

, vol.

17

(pg.

581

-

92

)

17

Wu

TJ

Hsieh

YC

Li

LA

Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition

,

Biometrics

,

2001

, vol.

57

(pg.

441

-

8

)

18

Van Helden

J

Metrics for comparing regulatory sequences on the basis of pattern counts

,

Bioinformatics

,

2004

, vol.

20

(pg.

399

-

406

)

19

Xu

Z

Hao

BL

CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes

,

Nucleic Acids Res

,

2009

, vol.

37

(pg.

W174

-

8

)

20

Qi

J

Luo

H

Hao

BL

CVTree: a phylogenetic tree reconstruction tool based on whole genomes

,

Nucleic Acids Res

,

2004

, vol.

32

(pg.

W45

-

7

)

21

Qi

J

Wang

B

Hao

BL

Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach

,

J Mol Evol

,

2004

, vol.

58

(pg.

1

-

11

)

22

Gao

L

Qi

J

Whole genome molecular phylogeny of large dsDNA viruses using composition vector method

,

BMC Evol Biol

,

2007

, vol.

7

1

pg.

41

23

Wang

H

Xu

Z

Gao

L

et al.

A fungal phylogeny based on 82 complete genomes using the composition vector method

,

BMC Evol Biol

,

2009

, vol.

9

1

pg.

195

24

Foret

S

Wilson

SR

Burden

CJ

Characterizing the D2 statistic: word matches in biological sequences

,

Stat Appl Genet Mol Biol

,

2009

, vol.

8

1

(pg.

1

-

21

)

25

Li

Q

Xu

Z

Hao

BL

Composition vector approach to whole-genome-based prokaryotic phylogeny: success and foundations

,

J Biotechnol

,

2010

, vol.

149

(pg.

115

-

19

)

26

Hua

WY

Xu

Z

Zhang

MH

et al.

The application of CVTree in structural analysis of microbial communities by 454 pyrosequencing

,

Chin J Microecol

,

2010

, vol.

22

4

(pg.

312

-

16

)

27

Liu

JM

Wang

HF

Yang

HX

et al.

Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms

,

Nucleic Acids Res

,

2013

, vol.

41

1

pg.

e3

28

Jiang

B

Song

K

Ren

J

et al.

Comparison of metagenomic samples using sequence signatures

,

BMC Genomics

,

2012

, vol.

13

1

pg.

730

29

Mrázek

J

Karlin

S

Distinctive features of large complex virus genomes and proteomes

,

Proc Natl Acad Sci USA

,

2007

, vol.

104

(pg.

5127

-

32

)

30

Karlin

S

Mrazek

J

Compositional differences within and between eukaryotic genomes

,

Proc Natl Acad Sci USA

,

1997

, vol.

94

(pg.

10227

-

32

)

31

Karlin

S

Burge

C

Dinucleotide relative abundance extremes: a genomic signature

,

Trends Genet

,

1995

, vol.

11

(pg.

283

-

90

)

32

Gentles

AJ

Karlin

S

Genome-scale compositional comparisons in eukaryotes

,

Genome Res

,

2001

, vol.

11

(pg.

540

-

6

)

33

Burge

C

Campbell

AM

Karlin

S

Over-and under-representation of short oligonucleotides in DNA sequences

,

Proc Natl Acad Sci

,

1992

, vol.

89

(pg.

1358

-

62

)

34

Campbell

A

Mrazek

J

Karlin

S

Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA

,

Proc Natl Acad Sci USA

,

1999

, vol.

96

(pg.

9184

-

89

)

35

Karlin

S

Mrazek

J

Campbell

AM

Compositional biases of bacterial genomes and evolutionary implications

,

J Bacteriol

,

1997

, vol.

179

(pg.

3899

-

913

)

PubMed

36

Willner

D

Thurber

RV

Rohwer

F

Metagenomic signatures of 86 microbial and viral metagenomes

,

Environ Microbiol

,

2009

, vol.

11

(pg.

1752

-

66

)

37

Jun

SR

Sims

GE

Wu

GA

et al.

Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution

,

Proc Natl Acad Sci USA

,

2010

, vol.

107

(pg.

133

-

8

)

38

Wu

GA

Jun

SR

Sims

GE

et al.

Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method

,

Proc Natl Acad Sci USA

,

2009

, vol.

106

(pg.

12826

-

31

)

39

Sims

GE

Jun

SR

Wu

GA

et al.

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions

,

Proc Natl Acad Sci USA

,

2009

, vol.

106

(pg.

2677

-

82

)

40

Sims

GE

Kim

SH

Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)

,

Proc Natl Acad Sci USA

,

2011

, vol.

108

(pg.

8329

-

34

)

41

Dai

Q

Wang

T

Comparison study on k-word statistical measures for protein: from sequence to ‘sequence space'

,

BMC Bioinformatics

,

2008

, vol.

9

1

pg.

394

42

Dai

Q

Yang

Y

Wang

T

Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison

,

Bioinformatics

,

2008

, vol.

24

(pg.

2296

-

302

)

43

Göke

J

Schulz

MH

Lasserre

J

et al.

Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

,

Bioinformatics

,

2012

, vol.

28

(pg.

656

-

63

)

44

Shepp

L

Normal functions of normal random variables

,

SIAM Rev

,

1964

, vol.

6

4

(pg.

459

-

460

)

45

Liu

X

Wan

L

Li

J

et al.

New powerful statistics for alignment-free sequence comparison under a pattern transfer model

,

Journal of Theoretical Biology

,

2011

, vol.

284

(pg.

106

-

16

)

46

Blow

MJ

McCulley

DJ

Li

Z

et al.

ChIP-Seq identification of weakly conserved heart enhancers

,

Nat Genet

,

2010

, vol.

42

(pg.

806

-

10

)

47

Visel

A

Blow

MJ

Li

Z

et al.

ChIP-seq accurately predicts tissue-specific activity of enhancers

,

Nature

,

2009

, vol.

457

(pg.

854

-

8

)

48

Song

K

Ren

J

Zhai

Z

et al.

Alignment-free sequence comparison based on next-generation sequencing reads

,

J Comput Biol

,

2013

, vol.

20

(pg.

64

-

79

)

49

Zhai

Z

Reinert

G

Song

K

et al.

Normal and compound Poisson approximations for pattern occurrences in NGS reads

,

J Comput Biol

,

2012

, vol.

19

(pg.

839

-

854

)

50

Cannon

CH

Kua

CS

Zhang

D

et al.

Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack

,

Mol Ecol

,

2010

, vol.

19

(pg.

147

-

61

)

51

Muegge

BD

Kuczynski

J

Knights

D

et al.

Diet Drives Convergence in gut microbiome functions across mammalian phylogeny and within humans

,

Science

,

2011

, vol.

332

(pg.

970

-

4

)

52

Rusch

DB

Halpern

AL

Sutton

G

et al.

The Sorcerer II Global ocean sampling expedition: northwest atlantic through eastern tropical pacific

,

PloS Biol

,

2007

, vol.

5

(pg.

398

-

431

)

53

Kurokawa

K

Itoh

T

Kuwahara

T

et al.

Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes

,

DNA Res

,

2007

, vol.

14

(pg.

169

-

81

)

54

Jeffrey

HJ

Chaos game representation of gene structure

,

Nucleic Acids Res

,

1990

, vol.

18

(pg.

2163

-

70

)

55

Ulitsky

I

Burstein

D

Tuller

T

et al.

The average common substring approach to phylogenomic reconstruction

,

J Comput Biol

,

2006

, vol.

13

(pg.

336

-

50

)

56

Haubold

B

Pierstorff

N

Möller

F

et al.

Genome comparison without alignment using shortest unique substrings

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

123

57

Pinho

AJ

Ferreira

PJ

Garcia

SP

et al.

On finding minimal absent words

,

BMC Bioinformatics

,

2009

, vol.

10

pg.

137

58

Yang

L

Zhang

X

Wang

T

et al.

Large local analysis of the unaligned genome and its application

,

J Comput Biol

,

2013

, vol.

20

(pg.

19

-

29

)

59

Zhao

B

He

RL

Yau

SST

A new distribution vector and its application in genome clustering

,

Mol Phylogenet Evol

,

2011

, vol.

59

(pg.

438

-

443

)

60

Didier

G

Corel

E

Laprevotte

I

et al.

Variable length local decoding and alignment-free sequence comparison

,

Theor Comput Sci

,

2012

, vol.

462

(pg.

1

-

11

)