CD-HIT: accelerated for clustering the next-generation sequencing data

doi:10.1093/bioinformatics/bts565

. 2012 Dec 1;28(23):3150-2.

doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11.

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu¹, Beifang Niu, Zhengwei Zhu, Sitao Wu, Weizhong Li

Affiliations

PMID: 23060610
PMCID: PMC3516142
DOI: 10.1093/bioinformatics/bts565

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu et al. Bioinformatics. 2012.

. 2012 Dec 1;28(23):3150-2.

doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11.

Authors

Limin Fu¹, Beifang Niu, Zhengwei Zhu, Sitao Wu, Weizhong Li

Affiliation

¹ Center for Research in Biological Systems, University of California San Diego, La Jolla, CA 92093, USA.

PMID: 23060610
PMCID: PMC3516142
DOI: 10.1093/bioinformatics/bts565

Abstract

Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.

Availability: http://cd-hit.org.

Contact: liwz@sdsc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Evaluation of CD-HIT parallelization: computational time speedup with respect to the number of used CPU cores

See this image and copyright information in PMC

Cited by

Ileal microbial microbiome and its secondary bile acids modulate susceptibility to nonalcoholic steatohepatitis in dairy goats.
Wang Y, Chen X, Huws SA, Xu G, Li J, Ren J, Xu J, Guan LL, Yao J, Wu S. Wang Y, et al. Microbiome. 2024 Nov 23;12(1):247. doi: 10.1186/s40168-024-01964-0. Microbiome. 2024. PMID: 39578870 Free PMC article.
Gap-free telomere-to-telomere haplotype assembly of the tomato hind (Cephalopholis sonnerati).
Lu S, Liu Y, Li M, Ge Q, Wang C, Song Y, Zhou B, Chen S. Lu S, et al. Sci Data. 2024 Nov 22;11(1):1268. doi: 10.1038/s41597-024-04093-3. Sci Data. 2024. PMID: 39578472 Free PMC article.
Multiomics of yaks reveals significant contribution of microbiome into host metabolism.
Yang S, Zheng J, Mao H, Vinitchaikul P, Wu D, Chai J. Yang S, et al. NPJ Biofilms Microbiomes. 2024 Nov 21;10(1):133. doi: 10.1038/s41522-024-00609-2. NPJ Biofilms Microbiomes. 2024. PMID: 39572587 Free PMC article.
The functions and factors governing fungal communities and diversity in agricultural waters: insights into the ecosystem services aquatic mycobiota provide.
Pham P, Shi Y, Khan I, Sumarah M, Renaud J, Sunohara M, Craiovan E, Lapen D, Aris-Brosou S, Chen W. Pham P, et al. Front Microbiol. 2024 Nov 5;15:1460330. doi: 10.3389/fmicb.2024.1460330. eCollection 2024. Front Microbiol. 2024. PMID: 39564490 Free PMC article.
Changes in the structure of the microbial community within the phycospheric microenvironment and potential biogeochemical effects induced in the demise stage of green tides caused by Ulva prolifera.
Liu X, Zang Y, Fan S, Miao X, Fu M, Ma X, Li M, Zhang X, Wang Z, Xiao J. Liu X, et al. Front Microbiol. 2024 Nov 5;15:1507660. doi: 10.3389/fmicb.2024.1507660. eCollection 2024. Front Microbiol. 2024. PMID: 39564489 Free PMC article.

See all "Cited by" articles

References

1. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. - PubMed
1. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. - PubMed
1. Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. - PubMed
1. Loong SNK, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA. 2007;13:170–187. - PMC - PubMed
1. Niu B, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010;11:187. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. - PubMed

[2] Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. - PubMed

[3] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. - PubMed

[4] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. - PubMed

[5] Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. - PubMed

[6] Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. - PubMed

[7] Loong SNK, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA. 2007;13:170–187. - PMC - PubMed

[8] Loong SNK, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA. 2007;13:170–187. - PMC - PubMed

[9] Niu B, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010;11:187. - PMC - PubMed

[10] Niu B, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010;11:187. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CD-HIT: accelerated for clustering the next-generation sequencing data

Affiliation

CD-HIT: accelerated for clustering the next-generation sequencing data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources