The human genome contracts again
- PMID: 23793748
- DOI: 10.1093/bioinformatics/btt362
The human genome contracts again
Abstract
The number of human genomes that have been sequenced completely for different individuals has increased rapidly in recent years. Storing and transferring complete genomes between computers for the purpose of applying various applications and analysis tools will soon become a major hurdle, hindering the analysis phase. Therefore, there is a growing need to compress these data efficiently. Here, we describe a technique to compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). Furthermore, we explore several intrinsic features of genomes and information in other genomic databases to further improve the compression attained. Using these methods, we compress James Watson's genome to 2.5 megabytes (MB), improving on recent work by 37%. Similar compression is obtained for most genomes available from the 1000 Genomes Project. Our biologically inspired techniques promise even greater gains for genomes of lower organisms and for human genomes as more genomic data become available.
Availability: Code is available at sourceforge.net/projects/genomezip/
Similar articles
-
GDC 2: Compression of large collections of genomes.Sci Rep. 2015 Jun 25;5:11565. doi: 10.1038/srep11565. Sci Rep. 2015. PMID: 26108279 Free PMC article.
-
ERGC: an efficient referential genome compression algorithm.Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2. Bioinformatics. 2015. PMID: 26139636 Free PMC article.
-
CoGI: Towards Compressing Genomes as an Image.IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331. IEEE/ACM Trans Comput Biol Bioinform. 2015. PMID: 26671800
-
Current bioinformatics tools in genomic biomedical research (Review).Int J Mol Med. 2006 Jun;17(6):967-73. Int J Mol Med. 2006. PMID: 16685403 Review.
-
Ten years of bacterial genome sequencing: comparative-genomics-based discoveries.Funct Integr Genomics. 2006 Jul;6(3):165-85. doi: 10.1007/s10142-006-0027-2. Epub 2006 May 12. Funct Integr Genomics. 2006. PMID: 16773396 Review.
Cited by
-
A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry.Entropy (Basel). 2022 Apr 19;24(5):574. doi: 10.3390/e24050574. Entropy (Basel). 2022. PMID: 35626459 Free PMC article.
-
MBGC: Multiple Bacteria Genome Compressor.Gigascience. 2022 Jan 27;11:giab099. doi: 10.1093/gigascience/giab099. Gigascience. 2022. PMID: 35084032 Free PMC article.
-
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020. PLoS One. 2020. PMID: 32453750 Free PMC article.
-
Tackling the Challenges of FASTQ Referential Compression.Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019. Bioinform Biol Insights. 2019. PMID: 30792576 Free PMC article.
-
NRGC: a novel referential genome compression algorithm.Bioinformatics. 2016 Nov 15;32(22):3405-3412. doi: 10.1093/bioinformatics/btw505. Epub 2016 Aug 2. Bioinformatics. 2016. PMID: 27485445 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous