A global reference for human genetic variation

doi:10.1038/nature15393

. 2015 Oct 1;526(7571):68-74.

doi: 10.1038/nature15393.

A global reference for human genetic variation

1000 Genomes Project Consortium; Adam Auton, Lisa D Brooks, Richard M Durbin, Erik P Garrison, Hyun Min Kang, Jan O Korbel, Jonathan L Marchini, Shane McCarthy, Gil A McVean, Gonçalo R Abecasis

Collaborators

PMID: 26432245
PMCID: PMC4750478
DOI: 10.1038/nature15393

A global reference for human genetic variation

1000 Genomes Project Consortium et al. Nature. 2015.

. 2015 Oct 1;526(7571):68-74.

doi: 10.1038/nature15393.

PMID: 26432245
PMCID: PMC4750478
DOI: 10.1038/nature15393

Abstract

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

PubMed Disclaimer

Conflict of interest statement

D.M.A. is affiliated with Vertex Pharmaceuticals, E.A. is on the speaker’s bureau for Illumina, P.A. is an advisor to Illumina and Ancestry.com, D.R.B., B.B., M.B., R.K.C., A.C., M.E., S.H., S.K., L.M., J.P. and R.S. are affiliated with Illumina, J.K.B. is affiliated with Ancestry.com, A.C. is on the Science Advisory Board of Biogen Idec. and the scientific advisory board of Affymetrix, A.W.C. is affiliated with DNAnexus, D.C. is affiliated with Personalis, C.J.D., J.G., J.P.S., T.W., B.W., and Y.Z. are affiliated with Affymetrix, E.T.D. is an advisor for DNAnexus, F.M.D.L.V. is employed by Real Time Genomics, M.A.D. is affiliated with SynapDx, P.D. is a co-founder and director of Genomics, and a partner in Peptide Groove, R.D. is a founder of Congenica and a consultant for Dovetail, E.E.E. is on the scientific advisory board of DNAnexus, and is a consultant for Kunming University of Science and Technology as part of the 1000 China Talent Program, P.F. is a member of the scientific advisory board of Omicia, M.G. is an advisor to Bina and DNAnexus, F.C.L.H. is affiliated with ThermoFisher Scientific, N.H. is affiliated with Life Technologies, C.L. is a scientific advisor for BioNano Genomics, H.Y.K.L. is affiliated with Bina Technologies which is part of Roche Sequencing, E.R.M. holds shares in Life Technologies, and G.M. is a co-founder of Genomics and a partner in Peptide Groove.

Figures

**Figure 1. Population sampling.**
a, Polymorphic variants within sampled populations. The area of each pie is proportional to the number of polymorphisms within a population. Pies are divided into four slices, representing variants private to a population (darker colour unique to population), private to a continental area (lighter colour shared across continental group), shared across continental areas (light grey), and shared across all continents (dark grey). Dashed lines indicate populations sampled outside of their ancestral continental region. b, The number of variant sites per genome. c, The average number of singletons per genome. PowerPoint slide

**Figure 2. Population structure and demography.**
a, Population structure inferred using a maximum likelihood approach with 8 clusters. b, Changes to effective population sizes over time, inferred using PSMC. Lines represent the within-population median PSMC estimate, smoothed by fitting a cubic spline passing through bin midpoints. PowerPoint slide

**Figure 3. Population differentiation.**
a, Variants found to be rare (<0.5%) within the global sample, but common (>5%) within a population. b, Genes showing strong differentiation between pairs of closely related populations. The vertical axis gives the maximum obtained value of the F_ST-based population branch statistic (PBS), with selected genes coloured to indicate the population in which the maximum value was achieved. PowerPoint slide

**Figure 4. Imputation and eQTL discovery.**
a, Imputation accuracy as a function of allele frequency for six populations. The insert compares imputation accuracy between phase 3 and phase 1, using all samples (solid lines) and intersecting samples (dashed lines). b, The average number of tagging variants (r² > 0.8) as a function of physical distance for common (top), low frequency (middle), and rare (bottom) variants. c, The proportion of top eQTL variants that are SNPs and indels, as discovered in 69 samples from each population. d, The percentage of eQTLs in TFBS, having performed discovery in the first population, and fine mapped by including an additional 69 samples from a second population (*P < 0.01, **P < 0.001, ***P < 0.0001, McNemar’s test). The diagonal represents the percentage of eQTLs in TFBS using the original discovery sample. PowerPoint slide

**Extended Data Figure 1. Summary of the callset generation pipeline.**
Boxes indicate steps in the process and numbers indicate the corresponding section(s) within the Supplementary Information.

**Extended Data Figure 2. Power of discovery and heterozygote genotype discordance.**
a, The power of discovery within the main data set for SNPs and indels identified within an overlapping sample of 284 genomes sequenced to high coverage by Complete Genomics (CG), and against a panel of >60,000 haplotypes constructed by the Haplotype Reference Consortium (HRC). To provide a measure of uncertainty, one curve is plotted for each chromosome. b, Improved power of discovery in phase 3 compared to phase 1, as assessed in a sample of 170 Complete Genomics genomes that are included in both phase 1 and phase 3. c, Heterozygote discordance in phase 3 for SNPs, indels, and SVs compared to 284 Complete Genomics genomes. d, Heterozygote discordance for phase 3 compared to phase 1 within the intersecting sample. e, Sensitivity to detect Complete Genomics SNPs as a function of sequencing depth. f, Heterozygote genotype discordance as a function of sequencing depth, as compared to Complete Genomics data.

**Extended Data Figure 3. Variant counts.**
a, The number of variants within the phase 3 sample as a function of alternative allele frequency. b, The average number of detected variants per genome with whole-sample allele frequencies <0.5% (grey bars), with the average number of singletons indicated by colours.

**Extended Data Figure 4. The standardized number of variant sites per genome, partitioned by population and variant category.**
For each category, z-scores were calculated by subtracting the mean number of sites per genome (calculated across the whole sample), and dividing by the standard deviation. From left: sites with a derived allele, synonymous sites with a derived allele, nonsynonymous sites with a derived allele, sites with a loss-of-function allele, sites with a HGMD disease mutation allele, sites with a ClinVar pathogenic variant, and sites carrying a GWAS risk allele.

**Extended Data Figure 5. Population structure as inferred using the admixture program for K = 5 to 12.**

**Extended Data Figure 6. Allelic sharing.**
a, Genotype covariance (above diagonal) and sharing of f₂ variants (below diagonal) between pairs of individuals. b, Quantification of average f₂ sharing between populations. Each row represents the distribution of f₂ variants shared between individuals from the population indicated on the left to individuals from each of the sampled populations. c, The average number of f₂ variants per haploid genome. d, The inferred age of f₂ variants, as estimated from shared haplotype lengths, with black dots indicating the median value.

**Extended Data Figure 7. Unsmoothed PSMC curves.**
a, The median PSMC curve for each population. b, PSMC curves estimated separately for all individuals within the 1000 Genomes sample. c, Unsmoothed PSMC curves comparing estimates from the low coverage data (dashed lines) to those obtained from high coverage PCR-free data (solid lines). Notable differences are confined to very recent time intervals, where the additional rare variants identified by deep sequencing suggest larger population sizes.

**Extended Data Figure 8. Genes showing very strong patterns of differentiation between pairs of closely related populations within each continental group.**
Within each continental group, the maximum PBS statistic was selected from all pairwise population comparisons within the continental group against all possible out-of-continent populations. Note the x axis shows the number of polymorphic sites within the maximal comparison.

**Extended Data Figure 9. Performance of imputation.**
a, Performance of imputation in 6 populations using a subset of phase 3 as a reference panel (n = 2,445), phase 1 (n = 1,065), and the corresponding data within intersecting samples from both phases (n = 1,006). b, Performance of imputation from phase 3 by variant class.

**Extended Data Figure 10. Decay of linkage disequilibrium as a function of physical distance.**
Linkage disequilibrium was calculated around 10,000 randomly selected polymorphic sites in each population, having first thinned each population down to the same sample size (61 individuals). The plotted line represents a 5 kb moving average.

See this image and copyright information in PMC

Comment in

Human genomics: The end of the start for population sequencing.
Birney E, Soranzo N. Birney E, et al. Nature. 2015 Oct 1;526(7571):52-3. doi: 10.1038/526052a. Nature. 2015. PMID: 26432243 No abstract available.

Cited by

Hemochromatosis neural archetype reveals iron disruption in motor circuits.
Loughnan R, Ahern J, Boyle M, Jernigan TL, Hagler DJ Jr, Iversen JR, Frei O, Smith DM, Andreassen O, Zaitlen N, Sugrue L, Thompson WK, Dale A, Schork AJ, Fan CC. Loughnan R, et al. Sci Adv. 2024 Nov 22;10(47):eadp4431. doi: 10.1126/sciadv.adp4431. Epub 2024 Nov 22. Sci Adv. 2024. PMID: 39576859 Free PMC article.
Assessing the Causal Relationship Between Immune Cells and Temporomandibular Related Pain by Bi‑Directional Mendelian Randomization Analysis.
He J, Chen X. He J, et al. J Pain Res. 2024 Nov 16;17:3791-3800. doi: 10.2147/JPR.S486817. eCollection 2024. J Pain Res. 2024. PMID: 39574830 Free PMC article.
The shared genetic architecture and evolution of human language and musical rhythm.
Alagöz G, Eising E, Mekki Y, Bignardi G, Fontanillas P; 23andMe Research Team; Nivard MG, Luciano M, Cox NJ, Fisher SE, Gordon RL. Alagöz G, et al. Nat Hum Behav. 2024 Nov 21. doi: 10.1038/s41562-024-02051-y. Online ahead of print. Nat Hum Behav. 2024. PMID: 39572686
Metabolomic and genomic prediction of common diseases in 700,217 participants in three national biobanks.
Nightingale Health Biobank Collaborative Group. Nightingale Health Biobank Collaborative Group. Nat Commun. 2024 Nov 21;15(1):10092. doi: 10.1038/s41467-024-54357-0. Nat Commun. 2024. PMID: 39572536 Free PMC article.
Human DNA from the oldest Eneolithic cemetery in Nalchik points the spread of farming from the Caucasus to the Eastern European steppes.
Zhur KV, Sharko FS, Leonova MV, Mey A, Prokhortchouk EB, Trifonov VA. Zhur KV, et al. iScience. 2024 Oct 16;27(11):110963. doi: 10.1016/j.isci.2024.110963. eCollection 2024 Nov 15. iScience. 2024. PMID: 39569382 Free PMC article.

See all "Cited by" articles

References

1. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010) - PMC - PubMed
1. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature491, 56–65 (2012) - PMC - PubMed
1. Voight BF, et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. doi: 10.1371/journal.pgen.1002793. - DOI - PMC - PubMed
1. Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature Genet. 2011;43:1193–1201. doi: 10.1038/ng.998. - DOI - PMC - PubMed
1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genet. 2012;44:955–959. doi: 10.1038/ng.2354. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010) - PMC - PubMed

[2] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010) - PMC - PubMed

[3] The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature491, 56–65 (2012) - PMC - PubMed

[4] The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature491, 56–65 (2012) - PMC - PubMed

[5] Voight BF, et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. doi: 10.1371/journal.pgen.1002793. - DOI - PMC - PubMed

[6] Voight BF, et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. doi: 10.1371/journal.pgen.1002793. - DOI - PMC - PubMed

[7] Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature Genet. 2011;43:1193–1201. doi: 10.1038/ng.998. - DOI - PMC - PubMed

[8] Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature Genet. 2011;43:1193–1201. doi: 10.1038/ng.998. - DOI - PMC - PubMed

[9] Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genet. 2012;44:955–959. doi: 10.1038/ng.2354. - DOI - PMC - PubMed

[10] Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genet. 2012;44:955–959. doi: 10.1038/ng.2354. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A global reference for human genetic variation

A global reference for human genetic variation

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources