Variance component model to account for sample structure in genome-wide association studies

doi:10.1038/ng.548

. 2010 Apr;42(4):348-54.

doi: 10.1038/ng.548. Epub 2010 Mar 7.

Variance component model to account for sample structure in genome-wide association studies

Hyun Min Kang¹, Jae Hoon Sul, Susan K Service, Noah A Zaitlen, Sit-Yee Kong, Nelson B Freimer, Chiara Sabatti, Eleazar Eskin

Affiliations

PMID: 20208533
PMCID: PMC3092069
DOI: 10.1038/ng.548

Variance component model to account for sample structure in genome-wide association studies

Hyun Min Kang et al. Nat Genet. 2010 Apr.

. 2010 Apr;42(4):348-54.

doi: 10.1038/ng.548. Epub 2010 Mar 7.

Authors

Hyun Min Kang¹, Jae Hoon Sul, Susan K Service, Noah A Zaitlen, Sit-Yee Kong, Nelson B Freimer, Chiara Sabatti, Eleazar Eskin

Affiliation

¹ Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.

PMID: 20208533
PMCID: PMC3092069
DOI: 10.1038/ng.548

Abstract

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Figures

**Figure 1**
Scatter plots of the first two principal components against latitude and longitude. Only individuals of known ancestry are included in the plot. Latitude and longitude are defined as the average latitude and longitude of the parents’ birthplaces. Colors indicate linguistic or geographic subgroups.

**Figure 2**
The genomic control parameters for ten traits change with the number of principal components used for adjustment. Sig PC, significant principal components, includes the principal components (PC) that have a t-test P value < 0.005 as predictors for each of the phenotypes. LDL, low density lipoprotein; SBP, systolic blood pressure; HDL, high-density lipoprotein; GLU, glucose; BMI, body mass index; DBP, diastolic blood pressure; INS, insulin plasma levels; TG, triglyceride; CRP, C-reactive protein.

**Figure 3**
Comparison of P value distributions across different methods with NFBC66 data. (a) Quantile-quantile plot of the height phenotype, which shows the largest inflation of test statistics, before application of genomic control. The shadowed region represents a conservative 95% confidence interval (CI) computed from the beta distribution assuming independence markers. ES100 indicates EIGENSOFT correcting for 100 principal components. (b) Comparison of LDL association P values between uncorrected and EMMAX analysis after application of genomic control in a logarithmic scale.

**Figure 4**
Rank concordance comparison of strongly associated SNPs between different methods. The ten NFBC66 phenotypes (abbreviated as in Fig. 2) are ordered by their genomic control inflation factors. Rank concordance is presented as CAT plots. The proportion of SNPs shared between sets of the top k SNPs for different methods are shown for 10 ≤ k ≤ 5000. Pairs of sets being compared are indicated in key at bottom; for example, Uncorr-EMMAX, comparison of uncorrected set and EMMAX set. ES100 indicates EIGENSOFT correcting for 100 principal components.

**Figure 5**
Distribution of the marker-specific inflation factors from NFBC66 data sets. (a) Box plots of the marker-specific inflation factors across ten phenotypes, in addition to the genomic control inflation factor for each phenotype. Abbreviations are as in Figure 2. (**b,c**) Distributions of P values of the height phenotype association when the estimated per-marker inflation factors are less than 1.05 (35,988 SNPs; b) and when they are greater than 1.2 (15,874 SNPs; c).

See this image and copyright information in PMC

Cited by

The 1000 Chinese Indigenous Pig Genomes Project provides insights into the genomic architecture of pigs.
Du H, Zhou L, Liu Z, Zhuo Y, Zhang M, Huang Q, Lu S, Xing K, Jiang L, Liu JF. Du H, et al. Nat Commun. 2024 Nov 22;15(1):10137. doi: 10.1038/s41467-024-54471-z. Nat Commun. 2024. PMID: 39578420 Free PMC article.
Releasing a sugar brake generates sweeter tomato without yield penalty.
Zhang J, Lyu H, Chen J, Cao X, Du R, Ma L, Wang N, Zhu Z, Rao J, Wang J, Zhong K, Lyu Y, Wang Y, Lin T, Zhou Y, Zhou Y, Zhu G, Fei Z, Klee H, Huang S. Zhang J, et al. Nature. 2024 Nov;635(8039):647-656. doi: 10.1038/s41586-024-08186-2. Epub 2024 Nov 13. Nature. 2024. PMID: 39537922 Free PMC article.
Multiomics dissection of Brassica napus L. lateral roots and endophytes interactions under phosphorus starvation.
Liu C, Bai Z, Luo Y, Zhang Y, Wang Y, Liu H, Luo M, Huang X, Chen A, Ma L, Chen C, Yuan J, Xu Y, Zhu Y, Mu J, An R, Yang C, Chen H, Chen J, Li Z, Li X, Dong Y, Zhao J, Shen X, Jiang L, Feng X, Yu P, Wang D, Chen X, Li N. Liu C, et al. Nat Commun. 2024 Nov 10;15(1):9732. doi: 10.1038/s41467-024-54112-5. Nat Commun. 2024. PMID: 39523413 Free PMC article.
Genome-Wide Association-Based Identification of Alleles, Genes and Haplotypes Influencing Yield in Rice (Oryza sativa L.) Under Low-Phosphorus Acidic Lowland Soils.
James M, Tyagi W, Magudeeswari P, Neeraja CN, Rai M. James M, et al. Int J Mol Sci. 2024 Oct 30;25(21):11673. doi: 10.3390/ijms252111673. Int J Mol Sci. 2024. PMID: 39519225 Free PMC article.
Structural variation reshapes population gene expression and trait variation in 2,105 Brassica napus accessions.
Zhang Y, Yang Z, He Y, Liu D, Liu Y, Liang C, Xie M, Jia Y, Ke Q, Zhou Y, Cheng X, Huang J, Liu L, Xiang Y, Raman H, Kliebenstein DJ, Liu S, Yang QY. Zhang Y, et al. Nat Genet. 2024 Nov;56(11):2538-2550. doi: 10.1038/s41588-024-01957-7. Epub 2024 Nov 5. Nat Genet. 2024. PMID: 39501128 Free PMC article.

See all "Cited by" articles

References

1. Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1:e32. - PMC - PubMed
1. Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771–780. - PubMed
1. Newman DL, Abney M, McPeek MS, Ober C, Cox NJ. The importance of genealogy in determining genetic associations with complex traits. Am J Hum Genet. 2001;69:1146–1148. - PMC - PubMed
1. Helgason A, Yngvadttir B, Hrafnkelsson B, Gulcher J, Stefnsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005;37:90–95. - PubMed
1. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations Database

[1] Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1:e32. - PMC - PubMed

[2] Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1:e32. - PMC - PubMed

[3] Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771–780. - PubMed

[4] Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771–780. - PubMed

[5] Newman DL, Abney M, McPeek MS, Ober C, Cox NJ. The importance of genealogy in determining genetic associations with complex traits. Am J Hum Genet. 2001;69:1146–1148. - PMC - PubMed

[6] Newman DL, Abney M, McPeek MS, Ober C, Cox NJ. The importance of genealogy in determining genetic associations with complex traits. Am J Hum Genet. 2001;69:1146–1148. - PMC - PubMed

[7] Helgason A, Yngvadttir B, Hrafnkelsson B, Gulcher J, Stefnsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005;37:90–95. - PubMed

[8] Helgason A, Yngvadttir B, Hrafnkelsson B, Gulcher J, Stefnsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005;37:90–95. - PubMed

[9] Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. - PMC - PubMed

[10] Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Variance component model to account for sample structure in genome-wide association studies

Affiliation

Variance component model to account for sample structure in genome-wide association studies

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources