1. Genome-wide Association Mapping
Avjinder Singh Kaler
PhD Candidate
Department of Crop, Soil, and Environmental Sciences
University of Arkansas
Nov-15-2016
Plant Breeding Lecture
2. Identify genomic regions associated with
phenotypes
Phenotypic Data
• Flowering time
• Plant height
• Yield
• Phenotype Variation
• Phenotypes are response
variables
Genotypic Data
• Genomic markers that span the
entire genome
• Single nucleotide
polymorphisms (SNPs) are
commonly used as markers
• Markers are explanatory
variables
10. GWAS based on Linkage Disequilibrium (LD)
• LD is the non-random correlation or association of alleles at
two loci
• D, D′ (normalized), and r2 are commonly used summary
statistics to estimate pairwise LD
• r2 is preferred in association studies because it is more
indicative of how markers might correlate with QTL
11. Visualize extent of LD between pairs of loci
LD Decay LD Block (Haplotype View)
12.
13. Genome-wide association study (GWAS)
• Identify genomic regions associated with a phenotype
• Fit a statistical model at each SNP in genome
• Use fitted models to test H0: No association with SNP
and phenotype
14. Associating SNPs with phenotypes
• At each SNP: Conduct a test of association with trait
• Significant SNP/trait association suggests:
– SNP has direct biological function (functional polymorphism)
– SNP in LD with functional polymorphism(s)
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
A/C T/C G/A A/G G/T
15. Genetic diversity can lead to false positives in a GWAS
• Two sources for false positives:
– Population structure—allele frequency differences among individuals due to local
adaptation or diversifying selection
– Familial relatedness—allele frequency differences among individuals due to recent co-
ancestry
Genetic Diversity of 2,815 Maize Inbreds
Principal Coordinate 1
PrincipalCoordinate2
Romay et al. (2013)
16. Controlling False Positives due to Population
Structure
• STRUCTURE (Q)
• Identify different subpopulations within a sample of individuals
collected from a population of unknown structure
• Estimating Q- matrix
• Time Consuming
• Principle Component Analysis
• Fast and effective approach to diagnose population structure
• PCA summarizes variation observed across all markers into a smaller
number of underlying component variables
• Estimating PCs-matrix
17. Principle Component Analysis
•Scree plot –shows the
fraction of total variance
in the data explained by
each PC
•PCs selected based on the
L-curve
18. Controlling False Positives due to Familial
relatedness
•A kinship coefficient (F) is the probability that two
homologous genes are identical by descent
•Kinship from genetic markers is an estimate of relative
kinship that is based on probabilities of identical by
state
19. Mixed models reduce false positives in GWAS
• (Line1,…, Linen) ~ MVN(0, )
• K = kinship matrix
• εi ~ i.i.d. N(0, )
Phenotype of ith
individual
Grand Mean
Fixed effects: account
for population
structure
Marker effect
Observed SNP alleles
of ith individual
Random effects:
account for familial
relatedness
Random error
term
Yu et al. (2006)
Measures relatedness between
individuals
21. Germplasm Selection
•Choice of germplasm is critical to the success of the
association analysis
•Phenotyping
•Design Experiment
• Collection of high quality phenotypic data
22. Phenotypic Outliers
•Outliers are “unusual” data points that substantially
deviate from the mean and strongly influence
parameter estimates
•Should ALWAYS check for outliers in our data sets
• Do NOT ignore outliers if detected
23. Phenotypic Outliers
• Outliers can
• increase error variance
• reduce the power of statistical tests
• distort estimates
• decrease normality if non-randomly distributed
• Potential Causes of Outliers
• Human errors in data collection, recording, or entry
• Technical errors from faulty or non-calibrated phenotyping equipment
• Intentional or motivated mis-reporting such as “speed” phenotyping in a
hot field environment
24. Evaluate Data for Outliers
•Histogram
•Box-plot (Box and Whisker plot)
•Quantile-Quantile plot – graphical method for
comparing two probability distributions to assess
goodness-of-fit
Get to know your data!
25. Statistical Identification of Outliers
•Cook’s distance – measures influence of a data point.
Data points that substantially change effect estimates.
•Deleted studentized residuals – measures leverage of
a data point. Data points that affect least squares fit.
Two of several possible methods
26. Removal of Outliers
•Removing anomalous data points from data sets is
controversial to some folks.
•If outliers are not removed, inferences made from the
fitted model may not be representative of the
population under study.
•If you remove outliers, then be sure to report it in the
manuscript.
27. Non-Normal Trait Data
•When fitting a mixed model, two very important
assumptions are that the error terms follow a normal
distribution and that there is a constant variance.
•When data are non-normal, these two assumptions in
particular could be violated.
28. Analysis of Non-Normal Trait Data
•Generalized linear mixed models can be used to
analyze non-normal data
•The Box-Cox procedure can be used to find the most
appropriate transformation that corrects for non-
normality of the error terms and unequal variances.
32. Genotype-Quality Control
• Removing the monomorphic markers
• Markers with Minor allele Frequency < 5% or < 3%
• Markers with high missing rate (e.g. > 10%)
• Imputation for missing data (LD-kNNi, FILLIN, FSHAP,
BEAGLE)
33. Controlling False Positives
• Population structure—allele frequency differences among individuals
due to local adaptation or diversifying selection
• Familial relatedness—allele frequency differences among individuals
due to recent co-ancestry
• If not properly controlled both can cause spurious associations in
GWAS
36. Mixed models reduce false positives in GWAS
• (Line1,…, Linen) ~ MVN(0, )
• K = kinship matrix
• εi ~ i.i.d. N(0, )
Phenotype of ith
individual
Grand Mean
Fixed effects: account
for population
structure
Marker effect
Observed SNP alleles
of ith individual
Random effects:
account for familial
relatedness
Random error
term
Yu et al. (2006)
Measures relatedness between
individuals
37. What is a significant association?
• Bonferroni correction –procedure to control the family-wise error rate
(i.e., probability of making one or more type I errors)
– Simplest and most conservative method to control FWER
– Calculated as α/n, when nis number of hypotheses (i.e., SNPs tested)
• False Discovery Rate –procedure to control the expected proportion of
false discoveries
– Less stringent than Bonferroni
– q-value is the FDR analogue of p-value e.g., q=0.10 is 10 false discoveries/100
tests
• Use list of p-values from ALL SNP tests as input to R function p.adjust
or packages qvalue, fdrtool, … others
Slide adapted from Prof. Jim Holland
39. Genome-wide Association Mapping Results
QQ-plot: assess performance of Statistical model
Simple Model without correcting for population structure Mixed Linear Model