3. Structure of this lecture
• Recap some concepts (SAS tutorial later)
• Discuss GWAS
• Look at the steps in running & analyzing results
GWAS
• Lab – analyze a GWAS
• SAS tutorial
4. Recap from last session
• At an A/T locus:
• -What are the genotypes?
• -What are the alleles?
• Given the following frequencies at a locus:
A/A: N=123
A/C: N=134
C/C N=52
Which is the minor allele?
What are the allele frequencies?
5. Recap from last session
• Additive model: Each ‘risk’ allele conveys risk for
the phenotype in an additive fashion
• Recessive effect of the minor allele: The minor
allele conveys risk, but you need 2 copies to have
the risk
• Dominant effect of the minor allele: The minor
allele conveys risk, and you only need 1 copy to
have the risk
6. Recap from last session
• Given these genotype means, what genotype
models would you run? (Additive / recessive effect
of the minor allele / dominant effect of the minor
allele?):
aa aA AA
1 10 15 15
2 23 35 42
3 10 12 14
4 9 11 32
5 5 5 16
6 1 5 7
11. What is a GWAS
Basic idea: Genotype individuals for a large number (~1M) of SNPs spread in a
generally unspecified way throughout the genome. Look for association.
Advantage?
What does this table show?
15. Why do we question the success of GWAS?
• What is the median income in the US?
– $50,000
• What is federal tax?
– 25%
• Median tax/year is
– ~12,500
• What is the average cost of an NIH genetics grant?
– $477,215 / $350,000
• 30-40 working years
16. Huge numbers of published GWAS!
• http://www.genome.gov/gwastudies/
• http://www.ebi.ac.uk/fgpt/gwas/#
17. JAMA, 2007
• T2DB suspected that gens ‘related to insulin
resistance would be identified. Interestingly, not
one gene known to be associated with insulin
resistance has been found in the recent set of
genome-wide scans for diabetes, but genes
related to insulin secretion, insulin transport, zinc
binding to insulin, and pancreatic is-let beta cell
development have been discovered”
18. And yet….
• Sir Alec Jeffries: ‘‘One of the great hopes for GWAS was
that, in the same way that huge numbers of Mendelian
disorders were pinned down at the DNA level and the gene
and mutations involved identified, it would be possible to
simply extrapolate from single gene disor-ders to complex
multigenic disorders. That really hasn’t happened.
Proponents will argue that it has worked and that all sorts
of fascinating genes that predispose to or protect against
diabetes or breast cancer, for example, have been
identified, but the fact remains that the bulk of the
heritability in these conditions cannot be ascribed to loci
that have emerged from GWAS, which clearly isn’t going to
be the answer to everything.’’”
19. • Cell, 2010 ‘‘To date, genome-wide association
studies (GWAS) have published hundreds of
common variants whose allele frequencies are
statistically correlated with various illnesses and
traits. However, the vast majority of such variants
have no established biolog-ical relevance to
disease or clinical utility for prognosis or
treatment.”
22. • With each step think:
“What effect does this have on our ability to find
genes for traits and disorders?”
23. Running a GWAS: Getting your genotype data
• Select your chip
• Complete your genotyping
24. Running a GWAS: QC-ing SNP data
• Poor quality samples
– Sample genotype success rate < 95 to 97.5%
– Greater proportion of heterozygous genotypes than
expected
• Related individuals (if independent samples)
– Based on pair-wise comparisons of similarity of
genotypes
• Sample switches
– Wrong sex
25. Running a GWAS: QC-ing SNP data
• This can get very complex!
• “GOLDN SNPs that were monomorphic (55,530) or had acall rate <96 %
(82,462) were removed from the analysis. In addition, SNPs were
excluded from the analysis based on the number of families with
Mendelian errors as follows: for minor allele frequency (MAF) <20 %,
removed if errors were present in 3 families (1,486 SNPs); for MAF < 10
%, removed if errors were present in 2 families (1,338 SNPs); MAF <
5%, removed if errors were present in 1 family (1,767 SNPs); for MAF
<5 %, removed if any errors were present (9,592 SNPs). In families with
remaining errors, SNPs that exhibited Mendelian error were set to
missing (31,595 SNPs). Furthermore, 16 participants with call rates<96
% were also removed from any subsequent analyses. Subsequently,
748 SNPs failing the Hardy–Weinberg equilibrium (HWE) test at P
value<10-6 were excluded.”
26. Running a GWAS: Imputing more SNPs
Useful for:
• Combining data from different platforms (e.g., Affy & Illumina) (for
replication or meta-analysis)
• Estimating unmeasured or missing genotypes
• Based on measured SNPs and external info (e.g., haplotype
structure of HapMap)
• Imputation methods use the dense genotype data available from
HapMap samples (i.e., CEU) and the LD relationships of the
SNPs to impute (predict) genotypes for a large number of SNPs
that were not measured experimentally
27. “Short cuts”
A T A G T A C AT
C
A
C
A
T
G
A
G
C
G
CA
A
A
T
T
G
G
A
A
G
C
G
C
T
C
C
C
G
C
G
C
A
C
C
C
SNPs 1, 3 and 4 are TagSNPs
28. Running a GWAS: Imputing more SNPs
• Requires large scale computing resources
• Can go from ~500,000 SNPs to ~2.2 million
• Need to assess quality of imputation
– Compare imputed genotypes to actual genotypes
• Error rates are higher than for genotyped SNPs
• Works less well for rarer alleles (e.g., MAF<3%)
• Best to take account of probabilities assigned to imputed
genotypes in the analysis
– “dosages” = probabilities of the genotypes
• Allows association testing of untyped variation
• Allows for ease of combining data across genotyping platforms
Genotype: AA AG GG
Probability: 80% 20% 0%
Coding: 2 1 0
30. Running a GWAS: Imputing more SNPs
MACH, Markov Chain Haplotyping
– Developed by Goncalo Abecasis
– http://www.sph.umich.edu/csg/abecasis/MACH/
IMPUTE
– Developed by Jonathan Marchini
– http://www.stats.ox.ac.uk/~marchini/#software
BIMBAM
– http://quartus.uchicago.edu/~yguan/bimbam/index.html
31. Running a GWAS: QC-ing SNP data
• More QC!
• MAF again
• Imputation quality (RSQ<.3??)
32. Running a GWAS: Computing your p-values
• Commonly SNPs coded as the additive effect of alleles
• Logistic regression or linear regression (much like a candidate gene
study)
34. Running a GWAS: Interpreting your data
• (Really… it is more QC)….
• 2 main issues:
• Population stratification
• Multiple testing
35. The Problems of population substructure
• Devlin and Roeder (1999) used
theoretical arguments to propose that
with population structure, the
distribution of Cochran-Armitage trend
tests, genome-wide, is inflated by a
constant multiplicative factor λ.
• We can estimate the multiplicative
inflation factor using the statistic
λ = median(Xi
2)/0.456.
• Inflation factor λ > 1 indicates
population structure and/or
genotyping error.
• We can carry out an adjusted test of
association that takes account of any
mismatching of cases/controls at any
SNP using the statistic Xi
2/ λ.
Inflation factor λ = 1.11
Population outliers
and/or structure?
True hits?
36.
37. Solving population stratification
• Has become a standard tool in genetics to
identify subpopulations
• Used to infer continuous axes of genetic
variation (eigenvectors) that reduce the data
to a small number of dimensions while
describing as much of the variability among
individuals as possible
• Utilizes a set of “neutral” SNPs (not
associated with phenotype, need many)
• Implemented in the EIGENSTRAT software
package.
39. Population substructure thought
experiment
• You run a GWAS and find a significant hit
within a gene. The SNP is genotyped,
biologically plausible, in HWE (p>.05) and has
a MAF = .13. However, your λ=1.56! So you
decide not to use the SNP.
• What would have happened if you had taken a
candidate gene approach and only analyzed
that SNP?
41. Multiple testing
• When all tests are independent,
– Probability to observe P<.01 is
– Probability to observe P<.05 is
– Probability to observe P<.5 is
42.
43. Multiple testing
• At an alpha of .05, the probability that the finding
is not due to chance, is 95%.
• How many significant findings do we expect to
arise by chance?
• How many tests does a GWAS have?
• So, how many significant findings do we expect to
arise by chance?
44. Multiple testing corrections
• Bonferroni:
• When k independent tests are used, the corrected
p value should be: α / k
“Bonferroni adjustments are, at best, unnecessary, and, at worst,
deleterious to sound statistical inference.” Perneger, 1998
While it is computationally simple (single step method), and
stringent, maybe it is too stringent? High probability of Type
II errors.
45. Multiple testing corrections
• Holm
Target p value__________________
n - rank number of the pair in terms of degree of
significance + 1
• Assume 3 p-values, and α=.05
• the smallest p value has to be smaller than .05/3 =
.017 to be sig
• the second smallest p value has to be smaller than
.05/2 = .025 to be sig
• and the third smallest has to be smaller than .05/1 =
.05 to be sig.
• Low probability of Type II errors… but too low?
50. Multiple testing corrections
• Storey – q values
• For a given test, we estimate the q-value by
calculating the minimum estimated false discovery
rate among all thresholds at which the false
discovery rate is called significant
• E.g. Rank 52 has a p-value of 0.01 and a q-value of
0.0141.
• A p-value of 0.01 implies a 1% chance of false
positives from all the results. Therefore 839 * .01 =
8.39 false tests. A q-value of .01 implies a 1%
chance of false positive, from all the values < .01.
In this experiment 52 tests had a q-value less than
0.0141 and so 52*0.0141 = 0.7332 false positives.
• False positives according to p-values take all 839
values into account; while q-values take into
account only those tests with q-values less the
threshold we choose.
51. Multiple testing example
• *Think* it is a sliding scale, magic p-value of ‘no
error’. BUT, repeatedly taken a range of SNPs
• 1.0*10-5>p>1.0*10-9
• So, what correction would you use? A stringent
one? A lenient one?
• What is the purpose of GWAS?
52. Multiple testing thought experiment
• Jane comes to you, she is interested in whether SNPs in
LDL1R gene are associated with CVD. She uses all 7 SNPs
genotyped in LDL1R, and the P-values are as follows: .0001,
.01, .0006, .000006, .000001, .05, .09. How many SNPs are
significantly associated with CVD?
• John comes to you, and has run a GWAS on a million SNPs,
including Jane’s 7 in LDL1R. He does a Bonferroni
correction AND an FDR correction on his GWAS, and none
of the SNPs survive the correction for multiple testing. He
concludes that the LDL1R gene is not associated with CVD
• Who is right?
53. Multiple testing: a final thought
• Power is a product of
-N
-Effect size (subsuming error)
-Alpha
56. Running a GWAS: Visualize your results
K Wang et al. Nature 000, 1-6 (2009) doi:10.1038/nature07999
57. Running a GWAS: Interpreting SNPs
• Look at the functionality of your SNP (SNPdoc)
• Literature search – can you give biological
plausibility?
• Other tests: pathway analysis / Gene based tests
• Tonne of free & very expensive tools out there.
Conduct your own quality control!!
59. Running a GWAS: Replication
Replication
Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005
NCI-NHGRI Working Group on Replication Nature
447: 655, 2007
Replication
Replication
60. Winner’s Curse
• ‘Winner's curse’ = the phenomenon whereby winners at competitive auctions
are likely to pay in excess of the value of the item.
• In genetic association studies, the winner's curse is the phenomenon whereby
the disease risk of a newly identified genetic association is overestimated.
• Occurs when the statistical power of original study is not sufficient.
• The winner's curse implies that the sample size required for confirmatory
study will be underestimated, resulting in failure of replication study to
corroborate the association.
• The winner's curse is common in genome-wide association (GWA) studies
because most single-GWA studies are underpowered to detect small genetic
effects at a stringent genome-wide significance level.
• What are the solutions? Large GWAS (or a meta-analysis)
61. Can you solve the case of the missing
heritability?
Hint. What have we discussed
about:
-Traits
-Statistics
-Genotype models
-Coverage of the genome
-Power?
63. Primer on SAS: Libraries
Find your libraries
here. Toggle between
libraries and results
Work is the default
library. SAS will pull files
from here if nothing else
is specified.
Here are other libraries
I have made
64. Primer on SAS: Reading in data
• For SAS datasets: Tell SAS to make a
library
Give SAS a path. SAS will read in
all SAS datasets in this folder
65. Primer on SAS: Reading in data
• For non-SAS datasets:
Follow on the import
wizard (I have code if you
prefer)
66. Primer on SAS: Move your file into ‘work’
‘Data’ command:
make a dataset. Call
it Goldn. There is no
library so put it in
‘work’.
‘set’ command: base the
new dataset off goldn in
the goldn library
Library
Dataset
68. Primer on SAS: Summarizing Data
• Continuous variables:
Command
‘proc
univariate’
Tell SAS
which
variables
Select a
subsample There are other commands. E.g. ‘BY’. Your data must be
sorted by the ‘BY’ variable to use this command
69. Primer on SAS: Summarizing Data
• Categorical variables:
Command
‘proc freq’
Tell SAS
which
variables
70. Primer on SAS: Manipulating data
Make your
data… set your
data. Think:
Libraries.
Options (look up
others)
71. Primer on SAS: Sorting your dataset
Command =
proc sort
Which dataset
to sort
What to call the
sorted dataset
What to sort by
72. Primer on SAS. One fundamental rule
• Always check your log!
74. Special commands: QQ plot
1. Make a new dataset (to preserve yours)
2. Set your data
3. Set to the number of SNPs
5. Select your data
NOTE: Your data must be sorted by p-value
4. Select the smallest p-value power
(here 1*10-25)
76. Special commands: Correcting for multiple
testing
• 1. Make a dataset with ONLY your p-values in (no
other columns)
• 2. Make sure your P-values are called RAW_P
3. Set your data
4. Select your corrections
(look them up )