SlideShare a Scribd company logo
1 of 79
Session 7: Genome-wide Association
Studies (GWAS)
But first….
June 13th 2013
Structure of this lecture
• Recap some concepts (SAS tutorial later)
• Discuss GWAS
• Look at the steps in running & analyzing results
GWAS
• Lab – analyze a GWAS
• SAS tutorial
Recap from last session
• At an A/T locus:
• -What are the genotypes?
• -What are the alleles?
• Given the following frequencies at a locus:
A/A: N=123
A/C: N=134
C/C N=52
Which is the minor allele?
What are the allele frequencies?
Recap from last session
• Additive model: Each ‘risk’ allele conveys risk for
the phenotype in an additive fashion
• Recessive effect of the minor allele: The minor
allele conveys risk, but you need 2 copies to have
the risk
• Dominant effect of the minor allele: The minor
allele conveys risk, and you only need 1 copy to
have the risk
Recap from last session
• Given these genotype means, what genotype
models would you run? (Additive / recessive effect
of the minor allele / dominant effect of the minor
allele?):
aa aA AA
1 10 15 15
2 23 35 42
3 10 12 14
4 9 11 32
5 5 5 16
6 1 5 7
All about GWAS
Back in the 90s…
Heralding the GWAS era
What is a GWAS
Basic idea: Genotype individuals for a large number (~1M) of SNPs spread in a
generally unspecified way throughout the genome. Look for association.
Advantage?
What does this table show?
The era of hypothesis generating research
The era of hypothesis generating research
Why do we question the success of GWAS?
• What is the median income in the US?
– $50,000
• What is federal tax?
– 25%
• Median tax/year is
– ~12,500
• What is the average cost of an NIH genetics grant?
– $477,215 / $350,000
• 30-40 working years
Huge numbers of published GWAS!
• http://www.genome.gov/gwastudies/
• http://www.ebi.ac.uk/fgpt/gwas/#
JAMA, 2007
• T2DB suspected that gens ‘related to insulin
resistance would be identified. Interestingly, not
one gene known to be associated with insulin
resistance has been found in the recent set of
genome-wide scans for diabetes, but genes
related to insulin secretion, insulin transport, zinc
binding to insulin, and pancreatic is-let beta cell
development have been discovered”
And yet….
• Sir Alec Jeffries: ‘‘One of the great hopes for GWAS was
that, in the same way that huge numbers of Mendelian
disorders were pinned down at the DNA level and the gene
and mutations involved identified, it would be possible to
simply extrapolate from single gene disor-ders to complex
multigenic disorders. That really hasn’t happened.
Proponents will argue that it has worked and that all sorts
of fascinating genes that predispose to or protect against
diabetes or breast cancer, for example, have been
identified, but the fact remains that the bulk of the
heritability in these conditions cannot be ascribed to loci
that have emerged from GWAS, which clearly isn’t going to
be the answer to everything.’’”
• Cell, 2010 ‘‘To date, genome-wide association
studies (GWAS) have published hundreds of
common variants whose allele frequencies are
statistically correlated with various illnesses and
traits. However, the vast majority of such variants
have no established biolog-ical relevance to
disease or clinical utility for prognosis or
treatment.”
Running (analyzing) a GWAS
• With each step think:
“What effect does this have on our ability to find
genes for traits and disorders?”
Running a GWAS: Getting your genotype data
• Select your chip
• Complete your genotyping
Running a GWAS: QC-ing SNP data
• Poor quality samples
– Sample genotype success rate < 95 to 97.5%
– Greater proportion of heterozygous genotypes than
expected
• Related individuals (if independent samples)
– Based on pair-wise comparisons of similarity of
genotypes
• Sample switches
– Wrong sex
Running a GWAS: QC-ing SNP data
• This can get very complex!
• “GOLDN SNPs that were monomorphic (55,530) or had acall rate <96 %
(82,462) were removed from the analysis. In addition, SNPs were
excluded from the analysis based on the number of families with
Mendelian errors as follows: for minor allele frequency (MAF) <20 %,
removed if errors were present in 3 families (1,486 SNPs); for MAF < 10
%, removed if errors were present in 2 families (1,338 SNPs); MAF <
5%, removed if errors were present in 1 family (1,767 SNPs); for MAF
<5 %, removed if any errors were present (9,592 SNPs). In families with
remaining errors, SNPs that exhibited Mendelian error were set to
missing (31,595 SNPs). Furthermore, 16 participants with call rates<96
% were also removed from any subsequent analyses. Subsequently,
748 SNPs failing the Hardy–Weinberg equilibrium (HWE) test at P
value<10-6 were excluded.”
Running a GWAS: Imputing more SNPs
Useful for:
• Combining data from different platforms (e.g., Affy & Illumina) (for
replication or meta-analysis)
• Estimating unmeasured or missing genotypes
• Based on measured SNPs and external info (e.g., haplotype
structure of HapMap)
• Imputation methods use the dense genotype data available from
HapMap samples (i.e., CEU) and the LD relationships of the
SNPs to impute (predict) genotypes for a large number of SNPs
that were not measured experimentally
“Short cuts”
A T A G T A C AT
C
A
C
A
T
G
A
G
C
G
CA
A
A
T
T
G
G
A
A
G
C
G
C
T
C
C
C
G
C
G
C
A
C
C
C
SNPs 1, 3 and 4 are TagSNPs
Running a GWAS: Imputing more SNPs
• Requires large scale computing resources
• Can go from ~500,000 SNPs to ~2.2 million
• Need to assess quality of imputation
– Compare imputed genotypes to actual genotypes
• Error rates are higher than for genotyped SNPs
• Works less well for rarer alleles (e.g., MAF<3%)
• Best to take account of probabilities assigned to imputed
genotypes in the analysis
– “dosages” = probabilities of the genotypes
• Allows association testing of untyped variation
• Allows for ease of combining data across genotyping platforms
Genotype: AA AG GG
Probability: 80% 20% 0%
Coding: 2 1 0
Running a GWAS: Imputing more SNPs
Running a GWAS: Imputing more SNPs
MACH, Markov Chain Haplotyping
– Developed by Goncalo Abecasis
– http://www.sph.umich.edu/csg/abecasis/MACH/
IMPUTE
– Developed by Jonathan Marchini
– http://www.stats.ox.ac.uk/~marchini/#software
BIMBAM
– http://quartus.uchicago.edu/~yguan/bimbam/index.html
Running a GWAS: QC-ing SNP data
• More QC!
• MAF again
• Imputation quality (RSQ<.3??)
Running a GWAS: Computing your p-values
• Commonly SNPs coded as the additive effect of alleles
• Logistic regression or linear regression (much like a candidate gene
study)
• PLINK
– http://pngu.mgh.harvard.edu/~purcell/plink/
Running a GWAS: Interpreting your data
• (Really… it is more QC)….
• 2 main issues:
• Population stratification
• Multiple testing
The Problems of population substructure
• Devlin and Roeder (1999) used
theoretical arguments to propose that
with population structure, the
distribution of Cochran-Armitage trend
tests, genome-wide, is inflated by a
constant multiplicative factor λ.
• We can estimate the multiplicative
inflation factor using the statistic
λ = median(Xi
2)/0.456.
• Inflation factor λ > 1 indicates
population structure and/or
genotyping error.
• We can carry out an adjusted test of
association that takes account of any
mismatching of cases/controls at any
SNP using the statistic Xi
2/ λ.
Inflation factor λ = 1.11
Population outliers
and/or structure?
True hits?
Solving population stratification
• Has become a standard tool in genetics to
identify subpopulations
• Used to infer continuous axes of genetic
variation (eigenvectors) that reduce the data
to a small number of dimensions while
describing as much of the variability among
individuals as possible
• Utilizes a set of “neutral” SNPs (not
associated with phenotype, need many)
• Implemented in the EIGENSTRAT software
package.
Solving population stratification - Eigenstrat
Population substructure thought
experiment
• You run a GWAS and find a significant hit
within a gene. The SNP is genotyped,
biologically plausible, in HWE (p>.05) and has
a MAF = .13. However, your λ=1.56! So you
decide not to use the SNP.
• What would have happened if you had taken a
candidate gene approach and only analyzed
that SNP?
Multiple testing
Multiple testing
• When all tests are independent,
– Probability to observe P<.01 is
– Probability to observe P<.05 is
– Probability to observe P<.5 is
Multiple testing
• At an alpha of .05, the probability that the finding
is not due to chance, is 95%.
• How many significant findings do we expect to
arise by chance?
• How many tests does a GWAS have?
• So, how many significant findings do we expect to
arise by chance?
Multiple testing corrections
• Bonferroni:
• When k independent tests are used, the corrected
p value should be: α / k
“Bonferroni adjustments are, at best, unnecessary, and, at worst,
deleterious to sound statistical inference.” Perneger, 1998
While it is computationally simple (single step method), and
stringent, maybe it is too stringent? High probability of Type
II errors.
Multiple testing corrections
• Holm
Target p value__________________
n - rank number of the pair in terms of degree of
significance + 1
• Assume 3 p-values, and α=.05
• the smallest p value has to be smaller than .05/3 =
.017 to be sig
• the second smallest p value has to be smaller than
.05/2 = .025 to be sig
• and the third smallest has to be smaller than .05/1 =
.05 to be sig.
• Low probability of Type II errors… but too low?
Multiple testing corrections
Holm
Bonferroni
FDR
Multiple testing corrections
• FDR procedures are designed to control the
expected proportion of incorrectly rejected null
hypotheses
Multiple testing corrections
Multiple testing corrections
Multiple testing corrections
• Storey – q values
• For a given test, we estimate the q-value by
calculating the minimum estimated false discovery
rate among all thresholds at which the false
discovery rate is called significant
• E.g. Rank 52 has a p-value of 0.01 and a q-value of
0.0141.
• A p-value of 0.01 implies a 1% chance of false
positives from all the results. Therefore 839 * .01 =
8.39 false tests. A q-value of .01 implies a 1%
chance of false positive, from all the values < .01.
In this experiment 52 tests had a q-value less than
0.0141 and so 52*0.0141 = 0.7332 false positives.
• False positives according to p-values take all 839
values into account; while q-values take into
account only those tests with q-values less the
threshold we choose.
Multiple testing example
• *Think* it is a sliding scale, magic p-value of ‘no
error’. BUT, repeatedly taken a range of SNPs
• 1.0*10-5>p>1.0*10-9
• So, what correction would you use? A stringent
one? A lenient one?
• What is the purpose of GWAS?
Multiple testing thought experiment
• Jane comes to you, she is interested in whether SNPs in
LDL1R gene are associated with CVD. She uses all 7 SNPs
genotyped in LDL1R, and the P-values are as follows: .0001,
.01, .0006, .000006, .000001, .05, .09. How many SNPs are
significantly associated with CVD?
• John comes to you, and has run a GWAS on a million SNPs,
including Jane’s 7 in LDL1R. He does a Bonferroni
correction AND an FDR correction on his GWAS, and none
of the SNPs survive the correction for multiple testing. He
concludes that the LDL1R gene is not associated with CVD
• Who is right?
Multiple testing: a final thought
• Power is a product of
-N
-Effect size (subsuming error)
-Alpha
Running a GWAS: Visualize your results
Running a GWAS: Visualize your results
Running a GWAS: Visualize your results
K Wang et al. Nature 000, 1-6 (2009) doi:10.1038/nature07999
Running a GWAS: Interpreting SNPs
• Look at the functionality of your SNP (SNPdoc)
• Literature search – can you give biological
plausibility?
• Other tests: pathway analysis / Gene based tests
• Tonne of free & very expensive tools out there.
Conduct your own quality control!!
Running a GWAS: Final steps
Running a GWAS: Replication
Replication
Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005
NCI-NHGRI Working Group on Replication Nature
447: 655, 2007
Replication
Replication
Winner’s Curse
• ‘Winner's curse’ = the phenomenon whereby winners at competitive auctions
are likely to pay in excess of the value of the item.
• In genetic association studies, the winner's curse is the phenomenon whereby
the disease risk of a newly identified genetic association is overestimated.
• Occurs when the statistical power of original study is not sufficient.
• The winner's curse implies that the sample size required for confirmatory
study will be underestimated, resulting in failure of replication study to
corroborate the association.
• The winner's curse is common in genome-wide association (GWA) studies
because most single-GWA studies are underpowered to detect small genetic
effects at a stringent genome-wide significance level.
• What are the solutions? Large GWAS (or a meta-analysis)
Can you solve the case of the missing
heritability?
Hint. What have we discussed
about:
-Traits
-Statistics
-Genotype models
-Coverage of the genome
-Power?
Lab 7: Analyzing GWAS data
Primer on SAS: Libraries
Find your libraries
here. Toggle between
libraries and results
Work is the default
library. SAS will pull files
from here if nothing else
is specified.
Here are other libraries
I have made
Primer on SAS: Reading in data
• For SAS datasets: Tell SAS to make a
library
Give SAS a path. SAS will read in
all SAS datasets in this folder
Primer on SAS: Reading in data
• For non-SAS datasets:
Follow on the import
wizard (I have code if you
prefer)
Primer on SAS: Move your file into ‘work’
‘Data’ command:
make a dataset. Call
it Goldn. There is no
library so put it in
‘work’.
‘set’ command: base the
new dataset off goldn in
the goldn library
Library
Dataset
Primer on SAS: Titles
Primer on SAS: Summarizing Data
• Continuous variables:
Command
‘proc
univariate’
Tell SAS
which
variables
Select a
subsample There are other commands. E.g. ‘BY’. Your data must be
sorted by the ‘BY’ variable to use this command
Primer on SAS: Summarizing Data
• Categorical variables:
Command
‘proc freq’
Tell SAS
which
variables
Primer on SAS: Manipulating data
Make your
data… set your
data. Think:
Libraries.
Options (look up
others)
Primer on SAS: Sorting your dataset
Command =
proc sort
Which dataset
to sort
What to call the
sorted dataset
What to sort by
Primer on SAS. One fundamental rule
• Always check your log!
Special commands for this session
Special commands: QQ plot
1. Make a new dataset (to preserve yours)
2. Set your data
3. Set to the number of SNPs
5. Select your data
NOTE: Your data must be sorted by p-value
4. Select the smallest p-value power
(here 1*10-25)
Special commands: Calculating median P-val
1. Make a new dataset (to preserve yours)
2. Set your data
Special commands: Correcting for multiple
testing
• 1. Make a dataset with ONLY your p-values in (no
other columns)
• 2. Make sure your P-values are called RAW_P
3. Set your data
4. Select your corrections
(look them up  )
Lab 7: The dataset
Rs number
of SNP
chromosome
Base pair
position
Effect allele
Frequency
of the
effect
allele
Imputation
quality
P value for
Hardy-
Weinberg
Not sure? No. of
people in
analysis

More Related Content

What's hot

Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingMahesh Biradar
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminarVarsha Gayatonde
 
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...FAO
 
Molecular markers and Functional molecular markers
Molecular markers and Functional molecular markersMolecular markers and Functional molecular markers
Molecular markers and Functional molecular markersChandana B.R.
 
Epi519 Gwas Talk
Epi519 Gwas TalkEpi519 Gwas Talk
Epi519 Gwas Talkjoshbis
 
Recent approaches in quantitative genetics
Recent approaches in  quantitative geneticsRecent approaches in  quantitative genetics
Recent approaches in quantitative geneticsAlex Harley
 
Association mapping in plants
Association mapping in plantsAssociation mapping in plants
Association mapping in plantsWaseem Hussain
 
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...Mahesh Biradar
 
Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Anilkumar C
 
Genomic Selection in Plants
Genomic Selection in PlantsGenomic Selection in Plants
Genomic Selection in PlantsPrakash Narayan
 
Association mapping approaches for tagging quality traits in maize
Association mapping approaches for tagging quality traits in maizeAssociation mapping approaches for tagging quality traits in maize
Association mapping approaches for tagging quality traits in maizeSenthil Natesan
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
 
Quantitative trait loci (QTL) analysis and its applications in plant breeding
Quantitative trait loci (QTL) analysis and its applications in plant breedingQuantitative trait loci (QTL) analysis and its applications in plant breeding
Quantitative trait loci (QTL) analysis and its applications in plant breedingPGS
 
Ammi model for stability analysis
Ammi model for stability analysisAmmi model for stability analysis
Ammi model for stability analysisBalaji Thorat
 
1 gpb 621 quantitative genetics introduction
1 gpb 621 quantitative genetics   introduction1 gpb 621 quantitative genetics   introduction
1 gpb 621 quantitative genetics introductionSaravananK153
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencingBhavya Sree
 
Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome SelectionRaghav N.R
 

What's hot (20)

Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mapping
 
Basics of association_mapping
Basics of association_mappingBasics of association_mapping
Basics of association_mapping
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
 
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
 
Molecular markers and Functional molecular markers
Molecular markers and Functional molecular markersMolecular markers and Functional molecular markers
Molecular markers and Functional molecular markers
 
Epi519 Gwas Talk
Epi519 Gwas TalkEpi519 Gwas Talk
Epi519 Gwas Talk
 
Recent approaches in quantitative genetics
Recent approaches in  quantitative geneticsRecent approaches in  quantitative genetics
Recent approaches in quantitative genetics
 
Association mapping in plants
Association mapping in plantsAssociation mapping in plants
Association mapping in plants
 
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...
 
Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding Use of SNP-HapMaps in plant breeding
Use of SNP-HapMaps in plant breeding
 
Genomic Selection in Plants
Genomic Selection in PlantsGenomic Selection in Plants
Genomic Selection in Plants
 
Association mapping approaches for tagging quality traits in maize
Association mapping approaches for tagging quality traits in maizeAssociation mapping approaches for tagging quality traits in maize
Association mapping approaches for tagging quality traits in maize
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
 
Quantitative trait loci (QTL) analysis and its applications in plant breeding
Quantitative trait loci (QTL) analysis and its applications in plant breedingQuantitative trait loci (QTL) analysis and its applications in plant breeding
Quantitative trait loci (QTL) analysis and its applications in plant breeding
 
QTL
QTLQTL
QTL
 
Ammi model for stability analysis
Ammi model for stability analysisAmmi model for stability analysis
Ammi model for stability analysis
 
1 gpb 621 quantitative genetics introduction
1 gpb 621 quantitative genetics   introduction1 gpb 621 quantitative genetics   introduction
1 gpb 621 quantitative genetics introduction
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
 
SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome Selection
 

Similar to Lecture 7 gwas full

Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.Varsha Gayatonde
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regressionbbuliksullivan
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisUC Davis
 
Making your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designMaking your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designjelena121
 
Lecture 3 quantitative traits and heritability full
Lecture 3 quantitative traits and heritability fullLecture 3 quantitative traits and heritability full
Lecture 3 quantitative traits and heritability fullLekki Frazier-Wood
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Emily Yunha Shin
 
Biometry for 2015.ppt
Biometry for 2015.pptBiometry for 2015.ppt
Biometry for 2015.pptmelkamugenet
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13Russ Altman
 
Whole Genome Trait Association in SVS
Whole Genome Trait Association in SVSWhole Genome Trait Association in SVS
Whole Genome Trait Association in SVSGolden Helix
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data AnalysisSetia Pramana
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Joaquin Dopazo
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...DrAmitJoshi9
 
scope and need of biostatics
scope and need of  biostaticsscope and need of  biostatics
scope and need of biostaticsdr_sharmajyoti01
 
Jillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian Aurisano
 
Lecture 6 candidate gene association full
Lecture 6 candidate gene association fullLecture 6 candidate gene association full
Lecture 6 candidate gene association fullLekki Frazier-Wood
 
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjadAdvanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjadHeadDPT
 
NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1QIAGEN
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detectionHyun-hwan Jeong
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian Aurisano
 

Similar to Lecture 7 gwas full (20)

Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
Genome wide association studies seminar Prepared by Ms Varsha Gaitonde.
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysis
 
Making your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designMaking your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental design
 
Lecture 3 quantitative traits and heritability full
Lecture 3 quantitative traits and heritability fullLecture 3 quantitative traits and heritability full
Lecture 3 quantitative traits and heritability full
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI
 
Biometry for 2015.ppt
Biometry for 2015.pptBiometry for 2015.ppt
Biometry for 2015.ppt
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13
 
Whole Genome Trait Association in SVS
Whole Genome Trait Association in SVSWhole Genome Trait Association in SVS
Whole Genome Trait Association in SVS
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...Digging into thousands of variants to find disease genes in Mendelian and com...
Digging into thousands of variants to find disease genes in Mendelian and com...
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...
 
scope and need of biostatics
scope and need of  biostaticsscope and need of  biostatics
scope and need of biostatics
 
Jillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideo
 
Lecture 6 candidate gene association full
Lecture 6 candidate gene association fullLecture 6 candidate gene association full
Lecture 6 candidate gene association full
 
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjadAdvanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
 
NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1NGS in Clinical Research: Meet the NGS Experts Series Part 1
NGS in Clinical Research: Meet the NGS Experts Series Part 1
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detection
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2
 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 

Recently uploaded (20)

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 

Lecture 7 gwas full

  • 1. Session 7: Genome-wide Association Studies (GWAS)
  • 3. Structure of this lecture • Recap some concepts (SAS tutorial later) • Discuss GWAS • Look at the steps in running & analyzing results GWAS • Lab – analyze a GWAS • SAS tutorial
  • 4. Recap from last session • At an A/T locus: • -What are the genotypes? • -What are the alleles? • Given the following frequencies at a locus: A/A: N=123 A/C: N=134 C/C N=52 Which is the minor allele? What are the allele frequencies?
  • 5. Recap from last session • Additive model: Each ‘risk’ allele conveys risk for the phenotype in an additive fashion • Recessive effect of the minor allele: The minor allele conveys risk, but you need 2 copies to have the risk • Dominant effect of the minor allele: The minor allele conveys risk, and you only need 1 copy to have the risk
  • 6. Recap from last session • Given these genotype means, what genotype models would you run? (Additive / recessive effect of the minor allele / dominant effect of the minor allele?): aa aA AA 1 10 15 15 2 23 35 42 3 10 12 14 4 9 11 32 5 5 5 16 6 1 5 7
  • 8. Back in the 90s…
  • 10.
  • 11. What is a GWAS Basic idea: Genotype individuals for a large number (~1M) of SNPs spread in a generally unspecified way throughout the genome. Look for association. Advantage? What does this table show?
  • 12. The era of hypothesis generating research
  • 13. The era of hypothesis generating research
  • 14.
  • 15. Why do we question the success of GWAS? • What is the median income in the US? – $50,000 • What is federal tax? – 25% • Median tax/year is – ~12,500 • What is the average cost of an NIH genetics grant? – $477,215 / $350,000 • 30-40 working years
  • 16. Huge numbers of published GWAS! • http://www.genome.gov/gwastudies/ • http://www.ebi.ac.uk/fgpt/gwas/#
  • 17. JAMA, 2007 • T2DB suspected that gens ‘related to insulin resistance would be identified. Interestingly, not one gene known to be associated with insulin resistance has been found in the recent set of genome-wide scans for diabetes, but genes related to insulin secretion, insulin transport, zinc binding to insulin, and pancreatic is-let beta cell development have been discovered”
  • 18. And yet…. • Sir Alec Jeffries: ‘‘One of the great hopes for GWAS was that, in the same way that huge numbers of Mendelian disorders were pinned down at the DNA level and the gene and mutations involved identified, it would be possible to simply extrapolate from single gene disor-ders to complex multigenic disorders. That really hasn’t happened. Proponents will argue that it has worked and that all sorts of fascinating genes that predispose to or protect against diabetes or breast cancer, for example, have been identified, but the fact remains that the bulk of the heritability in these conditions cannot be ascribed to loci that have emerged from GWAS, which clearly isn’t going to be the answer to everything.’’”
  • 19. • Cell, 2010 ‘‘To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with various illnesses and traits. However, the vast majority of such variants have no established biolog-ical relevance to disease or clinical utility for prognosis or treatment.”
  • 20.
  • 22. • With each step think: “What effect does this have on our ability to find genes for traits and disorders?”
  • 23. Running a GWAS: Getting your genotype data • Select your chip • Complete your genotyping
  • 24. Running a GWAS: QC-ing SNP data • Poor quality samples – Sample genotype success rate < 95 to 97.5% – Greater proportion of heterozygous genotypes than expected • Related individuals (if independent samples) – Based on pair-wise comparisons of similarity of genotypes • Sample switches – Wrong sex
  • 25. Running a GWAS: QC-ing SNP data • This can get very complex! • “GOLDN SNPs that were monomorphic (55,530) or had acall rate <96 % (82,462) were removed from the analysis. In addition, SNPs were excluded from the analysis based on the number of families with Mendelian errors as follows: for minor allele frequency (MAF) <20 %, removed if errors were present in 3 families (1,486 SNPs); for MAF < 10 %, removed if errors were present in 2 families (1,338 SNPs); MAF < 5%, removed if errors were present in 1 family (1,767 SNPs); for MAF <5 %, removed if any errors were present (9,592 SNPs). In families with remaining errors, SNPs that exhibited Mendelian error were set to missing (31,595 SNPs). Furthermore, 16 participants with call rates<96 % were also removed from any subsequent analyses. Subsequently, 748 SNPs failing the Hardy–Weinberg equilibrium (HWE) test at P value<10-6 were excluded.”
  • 26. Running a GWAS: Imputing more SNPs Useful for: • Combining data from different platforms (e.g., Affy & Illumina) (for replication or meta-analysis) • Estimating unmeasured or missing genotypes • Based on measured SNPs and external info (e.g., haplotype structure of HapMap) • Imputation methods use the dense genotype data available from HapMap samples (i.e., CEU) and the LD relationships of the SNPs to impute (predict) genotypes for a large number of SNPs that were not measured experimentally
  • 27. “Short cuts” A T A G T A C AT C A C A T G A G C G CA A A T T G G A A G C G C T C C C G C G C A C C C SNPs 1, 3 and 4 are TagSNPs
  • 28. Running a GWAS: Imputing more SNPs • Requires large scale computing resources • Can go from ~500,000 SNPs to ~2.2 million • Need to assess quality of imputation – Compare imputed genotypes to actual genotypes • Error rates are higher than for genotyped SNPs • Works less well for rarer alleles (e.g., MAF<3%) • Best to take account of probabilities assigned to imputed genotypes in the analysis – “dosages” = probabilities of the genotypes • Allows association testing of untyped variation • Allows for ease of combining data across genotyping platforms Genotype: AA AG GG Probability: 80% 20% 0% Coding: 2 1 0
  • 29. Running a GWAS: Imputing more SNPs
  • 30. Running a GWAS: Imputing more SNPs MACH, Markov Chain Haplotyping – Developed by Goncalo Abecasis – http://www.sph.umich.edu/csg/abecasis/MACH/ IMPUTE – Developed by Jonathan Marchini – http://www.stats.ox.ac.uk/~marchini/#software BIMBAM – http://quartus.uchicago.edu/~yguan/bimbam/index.html
  • 31. Running a GWAS: QC-ing SNP data • More QC! • MAF again • Imputation quality (RSQ<.3??)
  • 32. Running a GWAS: Computing your p-values • Commonly SNPs coded as the additive effect of alleles • Logistic regression or linear regression (much like a candidate gene study)
  • 34. Running a GWAS: Interpreting your data • (Really… it is more QC)…. • 2 main issues: • Population stratification • Multiple testing
  • 35. The Problems of population substructure • Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Cochran-Armitage trend tests, genome-wide, is inflated by a constant multiplicative factor λ. • We can estimate the multiplicative inflation factor using the statistic λ = median(Xi 2)/0.456. • Inflation factor λ > 1 indicates population structure and/or genotyping error. • We can carry out an adjusted test of association that takes account of any mismatching of cases/controls at any SNP using the statistic Xi 2/ λ. Inflation factor λ = 1.11 Population outliers and/or structure? True hits?
  • 36.
  • 37. Solving population stratification • Has become a standard tool in genetics to identify subpopulations • Used to infer continuous axes of genetic variation (eigenvectors) that reduce the data to a small number of dimensions while describing as much of the variability among individuals as possible • Utilizes a set of “neutral” SNPs (not associated with phenotype, need many) • Implemented in the EIGENSTRAT software package.
  • 39. Population substructure thought experiment • You run a GWAS and find a significant hit within a gene. The SNP is genotyped, biologically plausible, in HWE (p>.05) and has a MAF = .13. However, your λ=1.56! So you decide not to use the SNP. • What would have happened if you had taken a candidate gene approach and only analyzed that SNP?
  • 41. Multiple testing • When all tests are independent, – Probability to observe P<.01 is – Probability to observe P<.05 is – Probability to observe P<.5 is
  • 42.
  • 43. Multiple testing • At an alpha of .05, the probability that the finding is not due to chance, is 95%. • How many significant findings do we expect to arise by chance? • How many tests does a GWAS have? • So, how many significant findings do we expect to arise by chance?
  • 44. Multiple testing corrections • Bonferroni: • When k independent tests are used, the corrected p value should be: α / k “Bonferroni adjustments are, at best, unnecessary, and, at worst, deleterious to sound statistical inference.” Perneger, 1998 While it is computationally simple (single step method), and stringent, maybe it is too stringent? High probability of Type II errors.
  • 45. Multiple testing corrections • Holm Target p value__________________ n - rank number of the pair in terms of degree of significance + 1 • Assume 3 p-values, and α=.05 • the smallest p value has to be smaller than .05/3 = .017 to be sig • the second smallest p value has to be smaller than .05/2 = .025 to be sig • and the third smallest has to be smaller than .05/1 = .05 to be sig. • Low probability of Type II errors… but too low?
  • 47. Multiple testing corrections • FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses
  • 50. Multiple testing corrections • Storey – q values • For a given test, we estimate the q-value by calculating the minimum estimated false discovery rate among all thresholds at which the false discovery rate is called significant • E.g. Rank 52 has a p-value of 0.01 and a q-value of 0.0141. • A p-value of 0.01 implies a 1% chance of false positives from all the results. Therefore 839 * .01 = 8.39 false tests. A q-value of .01 implies a 1% chance of false positive, from all the values < .01. In this experiment 52 tests had a q-value less than 0.0141 and so 52*0.0141 = 0.7332 false positives. • False positives according to p-values take all 839 values into account; while q-values take into account only those tests with q-values less the threshold we choose.
  • 51. Multiple testing example • *Think* it is a sliding scale, magic p-value of ‘no error’. BUT, repeatedly taken a range of SNPs • 1.0*10-5>p>1.0*10-9 • So, what correction would you use? A stringent one? A lenient one? • What is the purpose of GWAS?
  • 52. Multiple testing thought experiment • Jane comes to you, she is interested in whether SNPs in LDL1R gene are associated with CVD. She uses all 7 SNPs genotyped in LDL1R, and the P-values are as follows: .0001, .01, .0006, .000006, .000001, .05, .09. How many SNPs are significantly associated with CVD? • John comes to you, and has run a GWAS on a million SNPs, including Jane’s 7 in LDL1R. He does a Bonferroni correction AND an FDR correction on his GWAS, and none of the SNPs survive the correction for multiple testing. He concludes that the LDL1R gene is not associated with CVD • Who is right?
  • 53. Multiple testing: a final thought • Power is a product of -N -Effect size (subsuming error) -Alpha
  • 54. Running a GWAS: Visualize your results
  • 55. Running a GWAS: Visualize your results
  • 56. Running a GWAS: Visualize your results K Wang et al. Nature 000, 1-6 (2009) doi:10.1038/nature07999
  • 57. Running a GWAS: Interpreting SNPs • Look at the functionality of your SNP (SNPdoc) • Literature search – can you give biological plausibility? • Other tests: pathway analysis / Gene based tests • Tonne of free & very expensive tools out there. Conduct your own quality control!!
  • 58. Running a GWAS: Final steps
  • 59. Running a GWAS: Replication Replication Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005 NCI-NHGRI Working Group on Replication Nature 447: 655, 2007 Replication Replication
  • 60. Winner’s Curse • ‘Winner's curse’ = the phenomenon whereby winners at competitive auctions are likely to pay in excess of the value of the item. • In genetic association studies, the winner's curse is the phenomenon whereby the disease risk of a newly identified genetic association is overestimated. • Occurs when the statistical power of original study is not sufficient. • The winner's curse implies that the sample size required for confirmatory study will be underestimated, resulting in failure of replication study to corroborate the association. • The winner's curse is common in genome-wide association (GWA) studies because most single-GWA studies are underpowered to detect small genetic effects at a stringent genome-wide significance level. • What are the solutions? Large GWAS (or a meta-analysis)
  • 61. Can you solve the case of the missing heritability? Hint. What have we discussed about: -Traits -Statistics -Genotype models -Coverage of the genome -Power?
  • 62. Lab 7: Analyzing GWAS data
  • 63. Primer on SAS: Libraries Find your libraries here. Toggle between libraries and results Work is the default library. SAS will pull files from here if nothing else is specified. Here are other libraries I have made
  • 64. Primer on SAS: Reading in data • For SAS datasets: Tell SAS to make a library Give SAS a path. SAS will read in all SAS datasets in this folder
  • 65. Primer on SAS: Reading in data • For non-SAS datasets: Follow on the import wizard (I have code if you prefer)
  • 66. Primer on SAS: Move your file into ‘work’ ‘Data’ command: make a dataset. Call it Goldn. There is no library so put it in ‘work’. ‘set’ command: base the new dataset off goldn in the goldn library Library Dataset
  • 67. Primer on SAS: Titles
  • 68. Primer on SAS: Summarizing Data • Continuous variables: Command ‘proc univariate’ Tell SAS which variables Select a subsample There are other commands. E.g. ‘BY’. Your data must be sorted by the ‘BY’ variable to use this command
  • 69. Primer on SAS: Summarizing Data • Categorical variables: Command ‘proc freq’ Tell SAS which variables
  • 70. Primer on SAS: Manipulating data Make your data… set your data. Think: Libraries. Options (look up others)
  • 71. Primer on SAS: Sorting your dataset Command = proc sort Which dataset to sort What to call the sorted dataset What to sort by
  • 72. Primer on SAS. One fundamental rule • Always check your log!
  • 73. Special commands for this session
  • 74. Special commands: QQ plot 1. Make a new dataset (to preserve yours) 2. Set your data 3. Set to the number of SNPs 5. Select your data NOTE: Your data must be sorted by p-value 4. Select the smallest p-value power (here 1*10-25)
  • 75. Special commands: Calculating median P-val 1. Make a new dataset (to preserve yours) 2. Set your data
  • 76. Special commands: Correcting for multiple testing • 1. Make a dataset with ONLY your p-values in (no other columns) • 2. Make sure your P-values are called RAW_P 3. Set your data 4. Select your corrections (look them up  )
  • 77. Lab 7: The dataset
  • 78. Rs number of SNP chromosome Base pair position Effect allele Frequency of the effect allele Imputation quality P value for Hardy- Weinberg
  • 79. Not sure? No. of people in analysis

Editor's Notes

  1. Why the tail?