Next-generation sequencing and structural variation was presented. Structural variation refers to changes in chromosome structure, such as deletions, duplications, inversions, and translocations. Approaches for discovery include read pairs, read depth, split reads, and fine-mapping breakpoints through local assembly to identify signatures of structural variation. Read pairs and read depth were discussed in more detail, including mapping, filtering, clustering, and issues. Combining different approaches and algorithms is important as results can vary, and validation is needed.
3. What is structural variation?
• “variation that changes the structure of
a chromosome”
• Mechanisms: NAHR, NHEJ, FoSTeS
• This presentation: focus on discovery
(not: genotyping)
“experiment 4” from last slide Thomas
10. RP - Workflow overview
Mapping
Identify discordant readpairs
Cluster on location
Filter on nr RPs/cluster
Filter on RD
Filter: mappingQ x #readpairs
Identify signatures
Alternative reference
Validate
11. RP - Mapping
• Provides raw data => crucial
• MAQ/bwa
– only report one hit (mappingQ = 0)
– MAQ might prefer mismatches to aberrant
distance!
• Insert size = distribution instead of exact
12. RP - Discordant readpairs
• Orientation
• Distance
– Plot insert size distribution for chromosome
– Very long tail! => difficult to set cutoff:
• 4mad or 0.01%?
13.
14.
15.
16.
17.
18. RP - Clustering
“standard clustering strategy”
– Only consider mate pairs that do not have
concordant mappings
– Ignore read pairs that have more than one
good mapping
Clustering: use insert size distribution
(e.g. 2x4 mad)
19. RP - Clustering: issues
• Ignores pairs that have >1 good mapping =>
no detection within repetitive regions
(segmental duplications)
• What cutoff for what is considered abnormal
distance? (4 mad? 0.01%? 2stdev?)
• Low library quality or mix of libraries =>
multiple peaks in size distribution
20. RP - Filtering
• On nr RPs/cluster
– Normally: n=2
– For high coverage (e.g. pilot 2: 80X): n=5
• On drop in RD & SR
• On (mappingQ x nrRP)
– If published data available: ROC for
different cutoffs mQxnrRP
– If not: very difficult
21. RP - Issues
• Difficult => different groups = different results
“consensus set”
– RP & SP: many set agree
– RD: totally different
• CEU (80X): sometimes drop in RD in all 3,
but RP spanning only in 2 => why??
• Mapper = critical; maq/bwa: only 1 mapping
(=> many false negatives); mosaik, mrFAST:
return more results
22. RP - Issues (2)
• Large insert size: low resolution for detecting
breakpoints
• Small insert size: low resolution for detecting
complex regions
24. RD - General principle
• Similar to aCGH: using reference RD
file (e.g. based on 1kG)
• In theory: higher resolution, but noisier
than aCGH
– Algorithms not mature yet
– More complex steps
=> Data binned
30. RD - Filtering
• mapQ
– mapQ >= 0 (noisy; few FN, many FP)
– mapQ >= 10
– mapQ >= 30 (many FN, few FP)
• Mean depth exon (often: e.g. +/- 0.01)
– Mean depth > 1
– Mean depth > 5
31. RD - Filtering: what’s left
mapQ >= 0 mapQ >= 10 mapQ >= 30
all 207,000 207,000 207,000
mean DP exon > 1 169,000 163,000 162,000
mean DP exon > 5 160,000 153,000 152,000
32. RD - correction
• Mainly: GC
– Other: repeat-rich regions, mapping Q, …
• Fit linear model GC-content exon and
RD of exon
=> noise decreases
33.
34.
35. RD - segmentation
• Identify spikes
• Many segmentational algorithms, e.g.
GADA
• Issues: setting parameters: when to cut
off peaks?
– Combine outputs from different runs with
different parameters
– Compare to known CNVs
39. RD - Issues
• How to assess TP/FP/FN? => compare
with known CNVs
• Breakpoints: unknown
– 1 datapoint/exon
– Can be outside of exon
• Different parameters for rare vs
common CNVs => which?
42. SR - Mapping
Short subsequences => many possible
mappings
Solution: “anchored split mapping” (e.g.
Pindel)
43.
44. D. Local reassembly
Aim: to determine breakpoints
Which reads?
– for deletions: local reads
– for insertions: hanging reads for read pairs with
only one read mapped
– (rather not: unmapped reads)
For large region: split up
50. Genotyping
• Create alternative reference => remap reads
– All reads vs reads covering variant locis
– Whole-genome vs concatenation of variant loci
• Homozygous insertions/deletions: should disappear
• Heterozygous insertions/deletions: should have different
signatures
• Bayesian approach: see what’s the most likely: do the reads
support wild-type/het/homnonref?
• Not exact mapping => local reassembly
– Microhomologies & non-template sequence => “breakpoint”
= region of 2-10 bp
• Convention: left-most position reported (but not always)
51. References and software
• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
• Lee S et al. Bioinformatics 24:i59-i67 (2008)
• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
• Campbell P et al. Nat Genet 40:722-729 (2008)
• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
• Chen K et al. Genome Res 19:1527-1541 (2009)
• Yoon S et al. Genome Res 19:1586-1592 (2009)
• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)
• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences
(2009)