Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Giab ashg webinar 160224
1. Genome in a Bottle Consortium
February 24, 2016
Reference Materials for Human Genome
Sequencing
Justin Zook, Ph.D and Marc Salit, Ph.D.
National Institute of Standards and Technology
2. Outline
• Genome in a Bottle
(GIAB) products
• Current and future work
• Best practices for using
GIAB products to
benchmark variant calls
• Genome in a Bottle
– Open consortium to
develop well-
characterized genomes
for benchmarking
– 100-150 public, private,
and academic
participants at
workshops
3. GIAB Scope
• The Genome in a Bottle Consortium is
developing the reference materials, reference
methods, and reference data needed to assess
confidence in human whole genome variant
calls.
• Priority is authoritative characterization of
human genomes.
GIAB steering committee, Aug 2015
4. Well-characterized, stable RMs
• Obtain metrics for
validation, QC, QA, PT
• Determine sources and
types of bias/error
• Learn to resolve difficult
structural variants
• Improve reference
genome assembly
• Optimization
• Enable regulated
applications
5. Analytical Performance
• Use well-characterized
genomic DNA reference
materials to benchmark
performance
• Tools to facilitate their
use
– With the Global Alliance
Data Working Group
Benchmarking Team
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
genericmeasurementprocess
6. High-confidence SNP/indel calls
• Methods to develop
SNP/indel call set
described in manuscript
• Broad and quick
adoption of call set for
benchmarking
– struck nerve
Zook et al., Nature Biotechnology, 2014.
7. Candidate NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
Note: RMs 8391 to 8393 are planned for release by end of Q2 2016
8. Dataset AJ Son AJ Parents Chinese son Chinese
parents
NA12878
Illumina Paired-
end
X X X X X
Illumina Long
Mate pair
X X X X X
Illumina
“moleculo”
X X X X X
Complete
Genomics
X X X X X
Complete
Genomics LFR
X X X
Ion exome
X X X X
BioNano
X X X X
10X
X X X
PacBio
X X X
SOLiD single end
X X X
Illumina exome
X X X X
Oxford
Nanopore
X
10. Data Release:
Real-time, Open, Public Release
Individual Datasets
• Uploaded to GIAB FTP site
as data are collected
• Includes raw reads, aligned
reads, and
variant/reference calls
• 12 datasets described in
bioRxiv paper
• Develop SNP, indel, and
homozygous reference calls
similar to NA12878
• Developing methods to
form high-confidence calls
for difficult variant types
and regions
• Released calls are versioned
• Preliminary call-sets will be
made available to be
critiqued
Integrated High-confidence Calls
11. SNP/Indel Integration Method Update
• Implementing refined integration methods
– Developed so others can readily reproduce results
– Consistent results for all GIAB genomes
– Simpler process taking advantage of best practices
for each technology
• Validating with released NA12878 RM data
– Preliminary comparisons show minor changes
• Application to PGP trios
– Plan to analyze AJ trio by Q2 2016
– Release of NIST RMs in Q2 2016
– Develop calls for GRCh38
12. Proposed approach to form high-
confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using
Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual
inspection
Combine integrated calls; manual inspection;
targeted experimental validation?
Aug/Dec 2015
Aug 2015-Jan 2016
Planning in
Jan-Feb 2016
Feb 2016 and
beyond
13. Preliminary comparisons of 17 Deletion Callsets
Sensitivity to calls in 2 technologies
NOTE: These are preliminary comparisons of data under active development and likely
different from true sensitivity of callers
14. Preliminary comparisons of 17 Deletion Callsets
Difference between predicted size and median predicted size
NOTE: These are preliminary comparisons of data under active development and likely
different from true size accuracy
15. Preliminary comparisons of 17 Deletion Callsets
Number of unique calls
NOTE: These are preliminary comparisons of data under active development without
filtering and unique calls may be correct
16. GeT-RM Browser from NCBI and CDC
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of data underlying call each call
17. Global Alliance for Genomics and Health
Benchmarking Task Team
Progress:
• Initial version of standardized definitions for
performance metrics like TP, FP, and FN.
• Continued development of sophisticated benchmarking
tools
– vcfeval – Len Trigg
– hap.py – Peter Krusche
– vgraph – Kevin Jacobs
• Standardized intermediate and final file formats
• Standardized bed files with difficult genome contexts for
stratification
• github.com/ga4gh/benchmarking-tools
18. Proposed Performance Metrics
Definitions
• Define TP/FP/FN/TN in 4 ways depending on
required stringency of match:
• Loose match: TP if within x-bp of a true variant
• Allelle match: TP if ALT allele matches
• Genotype match: TP if genotype and ALT allele
match
• Phasing match: TP if genotype, ALT allele, and
phasing with nearby variants all match
• True negatives are difficult to define because
an infinite number of potential alleles exist
19. Approaches to Benchmarking Variant
Calling
• Well-characterized whole genome Reference
Materials
• Many samples characterized in clinically relevant
regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over
time
20. Challenges in Benchmarking
Small Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file,
but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near
boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance
metrics
22. Acknowledgments
• FDA
• Many members of
Genome in a
Bottle
–New members
welcome!
–Sign up on website
for email
newsletters
GIAB Steering Committee
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
23. For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops
– Winter at Stanford University, California, USA
– Summer at NIST, Maryland, USA
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov