4. Today we’re releasing 4 new
GIAB RM Genomes.
• PGP Human
Genomes
– AJ son
– AJ trio
– Asian son
• Parents also
characterized
• Available
immediately
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
5. Today we’re releasing 4 new
GIAB RM Genomes.
• New, reproducible
methods applied
to characterize
high-confidence
SNPs/indels in 85-
90% of each
genome
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
6. We’re also releasing a
Microbial Genome RM
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8375
Microbial Genomic DNA Standards for Sequencing Performance Assessment
(MG-001, MG-002, MG-003, MG-004)
This Reference Material (RM) is intended for validation, optimization, process evaluation, and performance
assessment of whole genome sequencing. A unit of RM 8375 consists of four vials. Each vial contains a different
microbial genomic DNA sample (MG-001 Salmonella Typhimurium LT2, MG-002 Staphylococcus aureus, MG-003
Pseudomonas aeruginosa, and MG-004 Clostridium sporogenes). Each vial contains approximately 2 µg of microbial
genomic DNA; with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA;
in TE buffer (10 mM TRIS, 0.1 mM EDTA, pH 8.0).
This material is intended to help assess performance of high-throughput DNA sequencing methods. This genomic
DNA is intended to be analyzed in the same way as any other sample a laboratory would analyze extracted DNA, such
as through the use of a genome assembly or variant calling bioinformatics pipelines. Because the RM is extracted
DNA, it does not assess pre-analytical steps such as DNA extraction. It does, however, challenge sequencing library
preparation, sequencing machines, base calling algorithms, and the subsequent bioinformatics analyses such as variant
calling. This RM is not intended to assess other bioinformatics steps such as genome assembly, strain identification,
phylogenetic analysis, or genome annotation.
Information Values: Information values are currently provided for the whole genome sequence to enable
performance assessment of variant calling and assembly methods. An information value is considered to be a value
that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty
associated with the value. We describe and disseminate our best, most confident, estimate of the assembly using the
data and methods available at present [1]. Information values cannot be used to establish metrological traceability.
The genome sequence files referenced in this Report of Investigation are available at:
MG-001 Salmonella Typhimurium LT2
https://github.com/usnistgov/NIST_Micro_Genomic_RM_Data/MG001/ref_genome/MG001_v1.00.fasta
MG-002 Staphylococcus aureus
This Reference Material
(RM) is intended for
validation, optimization,
process evaluation, and
performance assessment of
whole genome sequencing.
• Salmonella Typhimurium
• Pseudomonas
aeruginosa
• Staphylococcus aureus
• Clostridium sporogenes
7.
8. What’s JIMB?
• Joint Initiative for
Metrology in Biology
– develop standards,
methods, tools and
measurement science
– make biology easier to
engineer
– make reproducibility and
reliability easier
• lower barriers to
translation of innovation
• enable scaling through
distribution of labor
Faculty
•Science
•Technology
Development
•Innovation
NIST
•Metrology
•Standards
Realization Lab
•Measurement
Science
Trainees
•Postdocs
•Coursework
•Graduate Trainees
Commercial
•Customers
•Technology
•Metrology
Training
•Workforce
The Joint Initiative for
Metrology in Biology
9. is Genomics and Synthetic
Biology.
DNA Read and Write.
The Joint Initiative for
Metrology in Biology
10. Genome in a Bottle Consortium
Whole Genome Variant Calling
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to
evaluate performance
– materials certified for their
variants against a reference
sequence, with confidence
estimates
• established consortium to
develop reference
materials, data, methods,
performance metrics
• Characterized Pilot Genome
NA12878
• Ashkenazim Trio, Asian son
from PGP released today!
genericmeasurementprocess
11. Bringing Principles of Metrology
to the Genome
• Reference materials
– DNA in a tube you can buy
from NIST
– NA12878 pilot sample,
now 2 PGP-sourced trios
• Extensive state-of-the-art
characterization
– as good as we can get for
small variants
– arbitrated “gold standard”
calls for SNPs, small indels
• “Upgradable” as
technology develops
• Analysis of all samples
ongoing as technology
develops
• PGP genomes suitable for
commercial derived
products
• Developing benchmarking
tools and software
– with GA4GH
• Samples being used to
develop and demonstrate
new technology
12. We are liaising with…
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Genome Reference
Consortium
• 1000 Genomes SV group
• CAP/CLIA
• Global Alliance for
Genomics and Health
Benchmarking Team
• ABRF
• FDA
• SEQC
• Global metrology system
13. Agenda
Monday
• Breakfast and registration
• Welcome and Context Setting
• NIST RM Update and Status Report
• Charge to Working Groups
• Coffee Break
• Working Group Breakout Discussions
• Lunch (provided)
• Informal Working Group Reports
• Coffee Break
• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy'
variants and regions of the genome
– Topic #2: Selecting future genomes for
Reference Materials
Tuesday
• Breakfast and registration
• Use cases: Experiences using the pilot
Reference Material
• Discussion of plans to release pilot
Reference Material
• Coffee Break
• Working Group Breakout discussions
• Lunch (provided)
• Working Group leaders present plans
and discussion
• Steering committee Overview
• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after
the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests
otherwise. Please use #giab as the hashtag.
14. NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
15. Data for GIAB PGP Trios
Dataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end
WGS
150x150bp
250x250bp
~300x/individual
~50x/individual
on SRA/FTP SNPs/indels/some SVs
Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs
SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs
Illumina Paired-end
WES
100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome
Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome
Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs
Illumina “moleculo” Custom library ~30x by long
fragments
on FTP SVs/phasing/assembly
Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing
10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x
on each AJ parent
on SRA/FTP SVs/phasing/assembly
/STRs
Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly
Nabsys 2.0 ~100kbp N50
nanopore maps
70x on AJ son SVs/assembly
BioNano Genomics 200-250kbp optical
map reads
~100x/AJ individual;
57x on Asian son
on FTP SVs/assembly
16. Dataset AJ Son AJ Parents Chinese son Chinese
parents
NA12878
Illumina Paired-
end
X X X X X
Illumina Long
Mate pair
X X X X X
Illumina
“moleculo”
X X X X X
Complete
Genomics
X X X X X
Complete
Genomics LFR
X X X
Ion exome
X X X X
BioNano
X X X X
10X
X X X
PacBio
X X X
SOLiD single end
X X X
Illumina exome
X X X X
Oxford
Nanopore
X
19. Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
20. Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
21. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
22. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
More high-confidence calls match Platinum Genomes
23. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
Similar extra calls not in Platinum Genomes
24. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
~80% fewer differences from PG in high confidence regions
25. New calls (v3.3) vs. old calls (v2.19)
Example vcf (verily) Stratified
V3.3
• 17% of SNPs not assessed
– 23% of SNPs in RefSeq coding
– 53% of SNPs in “bad
promoters”
• 78% of indels not assessed
– 0.7% difference rate
• 17% FP in regions
homologous to decoy
V2.19
• 27% of SNPs not assessed
– 36% of SNPs in RefSeq coding
– 82% of SNPs in “bad
promoters”
• 78% of indels not assessed
– 1.2% difference rate
• 0.2% FP in regions
homologous to decoy
26. Principles of Integration Process
• Form sensitive variant
calls from each dataset
• Define “callable regions”
for each callset
• Filter calls from each
method with annotations
unlike concordant calls
• Compare high-confidence
calls to other callsets and
manually inspect subset
of differences
– vs. pedigree-based calls
– vs. common pipelines
– Trio analysis
• When benchmarking a
new callset against ours,
most putative FPs/FNs
should actually be
FPs/FNs
27. Criteria for including new callsets
• Form sensitive variant
calls from each dataset
• Define “callable regions”
for each callset
• Good coverage and MapQ
• Use knowledge about
technology and manual
inspection to exclude repetitive
regions difficult for each dataset
• For new callsets, ensure most
FNs in callable regions relative
to current high-confidence calls
are questionable in the current
calls
• Filter calls from each
method with annotations
unlike concordant calls
– Annotations for which
outliers are expected to
indicate bias should be
selected for each callset
28. Global Alliance for Genomics and Health
Benchmarking Task Team
• Developed standardized
definitions for
performance metrics like
TP, FP, and FN.
• Developing sophisticated
benchmarking tools
• Integrated into a single
framework with
standardized inputs
and outputs
• Standardized bed files
with difficult genome
contexts for stratification
Credit: GA4GH, Abby Beeler, Ellie Wood
Stratification of FP Rates
Higher FP rates at Tandem Repeats
https://github.com/ga4gh/benchmarking-tools
31. Acknowledgements
• NIST
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
• Genome in a Bottle
Consortium
• GA4GH Benchmarking
Team
• FDA
– Liz Mansfield
– Zivana Tevak
– David Litwack
32. For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
Public workshops
– Next one Sep 15-16 at NIST, MD, USA
NIST postdoc opportunities available! Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov
34. What is the standards architecture to
demonstrate safety and efficacy?
Preanalytical
Sequencing
Sequence
Bioinformatics
Functional Variant
Annotation
Clinical Variant
Knowledgebase
Query
Clinical
Interpretation
Reporting
EHR Archival