Building a platinum human genome assembly from single haplotype human genomes generated from long molecule sequencing
1. Building a platinum human genome
assembly from single haplotype
human genomes generated from
long molecule sequencing
Karyn Meltz Steinberg
ASHG 2015
@KMS_Meltzy
6. How do we define platinum and gold standards?
GRCh38
Platinum
(CHM1)
Gold
(NA19240)
% Reference genome
covered
100 98.40 90.80
% Assigned chromosomes 99.60 98.40 90.80
% gene models covered
(>95% id, >90% length)
99.96 98.78 94.26
Contig N50 67.8 Mb 26.9 Mb 6.0 Mb
Number of gaps 875 3,640 3,568
Total Assembled size 3.067 Gb 2.996 Gb 2.745 Gb
% haplotype blocks
(>1kb) resolved
NA >95 >80
http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
7. CHM13 Draft Assembly (GCA_000983455.1)
• 60X PacBio (P5 and P6 chemistry)
• Average read length ~11kb
• Daligner/Falcon v 0.2
Total sequence length 2,851,367,788
Number of contigs 2,873
Contig N50 12,981,785
Contig L50 68
8. Gene Model (RefSeq) Analysis
GRCh38
CHM1_
1.1
CHM1_PB1 CHM1_PB2 CHM13
Number of
sequences
not aligning
21 88 67 67 125
Split
Transcripts 8 35 1,245 1,131 285
CDS coverage
<95% 17 266 1,339 1,212 265
Total Sequences Retrieved from Entrez 49,680
9. Short read sequence analysis
• 100X Illumina sequence
• Align with BWA-MEM to ordered and
oriented assembly
• Variant calling via SpeedSeq (Chiang et al,
2015)
• SNVs, indels: FreeBayes
• SVs: LUMPY, SVTyper
• CNV: CNVnator
10. CHM13 Illumina data aligned to CHM13 assembly
202,016 SNVs/indels on unplaced scaffolds
SV_TYPES
>10kb
5-10kb
1-5kb
<1kb
DELETIONS
174
131
430
2582
INVERSIONS
5
0
2
7
DUPLICATIONS
151
112
309
113
TOTAL
330
243
741
2702
11. BioNano SV calls can be used to identify misassembly
Collapse
Expansion
inAssembly
Gap in SequencePacBio Assembly
BioNano Map
SV_TYPES
DELETIONS
41
INVERSIONS
10
INSERTIONS
15
TOTAL
66
BioNano alignment to CHM13
19. Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Tina Graves-Lindsay
Vince Magrini
Sean McGrath
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics
Teams at The Genome Institute
University of Washington
Evan Eichler
John Huddleston
Archana Raja
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM13 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Palak Sheth
Pacific Biosciences
Jason Chin
Nick Sisneros