More Related Content Similar to Review of Liao et al - A draft human pangenome reference - Nature (2023) (20) Review of Liao et al - A draft human pangenome reference - Nature (2023)1. Review of:
“A draft human
pangenome reference”
Presented by:
Stuart MacGowan
Liao, WW., Asri, M., Ebler, J. et al.
A draft human pangenome reference.
Nature 617, 312–324 (2023).
https://doi.org/10.1038/s41586-023-05896-x
Source: Nature Vol. 617 Issue 7960 (Image: Darryl Leja/NHGRI)
2. Why isn’t one human genome enough?
Source: The Human Pangenome – NHGRI/ YouTube
3. A draft human
pangenome reference
• First draft of human pangenome reference!
• 47 phased, diploid assemblies from
genetically diverse individuals
• Coverage and accuracy:
• Over 99% of expected genome sequence
• over 99% structural and base pair accuracy
• Contributions:
• Reveals new alleles at structurally complex loci
• Adds 119M base pairs of euchromatic polymorphic sequences
• Identifies 1,115 new gene duplications
• Benefits:
• Reduced small variant discovery errors by 34%
• increased detection of structural variants by 104% compared
to GRCh38 workflows
0
50
100
150
200
250
Small Variant Discovery Errors
(%)
Detected Structural Variants (%)
Percent
relative
to
GRCh38
(%)
Variant Discovery in
GRCh38 vs. Pangenome
GRCh38 Pangenome
© 2023 Stuart A. MacGowan, CC BY 4.0
4. Introduction
• Limitations of current human genome reference (GRCh38):
• Contains ~210 Mb of unknown or simulated
sequences, limiting study scope.
• Achievements with T2T-CHM13 genome sequencing:
• Uncovered 3.7 million additional SNPs.
• Better representation of true copy number variants
(CNVs).
• Shortcomings of a single reference genome:
• Can't capture full human genetic diversity.
• Overlooks many structurally variant (SV) regions,
which significantly impact gene function.
• The solution: Transition to a pangenomic reference:
• Overcomes reference bias.
• Current study presents a draft human pangenome.
• Ultimate goal: Capture global genomic diversity with
a panel of 700 haplotypes from 350 individuals. Credit: Darryl Leja, NHGRI, from http://www.ensembl.info/
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license as indicated.
5. The manuscript
• Assembling 47 diverse human genomes
• Assembly assessment
• Regional assembly assessment
• Completeness and CNV
• Annotating 47 diverse genomes
• Constructing a draft pangenome
• Measuring pangenome variation
• Pangenomes represent complex loci
• Applications of the pangenome
• Pangenome-based short variant discovery
• A pangenome variant resource
• SV genotyping
• Improved tandem repeat representation
• Improved RNA sequencing mapping
• Improved chromatin immunoprecipitation and sequencing analysis
Credit: NHGRI and Massive Science
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
6. Assembling 47 diverse
human genomes
• 29 HPRC samples and 18 samples from other
efforts.
• Multimodal sequencing for all samples
• PacBio HiFi, ONT long-read, Bionano, Hi-C
Illumina.
• On average, 39.7x HiFi sequence depth of
coverage
• N50 value averaged at 19.6 kb for the HiFi reads.
• Core assembler: Trio-Hifiasm, which uses PacBio
HiFi long-read and parental Illumina short-read for
phased contig assemblies.
Supplementary Figure 1. Trio-Hifiasm assembly pipeline.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
7. Assembly assessment
• Fixed misassemblies
• three large duplication errors
• one large phasing error,
• 217 putative interchromosomal joins.
• Haploid assemblies with X chromosome average 3.04 Gb,
(99.3% of CHM13). With Y chromosome average 2.93 Gb.
• Average NG50 value was 40 Mb (cf. 56 Mb for GRCh38)
• Yak k-mer QV of 53.57 ≈ 1 error per 227,509 bases.
• QVs for two samples confirmed against Genome in a Bottle
• 32% of indel errors were in homopolymers >5 bp
• 48% were in tandem repeats and low-complexity regions
• Average haplotype switch error rate of 0.67%
Fig 1c. Total assembled sequence per assembly.
Fig 1d. Assembly contiguity NGx plot.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
8. Ensembl
GENCODE
annotation
Reference Gene
Set (GENCODE)
Identify Gene
Clusters
(100kb windows)
Define Anchor
Points
(3 per cluster)
Map Anchors to
Target Genome
(minimap2)
Determine High-
Confidence
Regions
Align GRCh38
Region with Target
Region (MAFFT)
Reconstruct
Transcripts
Check Mapping
Quality and
Resolve
Inconsistencies
Look for Recent
Duplications
© 2023 Stuart A. MacGowan, CC BY 4.0
9. Ensembl GENCODE annotation statistics
• Transcriptome mapping
• Median of 99.07% protein-coding genes and 99.42% transcripts
• Median of 98.16% noncoding genes and 98.96% transcripts
• Median of 25 nonsense and 72 frameshifts per assembly
• Within expected range of loss-of-function mutations
• Over 80% supported by independent Illumina variant
callsets
• Suggest upper bound of 18 transcript-altering errors per
transcriptome or
1 per 1.7 million assembled transcriptome bases.
• Identified 1,115 protein-coding gene families with
copy number gain in 1+ genome
Fig 2a. Percentages of coding and noncoding genes and transcripts
annotated from the reference set each assembly.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
10. Building A Draft
Human Pangenome
• Pangenomes can be visualized as sequence
graphs with DNA segments as nodes and
combinations of orientations as edges.
• Haplotype sequences are walks in the
graph and are implicitly aligned.
• Graph Construction:
• Minigraph: Reference-based,
gradually adds assemblies.
• Minigraph-Cactus (MC): Enhances
Minigraph with further alignments.
• PanGenome Graph Builder (PGGB):
All-to-all assembly alignments.
• Includes GRCh38 and T2T-CHM13
• Three samples held out for benchmarking.
Source: Li, H., Feng, X. & Chu, C. Genome Biol 21, 265 (2020).
Fig 3a. A pangenome variation graph. Source: Liao et al., Nature, 2023.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license as indicated.
11. Exploring Human Genetic Variation with
Pangenome Reference Draft
• ~22 million small variants
• ~70k structural variants
• Novel ~175-190 Mb of euchromatic
autosomal sequence
• High concordance of variants with
conventional genotyping
• > 97% HiFi reads align to the MC graph
• Annotated nearly 99.1% of protein-coding
transcripts per assembly
Fig 3g. Pangenome growth curves for PGGB. Depth measures how often a
segment is contained in any haplotype sequence. Core is present in ≥95% of
haplotypes, common is ≥5%.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
12. Applications of the pangenome to
downstream analysis workflows
• Pangenome-based short variant
discovery
• A pangenome variant resource
• SV genotyping
• Tandem repeat representation
• RNA sequencing mapping
• Chromatin immunoprecipitation and
sequencing analysis
Credit: Darryl Leja, NHGRI.
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023 and as indicated.
13. Improved Variant Calling Accuracy
with Pangenomic Approach
• Methods:
• Giraffe alignment to MC pangenome graph
• cf. alignments to GRCh38 and Dragen Graph.
• Pangenomic approach (Giraffe + DeepVariant)
outperformed others in calling small variants.
• Error comparison:
• 21,700 errors vs. 36,144 (GRCh38) and
26,852 (Dragen pipeline).
• Even better in complex medically relevant genes.
• Additive improvements with with DeepTrio
Fig 6a. GIAB (v.4.2.1) HG005 benchmark
Fig 6b. CMRG (v.1.0) benchmark
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license from Liao et al., Nature, 2023.
14. Pangenome Variant
Resource
• Applied Giraffe + DeepVariant pipeline to high-
coverage short-reads from 1KG
• Mendelian consistency across 100 trios
comparable to samples from GIAB
• On average 64,000 more variants per sample
compared to 1KG catalogue
• Improved performance in challenging regions
yields better allele frequencies at complex,
medically relevant loci
Link: Google Cloud Bucket
© 2023 Stuart A. MacGowan, CC BY 4.0
15. Accessing draft
pangenome
resources
• UCSC Genome Browser (http://hprc-
browser.ucsc.edu)
• Ensembl Rapid Release Genome Browser
(https://rapid.ensembl.org)
• Ensembl HPRC project page
(https://projects.ensembl.org/hprc/)
• AnVIL_HPRC workspace
(https://anvilproject.org/)
• AWS Open Data Program in human-pangenomics
S3 bucket (https://s3-us-west-
2.amazonaws.com/human-
pangenomics/index.html)
• Various BioProject, Zenodo and GitHub repos
© 2023 Stuart A. MacGowan, CC BY 4.0
16. FLG assemblies in
Jalview
• Extracted from Proteins FASTAs from Ensembl HPRC
project page (https://projects.ensembl.org/hprc/)
• nb. Ensembl Ids were mangled…
• Alignment properties:
• Sequences: 97
• Minimum Sequence Length: 2390
• Maximum Sequence Length: 4710
• Average Length: 3680
© 2023 Stuart A. MacGowan, CC BY 4.0
17. Conclusion
• A Draft Human Pangenome: 94 diverse, high-quality de novo haplotype
assemblies.
• New Genetic Insights: Uncovered novel genetic variations and
mutational processes.
• Pangenomes are Powerful Tools: Enhanced mapping workflows and
error reduction.
• Future of SVs: Pangenome + long-reads = comprehensive SV
genotyping.
• Globalising genomics: Pangenomic workflows improve genotype
detection across diverse individuals and ancestries, and help mitigate
detection bias.
• Challenges Ahead: Assembly reliability, sequencing errors, and need
more diversity.
• Implications: Promises to improve understanding of genomics and
ability to predict, diagnose, and treat disease. Set new standards for
capturing variant diversity.
• Towards a Global Reference: Anticipated rapid pangenome
improvements and many applications.
Source: Nature Vol. 617 Issue 7960 (Image: Darryl Leja/NHGRI)
© 2023 Stuart A. MacGowan, CC BY 4.0. Content used under license as indicated.
18. Sources
• Unless otherwise indicated, the figures used in this presentation are
sourced from the article:
• A draft human pangenome reference" by Liao, WW., Asri, M., Ebler, J. et al.,
published in Nature, 2023.
• https://doi.org/10.1038/s41586-023-05896-x
• The figures are used in accordance with the Creative Commons Attribution
4.0 International License, which permits use, sharing, adaptation,
distribution and reproduction in any medium or format, as long as
appropriate credit is given to the original author(s) and the source.
• The Creative Commons license can be viewed here:
http://creativecommons.org/licenses/by/4.0/
• Figures were resized and cropped to fit the slide format.
19. License
• This presentation, including all original figures, is created by Stuart A.
MacGowan and is licensed under a Creative Commons Attribution 4.0
International License.
• You are free to share (copy and redistribute the material in any
medium or format) and adapt (remix, transform, and build upon the
material) for any purpose, even commercially, provided you give
appropriate credit, provide a link to the license, and indicate if
changes were made.
• For full details of the license, visit:
http://creativecommons.org/licenses/by/4.0/
• © 2023 Stuart A. MacGowan
Editor's Notes Minigraph figure: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/vg/graph_to_grch38;tab=objects?pli=1&prefix=&forceOnObjectsSortingFiltering=false&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))
E.g. detected gene conversion event in RHCE and variants within KCNE1 (previously inaccessible due to a false duplication in GRCh38)