Making One BIG Genome from Millions of Small Pieces

Genome Assembly:
the art of trying to make one
BIG thing from millions of
very small things
Keith Bradnam
@kbradnam
Image from Wellcome Trust
v1.1 June 2015

Genome Assembly:
the art of trying to make one
BIG thing from millions of
very small things
Keith Bradnam
@kbradnam
Image from Wellcome Trust
This was a talk given at UC Davis on 15th June 2015 as part of a Bioinformatics
Core teaching workshop.
Author: Keith Bradnam, Genome Center, UC Davis
This work is licensed under a Creative Commons Attribution-NonCommercial-
ShareAlike 4.0 International License.

ﬂickr.com/incrediblehow/
Overview

1. What is genome assembly?
2. Why is it difﬁcult?
3. Why is it important?
4. How do we know if an assembly is any good?

What is genome assembly?

A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.

Using a piece of bioinformatics software is just like running an experiment. Just
because you get an answer, it doesn't mean it will be the right answer. You should
always be prepared to tweak some parameters and re-run the experiment.

The ideal goal would be to end up with complete sequences for each chromosome at
each level of ploidy. E.g. diploid genomes would be assembled as two sets of
genome sequences.

'Large' is a relative term. We would expect that advances in sequencing technology
would mean that the number of sequences needed to assemble a genome is only
ever going to decrease.

'Short' is also a relative term. As technology improves, we expect to see our input
sequences get longer and longer until the steps of sequencing and assembly
essentially merge into one process.

It's a bit like trying to do the hardest
jigsaw puzzle you can imagine!

This is a jigsaw that I did for the benefit of your education! There are lots of
analogies that can be made between assembling genomes, and assembling jigsaws.

Sometimes we assemble regions of jigsaws that are locally accurate, but globally
misplaced (the top region circled in red). Sometimes we also assemble regions and
leave them to one side as we don't know where they should go. Many 'finished'
genome assemblies include sets of 'unanchored' sequences that are not positioned
on any chromosome.

Let's keep working on our jigsaw.

Repetitive regions are a big problem for genome assembly

The hardest parts of a jigsaw tend to be repetitive regions (skies, sea, forests etc.).
The same is true for genome assemblies.

Certain information can help pair together regions

Sometimes we can use information to pair together two different completed
sections of a jigsaw. In this case, we can use our understanding of what a bridge
looks like to give us an approximate spacing between the two completed sections at
the top of this puzzle. We do similar things with genome assemblies and also end up
inserting approximately sized gaps between regions of sequence.

Is this good enough?
For a jigsaw, we would never ever call this 'finished', but for a genome assembly this
would represent an almost perfect sequence! All of the main details are present, you
can identify what the picture is showing (San Francisco), the edges are detailed
enough that we can accurately calculate the size of the jigsaw, and the parts that are
missing are mostly minor details.

We often end up with some missing pieces

We often try to ﬁt pieces in the wrong way

Jigsaws often end up with a few missing pieces meaning that it is impossible to
complete the puzzle. Genome assemblies also end up with missing pieces because
they were never in the input set of sequences to begin with. This is because not all
sequencing technologies capture all locations in a genome.

We never get to this point with genome assembly!

With the exception of bacterial genomes, we never reach this point with genome
assembly. All published eukaryotic genomes are incomplete and contain errors.
Maybe yeast (Saccharomyces cerevisiae) and worm (Caenorhabditis elegans) are
the best examples we have a of near-complete reference genome for a eukaryotic
species.

Why is it difﬁcult?

World's largest assembled genome
• Lobolly pine (Pinus taeda)
• 22 Gbp genome!
• ~80% repetitive
• 64x coverage
from tulsalandscape.com

World's largest assembled genome
• Lobolly pine (Pinus taeda)
• 22 Gbp genome!
• ~80% repetitive
• 64x coverage
from tulsalandscape.com
This gargantuan effort featured the work of many people at UC Davis, led by the
efforts of David Neale's group.

What does 64x coverage mean?
Over 1.4 trillion bp of DNA were sequenced!

What does 64x coverage mean?
Over 1.4 trillion bp of DNA were sequenced!
I.e. they had to use 64x times as much input DNA as they ended up with in the final
output. Imagine if baking a cake was like this, and you had to use 64x as many
ingredients in order to make one cake.
Some genome assembly projects are done with >100x coverage.

Biological challenges
for genome assembly
Problem Description
Repeats
Many plant and animal genomes mostly consist of
repetitive sequences, some of which are longer than
length of sequencing reads.
Ploidy
For many species, you have at least two copies of the
genome present. Level of heterozygosity is important.
Lack of reference
genome
Reference-assisted assembly is a much easier problem
than de novo assembly. Even having genome from a
closely related species can help.

Biological challenges
for genome assembly
Problem Description
Repeats
Many plant and animal genomes mostly consist of
repetitive sequences, some of which are longer than
length of sequencing reads.
Ploidy
For many species, you have at least two copies of the
genome present. Level of heterozygosity is important.
Lack of reference
genome
Reference-assisted assembly is a much easier problem
than de novo assembly. Even having genome from a
closely related species can help.
Ploidy is often a much bigger problem for plant genomes. E.g. some wheat species
are hexaploid. Genome assembly is sometimes performed on a genome for which
we already have a reference (e.g. if you sequenced your own genome, you could
align it to the human reference sequence). Otherwise, we are talking about de novo
assembly which is much, much harder.

from amazon.com
Returning to the jigsaw analogy…every jigsaw puzzle comes with a picture of the
puzzle on the box. This is a luxury not always available to genome assemblers.

When we are doing de novo assembly, it is a bit like doing a jigsaw without knowing
what it will look like.

Even with de novo assembly, we may have a distant relative with a known genome
sequence that can help with the assembly. A bit like assembling a jigsaw using a
blurred picture as a guide.

Jigsaws tell you how many pieces are in the puzzle (and what the dimensions of the
puzzle will be). We don't always know this for genome assembly. There are
measures for determining how big a genome might be, but these methods can
sometimes be misleading.

2.0
2.5
3.0
3.5
4.0
? 1949 1959 1971 1972 1980 1981 1983 1985 1990 1994 1998
Data from genomesize.com
C-value
(pg)

2.0
2.5
3.0
3.5
4.0
? 1949 1959 1971 1972 1980 1981 1983 1985 1990 1994 1998
Data from genomesize.com
C-value
(pg)
These are experimental estimates of the mouse genome size (taken from the animal
genome size database). There is a lot of variation! Many organisms only have one
experimental estimate of how big their genome is.

Other challenges
for genome assembly
Problem Description
Cost
In 2014 Illumina claimed the $1,000 genome barrier had
been broken (if you ﬁrst spend ~$10 million on hardware).
Library prep A critical, and often overlooked, step in the process.
Sequence
diversity
Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which
mix of sequence data will you be using?
Hardware
Some genome assemblers have very high CPU/RAM
requirements. Might need specialized cluster.
Expertise
Not always easy to even get assembly software installed,
let alone understand how to run it properly.
Software There is a lot of choice out there.

The PRICE genome
assembler has 52
command-line options!!!

The PRICE genome
assembler has 52
command-line options!!!
This is probably not the most complex, nor the most simple, genome assembler that
is out there. But how much time do you have to explore some of those 52
parameters that could affect the resulting genome assembly?

You may need more than one tool
via Shaun Jackman

You may need more than one tool
via Shaun Jackman
Modern genome assembly pipelines don't always rely on a single tool. This pipeline
consists of many different programs.

Problem Description
Cost
In 2014 Illumina claimed the $1,000 genome barrier had
been broken (if you ﬁrst spend ~$10 million on hardware).
Library prep A critical, and often overlooked, step in the process.
Sequence
diversity
Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which
mix of sequence data will you be using?
Hardware
Some genome assemblers have very high CPU/RAM
requirements. Might need specialized cluster.
Expertise
Not always easy to even get assembly software installed,
let alone understand how to run it properly.
Software There is a lot of choice out there.
Other challenges
for genome assembly

There are over 125 different tools available to
help assemble a genome!

There are over 125 different tools available to
help assemble a genome!
Not all of these are comprehensive genome assemblers, some are tools to help with
specific aspects of the assembly process, or to help evaluate genome assemblies etc.
Still, this represents a bewildering amount of choice.

bambus2
Ray
Celera
MIRA
ALLPATHS-LG
SGACurtain Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMiner
Lasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
Quast
SCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapﬁller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAP
SR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC Omega
GABenchToB
HiPGA
SAGE
HyDA-Vista
MHAP
Mapsembler 2
GAML
SAT-Assembler
RAMPART
VICUNA
CloudBrush
Which tool will you use?

bambus2
Ray
Celera
MIRA
ALLPATHS-LG
SGACurtain Metassembler
Phusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
Edena
Forge
Geneious
IDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
Monument
Atlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMiner
Lasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
Quast
SCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapﬁller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAP
SR-ASM
SuccinctAssembly
SUTTA
Ragout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC Omega
GABenchToB
HiPGA
SAGE
HyDA-Vista
MHAP
Mapsembler 2
GAML
SAT-Assembler
RAMPART
VICUNA
CloudBrush
Which tool will you use?
This slide was made in 2014, and so is already out of date!

These six assembly tools were published in one month in 2014!

Before you assemble…
• You should remove adapter contamination
• You should remove sequence contamination
• You should trim sequences for low quality regions

Before you assemble…
• You should remove adapter contamination
• You should remove sequence contamination
• You should trim sequences for low quality regions
After we have generated the raw sequence data, we still must run a few basic steps
to clean up our data prior to assembly. How straightforward are these steps?

Tools for removing adapter
contamination from sequences
There are at least 34 different tools!
One of these tools has 27 different
command-line options

Tools for removing adapter
contamination from sequences
There are at least 34 different tools!
One of these tools has 27 different
command-line options
Even the first step of removing adapter contamination is something for which you
could spend a lot of time researching different software choices.

Why is it important?

Saccharomyces cerevisiae
• 12 Mbp genome
• Published in 1997
• First eukaryotic genome sequence

Saccharomyces cerevisiae
• 12 Mbp genome
• First eukaryotic genome sequence
Not the first published genome — there were several bacterial genomes sequenced
in the preceding couple of years — but this was the first eukaryotic genome
sequence. Furthermore, this genome sequence has undergone continual
improvements and corrections since publication (the last set of changes were in
2011).

Caernorhabditis elegans
• ~100 Mbp genome
• First animal genome sequence

Arabidopsis thaliana
• First plant genome sequence
• Size?
• 2000 = 125 Mbp
• 2007 = 157 Mbp
• 2012 = 135 Mbp

Arabidopsis thaliana
• First plant genome sequence
• Size?
• 2000 = 125 Mbp
• 2007 = 157 Mbp
• 2012 = 135 Mbp
As alluded to earlier, we don't always know for sure how big (or small) a genome is.
The Arabidopsis genome size has been corrected upwards and downwards since
publication. The amount of sequenced information as of today is about 119 Mbp.
And this is for the best understood plant genome that we know about it!

Homo sapiens
• ~3 Gbp genome
• Finished?
• 'working draft' announced in 2000
• 'working draft' published in 2001
• completion announced in 2003
• complete sequence published in 2004

Homo sapiens
• ~3 Gbp genome
• Finished?
• 'working draft' announced in 2000
• 'working draft' published in 2001
• completion announced in 2003
• complete sequence published in 2004
The human genome has also undergone improvements since the (many)
announcements regarding its completion (or near completion). There are only a
small number of species for which there is dedicated group of people who seek to
continually improve the genome sequence and get closer to 'the truth'.

The 100,000 genomes project
There are lots of ongoing
genome sequencing projects
i5k Insect and other Arthropod
Genome Sequencing Initiative

The 100,000 genomes project
There are lots of ongoing
genome sequencing projects
i5k Insect and other Arthropod
Genome Sequencing Initiative
Bigger numbers must be better, right? Some projects sequence genomes to align
back to a reference to look for the differences, others seek to characterize genomes
for which we have very little genomic information. The 100,000 genomes project in
England heralds the start of the mass sequencing of patients to understand disease.

We no longer have one
genome per species
• We have genome sequences representing different
strains and varieties of a species
• We have genome sequences from multiple
individuals of a species
• We have multiple genomes from different tissues of
the same individual (e.g. cancer genomes)

We no longer have one
genome per species
• We have genome sequences representing different
strains and varieties of a species
• We have genome sequences from multiple
individuals of a species
• We have multiple genomes from different tissues of
the same individual (e.g. cancer genomes)
Also, in the near future we can imagine having your genome sequenced at birth
(from different tissues) and getting 'genome health checks' throughout your life.

There is no point sequencing so many genomes
if we can't accurately assemble them!

There is no point sequencing so many genomes
if we can't accurately assemble them!
Sequencing genomes is relatively easy. Putting that information together in a
meaningful way so as to make it useful to others…that's not so easy.

Bad genome assemblies #1
Length of 10 shortest sequences:
100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
The average
vertebrate gene is
about 25,000 bp

Length of 10 shortest sequences:
100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
The average
vertebrate gene is
about 25,000 bp
Everyone wants long sequences in a genome assembly. This may not always matter,
but in most cases they should hopefully be long enough to contain at least one gene.
These data are from a vertebrate genome sequence that someone asked me to look
at. Over half of the genome assembly was represented by sequences less than 150
bp! This is not much use to anyone.

Ns = 91% !!!
Genome sequences
usually contain
unknown bases (Ns)

Ns = 90.6% !!!
Genome sequences
usually contain
unknown bases (Ns)
From another assembly that I was asked to look at. Even the 9% of the genome
which wasn't an 'N' was split into tiny little fragments. Completely unusable
information.

Has anyone compared different assemblers to
work out which is the best?

Assemblathons

A genome assembly competition
This was a genome assembly assessment exercise that I was involved with.

@assemblathonIt spawned a sequel.

Coming soon to a cinema near you!

Coming soon to a cinema near you!
Work is currently underway to organize a third Assemblathon effort.

Assemblathon 2
Published in
Gigascience, 2013

Bird
SnakeFish
Three species were used in Assemblathon 2. A budgie, a Lake Malawi cichlid fish,
and a boa constrictor snake.

Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!

Species
Estimated
genome size
Illumina Roche 454 PacBio
Bird 1.2 Gbp 285x
(14 libraries)
16x
(3 libraries)
10x
(2 libraries)
Fish 1.0 Gbp 192x
(8 libraries)
Snake 1.6 Gbp 125x
(4 libraries)
Assemble this!
Lots of sequence data were provided, especially for the bird.

3 species
21 teams
43 assemblies
52 Gbp of sequence!

Goals
• Assess 'quality' of genome assemblies
• Identify the best assemblers
• First need to deﬁne quality!

Who makes the best pizza in Davis?

An easy question to ask, but maybe not as straightforward as it seems…

Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest?
Choice of toppings?
Free sodas?
Delivery time?
Tastiest?

Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
'Best' is subjective. If you are intolerant to gluten, then the best pizza place will be
the one that makes gluten-free pizzas.

Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
Even if you focus on who makes the best 'tasting' pizzas, this is still very subjective.

Image from ﬂickr.com/dullhunk/
Who makes the best genome assembler?

Who makes the best genome assembly?
But surely this is not such a subjective topic when it comes to genome assembly?

Longest contigs?
Fewest errors?
Lowest CPU demands?Best deals with repeats?
Produces most genes?
Fastest?
Best resolves heterozygosity?
Easiest to install?
Longest scaffolds?

Who makes the best genome assembly?
Longest contigs?
Fewest errors?
Contains most genes?
Fastest?
Easiest to install?
Longest scaffolds?
It is less subjective, but there are still many different ways we can think of when
trying to determine what makes a good genome assembly.

Longest contigs?
Fewest errors?
Fastest?
Easiest to install?
Longest scaffolds?

Longest contigs?
Fewest errors?
Fastest?
Easiest to install?
Longest scaffolds?
The best assembler in the world may be no use to anyone if people can't get it
installed and understand how it should be run.

Metrics

Metric Notes
Assembly size How does it compare to expected size?
Number of sequences How fragmented is your assembly?
N50 length
(contigs & scaffolds)
Making contigs and making scaffolds
are two different skills.
NG50 scaffold length Becoming more common to see this used.
Coverage
How much of some reference sequence
is present in your assembly?
Errors
Errors in alignment of assembly to reference
sequence or to input read data.
Number of genes
From comparison to reference transcriptome
and/or set of known genes

Metric Notes
Assembly size How does it compare to expected size?
Number of sequences How fragmented is your assembly?
N50 length
(contigs & scaffolds)
Making contigs and making scaffolds
are two different skills.
NG50 scaffold length Becoming more common to see this used.
Coverage
How much of some reference sequence
is present in your assembly?
Errors
Errors in alignment of assembly to reference
sequence or to input read data.
Number of genes
From comparison to reference transcriptome
and/or set of known genes
This is a very brief summary that lists just some of the ways in which you could
describe your genome assembly.

Assembly size
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
A B C D E F G H I J K L M
Assemblathon 2 bird genome assemblies

Assembly size
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
A B C D E F G H I J K L M
Assemblathon 2 bird genome assemblies
In Assemblathon 2, one assembly of the bird genome (a parrot) was very, very
small. Conversely, one assembly was almost twice the size of the estimated genome
(~1.2 Gbp). Bigger is not always better when it comes to assembly size.

Using core genes
• All genomes perform some core functions
(transcription, replication, translation etc.)
• Proteins involved tend to be highly conserved
• They should be present in every genome

CEGMA
This was an approach developed by our lab, originally to find a handful of genes in a
newly sequenced genome which could be used to train a species-specific gene
finder. We then adapted the technique to assess the gene space of a draft genome.

What is CEGMA?
• CEGMA (Core Eukaryotic Gene Mapping Approach)
• deﬁnes a set of 248 'Core Eukaryotic Genes' (CEGs)
• CEGs identiﬁed from genomes of: S. cerevisiae, S. pombe,
A. thaliana, C. elegans, D. melanogaster, and H. sapiens
• How many full-length CEGs are present in an assembly?

What is CEGMA?
• CEGMA (Core Eukaryotic Gene Mapping Approach)
• deﬁnes a set of 248 'Core Eukaryotic Genes' (CEGs)
• CEGs identiﬁed from genomes of: S. cerevisiae, S. pombe,
A. thaliana, C. elegans, D. melanogaster, and H. sapiens
• How many full-length CEGs are present in an assembly?We expect that these 248 genes to be present in all eukaryotes. CEGMA uses a
combination of software tools to find these genes. The number of core genes
present is assumed to reflect the proportion of all genes that are present in the
assembly. Sometimes genes are split across contigs or scaffolds, CEGMA can find
some of these and reports them as partial matches.

Here are N50 scaffold lengths and number of core genes present in a variety of
genomes that I have looked at. There is a lot of variation. Some assemblies might
give you longer sequences (higher N50 values), but this is no guarantee that those
assemblies will contain more gene sequences. Likewise, assemblies with more gene
sequences may not necessarily have longer sequences.

Should you use CEGMA?
• CEGMA is not easy to install
• It is old and somewhat out of date
• You could use other transcript/protein data sets
instead of CEGMA

Should you use CEGMA?
• CEGMA is not easy to install
• It is old and somewhat out of date
• You could use other transcript/protein data sets
instead of CEGMA
The principle of CEGMA could be used with a variety of different data. Maybe there
are a small number of full-length mRNAs available for your species of interest. If
you have multiple genome assemblies, you could simply see how they differ with
respect to the presence of those genes.

BUSCO is a recently developed tool that works along similar lines to CEGMA.

Other tools for evaluating assemblies
FRCbam (2012) REAPR (2013) kPAL (2014)

Other tools for evaluating assemblies
FRCbam (2012) REAPR (2013) kPAL (2014)
Just as it seems increasingly popular to develop new genome assemblers, there is a
growing demand (and supply) for tools to evaluate genome assemblies. Here are
three recent ones.

102 metrics
per assembly
10 key
metrics
1 ﬁnal
ranking

102 metrics
per assembly
10 key
metrics
1 ﬁnal
ranking
Starting from 102 metrics per assembly, the entries were ultimately judged on 10
'key' metrics, that largely captured different aspects of an assembly's 'quality'. The
results from these 10 were combined into a single overall ranking (for each species).

And the winner is…
• No winner!
• Some assemblers seemed to work well for one
species, but not for other species
• Some assemblies were good, as measured by one
metric, but not when measured by others

And the winner is…
• No winner!
• Some assemblers seemed to work well for one
species, but not for other species
• Some assemblies were good, as measured by one
metric, but not when measured by others
This result was disappointing to many who was hoping that we would provide a
resounding endorsement for assembler 'X'.

Assembly
Number of
core genes
Rank Z-score
CRACS 438 1 +0.68
SYMB 436 2 +0.59
PHUS 435 3 +0.54
BCM 434 4 +0.49
SGA 433 5 +0.44
MERAC 430 6 +0.30
ABYSS 429 7 +0.25
SOAP 428 8 +0.21
RAY 422 9 –0.08
GAM 415 10 –0.41
CURT 360 11 –3.02

Assembly
Number of
core genes
Rank Z-score
CRACS 438 1 +0.68
SYMB 436 2 +0.59
PHUS 435 3 +0.54
BCM 434 4 +0.49
SGA 433 5 +0.44
MERAC 430 6 +0.30
ABYSS 429 7 +0.25
SOAP 428 8 +0.21
RAY 422 9 –0.08
GAM 415 10 –0.41
CURT 360 11 –3.02
Here are the CEGMA results. As well as rank each metric, we calculated a Z-score
for each metric (how man standard deviations was each assembly from the average)
and summed Z-scores to generate the final rankings.

The SGA team initially produced what looked like a clear winner for the snake
competition. Error bars show maximum and minimum Z-scores that would be
produced if any 9 of 10 combinations of metrics were used.

Assemblathon 2 Metric Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2

Assemblathon 2 Metric Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
Even though the SGA entry ranked 1st overall, it only ranked 1st in one individual
metric. So it was a good assembler, on average.

The long and short of it

Technology Date Typical read lengths
Sanger ~1970–2000 750–1,000 bp
Solexa/Illumina ~2005 ~25 bp
Illumina ~2014 ~150–250 bp
Paciﬁc Biosciences ~2014 10–15 Kbp
Oxford Nanopore ~2014 5–??? Kbp
Revolocity 2015 28 bp

Technology Date Typical read lengths
Sanger ~1970–2000 750–1,000 bp
Solexa/Illumina ~2005 ~25 bp
Illumina ~2014 ~150–250 bp
Paciﬁc Biosciences ~2014 10–15 Kbp
Oxford Nanopore ~2014 5–??? Kbp
Revolocity 2015 28 bp
Different technologies produce reads with very different length distributions, and
these technologies also increase the length of reads over time. Perhaps more
importantly, different technologies have different error profiles (where errors occur
in reads and types of error).

N50 length
The most widely used statistic for genome assemblies
First described in human genome paper (2001)

N50 length
The most widely used statistic for genome assemblies
First described in human genome paper (2001)
The length of the sequence which takes the sum length of all sequences past 50% of
the total assembly size (when summing lengths from longest to shortest).

NG50 length
Use NG50 when making comparisons between
genome assemblies because N50 can be biased
Be warned…some people obsess over N50!

NG50 length
Use NG50 when making comparisons between
genome assemblies because N50 can be biased
Be warned…some people obsess over N50!
In the Assemblathon contests, we used a new measure which enables a fairer
comparison between different assemblies (of the same genome).

What you can do

My #1 piece
of advice
ﬂickr.com/julia_manzerova

ﬂickr.com/thomashawk
Look at your data!

ﬂickr.com/thomashawk
Look at your data!
Before you do anything with your assembly, look at it closely. Look at the
distribution of lengths (not just N50). Look at the %N. Are the sequences long
enough to contain genes? Are the shortest sequences just unassembled reads?

The future of genome assembly

As new sequencing technologies mature, the associated tools also get developed.
People have recently published a de novo (bacterial) genome assembly using data
from the very new Oxford Nanopore MinION platform.

Over companies are developing promising long-range technologies which will be a
great resource for genome assemblers.

There are more companies out there, waiting to make their big entrance on the
world stage of genome sequencing and assembly.

Many of these companies promise the same key features.

Summary

In conclusion…
• Genome assembly is not a solved problem
• If possible, try different genome assemblers
• Don't rely on one metric to assess quality
• Different metrics assess different aspects of quality
• Look at your genome assembly!

http://acgt.me
@assemblathon
I frequently blog about some of the issues raised in this talk. I also use the
@assemblathon twitter account to publish links to lots of papers and other
resources that are related to this field.

Lex Nederbragt
@lexnederbragt
ﬂxlexblog.wordpress.com
Nick Loman
@pathogenomenick
pathogenomic.bham.ac.uk/blog
Mick Watson
@BioMickWatson
biomickwatson.wordpress.com
Keith Robison
@OmicsOmicsBlog
omicsomics.blogspot.com

Lex Nederbragt
@lexnederbragt
ﬂxlexblog.wordpress.com
Nick Loman
@pathogenomenick
pathogenomic.bham.ac.uk/blog
Mick Watson
@BioMickWatson
biomickwatson.wordpress.com
Keith Robison
@OmicsOmicsBlog
omicsomics.blogspot.com
These people have a lot of useful things to say about genome sequencing and
assembly. Their blogs and twitter feeds are useful resources.

Making One BIG Genome from Millions of Small Pieces

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Making One BIG Genome from Millions of Small Pieces

Similar to Making One BIG Genome from Millions of Small Pieces (20)

More from Keith Bradnam

More from Keith Bradnam (8)

Recently uploaded

Recently uploaded (20)

Making One BIG Genome from Millions of Small Pieces