2. Processing Reads
• As we’ve covered before, if we already have a
reference assembly, we can process reads by
aligning to the reference genome
3. The Sequencing Abstraction
It was the best of times, it was the worst of times…
worst of times
was the worst
the worst of
• Sequencing performs a poisson distributed
sampling of substrings from a larger string
• Reads are exact substrings (i.e., error free)
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
best of times
4. The Alignment Abstraction
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
worst of times
the worst of
best of times was the worst
was the worst
It was the
worst of times
the best of
the worst of
times, it was
best of times
5. But!
• What do we do if we don’t have a reference
genome to map against?
• Can we use information in the reads to assemble
the reads together into a string?
6. Sequence Assembly
was the worst
best of times
It was the
worst of times
the best of
the worst of
times, it was
It was the
the best of
best of times
times, it was
was the worst
the worst of
worst of times
It was the best of times, it was the worst of times…
7. The Assembly Problem
• Given a set of reads, we want to assemble the
“best” contigs possible
• Contig = contiguous sequence
• Two general formulations for assembly:
• Overlap-layout-consensus (OLC)
• de Brujin graph (DBG)
9. Assembly is Graph Traversal
• In OLC, we create an overlap graph, and find a
Hamiltonian path
• In DBG, we create a de Brujin graph, and find an
Eulerian path
10. Overlap Graphs
• Given a set of reads, represents how these reads
overlap
Nodes are reads, edges are overlaps.
11. Example Overlap Graph
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
the best of
It was the
times, it was
worst of times
the worst of
best of times
was the worst
12. Hamiltonian Path
• A Hamiltonian Path is a path which visits each node
in the graph exactly once
13. Computing Overlaps
• To compute overlaps between two reads, we
compute the pairwise alignment of these two reads
• This can be done using dynamic programming
(Smith-Waterman) or a profile HMM
• We can accelerate this with indexing-based
methods, similar to those in SNAP
14. Two Problems
1. Overlapping is expensive:
• Must compute O(n2) overlaps, n = # reads
• Computing an overlap is O(l2), l = read length
2. Hamiltonian Path is NP-hard:
• Approximate solvers exist, but don’t scale up
to genomics datasets
15. de Brujin Graphs
• In a de Brujin graph, nodes are k-mers, and edges
represent observed transitions between k-mers
• k-mers are k-length substrings from reads
ACACTGCACT
ACCAAC
ACT
CTG
T GGCCA
C AACCT
16. de Brujin Graphs
• In a de Brujin graph, we may have multiple paths
between two nodes
ACACTGCACT
ACCAAC
ACT
CTG
T GGCCA
C AACCT
ACA CAC ACT
GCA TGC CTG
17. Eulerian Path
• In an Eulerian path, we use every edge exactly once
• Preconditions for finding an Eulerian path assembly on a DBG:
1. One node must have one more edge leaving than
entering
2. One node must have one more edge entering than
leaving
3. All other nodes must have equal numbers of edges
entering and leaving
18. Finding an Eulerian Path
• Connect the two nodes with unbalanced edges
• This provides us an Eulerian cycle
• From an arbitrary node n, walk the graph until we return from
n, and save the path we’ve walked
• Until all edges have been used:
• Pick a point n’ from our path, where n’ has unused
edges
• Walk from n’ until we return to n’, and track visited edges
19. Problems with Eulerian Path
• For a given graph, we may have multiple valid paths!
CAA AAT
8
9
1 2
ACA CAC ACT
GCA TGC CTG
10
ATG 7
3
5 4
6
11
ACACTGCACAATGC
CAA AAT
1
2
8 9
ACA CAC ACT
GCA TGC CTG
3
ATG 7
10
5 11
6
4
ACAATGCACACTGC
21. How Do We Assemble
Multiple Reads?
• In practice, de Brujin graphs are additive
• This allows us to merge graphs from multiple reads
• When do we keep/remove edges?
23. Errors!
• One of the key assumptions that we make in the
sequencing process is that reads are correct
• But, in reality, reads have a 2% error rate
• How does this impact us?
24. What Are The Errors Like?
ACATATAGAA
AGATATAGAN
• Currently, the most common sequencing technology
is called Illumina
• Errors tend to be a misread of a single base
• Errors tend to be clustered at the ends of reads
28. Help‽ What Can We Do?
• For some errors, we can inspect the de Brujin
graph directly, and eliminate edges from the graph
• More generally, we can look at the distribution of
k-mers, and try to make corrections to the reads
29. Trimming Spurs
• Since errors are at the ends of reads, we see spurious branches
off of the graph
• Use heuristics to determine whether we can remove these nodes
• E.g., if these nodes are only present in 1 read, probably OK
30. The k-mer Spectrum
• If we look at the frequencies of k-mers, we see
something interesting…
32. Those Are Our Errors!
• Errors create low-frequency substrings
• We can identify errors with a mixture model:
• Mixture of poissons
• Distribution with lowest mean —> errors
• From here, we can remove those “erroneous”
strings, and pick likely replacements
33. How Do We Define Likely?
• Can use edit distance of replacement as a heuristic
• Can define a probabilistic measure for the quality of
a replacement:
34. Dealing With Repeats
• A cycle in a de Brujin graph is caused by repeated
sequence
• In real genomes, there is a lot of repetition:
• Structural variation —> duplicated sequences
• Transposons/Mobile Elements
• Centromeres and Telomeres
35. Increased k-mer Length
ACA CAC ACT
GCA TGC CTG
ACACTGCACT
ACACT CACTG ACTGC
GCACT TGCAC CTGCA
• If we have a sequence which is less than b bases
long, we can resolve the repeat by using k-mers
with k > b
36. Scaffolding
It was the best of times, it was the worst of times…
the best of
best of times was the worst
It was the
worst of times
times, it was
• Current sequencing technology gives us paired reads,
with approximately known distance between reads
37. Scaffolding
• We can use this to estimate repeat sizes:
• Or, to estimate the size of gaps:
smaller!
bigger!
40. Opportunities
• New read technologies are available
• Provide much longer reads (250bp vs. >10kbp)
• Different error model… (15% INDEL errors, vs. 2% SNP errors)
• Generally, lower sequence specific bias
• But, need to improve OLC assembler performance!
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
41. Can we turn an expensive,
serial problem into a
cheap, parallel problem?
42. Fast Overlapping with
MinHashing
• Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
Can reduce complexity from O(n2) to O(nb)!
43. MapReduce
• Intuition: if we have a data parallel algorithm, we can
run the algorithm across many computers
• Many popular systems:
• MapReduce at Google
• Hadoop
• (from Berkeley!)
• Provide special programming models for graphs…
44. MinHash On MR
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
map groupBy map + filter
45. Transitive Reduction
• We can find a consensus between clique members
• Or, we can reduce down:
• Can be implemented efficiently using graph-optimized
MapReduce libraries!