1. Streaming approaches to sequence
data compression
(via normalization and error correction)
C. Titus Brown
Asst Prof, CSE and
Microbiology
Michigan State University
ctb@msu.edu
4. Side note: error correction is the
biggest “data” problem left in
sequencing.*
Both for mapping & assembly.
*paraphrased, E. Birney
5. My biggest research problem –
soil.
Est ~50 Tbp to comprehensively sample the microbial
composition of a gram of soil.
Bacterial species in 1:1m dilution, est by 16s
Does not include phage, etc. that are invisible to tagging
approaches
Currently we have approximately 2 Tbp spread across
9 soil samples, for one project; 1 Tbp across 10
samples for another.
Need 3 TB RAM on single chassis to do assembly of
300 Gbp (Velvet).
…estimate 500 TB RAM for 50 Tbp of sequence.
That just won‟t do.
6. Online, streaming, lossy compression.
(Digital normalization)
Much of next-gen sequencing is redundant.
7. Uneven coverage => even more
redundancy
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
15. Coverage after digital normalization:
Normalizes coverage
Discards redundancy
Eliminates majority of
errors
Scales assembly dramat
Assembly is 98% identica
16. Digital normalization algorithm
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
18. Little-appreciated implications!!
Digital normalization puts both sequence and
assembly graph analysis on a streaming and
online basis.
Potentially really useful for streaming variant calling
and streaming sample categorization
Can implement (< 2)-pass error
detection/correction using locus-specific
coverage.
Error correction can be “tuned” to specific
coverage retention and variant detection.
19. Local graph coverage
Diginorm provides ability to
efficiently (online) measure
local graph coverage, very
efficiently.
(Theory still needs to be
developed)
20. Alignment of reads to graph
“Fixes” digital normalization
Aligned reads => error corrected reads
Can align longer sequencesCorrection
Sequence Read (transcripts?
contigs?) to graphs.
Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG
Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG
A
G G
19
G 19 19 G
SN
19 SN SN 19
C T
SN SN
19 19
C SN SN
19 A
G SN 19
19 SN Seed K-mer
A MN C
20 20 CGAATCTGAT
MN MN
G A
1 1
ME C ME
1 T
C 1
ME
1 G G ME Emission Base → A
ME 1 G
A
G 1
ME
K-mer Coverage → 19
ME 1
ME 1
1
ME
Vertex Class → SN
ME
Jason Pell
21. 1.2x pass error-corrected E. coli*
(Significantly more compressible)
* Same approach can be used on mRNAseq and metageno
22. Some thoughts
Need fundamental measures of information
retention so that we can place limits on what
we‟re discarding with lossy compression.
Compression is most useful to scientists when it
makes analysis faster/lower memory/better.
Variant calling, assembly, and just deleting your
data are all just various forms of lossy
compression :)
23. How compressible is soil data?
De Bruijn graph overlap: 51% of the reads in prairie
(330 Gbp) have coverage > 1 in the corn sample‟s
de Bruijn graph (180 Gbp).
Corn Prairie
24. Further resources
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/interests.html
See esp:
BIGDATA: Small: DA: DCM: Low-memory
Streaming Prefilters for Biological Sequencing
Data
Preprints: on arXiv, q-bio:
„diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling