SlideShare a Scribd company logo
1 of 25
Streaming approaches to sequence
        data compression
    (via normalization and error correction)



                 C. Titus Brown
               Asst Prof, CSE and
                  Microbiology
            Michigan State University
                 ctb@msu.edu
What Ewan said.
What Guy said.
Side note: error correction is the
biggest “data” problem left in
sequencing.*




        Both for mapping & assembly.

                            *paraphrased, E. Birney
My biggest research problem –
soil.
 Est ~50 Tbp to comprehensively sample the microbial
 composition of a gram of soil.
   Bacterial species in 1:1m dilution, est by 16s
   Does not include phage, etc. that are invisible to tagging
   approaches

 Currently we have approximately 2 Tbp spread across
 9 soil samples, for one project; 1 Tbp across 10
 samples for another.

 Need 3 TB RAM on single chassis to do assembly of
  300 Gbp (Velvet).
 …estimate 500 TB RAM for 50 Tbp of sequence.

                    That just won‟t do.
Online, streaming, lossy compression.
(Digital normalization)
        Much of next-gen sequencing is redundant.
Uneven coverage => even more
redundancy


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Coverage before digital
normalization:


                          (MD amplified)
Coverage after digital normalization:

                            Normalizes coverage

                            Discards redundancy

                            Eliminates majority of
                            errors

                            Scales assembly dramat

                            Assembly is 98% identica
Digital normalization algorithm

for read in dataset:
  if estimated_coverage(read) < CUTOFF:
        update_kmer_counts(read)
        save(read)
  else:
        # discard read

              Note, single pass; fixed memory.
Digital normalization retains information, while
discarding data and errors
Little-appreciated implications!!
 Digital normalization puts both sequence and
 assembly graph analysis on a streaming and
 online basis.
   Potentially really useful for streaming variant calling
   and streaming sample categorization

 Can implement (< 2)-pass error
 detection/correction using locus-specific
 coverage.

 Error correction can be “tuned” to specific
 coverage retention and variant detection.
Local graph coverage
Diginorm provides ability to
efficiently (online) measure
local graph coverage, very
efficiently.
(Theory still needs to be
developed)
Alignment of reads to graph
 “Fixes” digital normalization
 Aligned reads => error corrected reads
 Can align longer sequencesCorrection
        Sequence Read (transcripts?
 contigs?) to graphs.
Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG
             Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG
                                   A
                             G          G
                                  19
                        G    19         19   G
                                  SN
                        19   SN         SN   19
                    C                              T
                        SN                   SN
                   19                             19
               C   SN                             SN
              19                                        A
         G    SN                                       19
         19                                            SN        Seed K-mer
     A   MN                                                 C
    20                                                      20   CGAATCTGAT
    MN                                                      MN
         G                                              A
          1                                             1
         ME   C                                        ME
               1                                   T
                   C                               1
              ME
                    1   G                    G    ME     Emission Base →    A
                   ME    1   G
                                   A
                                       G      1
                                             ME
                                                       K-mer Coverage →    19
                        ME    1
                             ME    1
                                        1
                                       ME
                                                          Vertex Class →   SN
                                  ME




                                                                                Jason Pell
1.2x pass error-corrected E. coli*
(Significantly more compressible)




               * Same approach can be used on mRNAseq and metageno
Some thoughts
 Need fundamental measures of information
 retention so that we can place limits on what
 we‟re discarding with lossy compression.

 Compression is most useful to scientists when it
 makes analysis faster/lower memory/better.

 Variant calling, assembly, and just deleting your
 data are all just various forms of lossy
 compression :)
How compressible is soil data?
De Bruijn graph overlap: 51% of the reads in prairie
(330 Gbp) have coverage > 1 in the corn sample‟s
            de Bruijn graph (180 Gbp).




             Corn          Prairie
Further resources
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
   See esp:
         BIGDATA: Small: DA: DCM: Low-memory
    Streaming Prefilters for Biological Sequencing
                          Data
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling
Streaming Twitter analysis.

More Related Content

Viewers also liked

Trainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarTrainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarZafar Ahmad
 
DNA的天羅地網
DNA的天羅地網DNA的天羅地網
DNA的天羅地網nanchi98
 
Protecting Your Business' Secrets in the Modern Era
Protecting Your Business' Secrets in the Modern EraProtecting Your Business' Secrets in the Modern Era
Protecting Your Business' Secrets in the Modern EraKegler Brown Hill + Ritter
 
03 Outsource To India Desktop Applications Reporting Tools
03 Outsource To India Desktop Applications Reporting Tools03 Outsource To India Desktop Applications Reporting Tools
03 Outsource To India Desktop Applications Reporting ToolsoutsourceToIndia
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaKegler Brown Hill + Ritter
 
Virtualizing the Next Generation of Server Workloads with AMD™
Virtualizing the Next Generation of Server Workloads with AMD™Virtualizing the Next Generation of Server Workloads with AMD™
Virtualizing the Next Generation of Server Workloads with AMD™James Price
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?jessecadelina
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Alegrijesy rebujos daycare
Alegrijesy rebujos daycareAlegrijesy rebujos daycare
Alegrijesy rebujos daycareSaely Cepeda
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselKegler Brown Hill + Ritter
 
Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011arowland1313
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation ReportZafar Ahmad
 

Viewers also liked (20)

Trainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II BhakkarTrainings Evaluation Reports WPS Phase-II Bhakkar
Trainings Evaluation Reports WPS Phase-II Bhakkar
 
What Is Eric
What Is EricWhat Is Eric
What Is Eric
 
DNA的天羅地網
DNA的天羅地網DNA的天羅地網
DNA的天羅地網
 
IT & Business Centre
IT & Business CentreIT & Business Centre
IT & Business Centre
 
2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit
 
Protecting Your Business' Secrets in the Modern Era
Protecting Your Business' Secrets in the Modern EraProtecting Your Business' Secrets in the Modern Era
Protecting Your Business' Secrets in the Modern Era
 
Peuples inconnus
Peuples inconnusPeuples inconnus
Peuples inconnus
 
03 Outsource To India Desktop Applications Reporting Tools
03 Outsource To India Desktop Applications Reporting Tools03 Outsource To India Desktop Applications Reporting Tools
03 Outsource To India Desktop Applications Reporting Tools
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 
Virtualizing the Next Generation of Server Workloads with AMD™
Virtualizing the Next Generation of Server Workloads with AMD™Virtualizing the Next Generation of Server Workloads with AMD™
Virtualizing the Next Generation of Server Workloads with AMD™
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?
 
Passivhus nordvest
Passivhus nordvestPassivhus nordvest
Passivhus nordvest
 
Informationsleder Jane Kruse
Informationsleder Jane KruseInformationsleder Jane Kruse
Informationsleder Jane Kruse
 
Sue
SueSue
Sue
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Alegrijesy rebujos daycare
Alegrijesy rebujos daycareAlegrijesy rebujos daycare
Alegrijesy rebujos daycare
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate Counsel
 
Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation Report
 
A perfect storm
A perfect stormA perfect storm
A perfect storm
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 

2012 wellcome-talk

  • 1. Streaming approaches to sequence data compression (via normalization and error correction) C. Titus Brown Asst Prof, CSE and Microbiology Michigan State University ctb@msu.edu
  • 4. Side note: error correction is the biggest “data” problem left in sequencing.* Both for mapping & assembly. *paraphrased, E. Birney
  • 5. My biggest research problem – soil.  Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Bacterial species in 1:1m dilution, est by 16s  Does not include phage, etc. that are invisible to tagging approaches  Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp (Velvet).  …estimate 500 TB RAM for 50 Tbp of sequence. That just won‟t do.
  • 6. Online, streaming, lossy compression. (Digital normalization) Much of next-gen sequencing is redundant.
  • 7. Uneven coverage => even more redundancy Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 15. Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
  • 16. Digital normalization algorithm for read in dataset: if estimated_coverage(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 17. Digital normalization retains information, while discarding data and errors
  • 18. Little-appreciated implications!!  Digital normalization puts both sequence and assembly graph analysis on a streaming and online basis.  Potentially really useful for streaming variant calling and streaming sample categorization  Can implement (< 2)-pass error detection/correction using locus-specific coverage.  Error correction can be “tuned” to specific coverage retention and variant detection.
  • 19. Local graph coverage Diginorm provides ability to efficiently (online) measure local graph coverage, very efficiently. (Theory still needs to be developed)
  • 20. Alignment of reads to graph  “Fixes” digital normalization  Aligned reads => error corrected reads  Can align longer sequencesCorrection Sequence Read (transcripts? contigs?) to graphs. Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG A G G 19 G 19 19 G SN 19 SN SN 19 C T SN SN 19 19 C SN SN 19 A G SN 19 19 SN Seed K-mer A MN C 20 20 CGAATCTGAT MN MN G A 1 1 ME C ME 1 T C 1 ME 1 G G ME Emission Base → A ME 1 G A G 1 ME K-mer Coverage → 19 ME 1 ME 1 1 ME Vertex Class → SN ME Jason Pell
  • 21. 1.2x pass error-corrected E. coli* (Significantly more compressible) * Same approach can be used on mRNAseq and metageno
  • 22. Some thoughts  Need fundamental measures of information retention so that we can place limits on what we‟re discarding with lossy compression.  Compression is most useful to scientists when it makes analysis faster/lower memory/better.  Variant calling, assembly, and just deleting your data are all just various forms of lossy compression :)
  • 23. How compressible is soil data? De Bruijn graph overlap: 51% of the reads in prairie (330 Gbp) have coverage > 1 in the corn sample‟s de Bruijn graph (180 Gbp). Corn Prairie
  • 24. Further resources Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  See esp: BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data  Preprints: on arXiv, q-bio: „diginorm arxiv‟, „illumina artifacts arxiv‟, „assembling