NGS de novo assembly progresses and challenges

NGS de novo assembly: progresses and challenges YingruiLi BGI Shenzhen

Overall sketch of SOAPdenovo 2

Overall sketch of SOAPdenovo 3

Main issues in NGS de novo assembly Efficient graph building and reduction Contig construction Scaffold construction Gap closure (to solve repeats) Iterative refining assemblies

1. Reducing graph complexity Eliminate errors in original raw reads ,[object Object]

Kmerfrequency spectrum-basedReduce errors beforehand to construct graph memory- and time-efficiently Also will significantly reduce the load in graph-reduction step Improve reliability of primary contigs, which serve as data basis for subsequent steps

Recent progresses 1) larger Kmer (up to 27) can be used with acceptable memory and speed. 2) algorithm is optimized so more error bases can be corrected. 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.

Simulation result of Arabidopsis data using different Kmer size

Results of different versions for error correction * overlap_cor: combination of error correction and merging of PE-read

2. Contiging For SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph

Progresses 1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina. 2) longer repeat can be resolved using overhung PE-read.

3. Scaffolding Scaffolding is to link primary contigs to a unambiguous path in relationship graph The data basis for gap-closure Highly-associated with final contig size Performance are hyper-sensitive to parameter setting

Progresses 1) repetitivecontigs are handled more cautiously. 2) some algorithmic logic are optimized to make less mistakes. *When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.

4. Gap closure Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs): Unique regions that did not pass stringent contiging threshold Repeat regions that are cut/not assembled in original assemblies A process that has high risk to induce errors

Progresses 1) overhung PE-read are used to span small gaps and fill them. 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not… 3) local assembly strategy is optimized to make better decision when encountering conflicts.

Results of different versions for gap filling * When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.

5. Post-processing Align reads back to the assembly to evaluate the reliability of each locus Correct artifacts in the assemblies Analyze the possibility of further improvement

6. Computational performance A bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory node Cloud-based assembler at dawn (dev code: Hecate) Memory footprint cut to <32G; speed performance scalable to number of nodes used.

Issues Achieving theorectical upper limit in contiging Paired-end short reads + insert size ~= Long reads Mixing up two haploids Several key factors affect quality of WGS assembly Heterozygous rate of the diploid genome Repetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied

Revised Hierarchical Assembly Build libraries hierarchically Using Fosmid clones Avoid combining two haploids Assembly hierarchically Combines de Bruijn graph & OLC strategies Providing an affordable sequencing solution to diploid & complex genome

Flowchart of Revised Hierarchical Assembly

Revised Hierarchical de novo Assembly on a Asian Genome Data Production: ,[object Object]

Optimally 30 Fosmids clones a pool

NGS de novo assembly progresses and challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NGS de novo assembly progresses and challenges

Similar to NGS de novo assembly progresses and challenges (20)

More from Scott Edmunds

More from Scott Edmunds (20)

Recently uploaded

Recently uploaded (20)

NGS de novo assembly progresses and challenges