This document discusses the progresses and challenges of de novo genome assembly using next-generation sequencing data, including improvements made to error correction, contig construction, scaffolding, gap closure, and computational performance that have increased assembly quality and scalability; however, challenges still remain around resolving repeats and assembling heterozygous diploid genomes accurately.
4. Main issues in NGS de novo assembly Efficient graph building and reduction Contig construction Scaffold construction Gap closure (to solve repeats) Iterative refining assemblies
5.
6. Kmerfrequency spectrum-basedReduce errors beforehand to construct graph memory- and time-efficiently Also will significantly reduce the load in graph-reduction step Improve reliability of primary contigs, which serve as data basis for subsequent steps
7. Recent progresses 1) larger Kmer (up to 27) can be used with acceptable memory and speed. 2) algorithm is optimized so more error bases can be corrected. 3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.
9. Results of different versions for error correction * overlap_cor: combination of error correction and merging of PE-read
10. 2. Contiging For SOAPdenovo, contiging is a process that finds all unique unambiguous paths in complexity-reduced de Bruijn graph
11. Progresses 1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina. 2) longer repeat can be resolved using overhung PE-read.
12. 3. Scaffolding Scaffolding is to link primary contigs to a unambiguous path in relationship graph The data basis for gap-closure Highly-associated with final contig size Performance are hyper-sensitive to parameter setting
13. Progresses 1) repetitivecontigs are handled more cautiously. 2) some algorithmic logic are optimized to make less mistakes. *When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.
14. 4. Gap closure Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs): Unique regions that did not pass stringent contiging threshold Repeat regions that are cut/not assembled in original assemblies A process that has high risk to induce errors
15. Progresses 1) overhung PE-read are used to span small gaps and fill them. 2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not… 3) local assembly strategy is optimized to make better decision when encountering conflicts.
16. Results of different versions for gap filling * When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.
17. 5. Post-processing Align reads back to the assembly to evaluate the reliability of each locus Correct artifacts in the assemblies Analyze the possibility of further improvement
18. 6. Computational performance A bunch of low-level optimizations now achieved 1 round of assembly cost 1 day for human genome on a 256G memory node Cloud-based assembler at dawn (dev code: Hecate) Memory footprint cut to <32G; speed performance scalable to number of nodes used.
19. Issues Achieving theorectical upper limit in contiging Paired-end short reads + insert size ~= Long reads Mixing up two haploids Several key factors affect quality of WGS assembly Heterozygous rate of the diploid genome Repetitive sequence distribution pattern of the species’ genome K-mer size used when the de Bruijn graph assembly applied
20. Revised Hierarchical Assembly Build libraries hierarchically Using Fosmid clones Avoid combining two haploids Assembly hierarchically Combines de Bruijn graph & OLC strategies Providing an affordable sequencing solution to diploid & complex genome
38. Straw webhost on genomes http://climb.genomics.org.cn/g10k/home.jsp Please advise what kind of functions to include, considering the fact that genomes will be available at different levels of completeness: Finished map Fine map w/ haploids solved Draft map w/ physical map anchord