Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
2015 pag-chicken
1. C. Titus Brown
Associate Professor
School of Veterinary Medicine
UC Davis
Jan 2015
Adventures in improving the chicken genome &
transcriptome
2. Current state of chicken genome
● galGal2 (2004)
o Sanger sequencing (6.6X)
o Physical and genetic linkage maps
● galGal3 (2006)
o 198K additional reads
Contigs end
Regions of poor quality
o SNP mapping
o chrZ and chrW
● galGal4 (2011)
o 454 (12X)
o - 10Mb artifactual duplications
o +15Mb mapped to chromosomes
o increases in N50 contig size
3. 2. Microchromosomes...
● 10 macrochromosomes
● 28 microchromosomes
o GC rich
o high recombination rate
o high gene density
o low intron size
● not sequencing friendly!
4. Moleculo vs PacBio
Moleculo
● Cheaper
o High throughput
● Low error rate
o ~0%
● Same problems as Illumina…
PacBio
● No 3' bias
● No PCR
● High error rate
o ~15%
● Lower throughput
● "$$-plated genome"
6. Exploring Moleculo
● 1,578,022 reads
● Covers 88% of galGal4
● 326 reads unmapped to galGal4 (0.02%)
o Searched 5 random in ENA (exonerate)
o 3 matched Sediminibacterium sp...
Luiz Irber
9. But Moleculo does not contain
missing genes… ;(
Search for de novo-assembled UniProt orthologs
from chicken in (a) galGal4 genome, and (b)
Moleculo data.
Luiz Irber
10. Moleculo data. Might be in
PacBio.
So, now working with PacBio.
● Dealing with PacBio data
o Most tools break horribly
(It's getting better)
● Assembling PacBio data
o High error rate (~15%)
o Most assemblers target short reads
o PacBio recommended assemblers interact poorly
with MSU HPCC
Would like to produce a step-by-step protocol to
do genome improvement or assembly with
PacBio… Luiz Irber
11. 2) Evaluating effects of gene models
on pathway prediction
Likit Preeyanon
Vertically integrated comparison.
12. GIMME: Software for Merging Gene Models
Assembly-
based
Local
Assembly
GIMME
Reference
-guided
Merged
Models
In-house software
ENSEMBL
Cufflinks can incorporate
ENSEMBL
23. RNAseq: your models matter
Our methods for generating hypotheses from mRNAseq
data are sensitive to references & technical details of the
approaches.
(This is expected but Bad.)
More RNAseq data coming every day.
…but we are not regularly updating gene models…
… and the genome that we have is Not Great.
Follow on Smith & Burt (2014) to continually regenerate
gene models for differential expression use.
A general model for vet/ag animals?
state of the chicken genome
galGal2
Sanger sequencing, 6.6X coverage
Aligned to chromosomal linkage groups using
physical maps
genetic linkage maps
galGal3
Additional 198K reads
contig ends
regions of poor quality
Improved using SNP mapping data
1.1 Gb
95% autosomes 1-28, 32, Z and W sex chromosomes
Z and W
3.3X coverage (hemizygous female bird)
chrZ: 33.6 -> 74.6 Mb
chrW: 4.9 -> 0.26 Mb
Contigs to chrW in galGal2 actually on chrZ
Particular problems
10 "Macrochromosomes"
28 "microchromosomes"
GC rich
high recombination rate
high gene density
low intron size
smallest: GGA{16, 25, 27-38}
Data still available:
70X Illumina data
Comparison between two approaches
moleculo: good and bad
pacbio: same
(a) Overview of the library preparation protocol. The subject's DNA (1) is sheared into fragments of about 10 kbp (2), which are then diluted and placed into 384 wells, at about 3,000 fragments per well (3). Within each well, fragments are amplified through long-range PCR, cut into short fragments and barcoded (4), before finally being pooled together and sequenced (5). (b) Overview of the bioinformatics pipeline. Sequenced short reads are aligned and mapped back to their original well using the barcode adapters (1). Within each well, reads are grouped into fragments (2), which are assembled at their overlapping heterozygous SNVs into haplotype blocks (3). These blocks are assigned a phase statistically based on a phased reference panel (4), which produces very long haplotype contigs (5).
Moleculo data alignment
There are 1578022 Moleculo reads in the input files.
There is a different number of unmapped reads for each reference genome:
- galGal4: 326 (0.02%)
- galGal3: 1504 (0.09%)
- galGal5: 6085 (0.3%)
Took 5 random unmapped sequences, searched on ENA:
Sequences 1 and 4 mapped to Gallus gallus sequences.
Sequences 2, 3 and 5 weirdly mapped to Sediminibacterium sp., a bacteria with a genome published January 2014.
http://nbviewer.ipython.org/github/luizirber/galGal/blob/9fcad08f652d7b29cc697fb2418cc5ad8580482b/notebooks/02.Exploring_moleculo.ipynb#ENA-exonerate-results
Caption:
Caption:
rnaseq intersection with Moleculo data
how many "real" mRNAseq genes, i.e. genes with orthology to uniprot, do/do not match in the genome?
Followed eel-pond protocol
pacbio efforts and bottlenecks
Mapping PacBio filtered reads to galGal4
Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq
From now on, when we refer to reference-guided models, we mean reference-guided + Ensembl.
Translation initiation factor
Note that it is fortunate that GOSeq supports custom KEGG annotation. Most tools do not accept custom annotation, so you can only use annotation of one species at a time.