REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

REASSEMBLING 600+ MARINE TRANSCRIPTOMES:
AUTOMATED PIPELINE DEVELOPMENT AND
EVALUATION
Lisa Cohen, Harriet Alexander, C. Titus Brown
Lab for Data Intensive Biology (DIB), UC Davis
ASLO Aquatic Sciences meeting
Session 016: Advances in Aquatic Meta-Omics
March 3, 2017
@monsterbashseq
ljcohen@ucdavis.edu

Marine Microbial Eukaryotic Transcriptome
Sequencing Project (MMETSP)
- 678 Illumina RNA sequence datasets = 1 TB raw data
- Wide diversity spanning more than 40 phyla
- Original assemblies by the National Center for Genome Resources (NCGR)
Keeling et al. 2014
PMID: 24959919
Caron et al. 2016
PMID: 27867198

Need for a modularized, extensible RNA-seq pipeline:
o Software and best practices for RNA-seq analysis changing rapidly
(Conesa et al. 2016, PMID: 26813401)
o Accumulating more and more data!
o MMETSP: awesome data set to test software and pipelines!
o What to do if:
• New samples to add?
• New software tool is developed?

Metadata from
NCBI
PRJNA231566
download
data
Trinity
assembly
Trinity.fasta
evaluation
annotation
expression
quantification
Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/
Titus Brown, Camille Scott, and Leigh Sheneman
trim, fastqc diginorm
Cohen, Lisa; Alexander, Harriet; Brown, C. Titus
(2017): Marine Microbial Eukaryotic Transcriptome
Sequencing Project, re-assemblies. figshare.
https://doi.org/10.6084/m9.figshare.3840153.v6
1TB raw storage,
>8,000 computing hours

Numberofcontigs
17 610
48,005
25,059
NCGR DIB
Our re-assemblies have more contigs:
# higher in NCGR # higher in DIB

Questions:
1. Did we generate more biologically-
meaningful content with re-assemblies?
1. Are there phylogenetic patterns in the
assemblies?

Smith-Unna et al. 2016
PMID: 27252236
Transrate score = overall quality
of the final assembly (scale 0-1.0)
Qualities of our re-assemblies are higher:
1. Did we generate more biologically-meaningful content with re-assemblies?
Transratescore
0.31
0.22
NCGR DIB

Re-assemblies generally contain most of the information in the
NCGR assemblies, plus ~30% more content:
Comparison:
DIB vs. NCGR
DIB
NCGR
Proportionofcontigs(CRB-BLAST)
Comparison:
NCGR vs. DIB
NCGR DIB

Similar Open Reading Frame (ORF) and
Benchmarks of Universal Single Copy Orthologs (BUSCO)
MeanORFpercentage
CompleteBUSCOpercentage
NCGR DIBNCGR DIB

Scott, C. in prep. 2016.
www.camillescott.org/dammit
‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB
annotated absent transcripts
transcripts absent from NCGR
#Transcripts
MMETSP sample (sorted)
After annotation, ~30% extra content appears real
DIB
NCGR
Extra content

Some DIB assemblies have more unique content.
Unique k-mers (k=25), unique word combinations
Probably.
Unique k-mers
(DIB)
Unique k-mers
(NCGR)

Assemblies from Dinophyta have more unique k-mers and lower qualities.
Dinoflagellates: steady-state gene expression, translational gene regulation
Aranda et al. 2016 PMID: 28004835
Lin 2011 PMID: 21514379
Hou and Lin 2009. PMID: 27426948
N=
173
111
73
61
60
60
25
22
2. Can we detect phylogenetic differences in the assemblies?
Unique k-mers = unique word combinations (k=25)

Ciliophora have lower ORF percentagesN=
173
111
73
61
60
60
25
22
Ciliates: alternative triplet codon dictionary, STOP codon different purpose
Alkalaeva and Mikhailova 2016, PMID: 28009453
Heaphy et al. 2016, PMID: 27501944
Swart et al. 2016, PMID: 27426948
2. Are there phylogenetic differences in the assemblies?
Trends.
Mean % ORF
# contigs

Future work:
• In-depth annotation analysis
• Orthologous groupings of contigs
• Co-expression network analysis
• Better reference transcriptomes for MMETSP available:
• Strain-specific trends in assemblies support previously-reported
transcriptomic features
• De novo transcriptome assembly pipeline available:
https://github.com/dib-lab/dib-MMETSP
Conclusions
@monsterbashseq
ljcohen@ucdavis.edu
Contact:

Acknowledgements
• Data Intensive Biology Lab
– Camille Scott, Luiz Irber
• MSU iCER
• NSF’s XSEDE, Jetstream cloud
• Substituting for my NPB101D
sections today:
– Natalia Caporale, Sheryar
Siddiqui, Pearl Chen, Arik
Davidyan, Karl Larson Photo by James Word
Data Intensive Biology Summer Institute, applications due March 17th!
http://ivory.idyll.org/dibsi/

Files available for download!
Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial
Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.
https://github.com/dib-lab/dib-MMETSP
@monsterbashseq
ljcohen@ucdavis.edu
Data Intensive Biology Summer Institute, applications due March 17th!
http://ivory.idyll.org/dibsi/

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Similar to REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION (20)

Recently uploaded

Recently uploaded (20)

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Editor's Notes