SlideShare a Scribd company logo
1 of 16
REASSEMBLING 600+ MARINE TRANSCRIPTOMES:
AUTOMATED PIPELINE DEVELOPMENT AND
EVALUATION
Lisa Cohen, Harriet Alexander, C. Titus Brown
Lab for Data Intensive Biology (DIB), UC Davis
ASLO Aquatic Sciences meeting
Session 016: Advances in Aquatic Meta-Omics
March 3, 2017
@monsterbashseq
ljcohen@ucdavis.edu
Marine Microbial Eukaryotic Transcriptome
Sequencing Project (MMETSP)
- 678 Illumina RNA sequence datasets = 1 TB raw data
- Wide diversity spanning more than 40 phyla
- Original assemblies by the National Center for Genome Resources (NCGR)
Keeling et al. 2014
PMID: 24959919
Caron et al. 2016
PMID: 27867198
Need for a modularized, extensible RNA-seq pipeline:
o Software and best practices for RNA-seq analysis changing rapidly
(Conesa et al. 2016, PMID: 26813401)
o Accumulating more and more data!
o MMETSP: awesome data set to test software and pipelines!
o What to do if:
• New samples to add?
• New software tool is developed?
Metadata from
NCBI
PRJNA231566
download
data
Trinity
assembly
Trinity.fasta
evaluation
annotation
expression
quantification
Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/
Titus Brown, Camille Scott, and Leigh Sheneman
trim, fastqc diginorm
Cohen, Lisa; Alexander, Harriet; Brown, C. Titus
(2017): Marine Microbial Eukaryotic Transcriptome
Sequencing Project, re-assemblies. figshare.
https://doi.org/10.6084/m9.figshare.3840153.v6
1TB raw storage,
>8,000 computing hours
Numberofcontigs
17 610
48,005
25,059
NCGR DIB
Our re-assemblies have more contigs:
# higher in NCGR # higher in DIB
Questions:
1. Did we generate more biologically-
meaningful content with re-assemblies?
1. Are there phylogenetic patterns in the
assemblies?
Smith-Unna et al. 2016
PMID: 27252236
Transrate score = overall quality
of the final assembly (scale 0-1.0)
Qualities of our re-assemblies are higher:
1. Did we generate more biologically-meaningful content with re-assemblies?
Transratescore
0.31
0.22
NCGR DIB
Re-assemblies generally contain most of the information in the
NCGR assemblies, plus ~30% more content:
Comparison:
DIB vs. NCGR
DIB
NCGR
Proportionofcontigs(CRB-BLAST)
Comparison:
NCGR vs. DIB
1. Did we generate more biologically-meaningful content with re-assemblies?
NCGR DIB
Similar Open Reading Frame (ORF) and
Benchmarks of Universal Single Copy Orthologs (BUSCO)
1. Did we generate more biologically-meaningful content with re-assemblies?
MeanORFpercentage
CompleteBUSCOpercentage
NCGR DIBNCGR DIB
Scott, C. in prep. 2016.
www.camillescott.org/dammit
‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB
annotated absent transcripts
transcripts absent from NCGR
#Transcripts
MMETSP sample (sorted)
1. Did we generate more biologically-meaningful content with re-assemblies?
After annotation, ~30% extra content appears real
DIB
NCGR
Extra content
Some DIB assemblies have more unique content.
Unique k-mers (k=25), unique word combinations
1. Did we generate more biologically-meaningful content with re-assemblies?
Probably.
Unique k-mers
(DIB)
Unique k-mers
(NCGR)
Assemblies from Dinophyta have more unique k-mers and lower qualities.
Dinoflagellates: steady-state gene expression, translational gene regulation
Aranda et al. 2016 PMID: 28004835
Lin 2011 PMID: 21514379
Hou and Lin 2009. PMID: 27426948
N=
173
111
73
61
60
60
25
22
2. Can we detect phylogenetic differences in the assemblies?
Unique k-mers = unique word combinations (k=25)
Ciliophora have lower ORF percentagesN=
173
111
73
61
60
60
25
22
Ciliates: alternative triplet codon dictionary, STOP codon different purpose
Alkalaeva and Mikhailova 2016, PMID: 28009453
Heaphy et al. 2016, PMID: 27501944
Swart et al. 2016, PMID: 27426948
2. Are there phylogenetic differences in the assemblies?
Trends.
Mean % ORF
# contigs
Future work:
• In-depth annotation analysis
• Orthologous groupings of contigs
• Co-expression network analysis
• Better reference transcriptomes for MMETSP available:
https://doi.org/10.6084/m9.figshare.3840153.v6
• Strain-specific trends in assemblies support previously-reported
transcriptomic features
• De novo transcriptome assembly pipeline available:
https://github.com/dib-lab/dib-MMETSP
Conclusions
@monsterbashseq
ljcohen@ucdavis.edu
Contact:
Acknowledgements
• Data Intensive Biology Lab
– Camille Scott, Luiz Irber
• MSU iCER
• NSF’s XSEDE, Jetstream cloud
• Substituting for my NPB101D
sections today:
– Natalia Caporale, Sheryar
Siddiqui, Pearl Chen, Arik
Davidyan, Karl Larson Photo by James Word
Data Intensive Biology Summer Institute, applications due March 17th!
http://ivory.idyll.org/dibsi/
Files available for download!
Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial
Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.
https://doi.org/10.6084/m9.figshare.3840153.v6
https://github.com/dib-lab/dib-MMETSP
@monsterbashseq
ljcohen@ucdavis.edu
Data Intensive Biology Summer Institute, applications due March 17th!
http://ivory.idyll.org/dibsi/

More Related Content

What's hot

UC Davis EVE161 Lecture 17 by @phylogenomics
 UC Davis EVE161 Lecture 17 by @phylogenomics UC Davis EVE161 Lecture 17 by @phylogenomics
UC Davis EVE161 Lecture 17 by @phylogenomicsJonathan Eisen
 
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is brokenMads Albertsen
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerJoe Parker
 
FINAL POSTER
FINAL POSTERFINAL POSTER
FINAL POSTERRyan Foo
 
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)Consortium for the Barcode of Life (CBOL)
 
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Jennifer Shelton
 
Genomic approaches to assessing ecosystem health
Genomic approaches to assessing ecosystem healthGenomic approaches to assessing ecosystem health
Genomic approaches to assessing ecosystem healthsr320
 
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...Consortium for the Barcode of Life (CBOL)
 
Johannes Bergsten Dna Barcoding
Johannes Bergsten Dna BarcodingJohannes Bergsten Dna Barcoding
Johannes Bergsten Dna Barcodingbioinfocourse
 

What's hot (14)

UC Davis EVE161 Lecture 17 by @phylogenomics
 UC Davis EVE161 Lecture 17 by @phylogenomics UC Davis EVE161 Lecture 17 by @phylogenomics
UC Davis EVE161 Lecture 17 by @phylogenomics
 
Tair workshop stanford2017
Tair workshop stanford2017Tair workshop stanford2017
Tair workshop stanford2017
 
EVE 161 Lecture 6
EVE 161 Lecture 6EVE 161 Lecture 6
EVE 161 Lecture 6
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
dna barcoding
dna barcodingdna barcoding
dna barcoding
 
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken
[2014.08.25] Albertsen ISME15 CAMI: Why metgenomics is broken
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
FINAL POSTER
FINAL POSTERFINAL POSTER
FINAL POSTER
 
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
David Schindel - DNA Barcoding and the consortium for the barcode of life (CBOL)
 
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
 
Internship presentation
Internship presentationInternship presentation
Internship presentation
 
Genomic approaches to assessing ecosystem health
Genomic approaches to assessing ecosystem healthGenomic approaches to assessing ecosystem health
Genomic approaches to assessing ecosystem health
 
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...
Dario Lijtmaer - Brief introduction to barcoding and the current goals and ca...
 
Johannes Bergsten Dna Barcoding
Johannes Bergsten Dna BarcodingJohannes Bergsten Dna Barcoding
Johannes Bergsten Dna Barcoding
 

Viewers also liked

Wu Mamber (String Algorithms 2007)
Wu  Mamber (String Algorithms 2007)Wu  Mamber (String Algorithms 2007)
Wu Mamber (String Algorithms 2007)mailund
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...Joseph Hughes
 
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAA
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAALesieur Cristal, filiale d'Avril, soutient l'initiative AAA
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAAAvril
 
"Different cultures, same agriculture"
"Different cultures, same agriculture""Different cultures, same agriculture"
"Different cultures, same agriculture"Avril
 
Alex Haw lecture 160411 - Museum of Architecture - Mobile Orchard
Alex Haw lecture 160411 - Museum of Architecture - Mobile OrchardAlex Haw lecture 160411 - Museum of Architecture - Mobile Orchard
Alex Haw lecture 160411 - Museum of Architecture - Mobile OrchardAtmos
 
Excel Green Technology
Excel Green TechnologyExcel Green Technology
Excel Green TechnologyRama Rao
 
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group AGBC Finland
 
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011abdnakhi
 
Zero Energy Buildings
Zero Energy BuildingsZero Energy Buildings
Zero Energy BuildingsJeffrey Funk
 
Net-Zero Energy Case Studies
Net-Zero Energy Case StudiesNet-Zero Energy Case Studies
Net-Zero Energy Case Studiesaiahouston
 
Lost art of troubleshooting
Lost art of troubleshootingLost art of troubleshooting
Lost art of troubleshootingLeon Fayer
 
Zeor energy buliding
Zeor energy bulidingZeor energy buliding
Zeor energy bulidingVrati Sharma
 
Zero energy building
Zero energy buildingZero energy building
Zero energy buildingRaghav Gupta
 
Republic act no. 7836 regulating practice of teaching
Republic act no. 7836   regulating practice of teachingRepublic act no. 7836   regulating practice of teaching
Republic act no. 7836 regulating practice of teachingJared Ram Juezan
 
Net zero energy building (ppt-2016)
Net zero energy building (ppt-2016)Net zero energy building (ppt-2016)
Net zero energy building (ppt-2016)niruma
 

Viewers also liked (20)

Wu Mamber (String Algorithms 2007)
Wu  Mamber (String Algorithms 2007)Wu  Mamber (String Algorithms 2007)
Wu Mamber (String Algorithms 2007)
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
 
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAA
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAALesieur Cristal, filiale d'Avril, soutient l'initiative AAA
Lesieur Cristal, filiale d'Avril, soutient l'initiative AAA
 
"Different cultures, same agriculture"
"Different cultures, same agriculture""Different cultures, same agriculture"
"Different cultures, same agriculture"
 
Light Beyond Vision
Light Beyond VisionLight Beyond Vision
Light Beyond Vision
 
Alex Haw lecture 160411 - Museum of Architecture - Mobile Orchard
Alex Haw lecture 160411 - Museum of Architecture - Mobile OrchardAlex Haw lecture 160411 - Museum of Architecture - Mobile Orchard
Alex Haw lecture 160411 - Museum of Architecture - Mobile Orchard
 
Lightlife #02
Lightlife #02Lightlife #02
Lightlife #02
 
Excel Green Technology
Excel Green TechnologyExcel Green Technology
Excel Green Technology
 
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A
[Metropolia Student Project Seminar 24.5.] Zero Energy Buildings, Group A
 
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011
Ben nakhi's Presentation at Kuwait District Cooling Summit - 2011
 
Zero Energy Buildings
Zero Energy BuildingsZero Energy Buildings
Zero Energy Buildings
 
Natural Lighting
Natural LightingNatural Lighting
Natural Lighting
 
Net-Zero Energy Case Studies
Net-Zero Energy Case StudiesNet-Zero Energy Case Studies
Net-Zero Energy Case Studies
 
Lost art of troubleshooting
Lost art of troubleshootingLost art of troubleshooting
Lost art of troubleshooting
 
Zeor energy buliding
Zeor energy bulidingZeor energy buliding
Zeor energy buliding
 
ZERO ENERGY BUILDING ENVELOPE COMPONENTS
ZERO ENERGY BUILDING ENVELOPE COMPONENTSZERO ENERGY BUILDING ENVELOPE COMPONENTS
ZERO ENERGY BUILDING ENVELOPE COMPONENTS
 
Zero energy building
Zero energy buildingZero energy building
Zero energy building
 
Republic act no. 7836 regulating practice of teaching
Republic act no. 7836   regulating practice of teachingRepublic act no. 7836   regulating practice of teaching
Republic act no. 7836 regulating practice of teaching
 
Net zero energy building (ppt-2016)
Net zero energy building (ppt-2016)Net zero energy building (ppt-2016)
Net zero energy building (ppt-2016)
 
Daylighting Buildings
Daylighting BuildingsDaylighting Buildings
Daylighting Buildings
 

Similar to REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics Christopher Mason
 
Nanopore long-read metagenomics
Nanopore long-read metagenomicsNanopore long-read metagenomics
Nanopore long-read metagenomicsMartin Hölzer
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Monica Munoz-Torres
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...WALEBUBLÉ
 
Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information NahalMalik1
 
Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)Alex Hardisty
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Karen Cranston
 
DNA Barcoding and its application in species identification
DNA Barcoding and its application in species identificationDNA Barcoding and its application in species identification
DNA Barcoding and its application in species identificationsupriya k
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 

Similar to REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION (20)

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
Nanopore long-read metagenomics
Nanopore long-read metagenomicsNanopore long-read metagenomics
Nanopore long-read metagenomics
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...
2017 - Analysis of nitrifying microbial communities by FISH and 16S rRNA ampl...
 
Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)Biodiversity Virtual e-Laboratory (BioVeL)
Biodiversity Virtual e-Laboratory (BioVeL)
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014
 
DNA Barcoding and its application in species identification
DNA Barcoding and its application in species identificationDNA Barcoding and its application in species identification
DNA Barcoding and its application in species identification
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

  • 1. REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION Lisa Cohen, Harriet Alexander, C. Titus Brown Lab for Data Intensive Biology (DIB), UC Davis ASLO Aquatic Sciences meeting Session 016: Advances in Aquatic Meta-Omics March 3, 2017 @monsterbashseq ljcohen@ucdavis.edu
  • 2. Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) - 678 Illumina RNA sequence datasets = 1 TB raw data - Wide diversity spanning more than 40 phyla - Original assemblies by the National Center for Genome Resources (NCGR) Keeling et al. 2014 PMID: 24959919 Caron et al. 2016 PMID: 27867198
  • 3. Need for a modularized, extensible RNA-seq pipeline: o Software and best practices for RNA-seq analysis changing rapidly (Conesa et al. 2016, PMID: 26813401) o Accumulating more and more data! o MMETSP: awesome data set to test software and pipelines! o What to do if: • New samples to add? • New software tool is developed?
  • 4. Metadata from NCBI PRJNA231566 download data Trinity assembly Trinity.fasta evaluation annotation expression quantification Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/ Titus Brown, Camille Scott, and Leigh Sheneman trim, fastqc diginorm Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare. https://doi.org/10.6084/m9.figshare.3840153.v6 1TB raw storage, >8,000 computing hours
  • 5. Numberofcontigs 17 610 48,005 25,059 NCGR DIB Our re-assemblies have more contigs: # higher in NCGR # higher in DIB
  • 6. Questions: 1. Did we generate more biologically- meaningful content with re-assemblies? 1. Are there phylogenetic patterns in the assemblies?
  • 7. Smith-Unna et al. 2016 PMID: 27252236 Transrate score = overall quality of the final assembly (scale 0-1.0) Qualities of our re-assemblies are higher: 1. Did we generate more biologically-meaningful content with re-assemblies? Transratescore 0.31 0.22 NCGR DIB
  • 8. Re-assemblies generally contain most of the information in the NCGR assemblies, plus ~30% more content: Comparison: DIB vs. NCGR DIB NCGR Proportionofcontigs(CRB-BLAST) Comparison: NCGR vs. DIB 1. Did we generate more biologically-meaningful content with re-assemblies? NCGR DIB
  • 9. Similar Open Reading Frame (ORF) and Benchmarks of Universal Single Copy Orthologs (BUSCO) 1. Did we generate more biologically-meaningful content with re-assemblies? MeanORFpercentage CompleteBUSCOpercentage NCGR DIBNCGR DIB
  • 10. Scott, C. in prep. 2016. www.camillescott.org/dammit ‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB annotated absent transcripts transcripts absent from NCGR #Transcripts MMETSP sample (sorted) 1. Did we generate more biologically-meaningful content with re-assemblies? After annotation, ~30% extra content appears real DIB NCGR Extra content
  • 11. Some DIB assemblies have more unique content. Unique k-mers (k=25), unique word combinations 1. Did we generate more biologically-meaningful content with re-assemblies? Probably. Unique k-mers (DIB) Unique k-mers (NCGR)
  • 12. Assemblies from Dinophyta have more unique k-mers and lower qualities. Dinoflagellates: steady-state gene expression, translational gene regulation Aranda et al. 2016 PMID: 28004835 Lin 2011 PMID: 21514379 Hou and Lin 2009. PMID: 27426948 N= 173 111 73 61 60 60 25 22 2. Can we detect phylogenetic differences in the assemblies? Unique k-mers = unique word combinations (k=25)
  • 13. Ciliophora have lower ORF percentagesN= 173 111 73 61 60 60 25 22 Ciliates: alternative triplet codon dictionary, STOP codon different purpose Alkalaeva and Mikhailova 2016, PMID: 28009453 Heaphy et al. 2016, PMID: 27501944 Swart et al. 2016, PMID: 27426948 2. Are there phylogenetic differences in the assemblies? Trends. Mean % ORF # contigs
  • 14. Future work: • In-depth annotation analysis • Orthologous groupings of contigs • Co-expression network analysis • Better reference transcriptomes for MMETSP available: https://doi.org/10.6084/m9.figshare.3840153.v6 • Strain-specific trends in assemblies support previously-reported transcriptomic features • De novo transcriptome assembly pipeline available: https://github.com/dib-lab/dib-MMETSP Conclusions @monsterbashseq ljcohen@ucdavis.edu Contact:
  • 15. Acknowledgements • Data Intensive Biology Lab – Camille Scott, Luiz Irber • MSU iCER • NSF’s XSEDE, Jetstream cloud • Substituting for my NPB101D sections today: – Natalia Caporale, Sheryar Siddiqui, Pearl Chen, Arik Davidyan, Karl Larson Photo by James Word Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/
  • 16. Files available for download! Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare. https://doi.org/10.6084/m9.figshare.3840153.v6 https://github.com/dib-lab/dib-MMETSP @monsterbashseq ljcohen@ucdavis.edu Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/

Editor's Notes

  1. Hi, my name is Lisa Cohen, I’m a PhD student at UC Davis. Thank you for this opportunity to speak today. I would like to first acknowledge my co-authors, Harriet Alexander, who is sitting in the audience today and my advisor, Titus Brown.
  2. The Marine Microbial Eukaryotic Sequencing Project is a unique set of mRNA sequence data generated by a consortium of PIs who all got together and submitted their favorite marine microbial eukaryotes to one sequencing facility. These species represent 40 pelagic and endosymbiotic phyla, such dinoflagellates, ciliates, diatoms. They are both phylogenetically diverse and geographically diverse, collected from all over the world.   This is a really exciting set of data for a few reasons, one is because it is one of the largest publicly available sets of RNA data with a standardized library preparation from different organisms with a total of about 1 TB of raw sequence data. Second, it’s purposefully built, not a metatranscriptome. We technically know who is supposed to be in this data set, so we are generating reference transcriptomes for all of these species, some of which have never had any reference transcriptomes or genomes before.   Right after data were sequenced, the NCGR assembled the transcriptomes as references with their own pipeline, using the genome assembler ABySS with some modifications and post-processing for transcriptomes. ==================== Bottom panel, left to right: Elphidium margaritaceum http://zoology.bio.spbu.ru/Eng/Sci/Korsun/Foram2_E-margaritaceum.jpg 2. Acanthamoeba https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png/220px-Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png 3. Gonyaulax spinifera http://www.sms.si.edu/IRLSpec/images/Gonyaulax_Lg.jpg 4. Asterionellopsis glacialis http://www.smhi.se/oceanografi/oce_info_data/plankton_checklist/diatoms/asterionellopsis_glacialis.gif 5. Tetraselmis http://cfb.unh.edu/phycokey/Choices/Chlorophyceae/unicells/flagellated/TETRASELMIS/Tetraselmis_06_500x345.jpg 6. Oxyrrhis marina http://cfb.unh.edu/phycokey/Choices/Dinophyceae/NonPS-dinos/OXYRRHIS/Oxyrrhis_04_300x246_marina.jpg 7. Alexandrium http://www.whoi.edu/cms/images/dfino/2006/6/Alexandrium_en_11187_26907.jpg 8. Pseudonitzschia https://upload.wikimedia.org/wikipedia/commons/5/5e/Pseudonitzschia2.jpg 9. Chlamydomonas https://web.mst.edu/~microbio/BIO221_2009/images_2009/chlamydomonas-3.jpg 10. Emiliania_huxleyi https://upload.wikimedia.org/wikipedia/commons/d/d9/Emiliania_huxleyi_coccolithophore_(PLoS).png 11. Symbiodinium http://www.personal.psu.edu/tcl3/index.html 12. Phaeocystis antarctica http://www.esf.edu/antarctica/images/Phaeo_montage2.jpg 13. Micromonas http://roscoff-culture-collection.org/sites/default/files/field/image/micromonas-colored-350_0.jpg 14. Karenia brevis http://www.sms.si.edu/irlspec/images/Kareni_brevis_2.jpg 15. Thalassiosira pseudonana http://genome.jgi.doe.gov/Thaps3/Tpseudonana.jpg 16. Ditylum_brightwellii https://cimt.pmc.ucsc.edu/images/HAB%20ID/diatom/Ditylum_brightwellii.jpg
  3. So, when I was starting my PhD about a year and a half ago, it was becoming apparent that software and best practices for RNA sequencing analysis and de novo transcriptome assembly are not standard and changing rapidly. Pipelines developed for model animal species do not necessarily hold true for all species. We’re also collecting more and more data! RNA sequence data in particular. The MMETSP data are great to use to test software and analysis pipelines! Because of its size and because the organisms are diverse, we can better understand how these tools are performing with data from difference species.   Some of the problems that I and others in Titus Brown’s lab think about is what happens if a PI wants to submit just one more sample? What happens if there are shiny new tools developed?
  4. Our modularized pipeline, which I wrote in Python, attempts to address these issues. It takes metadata from any data set in NCBI as input and decides which samples to run. Raw sequence reads are downloaded from NCBI, quality trimmed, checked with fastqc, run through digital normalization, then assembled using the Trinity transcriptome assembler. I’m glossing over a lot of details here because there is not enough time, but if you are interested please see me after to talk. There is a tutorial also available, called the “Eel pond protocol”, which is open access and has a small subset of data to run through the steps of a de novo assembly with Trinity. A benefit of this pipeline to highlight is that you can pick up from where you left off if something crashes. As anyone who has used an institutional high performance computing cluster knows, stuff breaks, stops running. With this pipeline, if something stops, you can start it again. I also want to mention that this data set pushes the limits of high performance computing clusters with 1 TB raw data, in terms of storage and compute resources. This took more than 8,000 computing hours, We have found that the resources required for these >600 assemblies are not trivial, and should be a consideration when planning for a project of this size in the future.
  5. In evaluating our assemblies, it appears that our re-assemblies have more contigs. A contig is a linear prediction of a full transcript by the assembly software. In subsequent slides, I’ll be showing similar figures like this, so want to orient you first. On the y-axis is what we’re measuring – here it’s the number of contigs. This is a split violin plot showing the frequency distribution around the mean of each pipeline. In the blue on the right shows our re-assemblies, which I’ve labeled “DIB” because we’re the data intensive biology lab. In the gray on the left are assemblies from NCGR. The number on top in blue shows the numbers of assemblies where DIB has a higher value than NCGR or in gray where NCGR has a higher number. In this case, we see that there were more DIB assemblies with higher numbers of contigs in comparison to the NCGR.   The mean of DIB is around 48,000 contigs, with some samples producing up to 190,000 contigs up here towards the tail of the distribution. While the mean of NCGR is around 25,000 contigs and fewer assemblies have high numbers of contigs, the highest is about 100,000.   So, these differences were interesting for us – and we came up with some questions (click)
  6. One, we’re interested to see whether we’ve assembled more things. It could be that this is just fragmented junk. But it could also be relevant, being able to resolve allelic variants or alternative splicing. Or just pieces of the same transcript. Theoretically, each contig is supposed to represent one transcript, but we can’t really say that yet. The second question has to do with the biological differences in our samples. They are from different taxonomic groupings. So, we’re wondering if the software is performing differently based on what species the data come from. The relationship between raw data content and assembly quality is not well understood. So, with this data set, we’re wondering whether the nucleic acid sequence information is being handled differently by the software tools.
  7. The qualities of our assemblies appear to be higher. Transrate is a software tool that was developed to help you understand your transcriptome based on a variety of metrics. One of those metrics is an overall synthetic quality score for a transcriptome, which is called the “transrate score”. Our mean transrate score, while the NCGR transrate score is 0.22
  8. In addition to have higher quality scores, there appears to be more content. The proportion of contigs from a comparison called a reciprocal best blast of NCGR vs. our DIB assemblies indicates that most of the content found in NCGR is also found in the DIB re-assemblies. But also that there is extra information in the DIB assemblies not found in the NCGR assemblies. This information was obtained by aligning the two assemblies against each other both ways. First with NCGR as the reference, then the reverse with DIB as the reference. Engage with audience: As you can see here…our peak is about 0.8, or 80%. This means that we’re capturing 80% of the content in the NCGR assemblies. On the other hand, NCGR assemblies capture about 50% of the content of our assemblies. The difference is about 30%.  The ~30% difference between these 2 blast comparisons leads us to still question whether we have just assembled junk or if we actually have higher resolution assemblies.  
  9. Orient audience to graphs: left ORF on Y axis Even though we have more contigs, the open reading frame protein coding regions detected is similar if not more tightly distributed towards the upper range. Most of the assemblies have slightly higher ORF content. And on the right are BUSCO percentages, which is a set of benchmarking universal single copy orthologs expected to be found in all eukaryotic transcriptomes, like housekeeping genes. While there are problems with using BUSCO scores as an absolute measurement of assembly quality, they can serve as a comparative metric relative to another pipeline. Our assemblies have a similar if not slightly higher BUSCO content relative to NCGR. So, at least these haven’t gone down. The extra content we found is probably not all junk.
  10. In digging deeper into the extra content, this is a plot of ONLY this extra content in the blue part. Samples are across the x axis, sorted by the number of extra contigs on the y axis. (pause, let this sink in, take a drink or something)   Highlighted in green is the number of these extra contigs that are actually annotated to a known gene.   I annotated the re-assemblies using this really great tool out of our lab by Camille Scott called ‘dammit’. No, it’s not an acronym, it was named out of frustration: “Just annotate it, dammit!” The dammit pipeline uses the highly-curated Pfam and Rfam known protein domain databases as well as ORthoDB with conserved orthology domains. About 1/3 of the extra content has annotations.
  11. This is a pared-down example of what the annotations look like from one of the Dinoflagellate samples, to illustrate some of our frustrations with contigs and annotations. The assembler will recognize a contig as a transcript, then the dammit pipeline will find matches with the databases. There are usually multiple proteins that match, so I’ve chosen the top e-value match so that there is only one protein annotated per transcript. Here you can see that there are multiple contigs annotated as the same protein, glycoprotein glucosyltransferase. So, this has been a challenge to sift through all of this. But – again - it is great to have these annotations. Porocentrum minimum https://www.eoas.ubc.ca/research/phytoplankton/dinoflagellates/prorocentrum/p_minimum.html
  12. Here we are comparing the raw sequence content, regardless of annotation, in terms of the number of kmers or unique word combinations with a k length of 25. We see that our assemblies fall above the 1:1 expectation, meaning that our assemblies have more unique words compared to the NCGR assemblies. This is kind of like taking two versions of the same book and digesting them down into individual 25 letter words found in the book. We found that our assemblies have more unique words than NCGR. Therefore, we are able to answer that our assemblies probably have a bit more biologically-meaningful content
  13. To address our second question about whether we can detect phylogenetic differences in the assemblies, we took a look at some of the assembly metrics grouped by taxa. Explain figures: unique k-mers on the y, input reads on the x, colors indicate different taxa, plotting mean and stdev The Dinoflagellates appear to have more unique kmer content. This seems to make sense, knowing that Dinoflagellates have this steady-state gene expression thing going on, where they just keep expressing genes on and one, then regulate more at the translational level. As far as the software, it might be useful to incorporate strain-specific information like this into assembly software.
  14. Here again, colors are different taxonomic groupings, mean percentage of open reading frame predictions on the y, number of transcripts on the x We see here that Cilliate assemblies appear to have a lower open reading frame percentage. This is interesting since it has recently been found Ciliates have an alternative triplet codon dictionary, with codons normally encoding STOP serving a different purpose. Dinoflagellates here have this high open reading frame content, and lots of contigs. In this case, it is useful to know that our assembly evaluation tools might perform outside the range of what is normal for the organisms in question. The assemblies are not necessarily lower quality, but may be perceived as lower in quality because of cool and unique features like this.
  15. Strain-specific trends may lead to understanding how raw data content affects the overall assembly quality
  16. Thank the Moore Foundation first.