SlideShare a Scribd company logo
1 of 62
NON-MODEL ORGANISMS AND 
DATA-INTENSIVE BIOLOGY 
C. Titus Brown 
Assistant Professor 
MMG / CSE
Outline 
• The Molgulid story: investigating non-model 
ascidians ( this is the biology) 
• Meditations on data analysis. 
• Methods, methods, methods. 
•Training, training, training. 
• Concluding thoughts
The Molgula Story – an int’l collaboration 
Elijah Lowe 
(MSU; Naples?) 
Billie Swalla (UW, BEACON) 
Lionel Christiaen (NYU); 
Claudia Racioppi (Naples; NYU)
…to the urochordateswe go! 
Putnam et al., 2008, 
Modified from SwaNllaa t2u0r0e1.
Filter feeding adults 
Molgula oculata 
Molgula occulta 
Molgula oculata Ciona intestinalis 
Elijah Lowe; collaboration w/Billie Swalla
Challenging organisms to work on! 
Molgula occulta & M. oculata: 
• Only spawn ~1 month out of the year 
• Located off the northern coast of France 
• Hybrids not found outside of lab conditions 
• Species cannot be cultured 
•Wet lab techniques are not fully developed for 
species 
• No genomic resources (as of 2008).
Billie Swalla, Nadine Peyriéras, Alberto Stolfi
Tail loss and notochord 
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta 
Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
Molgula clades – tail loss is derived
Solitary ascidians 
have determinant 
and invariant cleavage. 
Some species have 
colored cytoplasms. 
(Boltenia villosa) 
The cell lineage is very 
similar in Ciona, Phallusia, 
Halocynthia roretzi & 
Molgula oculata.
Molgula occidentalis 
Ciona intestinalis
Notochord formation (convergence & 
extension) in ascidians is highly 
conserved. 
Ciona savignyi Jiang and Smith, 2007
Notochord Formation 
in Molgulids 
Molgula oculata notochord 
(40 cells, converged & extended) 
Molgula occulta no notochord 
(20 cells, not converged & extended) 
Hybrid notochord 
(20 cells, converged & extended) 
Swalla and Jeffery, 1996
First we applied mRNAseq… 
Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
…which gave us entire transcriptomes… 
Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
…then we sequenced their genomes... 
• 3 species: 
Molgula occidentalis (tailed) – “MOXI” 
Molgula oculata (tailed) – “MOCU” 
Molgula occulta (tail-less) – “MOCC” 
• 3 lanes: 300-400 bp; 650-750 bp; 900-1000 bp 
• ≥ 200X coverage each genome 
De novo assembly by Elijah Lowe (MSU) 
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
…which gave us most of their genes (and 
regulatory elements?) 
Genome assembly statistics: 
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
Shift in differentially expressed genes from 
gastrulation to neurulation 
M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula 
Differentially expressed during neurulation in M. ocu vs M. occ 
Elijah Lowe
Notochord gene expression similar to tailed 
species 
-10 -5 0 5 10 15 
-10 -5 0 5 10 15 
Expression difference Hybrid vs Parent species 
log2(hybrid)-log2(oculata) 
log2(hybrid)-log2(occulta) 
Elijah Lowe
Heterochronic Shift in MolgulidaeDevelopment 
*79 genes examined 
across six species
Transgenics of reporter constructs 
(“Mutual intelligibility” across ~350 my) 
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
Prickle is a key part of the notochord program. 
Veeman, M., et al., 2007 
•Planar cell 
polarity (PCP) 
pathway 
•Involved in 
convergence and 
extension
Prickle expressed in notochord cells of 
tailless ascidians. 
Mita et al Zool. Sci., 2010 
M. occulta gastrulation 
Ciona intestinalis 
Satoh Nature Reviews Genetics 4, 2003 
FGF Bra Pk 
Elijah Lowe
(Re)booting the Molgula -- 
• Determined conservation of cardiopharyngeal 
developmental program, despite shifts in cis-regulatory 
sequences (Stolfi et al, eLife, 2014). 
• Examining heterochronic shifts in developmental timing 
(tail loss) (Maliska et al., in preparation). 
• Connecting evolutionary shifts in developmental gene 
regulatory networks with conserved molecular profiles 
(Lowe et al, submitted; Lowe et al., in preparation).
More thoughts on Molgula 
• One grad student, two transcriptomes, three genomes, 
four years… 
• Genomic resources are enabling a sprawling international 
collaboration (UW/BEACON, MSU/BEACON, NYU, 
Naples, Paris) 
• !Methods development key!
How Science Works 
Data 
Analysis 
Data generation
Luckily, data analysis is cheap and easy!
Err, well, actually… 
Data generation 
Data Analysis 
http://www.pixelpog.com/ftpimages/GnomesAttack.jpg
It is now easy to generate sequencing 
data sets of such a size and scale that 
the first round analysis cannot even be 
completed.
My research: 
theoretical => applied solutions to scale. 
Theoretical advances 
in data structures and 
algorithms 
Practically useful & usable 
implementations, at scale. 
Demonstrated 
effectiveness on real data.
My research: three methods. 
1. Adaptation of a suite of probabilistic data structures for 
representing set membership and counting (Bloom filters 
and CountMin Sketch). (Zhang et al., PLoS One, 2014.) 
2. An online streaming approach to lossy compression of 
sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.) 
3. Compressible de Bruijn graph representation for 
assembly. (Pell et al., PNAS, 2012.)
Method #2 - Digital normalization 
(a computational version of library normalization) 
Suppose you have a 
dilution factor of A (10) to 
B(1). To get 10x of B you 
need to get 100x of A! 
Overkill!! 
This 100x will consume 
disk space and, because 
of errors, memory. 
We can discard it for 
you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization retains information, while 
discarding data and errors
Digital normalization approach 
A digital analog to cDNA library normalization, diginorm: 
• Streaming & single pass: looks at each read at most 
once; 
• Does not “collect” the majority of errors; 
• Keeps all low-coverage reads; 
• Smooths out coverage of sequencing. 
=> 
Enables analyses that are otherwise completely 
impossible.
Witness the power of this fully operational 
set of sequence analysis methods: 
1. Assembling soil metagenomes. 
Howe et al., PNAS, 2014 (w/Tiedje) 
2. Understanding bone-eating worm symbionts. 
Goffredi et al., ISME, 2014. 
3. An ultra-deep look at the lamprey transcriptome. 
Scott et al., in preparation (w/Li) 
4. Understanding development in Molgulid ascidians. 
Stolfi et al, eLife 2014; etc.
Open science 
Guiding principle: methods that aren’t broadly 
available aren’t very useful. 
(=> Preprints, open source code, blog posts, Twitter, 
training, etc.) 
Estimated ~1000 users of our software. 
Diginorm now included in Trinity software from Broad 
Institute (~10,000 users) 
Illumina TruSeq long-read technology now 
incorporates our approach (~100,000 users)
Current research: 
Compressive algorithms for sequence 
analysis 
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Can we enable and accelerate sequence-based 
inquiry by making all basic analysis 
easier and some analyses possible?
The data challenge in biology 
In 5-10 years, we will have nigh-infinite data. 
(Genomic, transcriptomic, proteomic, metabolomic, 
…?) 
We currently have no good way of querying, 
exploring, investigating, or mining these data sets, 
especially across multiple locations.. 
Moreover, most data is unavailable until after 
publication… 
…which, in practice, means it will be lost.
Infrastructure: distributed graph database server 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI)
“Data Intensive Biology” 
• Increasingly, relevant data is out there or can be 
generated fairly inexpensively. 
• But what does the data mean? How can we get it to yield 
putative answers? How can we integrate it with other 
people’s data? 
• Virtually nobody in biology is trained to do this. 
• Virtually nobody in biology is being trained in how to do 
this.
Summer NGS workshop (2010-2017)
Perspectives on training 
• Prediction: The single biggest 
challenge facing biology over the 
next 20 years is the lack of data 
analysis training (see: NIH DIWG 
report) 
• Data analysis is not turning the 
crank; it is an intellectual exercise 
on par with experimental design or 
paper writing. 
• Training is systematically 
undervalued in academia (!?)
Training - looking forward 
• NIH “Big Data 2 Knowledge” (BD2K) will be investing 
~$20-40m in training each year (my estimate). 
Biomedical science increasingly depends on data 
analysis. 
• Moore, Sloan Foundations are investing heavily in training 
(see: Software Carpentry) 
• NSF BIO Centers have stated that “training is the second 
most important problem that all of us have”.
My training efforts – looking backwards 
• Approximately $600k of my funding has been received for 
developing and implementing training. 
• “Students” have included about a dozen associate & full 
professors; over 120 alumni of summer course in total. 
• Invited talks, collaborations, problem discovery, networking, 
interaction with program managers, and volleyball. 
• Strong pushback from every level of the administration at 
MSU!? But enthusiastic support from many research-active 
faculty. 
(Invest in data science should be part of MMG’s vision for the 
future…)
About those STEM career paths… 
Quote: 
“…foisting graduates upon a carcass-strewn 
jobless dystopia.” 
Dr. Rebecca Schuman, https://chroniclevitae.com/news/702- 
crimes-against-dissertation-humanity
Want a faculty job? 
http://www.ascb.org/ascbpost/index.php/compass-points/item/ 
285-where-will-a-biology-phd-take-you
Want a faculty job? Don’t count on it. 
< 10% of entering PhD students will become 
tenure track faculty.* 
53% rank research professorships as their desired 
career.* 
(Optimism is great! But…) 
Note: universities have little provision for 
permanent non-tenure-track positions. 
* http://www.ascb.org/ascbpost/index.php/compass-points/item/ 
285-where-will-a-biology-phd-take-you
(Sorry. I thought you should all know.)
Alternatives to tenure track. 
PhD research prepares you marvelously for 
tackling an immense range of problems!! 
Biotech, startups, research institutes, teaching, 
science communication… 
(PhD advisors generally do not do such a good job 
of preparing you for non-tenure track positions.) 
Papers are necessary to graduate but insufficient 
to get you a non-academic job afterwards.
Wrapping it all up 
• There are great opportunities in our increasing ability to 
generate data! 
• Data analysis is rapidly becoming a first class citizen in 
biology. 
• We aren’t training people in data analysis approaches. 
• …this would help them find jobs, too.
Funding
Students and postdocs 
Former: 
• Dr. Jason Pell (Google NYC) 
• Asst Professor Adina Howe (Iowa State) 
• Current: 
• Dr. Likit Preeyanon (MMG) 
• Elijah Lowe (CSE) 
• Qingpeng Zhang (CSE) 
• Jaron Guo (MMG) 
• Camille Scott (CSE) 
• Michael Crusoe 
• Luiz Irber (CSE) 
• Dr. Sherine Awad (MMG)
Support network 
Dr. Vivien Bonazzi, my fairy 
NIH program officer. 
Dr. Jim Tiedje, He Who 
Comes with Sequence
Support network
Co-conspirators / family 
Thanks! 
(1994)

More Related Content

Similar to 2014 mmg-talk

Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Sandra Binning
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
 

Similar to 2014 mmg-talk (20)

2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 naples
2014 naples2014 naples
2014 naples
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universities
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 

Recently uploaded

Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxJosielynTars
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 

Recently uploaded (20)

Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptx
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 

2014 mmg-talk

  • 1. NON-MODEL ORGANISMS AND DATA-INTENSIVE BIOLOGY C. Titus Brown Assistant Professor MMG / CSE
  • 2. Outline • The Molgulid story: investigating non-model ascidians ( this is the biology) • Meditations on data analysis. • Methods, methods, methods. •Training, training, training. • Concluding thoughts
  • 3. The Molgula Story – an int’l collaboration Elijah Lowe (MSU; Naples?) Billie Swalla (UW, BEACON) Lionel Christiaen (NYU); Claudia Racioppi (Naples; NYU)
  • 4. …to the urochordateswe go! Putnam et al., 2008, Modified from SwaNllaa t2u0r0e1.
  • 5. Filter feeding adults Molgula oculata Molgula occulta Molgula oculata Ciona intestinalis Elijah Lowe; collaboration w/Billie Swalla
  • 6. Challenging organisms to work on! Molgula occulta & M. oculata: • Only spawn ~1 month out of the year • Located off the northern coast of France • Hybrids not found outside of lab conditions • Species cannot be cultured •Wet lab techniques are not fully developed for species • No genomic resources (as of 2008).
  • 7. Billie Swalla, Nadine Peyriéras, Alberto Stolfi
  • 8. Tail loss and notochord a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
  • 9. Molgula clades – tail loss is derived
  • 10. Solitary ascidians have determinant and invariant cleavage. Some species have colored cytoplasms. (Boltenia villosa) The cell lineage is very similar in Ciona, Phallusia, Halocynthia roretzi & Molgula oculata.
  • 12.
  • 13. Notochord formation (convergence & extension) in ascidians is highly conserved. Ciona savignyi Jiang and Smith, 2007
  • 14. Notochord Formation in Molgulids Molgula oculata notochord (40 cells, converged & extended) Molgula occulta no notochord (20 cells, not converged & extended) Hybrid notochord (20 cells, converged & extended) Swalla and Jeffery, 1996
  • 15. First we applied mRNAseq… Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
  • 16. …which gave us entire transcriptomes… Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
  • 17. …then we sequenced their genomes... • 3 species: Molgula occidentalis (tailed) – “MOXI” Molgula oculata (tailed) – “MOCU” Molgula occulta (tail-less) – “MOCC” • 3 lanes: 300-400 bp; 650-750 bp; 900-1000 bp • ≥ 200X coverage each genome De novo assembly by Elijah Lowe (MSU) Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
  • 18. …which gave us most of their genes (and regulatory elements?) Genome assembly statistics: Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
  • 19. Shift in differentially expressed genes from gastrulation to neurulation M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula Differentially expressed during neurulation in M. ocu vs M. occ Elijah Lowe
  • 20. Notochord gene expression similar to tailed species -10 -5 0 5 10 15 -10 -5 0 5 10 15 Expression difference Hybrid vs Parent species log2(hybrid)-log2(oculata) log2(hybrid)-log2(occulta) Elijah Lowe
  • 21. Heterochronic Shift in MolgulidaeDevelopment *79 genes examined across six species
  • 22. Transgenics of reporter constructs (“Mutual intelligibility” across ~350 my) Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
  • 23. Prickle is a key part of the notochord program. Veeman, M., et al., 2007 •Planar cell polarity (PCP) pathway •Involved in convergence and extension
  • 24. Prickle expressed in notochord cells of tailless ascidians. Mita et al Zool. Sci., 2010 M. occulta gastrulation Ciona intestinalis Satoh Nature Reviews Genetics 4, 2003 FGF Bra Pk Elijah Lowe
  • 25. (Re)booting the Molgula -- • Determined conservation of cardiopharyngeal developmental program, despite shifts in cis-regulatory sequences (Stolfi et al, eLife, 2014). • Examining heterochronic shifts in developmental timing (tail loss) (Maliska et al., in preparation). • Connecting evolutionary shifts in developmental gene regulatory networks with conserved molecular profiles (Lowe et al, submitted; Lowe et al., in preparation).
  • 26. More thoughts on Molgula • One grad student, two transcriptomes, three genomes, four years… • Genomic resources are enabling a sprawling international collaboration (UW/BEACON, MSU/BEACON, NYU, Naples, Paris) • !Methods development key!
  • 27. How Science Works Data Analysis Data generation
  • 28. Luckily, data analysis is cheap and easy!
  • 29. Err, well, actually… Data generation Data Analysis http://www.pixelpog.com/ftpimages/GnomesAttack.jpg
  • 30. It is now easy to generate sequencing data sets of such a size and scale that the first round analysis cannot even be completed.
  • 31. My research: theoretical => applied solutions to scale. Theoretical advances in data structures and algorithms Practically useful & usable implementations, at scale. Demonstrated effectiveness on real data.
  • 32. My research: three methods. 1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). (Zhang et al., PLoS One, 2014.) 2. An online streaming approach to lossy compression of sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.) 3. Compressible de Bruijn graph representation for assembly. (Pell et al., PNAS, 2012.)
  • 33. Method #2 - Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 40. Digital normalization retains information, while discarding data and errors
  • 41. Digital normalization approach A digital analog to cDNA library normalization, diginorm: • Streaming & single pass: looks at each read at most once; • Does not “collect” the majority of errors; • Keeps all low-coverage reads; • Smooths out coverage of sequencing. => Enables analyses that are otherwise completely impossible.
  • 42. Witness the power of this fully operational set of sequence analysis methods: 1. Assembling soil metagenomes. Howe et al., PNAS, 2014 (w/Tiedje) 2. Understanding bone-eating worm symbionts. Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome. Scott et al., in preparation (w/Li) 4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.
  • 43. Open science Guiding principle: methods that aren’t broadly available aren’t very useful. (=> Preprints, open source code, blog posts, Twitter, training, etc.) Estimated ~1000 users of our software. Diginorm now included in Trinity software from Broad Institute (~10,000 users) Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)
  • 44. Current research: Compressive algorithms for sequence analysis Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Can we enable and accelerate sequence-based inquiry by making all basic analysis easier and some analyses possible?
  • 45. The data challenge in biology In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication… …which, in practice, means it will be lost.
  • 46. Infrastructure: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 47. “Data Intensive Biology” • Increasingly, relevant data is out there or can be generated fairly inexpensively. • But what does the data mean? How can we get it to yield putative answers? How can we integrate it with other people’s data? • Virtually nobody in biology is trained to do this. • Virtually nobody in biology is being trained in how to do this.
  • 48. Summer NGS workshop (2010-2017)
  • 49. Perspectives on training • Prediction: The single biggest challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report) • Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing. • Training is systematically undervalued in academia (!?)
  • 50. Training - looking forward • NIH “Big Data 2 Knowledge” (BD2K) will be investing ~$20-40m in training each year (my estimate). Biomedical science increasingly depends on data analysis. • Moore, Sloan Foundations are investing heavily in training (see: Software Carpentry) • NSF BIO Centers have stated that “training is the second most important problem that all of us have”.
  • 51. My training efforts – looking backwards • Approximately $600k of my funding has been received for developing and implementing training. • “Students” have included about a dozen associate & full professors; over 120 alumni of summer course in total. • Invited talks, collaborations, problem discovery, networking, interaction with program managers, and volleyball. • Strong pushback from every level of the administration at MSU!? But enthusiastic support from many research-active faculty. (Invest in data science should be part of MMG’s vision for the future…)
  • 52. About those STEM career paths… Quote: “…foisting graduates upon a carcass-strewn jobless dystopia.” Dr. Rebecca Schuman, https://chroniclevitae.com/news/702- crimes-against-dissertation-humanity
  • 53. Want a faculty job? http://www.ascb.org/ascbpost/index.php/compass-points/item/ 285-where-will-a-biology-phd-take-you
  • 54. Want a faculty job? Don’t count on it. < 10% of entering PhD students will become tenure track faculty.* 53% rank research professorships as their desired career.* (Optimism is great! But…) Note: universities have little provision for permanent non-tenure-track positions. * http://www.ascb.org/ascbpost/index.php/compass-points/item/ 285-where-will-a-biology-phd-take-you
  • 55. (Sorry. I thought you should all know.)
  • 56. Alternatives to tenure track. PhD research prepares you marvelously for tackling an immense range of problems!! Biotech, startups, research institutes, teaching, science communication… (PhD advisors generally do not do such a good job of preparing you for non-tenure track positions.) Papers are necessary to graduate but insufficient to get you a non-academic job afterwards.
  • 57. Wrapping it all up • There are great opportunities in our increasing ability to generate data! • Data analysis is rapidly becoming a first class citizen in biology. • We aren’t training people in data analysis approaches. • …this would help them find jobs, too.
  • 59. Students and postdocs Former: • Dr. Jason Pell (Google NYC) • Asst Professor Adina Howe (Iowa State) • Current: • Dr. Likit Preeyanon (MMG) • Elijah Lowe (CSE) • Qingpeng Zhang (CSE) • Jaron Guo (MMG) • Camille Scott (CSE) • Michael Crusoe • Luiz Irber (CSE) • Dr. Sherine Awad (MMG)
  • 60. Support network Dr. Vivien Bonazzi, my fairy NIH program officer. Dr. Jim Tiedje, He Who Comes with Sequence
  • 62. Co-conspirators / family Thanks! (1994)