SlideShare a Scribd company logo
1 of 35
Can we get scientists
to share data through
self-interest?
C. Titus Brown
UC Davis
ctbrown@ucdavis.edu
Thanks, Nick!
This is an attempt to explain why I pitched this:
http://ivory.idyll.org/blog/2014-moore-ddd-talk.html
and talk about what I’d like to do with the money.
The way data commonly
gets published
Gather data Analyze data Write paper
Publish paper
and data
Many failure modes:
Gather data Analyze data Write paper
Publish paper
and dataX
Lack of expertise;
Lack of tools;
Lack of compute;
Bad experimental design;
Many failure modes:
Gather data Analyze data Write paper
Publish paper
and dataX
(The usual reasons)
One failure mode in
particular:
Gather data Analyze data Write paper
Publish paper
and data
Other data
X
One failure mode in
particular:
Gather data Analyze data Write paper
Publish paper
and data
Other data
X
Lots of biological data doesn’t make
sense, except in the light of other
data.
This is especially true in two of the
fields I work in, environmental
metagenomics and non-model
mRNAseq
(For example: gene
annotation by homology)
Anything else Mollusc Cephalopod
no similarity
One failure mode in
particular:
Gather data Analyze data Write paper
Publish paper
and data
Other data
X
Lots of biological data doesn’t make
sense, except in the light of other
data.
This is especially true in two of the
fields I work in, environmental
metagenomics and non-model
mRNAseq
Hmm.
Data
publication
Data
publication
Data
analysis
Data
analysis
I believe:
There are many interesting and useful data sets
immured behind lab walls by lack of:
• Expertise
• Tools
• Compute
• Well-designed experimental setup
• Pre-analysis data publication culture in biology
• Recognition that sometimes hypotheses just get in
the way
• Good editorial judgment
I believe:
There are many interesting and useful data sets
immured behind lab walls by lack of:
• Expertise
• Tools
• Compute
• Well-designed experimental setup
• Pre-analysis data publication culture in biology
• Recognition that sometimes hypotheses just get in
the way
• Good editorial judgment
(Side note)
The existence of journals that will let you publish
virtually anything should have really helped data
availability!
Sadly, many of them don’t enforce data publication
rules.
Data publications!
The obvious solution: data pubs!
(“Pre-publication data sharing”)
Make your data available so that others can cite it!
GigaScience, Data Science, etc.
…but we don’t yet reward this culturally in biology.
(True story: no one cares, yet.)
I’m actually uncertain myself about how much we should
reward data and source code pubs. But we can talk later.
Pre-publication data
sharing?
There is no obvious reason to make data available prior to
publication of its analysis.
There is no immediate reward for doing so.
Neither is there much systematized reward for doing so.
(Citations and kudos feel good, but are cold comfort.)
Worse, there are good reasons not to do so.
If you make your data available, others can take advantage of
it…
…but they don’t have to share their data with you in order to do
so.
This bears some similarity
to the Prisoners’ Dilemma:
http://www.acting-man.com/?p=34313
“Confession” here is not
sharing your data.
Note: I’m not a game
theorist (but some of my
best friends are).
So, how do we get academics
to share their data!?
Two successful “systems” (send me more!!)
1. Oceanographic research
2. Biomedical research
1. Research cruises are
expensive!
In oceanography,
individual researchers cannot
afford to set up a cruise.
So, they form scientific consortia.
These consortia have data
sharing and preprint sharing
agreements.
(I’m told it works pretty well (?))
2. Some data makes more sense
when you have more data
Omberg et al., Nature Genetics, 2013.
Sage Bionetworks et al.:
Organize a consortium to
generate data;
Standardize data generation;
Share via common platform;
Store results, provenance,
analysis descriptions, and source
code;
Run a leaderboard for a subset of
analyses;
Win!
This “walled garden”
model is interesting!
“Compete” on analysis, not on data.
Some notes -
• Sage model requires ~similar data in common
format;
• Common analysis platform then becomes
immediately useful;
• Data is ~easily re-usable by participants;
• Publication of data becomes straightforward;
• Both models are centralized and coordinated.
The $1.5m question(s):
• Can we “port” this sharing model over to
environmental metagenomics, non-model
mRNAseq, and maybe even VetMed and
agricultural research?
• Can we use this model to drive useful pre-
publication data sharing?
• Can we take it from a coordinated and centralized
model to a decentralized model?
A slight digression -
Most data analysis models are based on centralizing data
and then computing on it there. This has several failure
points:
• Political: expect lots of biomedical, environmental data
to be restricted geopolitically.
• Computation: in the limit of infinite data…
• Bandwidth: in the limit of infinite data…
• Funding: in the limit of infinite data…
Proposal: distributed graph database server
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
Graph queries
across public & walled-garden data sets:
See Lee, Alekseyenko,
Brown, 2009, SciPy
Proceedings: ‘pygr’
project.
raw sequence
assembled
sequence
nitrite reductase ppaZ
SIMILAR TO ALSO CONTAINS
Graph queries
across public & walled-garden data sets:
“What data sets contain <this gene>?”
“Which reads match to <this gene>, but not in
<conserved domain>?”
“Give me relative abundance of <gene X>
across all data sets, grouped by nitrogen
exposure.”
Thesis:
If we can provide immediate returns for data sharing,
researchers will do so, and do so immediately.
Not to do so would place them at a competitive
disadvantage.
(All the rest is gravy: open analysis system,
reproducibility, standardized data format, etc.)
Puzzle pieces.
1. Inexpensive and widely available cloud computing
infrastructure?
Yep. See Amazon, Google, Rackspace, etc.
Puzzle pieces.
2. The ability to do many or most sequence analyses
inexpensively in the cloud?
Yep. This is one reason for khmer & khmer-protocols.
Puzzle pieces.
3. Locations to persist indexed data sets for use in
search & retrieval?
figshare & dryad (?)
Puzzle pieces.
4. Distributed data mining approaches?
Some literature, but I know little about it.
In summary:
How will we do this?
I PLAN TO FAIL.
A LOT.
PUBLICLY.
(ht @ethanwhite)
In summary:
How will we know if (or when) we’ve “won”?
1. When people use, extend, and remix our software
and concepts without talking to us about it first.
(c.f. khmer!)
2. When the system becomes so useful that people go
back and upload old data sets to it.
In summary:
The larger vision
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.
Thanks!
References and pointers welcome!
https://github.com/ged-lab/buoy
(Note: there’s nothing there yet.)

More Related Content

What's hot

What's hot (20)

The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
B.3.5
B.3.5B.3.5
B.3.5
 
SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
Executing the Research Paper
Executing the Research PaperExecuting the Research Paper
Executing the Research Paper
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 

Viewers also liked

Buscadores (Fodehum)
Buscadores (Fodehum)Buscadores (Fodehum)
Buscadores (Fodehum)
grupo3fodehum
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?
isvincent
 
Company Presentation for Publishers
Company Presentation for PublishersCompany Presentation for Publishers
Company Presentation for Publishers
Sponsormob
 

Viewers also liked (20)

Creditmanagement en cloud computing
Creditmanagement en cloud computingCreditmanagement en cloud computing
Creditmanagement en cloud computing
 
Advanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for EntrepreneursAdvanced International Business Strategies for Entrepreneurs
Advanced International Business Strategies for Entrepreneurs
 
Healthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECDHealthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECD
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van Sourcing
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
 
11i Logs
11i Logs11i Logs
11i Logs
 
Allah & Universe
Allah & UniverseAllah & Universe
Allah & Universe
 
Buscadores (Fodehum)
Buscadores (Fodehum)Buscadores (Fodehum)
Buscadores (Fodehum)
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial Management
 
Long term evaluation of IL programme slides
Long term evaluation of IL programme slidesLong term evaluation of IL programme slides
Long term evaluation of IL programme slides
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?
 
Hazed and Confused
Hazed and ConfusedHazed and Confused
Hazed and Confused
 
Coalition Orientation to Public
Coalition Orientation to PublicCoalition Orientation to Public
Coalition Orientation to Public
 
Bloggingforbusiness2003
Bloggingforbusiness2003Bloggingforbusiness2003
Bloggingforbusiness2003
 
Real Kings Of Logistics
Real Kings Of LogisticsReal Kings Of Logistics
Real Kings Of Logistics
 
Legal Strategies: Exporting
Legal Strategies: ExportingLegal Strategies: Exporting
Legal Strategies: Exporting
 
XBRL in Oracle 11i and R12
XBRL in Oracle 11i and R12XBRL in Oracle 11i and R12
XBRL in Oracle 11i and R12
 
Company Presentation for Publishers
Company Presentation for PublishersCompany Presentation for Publishers
Company Presentation for Publishers
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
Wild beauty2
Wild beauty2Wild beauty2
Wild beauty2
 

Similar to 2015 balti-and-bioinformatics

Data management plans
Data management plansData management plans
Data management plans
Brad Houston
 

Similar to 2015 balti-and-bioinformatics (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Data management plans
Data management plansData management plans
Data management plans
 
Big Data In Medicine
Big Data In Medicine Big Data In Medicine
Big Data In Medicine
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 

More from c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 

More from c.titus.brown (20)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Recently uploaded

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Recently uploaded (20)

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 

2015 balti-and-bioinformatics

  • 1. Can we get scientists to share data through self-interest? C. Titus Brown UC Davis ctbrown@ucdavis.edu
  • 2. Thanks, Nick! This is an attempt to explain why I pitched this: http://ivory.idyll.org/blog/2014-moore-ddd-talk.html and talk about what I’d like to do with the money.
  • 3. The way data commonly gets published Gather data Analyze data Write paper Publish paper and data
  • 4. Many failure modes: Gather data Analyze data Write paper Publish paper and dataX Lack of expertise; Lack of tools; Lack of compute; Bad experimental design;
  • 5. Many failure modes: Gather data Analyze data Write paper Publish paper and dataX (The usual reasons)
  • 6. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X
  • 7. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X Lots of biological data doesn’t make sense, except in the light of other data. This is especially true in two of the fields I work in, environmental metagenomics and non-model mRNAseq
  • 8. (For example: gene annotation by homology) Anything else Mollusc Cephalopod no similarity
  • 9. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X Lots of biological data doesn’t make sense, except in the light of other data. This is especially true in two of the fields I work in, environmental metagenomics and non-model mRNAseq
  • 11. I believe: There are many interesting and useful data sets immured behind lab walls by lack of: • Expertise • Tools • Compute • Well-designed experimental setup • Pre-analysis data publication culture in biology • Recognition that sometimes hypotheses just get in the way • Good editorial judgment
  • 12. I believe: There are many interesting and useful data sets immured behind lab walls by lack of: • Expertise • Tools • Compute • Well-designed experimental setup • Pre-analysis data publication culture in biology • Recognition that sometimes hypotheses just get in the way • Good editorial judgment
  • 13. (Side note) The existence of journals that will let you publish virtually anything should have really helped data availability! Sadly, many of them don’t enforce data publication rules.
  • 14. Data publications! The obvious solution: data pubs! (“Pre-publication data sharing”) Make your data available so that others can cite it! GigaScience, Data Science, etc. …but we don’t yet reward this culturally in biology. (True story: no one cares, yet.) I’m actually uncertain myself about how much we should reward data and source code pubs. But we can talk later.
  • 15. Pre-publication data sharing? There is no obvious reason to make data available prior to publication of its analysis. There is no immediate reward for doing so. Neither is there much systematized reward for doing so. (Citations and kudos feel good, but are cold comfort.) Worse, there are good reasons not to do so. If you make your data available, others can take advantage of it… …but they don’t have to share their data with you in order to do so.
  • 16. This bears some similarity to the Prisoners’ Dilemma: http://www.acting-man.com/?p=34313 “Confession” here is not sharing your data. Note: I’m not a game theorist (but some of my best friends are).
  • 17. So, how do we get academics to share their data!? Two successful “systems” (send me more!!) 1. Oceanographic research 2. Biomedical research
  • 18. 1. Research cruises are expensive! In oceanography, individual researchers cannot afford to set up a cruise. So, they form scientific consortia. These consortia have data sharing and preprint sharing agreements. (I’m told it works pretty well (?))
  • 19. 2. Some data makes more sense when you have more data Omberg et al., Nature Genetics, 2013. Sage Bionetworks et al.: Organize a consortium to generate data; Standardize data generation; Share via common platform; Store results, provenance, analysis descriptions, and source code; Run a leaderboard for a subset of analyses; Win!
  • 20. This “walled garden” model is interesting! “Compete” on analysis, not on data.
  • 21. Some notes - • Sage model requires ~similar data in common format; • Common analysis platform then becomes immediately useful; • Data is ~easily re-usable by participants; • Publication of data becomes straightforward; • Both models are centralized and coordinated.
  • 22. The $1.5m question(s): • Can we “port” this sharing model over to environmental metagenomics, non-model mRNAseq, and maybe even VetMed and agricultural research? • Can we use this model to drive useful pre- publication data sharing? • Can we take it from a coordinated and centralized model to a decentralized model?
  • 23. A slight digression - Most data analysis models are based on centralizing data and then computing on it there. This has several failure points: • Political: expect lots of biomedical, environmental data to be restricted geopolitically. • Computation: in the limit of infinite data… • Bandwidth: in the limit of infinite data… • Funding: in the limit of infinite data…
  • 24. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  • 25. Graph queries across public & walled-garden data sets: See Lee, Alekseyenko, Brown, 2009, SciPy Proceedings: ‘pygr’ project. raw sequence assembled sequence nitrite reductase ppaZ SIMILAR TO ALSO CONTAINS
  • 26. Graph queries across public & walled-garden data sets: “What data sets contain <this gene>?” “Which reads match to <this gene>, but not in <conserved domain>?” “Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”
  • 27. Thesis: If we can provide immediate returns for data sharing, researchers will do so, and do so immediately. Not to do so would place them at a competitive disadvantage. (All the rest is gravy: open analysis system, reproducibility, standardized data format, etc.)
  • 28. Puzzle pieces. 1. Inexpensive and widely available cloud computing infrastructure? Yep. See Amazon, Google, Rackspace, etc.
  • 29. Puzzle pieces. 2. The ability to do many or most sequence analyses inexpensively in the cloud? Yep. This is one reason for khmer & khmer-protocols.
  • 30. Puzzle pieces. 3. Locations to persist indexed data sets for use in search & retrieval? figshare & dryad (?)
  • 31. Puzzle pieces. 4. Distributed data mining approaches? Some literature, but I know little about it.
  • 32. In summary: How will we do this? I PLAN TO FAIL. A LOT. PUBLICLY. (ht @ethanwhite)
  • 33. In summary: How will we know if (or when) we’ve “won”? 1. When people use, extend, and remix our software and concepts without talking to us about it first. (c.f. khmer!) 2. When the system becomes so useful that people go back and upload old data sets to it.
  • 34. In summary: The larger vision Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future.
  • 35. Thanks! References and pointers welcome! https://github.com/ged-lab/buoy (Note: there’s nothing there yet.)

Editor's Notes

  1. Analyze data in cloud; import and export important; connect to other databases.
  2. Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.
  3. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.