SlideShare a Scribd company logo
1 of 39
Download to read offline
A Data Biosphere for
Biomedical Research
Robert L. Grossman
University of Chicago &
Open Commons Consortium
AIRI IT Summit
Grand Rapids, Michigan
May 1, 2018
1. What is a Data Commons?
The challenge of big data in biomedical and behavioral
research…
The commoditization of sensors is
creating an explosive growth of data.
It can take weeks to download large datasets, it is difficult to
set up compliant computing infrastructure, and it can take
months to integrate & format the data for analysis.
There is not enough
funding for every
researcher to house all the
data they need
More challenges…
Data produced by different groups using different
methods is hard to integrate and compare.
There are few good software
platforms for researchers to use to
share their large datasets.
Most researchers don’t have the
bioinformatics support to process all
the data that could help their
research.
IT infrastructure challenges
• Data size
• Security & compliance
Limited funding
Growing importance of open
data, open reproducible
science & data ecosystems
IT infrastructure challenges
Limited funding
Growing importance of open
data, open reproducible
science & data ecosystems
data commonsData commons co-locate data
with cloud computing
infrastructure and commonly used
software services, tools & apps
for managing, analyzing and
sharing data to create an
interoperable resource for the
research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
Research ethics
committees (RECs) review
the ethical acceptability of
research involving human
participants. Historically,
the principal emphases of
RECs have been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27 of
the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific advancement
and its benefits"
(including to freely
engage in responsible
scientific inquiry)…*
Protect human
subject data
The right of human
subjects to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Data commons balance protecting human subject data with open
research that benefits patients:
2. An Example of a Data Commons
NCI Genomic Data Commons* • The GDC makes
available over 2.5 PB of
data available for access
via an API, analysis by
cloud resources on
public clouds, and
downloading.
• In a typical month, the
GDC is used by over
20,000 unique users and
over 2 PB of data are
accessed/downloaded.
• The GDC is based upon
an open source
software stack that can
be used to build other
data commons.
*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
The GDC consists of a 1) data exploration & visualization portal (DAVE), 2) data
submission portal, 3) data analysis and harmonization system system, 4) an API
so third party can build applications.
A
B
C
D
Source: The NCI Genomic Data Commons, to appear, 2018.
Systems 1 & 2: Data Portals to Explore and Submit Data
• MuSE
(MD Anderson)
• VarScan2 (Washington
Univ.)
• SomaticSniper
(Washington Univ.)
• MuTect2
(Broad Institute)
Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI
Genomic Data Commons, to appear.
System 3: Data Harmonization System To Analyze all of the
Submitted Data with a Common Pipelines
System 4: An API to Support User Defined Applications and
Notebooks to Create a Data Ecosystem
https://api.gdc.cancer.gov//files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
• The GDC has a REST API so that researchers can develop their own
applications.
• There are third party applications that use the REST API for Python, R,
Jupyter notebooks and Shiny.
• The REST API drives the GDC data portal, data submission system, etc.
Benefits of Data Commons and Data Sharing (1 of 2)
1. The data is available to other researchers for discovery,
which moves the research field faster.
2. Data commons support repeatable, reproducible and open
research.
3. Some diseases are dependent upon having a critical mass
of data to provide the required statistical power for the
scientific evidence (e.g. to study combinations of rare
mutations in cancer)
4. With more data, smaller effects can be studied (e.g. to
understand the effect of environmental factors on disease).
Source: Robert L. Grossman, Supporting Open Data and Open Science With Data Commons: Some Suggested Guidelines for Funding Organizations,
2017, https://www.healthra.org/download-resource/?resource-url=/wp-content/uploads/2017/08/Data-Commons-
Guidelines_Grossman_8_2017.pdf
Benefits of Data Commons and Data Sharing (2 of 2)
5. Data commons enable researchers to work with large
datasets at much lower cost to the funder than if each
researcher set up their own local environment.
6. Data commons generally provide higher security and greater
compliance than most local computing environments.
7. Data commons support large scale computation so that the
latest bioinformatics pipelines can be run.
8. Data commons can interoperate with each other so that over
time data sharing can benefit from a “network effect”
3. The Data Biosphere
Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
Source: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical
Research, Medium, Oct 16, 2017, medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
Concepts:
- Datasets
- Software Components
- Data Environments
Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
Principles:
- Modular
- Open
- Community-driven
- Standards-based
Driver Projects:
- All of Us
- Human Cell Atlas
- NCI Cloud Resources
Authors:
- Josh Denny
- David Glazer
- Robert L Grossman
- Ben Paten
- Anthony Philippakis
Ingest
GDC &
CRDC
Analysis
Engine
Data
Generators
Researchers
Methods
Repo
Store
Explore
Portals
Work-
Spaces
Use in cloud
NCI GDC & CRDC Data Flow
- Generators upload to cloud-
based data store
- Process data with analysis
engine, and curate metadata
- Enable search & discovery
- Ecosystem of applications
Source: Anthony Philippakis
CRDC Team:
- UChicago
- Broad
- SBG
- ISB
Ingest
GDC &
CRDC
Analysis
Engine
Data
Generators
Researchers
Ingest
Explore
HCA
Methods
Repo
StoreStore
Explore
Portals
Work-
Spaces
Use in cloud
HCA Data Flow
- Generators upload to cloud-
based data store
- Process data with analysis
engine, and curate metadata
- Enable search & discovery
- Ecosystem of applications
Source: Anthony Philippakis
HCA Team:
- Broad
- UCSC
- EBI
- CZI
Ingest
GDC &
CRDC
Analysis
Engine
Data
Generators
Researchers
Ingest
Explore
HCA
Methods
Repo
Store
Ingest
Store
Explore
AoU
Store
Explore
Portals
Work-
Spaces
Use in cloud
All of Us Data Flow
- Generators upload to cloud-
based data store
- Process data with analysis
engine, and curate metadata
- Enable search & discovery
- Ecosystem of applications
Source: Anthony Philippakis
AoU Team:
- Broad
- Verily
- Vandy
- U. Mich
- Columbia
Ingest
Explore
HCA
Analysis
Engine
Firecloud, AoU, NIH DC
Ingest
Explore
GDC
CRDC
Methods
Repo
Work-
Spaces
Store
Ingest
Store
Explore
AoU
Store
DOS
Differing implementations
CDR Index-d datastore
WES
Toil
Cromwell
TES
Agora
Dockstore
GA4GH Standardized APIs
IDs
Meta
Data Biosphere best practices
AuthN
AuthZ
Adapted from slide by Anthony Philippakis
The Origins of the Data Biosphere
• Anthony, Ben & Bob met at GA4GH meeting in Hinxton in May 2017
• Realized that this was a chance to drive interoperability.
• Goals of our collaboration are:
o Architect a federated data commons based on best practices, GA4GH
standards, and emerging standards, and see it reduced to practice.
o Nucleate an ecosystem of activity that goes beyond just our own
groups (“We are building a neighborhood, not a house.”)
o Bring interoperability among flagship NIH projects
Adapted from slide by Anthony Philippakis
4. Getting Involved with the Data Biosphere Project*
*This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
Activity 1: Contribute Applications and Tools to
Current & Emerging Data Biosphere Ecosystem(s)
Diagram: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles &
Practice of Oncology, 2018, to appear.
Activity 2: Participate in the GitHub Open Source
Software Community Building Data Biosphere Platforms
and Applications
Activity 3: Participate in GA4GH Standards
www.ga4gh.org
Activity 4: Participate in the Open Commons Consortium
& Build Your Own Data Commons
www.occ-data.org
5. Recommendations for Research Institutes*
*This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
Rec. 1: Put a senior leader in charge of data and data strategy
for your institute (a chief data officer, chief analytics officer,
etc.) and develop and implement a data strategy.
Strategic planning is the continuous process of making
present entrepreneurial (risk-taking) decisions
systematically and with the greatest knowledge of their
futurity; organizing systematically the efforts needed to
carry out these decisions; and measuring the results of
these decisions against the expectations through
organized, systematic feedback.
Peter Drucker, Management Tasks and Responsibilities, Harper and Row, 1974
Rec 2: Establish internal best practices for data.
Examples
• Support data tiers
o Data catalog
o Data lake
o Data commons
• Practice data portability
• etc.
Number
Size &
Complexity
Data Commons
Data Lake
Data Catalog
Rec. 3: Participate in the data biosphere community discussion.
6. Summary
Summary
1. Data commons co-locate data with cloud computing infrastructure and
commonly used software services, tools & apps for managing,
analyzing and sharing data to create an interoperable resource for the
research community.
2. The Data Biosphere Project are developing open, modular,
community-driven and standards-based data environments.
3. The Data Biosphere Project are working to develop open common APIs
across the NCI GDC / Cancer Research Data Commons, the NIH All of
Us Project, and the CZI Human Cell Atlas Project.
4. Contact us if you are interested in getting involved in the Data
Biosphere Project.
Questions?
cdis.uchicago.edu
Robert L. Grossman
rgrossman.com
@BobGrossman
robert.grossman@uchicago.edu
Contact Information
occ-data.org
For more information:
• To learn more about the Data Biosphere: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony
Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-
for-biomedical-research-d212bbfae95d
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science
as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608
• To large more about large scale, secure compliant cloud based computing environments for biomedical data, see:
Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal
of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1.
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for
cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed
using Bionimbus Gen2.
• To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark
Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang,
Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications
and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-
e18.
Abstract
A Data Biosphere for Biomedical Research
As datasets grow in scale, the practice of downloading data is becoming
impractical in terms of cost (storing multiple copies of large datasets is
wasteful), accessibility (few researchers have the necessary
computational infrastructure) and security (many research laboratories
lack state-of-the-art security and access control). We propose the idea
of creating a vibrant ecosystem, which we call the “Data Biosphere.”
Building a Data Biosphere to propel progress in biomedicine will require
a community working together, including laboratory groups generating
data, software developers creating Biosphere Components, and
technical teams assembling and operating Data Environments.

More Related Content

What's hot

BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaidatascienceiqss
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsTom Plasterer
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data ManagementCarole Goble
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
BioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageBioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for BiopharmaTom Plasterer
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Sanjay Padhi, Ph.D
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Michel Dumontier
 

What's hot (20)

BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaiDataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data Citation
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge Graphs
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
BioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageBioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative Advantage
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for Biopharma
 
V3 i35
V3 i35V3 i35
V3 i35
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...
 

Similar to A Data Biosphere for Biomedical Research

Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchGigaScience, BGI Hong Kong
 
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands Vivien Bonazzi
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereAlex Hardisty
 
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Sky Bristol
 
Open Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonOpen Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonAfrican Open Science Platform
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...African Open Science Platform
 
Data commons bonazzi bd2 k fundamentals of science feb 2017
Data commons bonazzi   bd2 k fundamentals of science feb 2017Data commons bonazzi   bd2 k fundamentals of science feb 2017
Data commons bonazzi bd2 k fundamentals of science feb 2017Vivien Bonazzi
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...GigaScience, BGI Hong Kong
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsVivien Bonazzi
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive trackGeorge Komatsoulis
 
Rda nitrd 2015 berman - final
Rda nitrd 2015 berman  - finalRda nitrd 2015 berman  - final
Rda nitrd 2015 berman - finalKathy Fontaine
 

Similar to A Data Biosphere for Biomedical Research (20)

Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do research
 
Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"
Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"
Hahnel "Open Data Policies: Opportunities, compliance and technology strategies"
 
BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands BD2K and the Commons : ELIXR All Hands
BD2K and the Commons : ELIXR All Hands
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
 
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
 
Open Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonOpen Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon Hodson
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
 
Nicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShowNicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShow
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Data commons bonazzi bd2 k fundamentals of science feb 2017
Data commons bonazzi   bd2 k fundamentals of science feb 2017Data commons bonazzi   bd2 k fundamentals of science feb 2017
Data commons bonazzi bd2 k fundamentals of science feb 2017
 
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content TypesIlik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...
 
Open Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon HodsonOpen Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon Hodson
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
 
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use caseEnabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
 
ACRL STS Liaisons Forum - AIBS
ACRL STS Liaisons Forum - AIBSACRL STS Liaisons Forum - AIBS
ACRL STS Liaisons Forum - AIBS
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive track
 
Rda nitrd 2015 berman - final
Rda nitrd 2015 berman  - finalRda nitrd 2015 berman  - final
Rda nitrd 2015 berman - final
 

More from Robert Grossman

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 

More from Robert Grossman (20)

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 

Recently uploaded

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

A Data Biosphere for Biomedical Research

  • 1. A Data Biosphere for Biomedical Research Robert L. Grossman University of Chicago & Open Commons Consortium AIRI IT Summit Grand Rapids, Michigan May 1, 2018
  • 2. 1. What is a Data Commons?
  • 3. The challenge of big data in biomedical and behavioral research… The commoditization of sensors is creating an explosive growth of data. It can take weeks to download large datasets, it is difficult to set up compliant computing infrastructure, and it can take months to integrate & format the data for analysis. There is not enough funding for every researcher to house all the data they need
  • 4. More challenges… Data produced by different groups using different methods is hard to integrate and compare. There are few good software platforms for researchers to use to share their large datasets. Most researchers don’t have the bioinformatics support to process all the data that could help their research.
  • 5. IT infrastructure challenges • Data size • Security & compliance Limited funding Growing importance of open data, open reproducible science & data ecosystems
  • 6. IT infrastructure challenges Limited funding Growing importance of open data, open reproducible science & data ecosystems data commonsData commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community.* *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
  • 7. Research ethics committees (RECs) review the ethical acceptability of research involving human participants. Historically, the principal emphases of RECs have been to protect participants from physical harms and to provide assurance as to participants’ interests and welfare.* [The Framework] is guided by, Article 27 of the 1948 Universal Declaration of Human Rights. Article 27 guarantees the rights of every individual in the world "to share in scientific advancement and its benefits" (including to freely engage in responsible scientific inquiry)…* Protect human subject data The right of human subjects to benefit from research. *GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR Data sharing with protections provides the evidence so patients can benefit from advances in research. Data commons balance protecting human subject data with open research that benefits patients:
  • 8. 2. An Example of a Data Commons
  • 9. NCI Genomic Data Commons* • The GDC makes available over 2.5 PB of data available for access via an API, analysis by cloud resources on public clouds, and downloading. • In a typical month, the GDC is used by over 20,000 unique users and over 2 PB of data are accessed/downloaded. • The GDC is based upon an open source software stack that can be used to build other data commons. *See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC consists of a 1) data exploration & visualization portal (DAVE), 2) data submission portal, 3) data analysis and harmonization system system, 4) an API so third party can build applications.
  • 10. A B C D Source: The NCI Genomic Data Commons, to appear, 2018.
  • 11. Systems 1 & 2: Data Portals to Explore and Submit Data
  • 12. • MuSE (MD Anderson) • VarScan2 (Washington Univ.) • SomaticSniper (Washington Univ.) • MuTect2 (Broad Institute) Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI Genomic Data Commons, to appear. System 3: Data Harmonization System To Analyze all of the Submitted Data with a Common Pipelines
  • 13. System 4: An API to Support User Defined Applications and Notebooks to Create a Data Ecosystem https://api.gdc.cancer.gov//files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state • The GDC has a REST API so that researchers can develop their own applications. • There are third party applications that use the REST API for Python, R, Jupyter notebooks and Shiny. • The REST API drives the GDC data portal, data submission system, etc.
  • 14. Benefits of Data Commons and Data Sharing (1 of 2) 1. The data is available to other researchers for discovery, which moves the research field faster. 2. Data commons support repeatable, reproducible and open research. 3. Some diseases are dependent upon having a critical mass of data to provide the required statistical power for the scientific evidence (e.g. to study combinations of rare mutations in cancer) 4. With more data, smaller effects can be studied (e.g. to understand the effect of environmental factors on disease). Source: Robert L. Grossman, Supporting Open Data and Open Science With Data Commons: Some Suggested Guidelines for Funding Organizations, 2017, https://www.healthra.org/download-resource/?resource-url=/wp-content/uploads/2017/08/Data-Commons- Guidelines_Grossman_8_2017.pdf
  • 15. Benefits of Data Commons and Data Sharing (2 of 2) 5. Data commons enable researchers to work with large datasets at much lower cost to the funder than if each researcher set up their own local environment. 6. Data commons generally provide higher security and greater compliance than most local computing environments. 7. Data commons support large scale computation so that the latest bioinformatics pipelines can be run. 8. Data commons can interoperate with each other so that over time data sharing can benefit from a “network effect”
  • 16. 3. The Data Biosphere
  • 17. Authors: - Josh Denny - David Glazer - Robert L Grossman - Ben Paten - Anthony Philippakis Source: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical Research, Medium, Oct 16, 2017, medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
  • 18. Concepts: - Datasets - Software Components - Data Environments Authors: - Josh Denny - David Glazer - Robert L Grossman - Ben Paten - Anthony Philippakis
  • 19. Principles: - Modular - Open - Community-driven - Standards-based Driver Projects: - All of Us - Human Cell Atlas - NCI Cloud Resources Authors: - Josh Denny - David Glazer - Robert L Grossman - Ben Paten - Anthony Philippakis
  • 20. Ingest GDC & CRDC Analysis Engine Data Generators Researchers Methods Repo Store Explore Portals Work- Spaces Use in cloud NCI GDC & CRDC Data Flow - Generators upload to cloud- based data store - Process data with analysis engine, and curate metadata - Enable search & discovery - Ecosystem of applications Source: Anthony Philippakis CRDC Team: - UChicago - Broad - SBG - ISB
  • 21. Ingest GDC & CRDC Analysis Engine Data Generators Researchers Ingest Explore HCA Methods Repo StoreStore Explore Portals Work- Spaces Use in cloud HCA Data Flow - Generators upload to cloud- based data store - Process data with analysis engine, and curate metadata - Enable search & discovery - Ecosystem of applications Source: Anthony Philippakis HCA Team: - Broad - UCSC - EBI - CZI
  • 22. Ingest GDC & CRDC Analysis Engine Data Generators Researchers Ingest Explore HCA Methods Repo Store Ingest Store Explore AoU Store Explore Portals Work- Spaces Use in cloud All of Us Data Flow - Generators upload to cloud- based data store - Process data with analysis engine, and curate metadata - Enable search & discovery - Ecosystem of applications Source: Anthony Philippakis AoU Team: - Broad - Verily - Vandy - U. Mich - Columbia
  • 23. Ingest Explore HCA Analysis Engine Firecloud, AoU, NIH DC Ingest Explore GDC CRDC Methods Repo Work- Spaces Store Ingest Store Explore AoU Store DOS Differing implementations CDR Index-d datastore WES Toil Cromwell TES Agora Dockstore GA4GH Standardized APIs IDs Meta Data Biosphere best practices AuthN AuthZ Adapted from slide by Anthony Philippakis
  • 24. The Origins of the Data Biosphere • Anthony, Ben & Bob met at GA4GH meeting in Hinxton in May 2017 • Realized that this was a chance to drive interoperability. • Goals of our collaboration are: o Architect a federated data commons based on best practices, GA4GH standards, and emerging standards, and see it reduced to practice. o Nucleate an ecosystem of activity that goes beyond just our own groups (“We are building a neighborhood, not a house.”) o Bring interoperability among flagship NIH projects Adapted from slide by Anthony Philippakis
  • 25. 4. Getting Involved with the Data Biosphere Project* *This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
  • 26. Activity 1: Contribute Applications and Tools to Current & Emerging Data Biosphere Ecosystem(s) Diagram: Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles & Practice of Oncology, 2018, to appear.
  • 27. Activity 2: Participate in the GitHub Open Source Software Community Building Data Biosphere Platforms and Applications
  • 28. Activity 3: Participate in GA4GH Standards www.ga4gh.org
  • 29. Activity 4: Participate in the Open Commons Consortium & Build Your Own Data Commons www.occ-data.org
  • 30. 5. Recommendations for Research Institutes* *This section represents my personal views, and not necessarily the views of the Data Biosphere Project.
  • 31. Rec. 1: Put a senior leader in charge of data and data strategy for your institute (a chief data officer, chief analytics officer, etc.) and develop and implement a data strategy. Strategic planning is the continuous process of making present entrepreneurial (risk-taking) decisions systematically and with the greatest knowledge of their futurity; organizing systematically the efforts needed to carry out these decisions; and measuring the results of these decisions against the expectations through organized, systematic feedback. Peter Drucker, Management Tasks and Responsibilities, Harper and Row, 1974
  • 32. Rec 2: Establish internal best practices for data. Examples • Support data tiers o Data catalog o Data lake o Data commons • Practice data portability • etc. Number Size & Complexity Data Commons Data Lake Data Catalog
  • 33. Rec. 3: Participate in the data biosphere community discussion.
  • 35. Summary 1. Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community. 2. The Data Biosphere Project are developing open, modular, community-driven and standards-based data environments. 3. The Data Biosphere Project are working to develop open common APIs across the NCI GDC / Cancer Research Data Commons, the NIH All of Us Project, and the CZI Human Cell Atlas Project. 4. Contact us if you are interested in getting involved in the Data Biosphere Project.
  • 38. For more information: • To learn more about the Data Biosphere: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere- for-biomedical-research-d212bbfae95d • To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1. • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15- e18.
  • 39. Abstract A Data Biosphere for Biomedical Research As datasets grow in scale, the practice of downloading data is becoming impractical in terms of cost (storing multiple copies of large datasets is wasteful), accessibility (few researchers have the necessary computational infrastructure) and security (many research laboratories lack state-of-the-art security and access control). We propose the idea of creating a vibrant ecosystem, which we call the “Data Biosphere.” Building a Data Biosphere to propel progress in biomedicine will require a community working together, including laboratory groups generating data, software developers creating Biosphere Components, and technical teams assembling and operating Data Environments.