CNI 2018: A Research Object Authoring Tool for the Data Commons
1. ELSEVIER | The Research Object Authoring Tool --- CNI 2018 1
FAIR4CURES
A Research Object Authoring Tool for the Data Commons
December 11, 2018
Anita de Waard (she, her)
VP Research Collaborations
2. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Overview:
1. The NIH Data Commons: a very short introduction
2. The FAIR4CURES Project
3. A Global Unique Identifier Broker
4. Research Objects: a very very short introduction
5. Building a Research Object Authoring Tool on Mendeley Data
3. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
The NIH Data Commons Pilot Phase aims to
provide a marketplace for tools, data and
workflows
based on existing technologies of commercial and
academic platforms that strive to embody the FAIR
Data principles.
Overview:
4. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Data Commons Overview:
Goal of the project:
1. Advance the policies and protocols for accessing human subjects data
2. Support global identification, indexing and searching of available data sets;
3. Provide a collection of computational pipelines that can be applied to data sets
4. Utilize standards to globally identify and access data sets, tools and workflows
5. Create policies for data citation, reuse and reproducibility
6. Enable researchers to port their own data and workflows into the cloud
Project structure:
• DCPPC research groups are addressing important Key Capabilities =>
• The Commons will be composed of four stacks, incorporating products from the KCs
Final output:
• Data from three large NIH Databases will be available through all of these systems
• Users can securely access data within all stacks, on multiple cloud providers
• Users have access a basic set of applications that run the same way on all stacks.
https://public.nihdatacommons.us/ExecutiveSummary_4YP/
Key Capabilities:
1: FAIR Guidelines & Metrics
2: Global Unique IDs for FAIR Digital Objects
3: Open Standard APIs
4: Cloud Agnostic Architecture Framework
5: Workspaces for Computation
6: Research Ethics, Privacy, and Security
7: Indexing and Search
5. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Data Commons Guiding Principles:
• 1. Identifiers for data: Develop and implement an interoperable global unique identifier system for digital
objects.
• 2. Data access: Develop and implement authentication and authorization policies and protocols for controlled
access to digital objects and derivatives.
• 3. Findability: Enable search and indexing of digital objects and data sets.
• 4. Software stacks: The Commons will encompass multiple robust and sustainable software stacks
implementing Commons standards and systems.
• 5. Data use, standards: All tools will be build using standard application interfaces.
• 6. Use cases: The Commons will develop and utilize an extensive use case library.
• 7. Community: The Commons is developed through intense Community engagement and support across
multiple levels of expertise.
• 8. Community: Governance, membership, and coordination will be established with and through the
community.
• 9. Evaluation methods and metrics: We plan a culture of frequent release of products, with small iterations,
routine evaluation and redesign.
• 10. FAIR guidelines and metrics: Once FAIR metrics and rubrics are defined, these will be used to measure the
level of “FAIRness” of repositories, datasets, and other digital objects.
https://public.nihdatacommons.us/executive-summary/
6. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Team Xenon – Four partner organisations
Findable Accessible Interoperable Reusable
Collaborative Usable Reproducible Extendable Scalable
The FAIR4CURES Collaboration:
Index 3 datasets:
• Trans-omics for Precision Medicine (TOPMed)
• Genotype Tissue Expression (GTEx)
• Model Organisms Database (MODs)
7. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
The FAIR4CURES PlatformThe FAIR4CURES System:
8. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
• Identifiers for hosted data files within TOPMed studies, GTEx dataset, and MODs
• Feature for researchers to register identifiers for their derived data files on the
platform, making the content public and searchable
• Selecting types of identifiers to support in the Data Commons ecosystem and the
required identifier metadata
• Open Source tool, connected to the SevenBridges Platform
• Also accessible via Github/SmartAPI
Global Unique Identifier Broker:
9. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Digital Object Types Identified following the KC2 Metadata Spec:
Seven Bridges
Object Type
DataCite
Resource
Type
Proposed
Schema.Org
CreativeWork Types
Supported Relationships Notes
File Dataset Dataset Source Of a Task (input file)
Derived From a Task
(output file)
Part Of a Collection
One (or more) files packaged with metadata as a dataset
App (Tool) Software SoftwareSourceCode Part of Task or Collection or
Workflow
Same as dataset, but file is source code
App
(Workflow)
Workflow SoftwareSourceCode
(?)
Has Part of Software An aggregation of Tools (Software). File is CWL definition
describing how the tools are chained.
Task Collection Collection Composition of Files and
Apps (Tools or Workflows)
An aggregation of Apps (either tools or workflows), plus files
(input & output) plus a record of all the settings used for each
App.
Collection
(Study)
Collection Collection Composition of any object An aggregation of heterogeneous objects for purpose of
publishing.
https://docs.google.com/document/d/1FD3aXr_uHnPy-YrFhQhuXET73tBVxu7F_Q5uS9TPUZs/edit
10. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Seven Bridges Data Publication Concept
Requirements Analysis:
1. Landing page URL including GUID
2. URL for page where file can be accessed (downloaded)
3. Metadata for object
4. Reference to the Task (zero or one) that this dataset was Derived From
5. Reference to the Task(s) (zero, one or more) that this dataset is the Source Of
1
2
3
4
5
11. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Seven Bridges Workflow Configuration (CWL)
12. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Standards-based metadata framework for
logically and physically bundling resources
with context
http://researchobject.org
What are Research Objects?
Aggregates
link things together
Annotations
about things & their
relationships
Container
Packaging content & links:
Zip files, BagIt, Docker images
Identification
locate things
regardless where
13. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Research Objects can be used to capture outputs in a wide range of scopes
• Profiles help define the shape and form of a research object.
• A profile defines the general purpose of that type of Research Objects:
• A format (e.g. Research Object Bundle),
• An expectation of what kind of resources should be expected,
• A link to any specific vocabularies that should be used in its annotations.
Applications of Research Objects include BDBags (Big Data Bags):
• In digital libraries, preservation of source artifacts commonly use the BagIt format for archive serialization, capturing
digital resources like audio recordings, document scans and their transcriptions, provenance and annotations.
• The Research Object BagIt archive is a profile for describing a BagIt archive and its content as a Research Object to
structure the metadata and relate the captured resources
• The NIH-funded Big Data for Discovery Science (BDDS) project captures Big Data bags (BDBag) of large complex datasets
from genomics workflows (https://doi.org/10.1109/BigData.2016.7840618).
• A key aspect of BDBag is the ability to use Minimal Viable Identifiers (minid) for referencing potentially large data sources
held in multiple remote repositories, effectively making a “Big Data” Research Object for large-scale workflows
(https://doi.org/10.1101/268755).
• A bag of bags (minid:b9vx04) is a metadata skeleton which may be completed with tools like bdbag to download the big
data
• The bags’ Research Object manifests can be consumed independently, linking to the remote resources.
Research Objects and BDBags:
http://www.researchobject.org/scopes/
14. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Moving from Datasets to Research Objects in Mendeley Data:
In Mendeley Data Repository, datasets are lists of files (stored in our S3 bucket) with metadata packaging (e.g. Titles,
Description, Categories, License) and a persistent identifier DOI).
We will introduce:
• Collections as an aggregation of Datasets. Similar to a Dataset, BUT, the contents are other datasets, not files.
• Software and Workflow as different types of Digital Objects. Similar to a Dataset, BUT files are source code or
workflow specifications (e.g. CWL) and metadata properties could be a bit different.
This forms the foundation for Research Objects, which are:
• Collections or aggregations of different types of Digital Objects (not just datasets)
• References to digital objects on other platforms, based on standard identifiers (e.g. DOIs or ARKs)
• A manifest which lists and describes the contents of the Research Object
• Exposed in JSON-LD:
15. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
GUID Broker (API Only)
Seven Bridges
Fair4CURES Platform
Phase 1
Pilot Project
(Apr – Sep 2018)
Register Datasets (Data Files)
Register Software Objects
Register Workflow Objects
Uses
Register a Collection as a list of
digital objects (data, sw, wf)
In Summary:
Objective 1 – support “Task” type
Research Objects on Seven Bridges
platform.
Objective 2 - support configurable
Research Objects on Mendeley Data
platform.
Phase 2
Project
(Oct 2018 - 2019) Add annotation and relationships
to collection to describe a research
object
Research Object Composer
Serialise Research Object in
standard format based on BDBags
and RO standards Mendeley Data
Platform
Uses Re-uses
http://smart-api.info/ui/
bf9abe9c17c9c78c432832382ef9e16a#/
16. ELSEVIER | The Research Object Authoring Tool --- CNI 2018
Acknowledgements:
• This work is supported by the NIH Data Commons Pilot Phase under the Research Opportunity
Announcement (ROA) RM-17-026 https://commonfund.nih.gov/commons/:
• NIH Data Commons - 1 OT3 OD025463-01
• NHLBI STAGE Project - 1 OT3 HL142478-01
• The FAIR4CURES Project lead by SevenBridges (Alison Leaf, Brandi Davis-Dusenbury and Sarper Avcil)
• We partner in the Project with Repositive UK and the US Dept of Veteran’s Affairs
• The metadata standards development was done by KC2, lead by Team Sodium (esp. Merce Crosas, Tim
Clark, Trisha Cruse and Martin Fenner)
• The Research Objects Authoring Tool work is lead by the University of Manchester, who pioneered work
on Research Objects (Stian Soiland-Reyes and Carole Goble)
• The Mendeley Data team has built the GUID Broker Prototype (Gabriel Oscares, Gareth Harvey