Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
1. Research Objects:
why, what and how
ProfessorCarole Goble CBE FREng FBCS
The University of Manchester, UK
The Software Sustainability Institute, UK
carole.goble@manchester.ac.uk
researchobject.org
Metadata and Semantic Research Conference 2015, 9-11 Sept 2015, Manchester, UK
2. Prologue
e-Lab Collabs.
& Shared Asset
Repositories
Knowledge,
Metadata, Linked
Data, Ontologies
Software
Engineering for
Scientists
Computational
Workflow Systems
Reproducibility
Micro
Publications
Open Science
Research
Objects
Linked Data for
Science
Scholarly
Comms
4. Knowledge Turning, Info Flow
Barriers to Cure
• Access to scientific
resources
• Coordination and
Collaboration
• Flow of Information
http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation
7. Virtual Witnessing*
Scientific publications:
• announce a result
• convince readers the result is
correct
“papers in experimental [and
computational science] should
describe the results and provide
a clear enough protocol
[algorithm] to allow successful
repetition and extension”
Jill Mesirov, Broad Institute, 2010**
**Accessible Reproducible Research, Science 22 January 2010, Vol. 327 no. 5964 pp. 415-416, DOI: 10.1126/science.1179653
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
8. Bramhall et al QUALITY OF METHODS REPORTING IN ANIMAL MODELS OF
COLITIS Inflammatory Bowel Diseases, , 2015
“Only one of the 58 papers reported all essential
criteria on our checklist. Animal age, gender, housing
conditions and mortality/morbidity were all poorly
reported…..”
50papers randomly chosen from 378 manuscripts in
2011 that use BurrowsWheeler Aligner for mapping
Illumina reads
31 no s/w version,
parameters, exact version of
genomic reference sequence
26no access to
primary data sets
Nekrutenko & Taylor, Next-generation sequencing data interpretation:
enhancing, reproducibility and accessibility, Nature Genetics 13 (2012)
9. “I can’t immediately reproduce the research in my own laboratory.
It took an estimated 280 hours for an average user to approximately
reproduce the paper.”
Prof Phil Bourne
Associate Director, NIH Big Data 2 Knowledge Program
10. “An article about
computational science in a
scientific publication is not
the scholarship itself, it is
merely advertising of the
scholarship. The actual
scholarship is the complete
software development
environment, [the complete
data] and the complete set
of instructions which
generated the figures.”
David Donoho, “Wavelab and
Reproducible Research,” 1995
11. From Manuscripts to “Research Objects”
Multi-various, citable research products/assets
13. From manuscripts to “Research Objects”
Pre-packaged Docker images containing a
bioinformatics tool and
standardised interface through which data
and parameters are passed.
http://bioboxes.org
14. FAIR Research, crossing silos
From Manuscripts to “Research Objects”
Datasets, Data collections
Standard operating procedures
Software, algorithms
Configurations,
Tools and apps, services
Codes, code libraries
Workflows, scripts
System software
Infrastructure
Compilers, hardware
Fragmentation
20. "Mapping present and future predicted distribution patterns for a
meso-grazer guild in the Baltic Sea" by Sonja Leidenberger et al
Workflow Commons
21. Instruments, Materials, Method
Data Scopes
Input Data
Software
Output Data
Config
Parameters
Methods
techniques, algorithms,
spec. of the steps
Materials
datasets, parameters,
algorithm seeds
Experiment
Instruments
codes, services, scripts,
underlying libraries
Laboratory
sw and hw infrastructure,
systems software,
integrative platforms
Setup
Drummond, Replicability is not Reproducibility: Nor is it Good Science, online
Peng, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
22. Instruments, Materials, Method
Read. Run. Remake
Science changes,
experiments & results vary,
So do labs.
Instruments break,
labs decay.
Zhao, et al . Why workflows break - Understanding and combating decay in
Taverna workflows, 8th Intl Conf e-Science 2012
http://atyourservice.blogs.xerox.com/files/2011/09/cloning-results-may-vary.jpg
26. FAIRDOM Metadata framework
link studies, link assets, map content to.
Common
elements and
relationships
between things
produced and
used in
experiments.
Common
elements
Specific
elements for
specific data
types.
Just Enough
Results Model
http://seek4science.org/JERMOntologyhttp://isatab.sourceforge.net/format.html
27. Penkler et al (2015) FEBSJ 282:1481-1511
https://dx.doi.org/10.1111/febs.13237
29. Why Research Objects?
Preserved, portable research products. Snapshots.
inter-platform exchange, reproducibility
Commons
New
Discovery
30. Cross-Institutional e-Lab fragmentation
parts scattered across subject specific/general resources
101 Innovations in Scholarly Communication - the Changing ResearchWorkflow, Boseman and Kramer, 2015,
http://figshare.com/articles/101_Innovations_in_Scholarly_Communication_the_Changing_Research_Workflow/1286826
31. Why Research Objects?
Active research products, snaphots
• Fork.
• Merge.
• Version.
• Cite
• Snapshot.
• Live.
[Martin Scharm]
Haus et al, BMC Systems Biology, 2011, 5:10
Solvent production by Clostridium acetobutylicum
32. F1000Research Living Figures
versioned articles, in-article data manipulation
R Lawrence Force2015, Vision Award Runner Up
http://f1000.com/posters/browse/summary/1097482
Simply data + code
Can change the definition of
a figure, and ultimately the
journal article
Colomb J and Brembs B.
Sub-strains of Drosophila Canton-S differ
markedly in their locomotor behavior [v1;
ref status: indexed, http://f1000r.es/3is]
F1000Research 2014, 3:176
Other labs can replicate the study, or
contribute their data to a meta-
analysis or disease model - figure
automatically updates.
Data updates time-stamped.
New conclusions added via versions.
33. Publish, Release (like Software)
11/09/2015 34
An “evolving manuscript” would begin with a
pre-publication, pre-peer review “beta 0.9”
version of an article, followed by the approved
published article itself, [ … ] “version 1.0”.
Subsequently, scientists would update this
paper with details of further work as the area
of research develops. Versions 2.0 and 3.0
might allow for the “accretion of confirmation
[and] reputation”.
Ottoline Leyser […] assessment criteria in
science revolve around the individual. “People
have stopped thinking about the scientific
enterprise”.
http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
34. Jennifer Schopf,Treating Data Like Software: A Case for Production Quality Data,JCDL 2012
Software-like Release paradigm
• Agile
development
methods
• Free Open
Source
Software
methods
https://tctechcrunch2011.files.wordpress.com/2011/05/tcdisrupt_tc-9.jpg
36. Multi-various products, platforms, resources.
First class citizens - id, manage, credit, track, profile, focus
A Framework to Bundle, Port and Link (scattered) resources, related
experiments. Metadata Objects that carry Research Context. Units of exchange.
Bechhofer,Why linked data is not enough for scientists,
DOI: 10.1016/j.future.2011.08.004
37. Metadata Objects
Evolving
multi –typed, stewarded, sited, authored
span research, researchers, platforms, time
Contributions.
Content.
closed <-> open
local <-> alienembed <-> refer
Stewardship. Citation.
Bigger on the inside than the
outside, Content maybe
logically or physically inside
TARDIS:Time and Relative
Dimension in Space
Scholarship
https://meditationsfromzion.files.wordpress.com/2013/05/tardis.jpg
38. What and How Framework
Manifest
Core model
using
standards
Annotation
profiles
progressive
extensions
Implement-
ation
Profiles
using legacy
& commodity
platforms
Policies
Tools
Lifecycle
Steward
Ship Training
Principles & Conventions
API specificationMetadata formats
40. Manifests and Containers
Container
Packaging:
Zip files, Docker images, BagIt, …
Catalogues & Commons Platforms:
FAIRDOM SEEK, Farr CommonsCKAN,
STELAR eLab, myExperiment
Manifest
Metadata
Describes the aggregated resources, their
annotations and their provenance
Manifest
41. Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest
42. Manifest Construction
Unique identifiers as
names for things.
doi, epic, orcid, purl, RII,
Identifiers.org
Mechanism of
aggregation to group
things together.
OAI-ORE
Metadata about those
things & how they relate
to each other.
W3C OADM
http://w3id.org/ro/
44. Checklists aka Reporting Guidelines
Consistent Reporting, Standardised Cataloguing, Validation
Gamble, Goble, Klyne, Zhao
MIM:A Minimum Information Model vocabulary and
framework for Scientific Linked Data,
IEEE 8th Intl Conf on eScience , 2012
MeanWhealDiameter reports:
must include values for the
properties: SubjectId,
SptSolution, Date, FollowUp
should include values for the
properties:VariableLabel
47. RO Unzip
• Reproducibility
• Versioning
• Systematic and
extensible meta-
data collection
• Cross platform
exchange
• Publishing
Living Snapshot
Sys and Syn Bio Experiments
management and publishing
49. Sys & Syn Biology
Community Standards
Bergmann, Rodriguez, Le Novère.
COMBINE archive specification.
<http://identifiers.org/combine.specifications/o
mex.version-1> (2014)
Bergman et al COMBINE archive and OMEX
format: one file to share all information to
reproduce a modeling project, BMC
Bioinformatics 2014, 15:369
Combine with RO.
Standardised metadata & API
http://co.mbine.org/documents/archive
https://github.com/stain/ro-combine-archive doi:10.5281/zenodo.10439
Martin Scharm
Universität Rostock
50. ATLAS Collider
Data Analytics
Portable, lightweight
application runtime
and packaging tool.
Image
ATLAS and CMS detector data
CharlesVardeman, Da Huo
University of Notre Dame
All data and files
of the execution
+ Instructions
convert
bundle
manifest
Relate files
and layers
Add provenance
and annotations
Link in other
content
Exchange
Reproducibility
Same data
Same code
Same run time
environment
Systematic and
extensible metadata
collection
52. STELAR Asthma
Research e-Lab
STELAR e-Lab
Requests for data
Data Exports
Comments, questions
ALSPAC
MAAS
SEATON
Ashford
On-going data
collection
STELAR Researchers
Isle of
Wight
Data Collection
Methods and Results
STELARTeam
Farr Institute@Manchester
54. NIH BD2K Commons
and Research Objects
Metadata Profiles
RO Model API
Community IDs
RO Model Manifest Profile
Implementation Profiles
https://datascience.nih.gov/commons
56. Many outstanding issues…
Social & Cultural Technical
Tragedy of the Commons
https://doctorwhothing.files.wordpress.com/2014/01/doctor-who-
fan-girl-group.jpg
59. RO Ramps. Born RO.
Commodity Tooling, Libraries, Lightweight
Making and Auto-making
Manifest Descriptions
Making
Containers
Literate Programming,
electronic lab notebooks
Rendering &
Using Manifests
60. FAIR Citation, credit, tracking
• Citation
– Resolution and semantics
• Tamper-proof currency
– Blockchain, Ethereum
• RO trajectories
– Data trajectories [Missier]
– Provenance propagation
• Credit trajectories
– Micro-credit tracking
• Social-political acceptance
– All research products valued
– FAIR publishing effort recognised
• Defend it (snapshot)
• Locate it (most recent)
• Reuse it (a version, a component)
• Credit it (contributory authorship)
• Cross link it (connections)
61. Knowledge Turning with Ros
Simple approach, towards transparent FAIR principles
https://d2t1xqejof9utc.cloudfront.net/screenshots/pics/1ddf584eb4cf6b12
83baf9aa6d380cff/original.jpg
62. Inspired by Bob Harrison
• Incremental shift for
infrastructure providers.
• Moderate shift for policy
makers and stewards.
• Paradigm shift for
researchers, their
institutions and
publishers.
Knowledge Turning with ROs
63. All the members of the Wf4Ever team
Colleagues in Manchester’s Information
Management Group
http://www.researchobject.org
http://www.wf4ever-project.org
http://www.fair-dom.org
http://seek4science.org
http://rightfield.org.uk
http://www.software.ac.uk
http://www.datafairport.org
Alan Williams
Jo McEntyre
Norman Morrison
Stian Soiland-Reyes
Paul Groth
Tim Clark
Juliana Freire
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
Barend Mons
Sean Bechhofer
Philip Bourne
Matthew Gamble
Raul Palma
Jun Zhao
Neil Chue Hong
Josh Sommer
Matthias Obst
Jacky Snoep
David Gavaghan
Rebecca Lawrence
Stuart Owen
Finn Bacall