INSERM - Data Management & Reuse of Health Data - May 2017

On community-standards, FAIR data and
scholarly communication
Susanna-Assunta Sansone, PhD
ORCID: 0000-0001-5306-5690
INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017
Data Consultant,
Founding Academic Editor
Associate Director,
Principal Investigator
www.slideshare.net/SusannaSansone

Source: https://www.dataone.org/best-practices
Simplified research data life cycle

• Available in a public repository
• Findable through some sort of search facility
• Retrievable in a standard format
• Self-describing so that third parties can make sense of it
• The product of careful planning, organization and stewardship
• Intended to outlive the experiment for which they were
collected
To do better science, more efficiently
we need data that are…

Key problem: low findability and understandability
• Not always well cited and stored
o True for data as well as for any other digital asset
• Poorly described for third party reuse
o Different level of details and annotation
• Reporting and annotation activities are perceived as time
consuming
o Often rushed and minimally done

We need content or reporting standards
• To harmonized the datasets with respect to the structure
and level or annotation of their:
§ experimental components (e.g., design, conditions, parameters),
§ fundamental biological entities (e.g., samples, genes, cells),
§ complex concepts (such as bioprocesses, tissues, diseases),
§ analytical process and the mathematical models, and
§ their instantiation in computational simulations (from the
molecular level through to whole populations of individuals)

Minimum information reporting
requirements, checklists
o Report the same core, essential
information
o e.g. MIAME guidelines
Controlled vocabularies, taxonomies, thesauri,
ontologies etc.
o Unambiguous identification and definition of
concepts
o e.g. Gene Ontology
Conceptual model, schema,
exchange formats etc
o Define the structure and
interrelation of information, and
the transmission format
o e.g. FASTA
Formats Terminologies Guidelines
Types of content standards

de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Community-driven efforts, just few examples

224
115
500+
source source
source
MIAME
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
SRAxml
SOFT FASTA
DICOM
MzML
SBRML
SEDML…
GELML
ISA
CML
MITAB
AAO
CHEBIOBI
PATO ENVO
MOD
BTO
IDO…
TEDDY
PRO
XAO
DO
VO
Content standards in numbers

How to discover the ‘right’ standards for your data?

A web-based, curated and searchable portal that monitors the development and
evolution of standards, their use in databases and the adoption of both in data
policies, to inform and educate the user community

Data policies by
funders, journals and
other organizations
Content standards
Map this complex and evolving landscape
Databases
All records are manually curated in-house
and verified by the community behind each resource

Data policies by
other organizations
Databases
Content standards
Using indicators to describe ‘status’
Ready for use, implementation, or recommendation
In development
Status uncertain
Deprecated as subsumed or superseded

Understanding how standards are used

Guideline

Formats
Guideline

Formats
Guideline
Formats

Formats
Guideline
Formats
Terminology

Data policies by
other organizations
Databases
Content standards
Using indicators to indicate ‘adoption’

Standard developing groups:Journal, publishers:
Cross-links, data exchange:
Societies and organisations: Institutional RDM services:
Projects, programmes:

Technologically-delineated
views of the world
Biologically-delineated
views of the world
Generic features (‘common core’)
- description of source biomaterial
- experimental design components
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Duplications & lack of interoperability among standards

Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Hard to use them in combinations, e.g. to represent:
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling

Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling

bsg-000174
biosharing:
ReportingGuideline
bsg-000161
MINSEQE
MIMARKS
sample
information
sample
identifier
taxonomy
identifier
sequence
read
geo location
High-level information about
the metadata standards
Representations
of the standards elements
Template elements
for
el-000001
el-000002
el-000003
provenance:
MINSEQE
provenance:
MINSEQE
and
MIMARKS
provenance:
MIMARKS
Serve machine-readable content metadata standards, providing provenance for
their elements, rendering standards invisible to the researchers
Inform the creation of metadata templates

How to discover the datasets relevant to your work?

OmicsDI: Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790
omicsdi.org

datamed.org
DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)
DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)

• Discoverability and reusability
o Complementing community
databases
• Incentive, credit for sharing
o Big and small data
o Unpublished data
o Long tail of data
o Curated aggregation
• Peer review of data
• Value of data vs. analysis
Growing number of data papers and data journals, e.g:

nature.com/scientificdataHonorary Academic Editor
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Varsha Khodiyar
Publisher
Iain Hrynaszkiewicz
A new open-access, online-only publication for
descriptions of scientifically valuable datasets
Supported by

• A peer reviewed description of data, to maximize usage
• Citable publications that give credit for reusable data
• It requires data deposition to the appropriate repository(s)
• Is complementary and can be associated or not to traditional article(s)
New article type

Research
papers
Data
records
Data
Descriptors
Value added component – complementing
articles and repositories

• Title
• Abstract
• Background & Summary
• Methods
• Data Records
• Technical Validation
• Usage Notes
• Figures & Tables
• References
• Data Citations
• following the Joint Declaration of Data Citation Principles
Detailed description of the methods and
technical analyses supporting the
quality of the measurements;
no scientific hypotheses
Article structure

Focus on data peer review
• Completeness = can others reproduce?
• Consistency = were community standards followed?
• Integrity = are data in the best repository?
• Experimental rigour, technical quality = were the methods sound?
Does not focus on perceived impact, importance, size, complexity of data

Credit for data producers, data managers/curators etc.
Credit to: Varsha Khodiyar

“The Data Descriptor made it easier to use
the data, for me it was critical that everything
was there…all the technical details like voxel
size.”
Professor Daniele Marinazzo
Credit to: Varsha Khodiyar
Data (re)use made easier

Decades
old dataset
Aggregated or
curated data
resources
Computationally
produced data
products
Large
consortium
dataset
Data from a
single
experiment
Data that YOU
find valuable
and that others
might find
useful too
Data associated
with a high impact
analysis article
What makes a good ?

Experimental metadata or
structured component
(in-house curated, machine-
readable formats)
Article or
narrative component
(PDF and HTML)
Data Descriptors has two components

The Data Curation Editor is responsible for creating and
curating the machine-readable structured component
• Enables browsing and searching the articles
• Facilitates links to related journal articles and repository
records
Curation and discoverability

Created with the input of the
authors, includes value-added
semantic annotation of the
experimental metadata
analysis
method
script
Data file or
record in a
database
Data Descriptors: structured component

Complementary roles of ISA and
nanopublications
From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles
of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612
PloS ONE (2015)

Responsibilities lie across several stakeholder groups
Understand the benefits of sharing
FAIR datasets and enact them
Engage and assist researchers to
enable them to share FAIR datasets
Release or endorse practices
and polices, but also incentive
and credit mechanisms for
researchers, curators and
developers

“As Data Science culture grows,
digital research outputs (such as
data, computational analysis and
software) are being established as
first-class citizens.
This cultural shift is required to go
one step further: to recognize
interoperability standards as digital
objects in their own right, with their
associated research, development
and educational activities”.
Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016).
Interoperability Standards - Digital Objects in Their Own
Right. Wellcome Trust”
https://dx.doi.org/10.6084/m9.figshare.4055496.v1

Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, DPhD
Research Software Engineer
Massimiliano
Izzo, PhD
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, Dphil
Contractor
David
Johnson, PhD
Melanie
Adekale, PhD
Biocurator Contractor
Delphine
Dauga, PhD
Biocurator Contractor
We work with and for
to make data and other
digital research assets
Principal Investigator, Associate Director
and Data Consultant for Springer Nature
enabling open science,
driving science and discoveries

INSERM - Data Management & Reuse of Health Data - May 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to INSERM - Data Management & Reuse of Health Data - May 2017

Similar to INSERM - Data Management & Reuse of Health Data - May 2017 (20)

More from Susanna-Assunta Sansone

More from Susanna-Assunta Sansone (20)

Recently uploaded

Recently uploaded (20)

INSERM - Data Management & Reuse of Health Data - May 2017