INSERM Workshop 246 - Management and reuse of health data: methodological issues: https://ateliersinserm.dakini.fr/en/workshop.246.management.and.reuse.of.health.data.methodological.issues-66-22.php
INSERM - Data Management & Reuse of Health Data - May 2017
1. On community-standards, FAIR data and
scholarly communication
Susanna-Assunta Sansone, PhD
ORCID: 0000-0001-5306-5690
INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017
Data Consultant,
Founding Academic Editor
Associate Director,
Principal Investigator
www.slideshare.net/SusannaSansone
4. • Available in a public repository
• Findable through some sort of search facility
• Retrievable in a standard format
• Self-describing so that third parties can make sense of it
• The product of careful planning, organization and stewardship
• Intended to outlive the experiment for which they were
collected
To do better science, more efficiently
we need data that are…
5. Key problem: low findability and understandability
• Not always well cited and stored
o True for data as well as for any other digital asset
• Poorly described for third party reuse
o Different level of details and annotation
• Reporting and annotation activities are perceived as time
consuming
o Often rushed and minimally done
6. We need content or reporting standards
• To harmonized the datasets with respect to the structure
and level or annotation of their:
§ experimental components (e.g., design, conditions, parameters),
§ fundamental biological entities (e.g., samples, genes, cells),
§ complex concepts (such as bioprocesses, tissues, diseases),
§ analytical process and the mathematical models, and
§ their instantiation in computational simulations (from the
molecular level through to whole populations of individuals)
7. Minimum information reporting
requirements, checklists
o Report the same core, essential
information
o e.g. MIAME guidelines
Controlled vocabularies, taxonomies, thesauri,
ontologies etc.
o Unambiguous identification and definition of
concepts
o e.g. Gene Ontology
Conceptual model, schema,
exchange formats etc
o Define the structure and
interrelation of information, and
the transmission format
o e.g. FASTA
Formats Terminologies Guidelines
Types of content standards
8. de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Formats Terminologies Guidelines
Community-driven efforts, just few examples
9. Formats Terminologies Guidelines
224
115
500+
source source
source
MIAME
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
SRAxml
SOFT FASTA
DICOM
MzML
SBRML
SEDML…
GELML
ISA
CML
MITAB
AAO
CHEBIOBI
PATO ENVO
MOD
BTO
IDO…
TEDDY
PRO
XAO
DO
VO
Content standards in numbers
14. Data policies by
funders, journals and
other organizations
Content standards
Formats Terminologies Guidelines
Map this complex and evolving landscape
Databases
All records are manually curated in-house
and verified by the community behind each resource
15. Data policies by
funders, journals and
other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to describe ‘status’
Ready for use, implementation, or recommendation
In development
Status uncertain
Deprecated as subsumed or superseded
21. Data policies by
funders, journals and
other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to indicate ‘adoption’
22.
23.
24.
25. Standard developing groups:Journal, publishers:
Cross-links, data exchange:
Societies and organisations: Institutional RDM services:
Projects, programmes:
26. Technologically-delineated
views of the world
Biologically-delineated
views of the world
Generic features (‘common core’)
- description of source biomaterial
- experimental design components
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Duplications & lack of interoperability among standards
27. Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Hard to use them in combinations, e.g. to represent:
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
28. Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
29. Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
33. datamed.org
DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)
DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)
34. • Discoverability and reusability
o Complementing community
databases
• Incentive, credit for sharing
o Big and small data
o Unpublished data
o Long tail of data
o Curated aggregation
• Peer review of data
• Value of data vs. analysis
Growing number of data papers and data journals, e.g:
35. nature.com/scientificdataHonorary Academic Editor
Susanna-Assunta Sansone, PhD
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Varsha Khodiyar
Publisher
Iain Hrynaszkiewicz
A new open-access, online-only publication for
descriptions of scientifically valuable datasets
Supported by
36. • A peer reviewed description of data, to maximize usage
• Citable publications that give credit for reusable data
• It requires data deposition to the appropriate repository(s)
• Is complementary and can be associated or not to traditional article(s)
New article type
38. • Title
• Abstract
• Background & Summary
• Methods
• Data Records
• Technical Validation
• Usage Notes
• Figures & Tables
• References
• Data Citations
• following the Joint Declaration of Data Citation Principles
Detailed description of the methods and
technical analyses supporting the
quality of the measurements;
no scientific hypotheses
Article structure
39. Focus on data peer review
• Completeness = can others reproduce?
• Consistency = were community standards followed?
• Integrity = are data in the best repository?
• Experimental rigour, technical quality = were the methods sound?
Does not focus on perceived impact, importance, size, complexity of data
40. Credit for data producers, data managers/curators etc.
Credit to: Varsha Khodiyar
41. “The Data Descriptor made it easier to use
the data, for me it was critical that everything
was there…all the technical details like voxel
size.”
Professor Daniele Marinazzo
Credit to: Varsha Khodiyar
Data (re)use made easier
42. Decades
old dataset
Aggregated or
curated data
resources
Computationally
produced data
products
Large
consortium
dataset
Data from a
single
experiment
Data that YOU
find valuable
and that others
might find
useful too
Data associated
with a high impact
analysis article
What makes a good ?
43. Experimental metadata or
structured component
(in-house curated, machine-
readable formats)
Article or
narrative component
(PDF and HTML)
Data Descriptors has two components
44. The Data Curation Editor is responsible for creating and
curating the machine-readable structured component
• Enables browsing and searching the articles
• Facilitates links to related journal articles and repository
records
Curation and discoverability
45. Created with the input of the
authors, includes value-added
semantic annotation of the
experimental metadata
analysis
method
script
Data file or
record in a
database
Data Descriptors: structured component
46.
47.
48.
49. Complementary roles of ISA and
nanopublications
From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles
of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612
PloS ONE (2015)
51. Responsibilities lie across several stakeholder groups
Understand the benefits of sharing
FAIR datasets and enact them
Engage and assist researchers to
enable them to share FAIR datasets
Release or endorse practices
and polices, but also incentive
and credit mechanisms for
researchers, curators and
developers
52. “As Data Science culture grows,
digital research outputs (such as
data, computational analysis and
software) are being established as
first-class citizens.
This cultural shift is required to go
one step further: to recognize
interoperability standards as digital
objects in their own right, with their
associated research, development
and educational activities”.
Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016).
Interoperability Standards - Digital Objects in Their Own
Right. Wellcome Trust”
https://dx.doi.org/10.6084/m9.figshare.4055496.v1
53. Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, DPhD
Research Software Engineer
Massimiliano
Izzo, PhD
Research Software Engineer
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, Dphil
Contractor
David
Johnson, PhD
Research Software Engineer
Melanie
Adekale, PhD
Biocurator Contractor
Delphine
Dauga, PhD
Biocurator Contractor
We work with and for
to make data and other
digital research assets
Susanna-Assunta Sansone, PhD
Principal Investigator, Associate Director
and Data Consultant for Springer Nature
enabling open science,
driving science and discoveries