Metadata management for data storage spaces :
INDEXATOR is a metadata management tool that addresses the problems of organising, documenting, storing and sharing data in a research unit or infrastructure, and fits perfectly into a data management plan of a collective.
The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around.
Given the diversity of domains, the approach chosen is to be both as flexible and as pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary corresponding to the reality of its field and activities. The main idea is to be able to "capture" the user's metadata as easily as possible using their vocabulary. It is possible to define the whole terminology using a spreadsheet.
The choice was made for the JSON format, which is very appropriate for describing metadata, readable by both humans and machines.
This tool is built around a web interface coupled with a MongoDB database. The web interface allows you to i) Describe a dataset using metadata of various types (Description), ii) Search datasets by their metadata (Accessibility).
1. Metadata management for data
storage spaces
Contributors:
François Ehrenmann (UMR BioGECO)
Philippe Chaumeil (UMR BioGECO)
Daniel Jacob (UMR BFP)
2. INRAE - Indexator – October 2022
• The implementation of a Data Management Plan (DMP) involves
some requisites such as the data outsourcing to be preserved
outside the users' disk space.
• This concerns not only published data but all data produced during
the course of a project.
• This is even more necessary when temporary staff (doctoral
students, post-docs, trainees, fixed-term contracts) are involved in
the production of data.
Data Management Plan
How to encourage the structures (Units, Platforms,...)
to better manage their data ?
3. INRAE - Indexator – October 2022
Data storage
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
Metadata
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
• Concern about the organisation of these storage spaces.
• Should they be harmonised, i.e. impose good practices such as i) folder and file naming, ii) folder structure (docs, data, scripts,
etc.), iii) the use of README files, iv) etc.
• At least the use of a README file seems the simplest and least restrictive. what to put in it ?
• How to use them effectively when you want to find information? With what vocabulary ?
4. INRAE - Indexator – October 2022
Data storage Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
5. INRAE - Indexator – October 2022
Generate the
metadata file (JSON)
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
Knowing the production of files in JSON
format being delicate for users, a web
interface makes it possible to create
them.
How to encourage the structures (Units, Platforms,...) to better manage their data
deposit
6. INRAE - Indexator – October 2022
View
Metadata
Generate the
metadata file (JSON)
Search datasets based
on some metadata
deposit
scan
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
Then, find projects and/or data
corresponding to your criteria
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
How to encourage the structures (Units, Platforms,...) to better manage their data
7. INRAE - Indexator – October 2022
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
8. INRAE - Indexator – October 2022
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
9. INRAE - Indexator – October 2022
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
The web interface
must therefore correspond to the scientific and experimental context
of the collective (research unit, project, platform, ...)
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
10. INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
11. INRAE - Indexator – October 2022
Sections
…
Web interface for metadata entry
Generate the metadata file (JSON)
12. INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
Sections
Fields
13. INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Type
Sections
Fields
14. INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Predefined terms
Sections
Fields
Type
15. INRAE - Indexator – October 2022
Sections
Predefined terms
…
Web interface for metadata entry
Fields
width=350px width=350px
width=350px width=500px
open
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Features
Type
16. INRAE - Indexator – October 2022
…
Fields Sections Type Features Predefined terms
config_terms.txt
Definition of metadata
• Terminology definition file in Tabulation-Separared-Values (TSV)
• Based on (controlled) vocabulary specified by the data manager of a collective (research unit, platform, … )
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
It is possible to define
the whole terminology
using a spreadsheet.
17. INRAE - Indexator – October 2022
• column 1 - Field : shortname of the fields
• column 2 - Section : shortname ot the sections
• column 3 - Search : indicates if the field can be used as a criterion search ('Y') or not ('N')
• column 4 - Shortview : indicates with ordered numbers if the field serves for the overview table after the search (empty by default)
• column 5 - Type : indicates the way they will be entered via the web interface (possible values are: textbox, dropbox, checkbox and areabox).
• column 6 - Features : dependings on the Type value, one can specifiy some specific features. If several features, they must be separated by a comma
• for checkbox: open=0 or open=1 indicates if the selection is opened or not
• for textbox & checkbox: autocomplete=item The items.js file must be present under web/js/autocomplete
• for textbox & dropbox: width=NNNpx allows you to specify the width of the box. Usefull if you want put several fields in the same line
• for areabox: row=NN and cols=NN allows you to specify the row and column size of the textarea
• column 7 - Label : Labels corresponding to the fields that will appear in the web interface
• column 8 - Predefined terms : for fields defined with a type equal to 'checkbox' or 'dropbox', one can give a list of terms separated by a comma.
Structure of the Terminology definition file
Definition of metadata
config_terms.txt
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
18. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
search
Configuration / Initialization steps
Normal operating mode
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
generate
linked
MongoDB Web interface
create
insert
PGD_XXXXX.json
options
scan
cron
Data storage
deposit
scan
View
Metadata
Docker Containers
Input / Output files
Data storage
Web server
19. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
MongoDB
http:/mysite.org/pgd-mmdt/config
Docker Containers
Input / Output files
Configuration / Initialization steps
web/json
20. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
Web interface
create
PGD_XXXXX.json
pgd-mmdt-schema.json
linked
options
Data storage
deposit
Metadata entry
Docker Containers
Input / Output files
web/json
22. INRAE - Indexator – October 2022
…
http:/mysite.org/pgd-mmdt/search
Web interface for search
23. INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/search#results
Web interface for search
Short View
24. INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/metadata/Atacama
Web interface for metadata
…
25. INRAE - Indexator – October 2022
PGD_XXXXX.json
deposit
scan
Web interface
options
scan cron
Web interface : Add new predefined terms
Terminology definition file
First time we need
of this new term
This new term is now available
for other users / datasets
Data storage
26. INRAE - Indexator – October 2022
web/js/autocomplete/cities.js
Web interface
Example with
Web interface : autocompletion
.
.
API « Découpage administratif » (Administrative division)
var cities=[];
$.getJSON("https://geo.api.gouv.fr/communes", function (data) {
$.each(data, function (index, value) { cities.push(value['nom']); });
});
. Terminology definition file
27. INRAE - Indexator – October 2022
// Get all descendant classes from 'Data' classe
edam_data=[];
get_terms_from_bioportal('EDAM', 'http://edamontology.org/data_0006', 'edam_data');
web/js/autocomplete/edam_data.js
To get information about the BioPortal API : https://data.bioontology.org/documentation
Web interface : autocompletion Example with
https://bioportal.bioontology.org/ontologies/EDAM/?p=classes
“datatype":{
"titre":"Data type",
"autocomplete":"edam_data",
"width":"350px“
}
web/json/config_terms.json
.
Web interface
.
.
Choose from 947 terms
autocompletion
28. INRAE - Indexator – October 2022
Web interface : autocompletion
https://vocabulaires-ouverts.inrae.fr/a-propos-du-thesaurus-inrae/
Example with
29. INRAE - Indexator – October 2022
Web interface : autocompletion Example with
https://consultation.vocabulaires-ouverts.inrae.fr/api/
web/js/autocomplete/VOvocab.js
.
Terminology definition file
keywords = [
'data', 'report','simulation', 'model', 'image','script',
'omics', 'statistic','scientific', 'research', ‘document',
'experiment','video', 'spatial', 'instrument'
]
VOvocab=[];
get_terms_from_voinrae(keywords,'VOvocab')
Choose from 405 terms
autocompletion
30. INRAE - Indexator – October 2022
Web interface : Resources
Terminology definition file
The "description" field should make it possible to better annotate the data,
while the "location" field should make it possible to
1) extend the perimeter of the data beyond the local space,
2) eventually to be able to emancipate oneself from the local space when one wishes to
disseminate the metadata alone
A location can be anything: a text, an absolute path in a tree, a URL link, ...
We can thus put a link to a publication: Type=article, link=DOI
31. INRAE - Indexator – October 2022
Creation
JSON metadata file
metadata viewer
Resource example 1: Atacama
32. INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
33. INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
Resource example 3: Indicate the path on a external storage
In case putting an URL is not possible, nervertheless
provide clear indications on the location of the data.
34. INRAE - Indexator – October 2022
VM
Data storage
Web server
Storage located on the VM
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
Local VM
Remote VM (Datacenter)
2 cpu, 2 Go RAM, 10 Go HD
35. INRAE - Indexator – October 2022
VM
Data storage
Web server
Local VM
Remote VM (Datacenter)
Storage located on the VM
Google Drive
2 cpu, 2 Go RAM, 10 Go HD
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
36. INRAE - Indexator – October 2022
scan
[ncloud]
type = webdav
url = https://nextcloud.inrae.fr/remote.php/webdav/
vendor = nextcloud
user = XXXXX
Pass = XXXXX
rclone mount ncloud:MTH2-PF-Bordeaux/DATA/ /mnt/ncloud/
--allow-other --vfs-cache-mode minimal
--read-only --no-checksum --no-modtime
--daemon --daemon-wait 15s
https://pmb-bordeaux.fr/ncloud/search
https://nextcloud.inrae.fr/apps/files/?dir=/MTH2-PF-Bordeaux/DATA
37. INRAE - Indexator – October 2022
Web Interface
Creation of the
JSON file
Mapping of JSON
file sections/terms
with the metadata
structure in
DATA INRAE
Pre-fill a dataset in the INRAE DATA dataverse (via API)
JSON Schema
Metadata JSON file
+
pgd-mmdt-schema.json
JSON-LD
Metadata JSON-LD file
• A good approach is to use only controlled vocabulary i.e. a relevant and sufficient
vocabulary used as reference in the field concerned to allow users to describe a project and
its context without having to add additional terms.
• A mapping of terms based on controlled vocabulary can thus be done more easily to
generate formats corresponding to different standards (MIAPPE, JSON-LD, ...)
Push
38. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
39. INRAE - Indexator – October 2022
API BioPortal Search
https://data.bioontology.org/search
?q=Gene%20expression%20profile&ontology=EDAM&subtree_root_id=http%3A%2F%2Fedamontology.org%2Fdata_0006&apikey=….
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
search
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
get
40. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
41. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
https://consultation.vocabulaires-ouverts.inrae.fr/rest/v1/search
?vocab=thesaurus-inrae&lang=en&type=skos%3AConcept
&query=metabolomics
&offset=0
API Thesaurus INRAE
search
get
Mapping
42. INRAE - Indexator – October 2022
Create
the
project Descriptive metadata
(Project)
Preserving
data
Web-based metadata entry tool
Storage space for the project
associated with the metadata file
Data analysis
•Adding new metadata
•Saving data with their metadata
•Convert to a suitable format
(JSON-LD)
Access to
data
Reuse of
data
Metadata query
(Web interface and/or API)
Observations,
Samples,
Experimentation,
Instrumentation
Push
JSON-LD
JSON with
a Schema
Adding
Resources
NAS
National and
international
data repositories
TSV
PGD_XXX.json
…
TSV
XXX
“Machine-Actionable Metadata" Create
the data
JSON with a Schema
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
43. INRAE - Indexator – October 2022
• Have a visibility of what is produced within the collective
• data sets, software, databases, images, sounds, videos, analyses, codes, ...
• Use a controlled vocabulary specific to the domain of the collective, with mapping to other formats
embedding ontologies to be done downstream as required,
• Propose an alternative/complement to external data repositories or other thematic warehouses to have
knowledge of and access to ALL data, not only those that are published,
• Favour FAIR (at least Findable & Accessible criteria) within the collective,
• Sensitise newcomers and students to a better description of what they produce.
Conclusion
The “INDEXATOR" tool allows a collective to :
44. INRAE - Indexator – October 2022
https://github.com/inrae/pgd-mmdt
Thank you for your attention
Metadata Management for Storage Spaces
Metadata aggregation & indexation
Source code