| www.eudat.eu | B2FIND Integration Version 4 February 2017: The aim of this presentation is to illustrate how metadata can be published in the B2FIND catalogue and how EUDAT’s B2FIND metadata catalogue can be integrated.
B2FIND is EUDAT’s simple, user friendly metadata catalogue allowing users to discover metadata from a wide range of scientific communities.
B2FIND is the metadata service of EUDAT and comprises two components : It is based on a comprehensive joint metadata catalogue of research data collections, which are stored in EUDAT data centers and other repositories and it provides a powerful and user-friendly discovery service on metadata, covering a wide range of research communities
B2FIND interacts with other EUDAT services in different ways.
For example B2FIND harvests metadata from B2SHARE to provide access to data objects stored within the EUDAT CDI B2FIND is also used in inter-service use cases, for example when staging EUDAT data into a high performance computing platform for processing. B2FIND is used in the first workflow step, to identify links to data collections. These references are then used by B2STAGE to transfer the data objects to HPC platforms. B2STAGE then deposits the results of the computation back into EUDAT. Depending on community arrangements, B2SAFE replicates these data into other EUDAT centres and B2HANDLE assigns persistent identifiers to them.
There are various reasons to publish your metadata in EUDAT. The key features are visibility, discoverability and re-use of your research data Your research data Become searchable, viewable, and accessible to the public and thus get popular in a cross-disciplinary and international scope This leads to improved interoperability with and re-use of data by scientists from your and other research fields Furthermore users are able to feedback and annotate on your research output. Finally you benefit from validation and quality assurance performed during the ingestion process of B2FIND and from the added value thus given to your metadata
EUDAT - and thus B2FIND - has a truly cross-community approach: Metadata are harvested from a wide range of research fields, that spreads From Climate Research to Social Sciences, From Biodiversity to Linguistics, From Archaeology to Seismology, and It can extend to all other kind of data-related research areas you can think of. To browse and search these fields, we need a common vocabulary for the whole catalogue. B2FIND transforms and homogenises the diverse metadata to achieve this; we will come back to this aspect later.
B2FIND originally indexed metadata harvested from some EUDAT core communities, for example from ENES and CLARIN, and stored through other EUDAT services, at the moment from B2SHARE. EUDAT extended and is extending the service to other external and reliable data and metadata providers that are interested in publishing their metadata in the international and cross-discipline scope of EUDAT. A snapshot of the list of communities indexed by B2FIND is shown here. The most up to date listing can be found on this B2FIND website (http://b2find.eudat.eu/group/)
[ Note to trainer: this slide and the Upcoming Improvements slide may get out of date quickly. Please ensure that these are up to date by contacting the B2FIND team with a few weeks’ notice: http://eudat.eu/support-request?service=B2FIND ]
More than [470,000] datasets from  sources are uploaded to the metadata catalogue and available in the discovery portal.
The communities listed on the x axis demonstrate the wealth and the variety of the research data available through B2FIND. They cover communities from humanities and social sciences, such as CLARIN and CESSDA, through natural sciences such as ENES and ALEPH and up to aggregators that themselves provide cross-discipline metadata as DataCite. Note that B2SHARE – displayed highlighted in colour in the histogram – is not a community, but a source within the EUDAT CDI whose metadata are indexed regularly by B2FIND. That means that each time a data object is uploaded to B2SHARE the associated metadata are automatically indexed in the B2FIND catalogue.
Note the logarithmic scale here that hides the high variance in indexed artefacts: An example for low number is the community ENES: less than a thousand metadata records are harvested from the data provider, but each of them refers to underlying data collections in the order of terabytes. CLARIN on the other hand contributes metadata from more than a hundred thousand records, though these refer to small data objects in the order of kilo or megabytes.
This diagram shows the roadmap of the metadata ingestion. Before B2FIND can start the ‘ingestion workflow’ two preconditions must be fulfilled on the community site : Metadata must be generated and made available. And they must be provided and made accessible for external ‘harvesting’ Then the B2FIND workflow on EUDAT site can start. B2FIND first pulls the community specific , ‘raw’ metadata from the data provider servers. Then the metadata are converted, semantically mapped, checked and validated Finally they are uploaded to the B2FIND catalogue. All these processes will be explained in detail in the following slides.
As I said previously, the community is in charge of metadata generation. In case you are not familiar with metadata generation, let me provide a short introduction. The generation of metadata has to be done in close proximity to the data production and should be part of your data management plan. In some cases this process must be checked and possibly enhanced to aim towards a comprehensive data description. The quality of the created metadata benefits from quality control at an early stage and should be based on common ontologies and metadata formats.
The other requirement from the community is to host a service that allows metadata harvesting. EUDAT uses OAI-PMH as the technology to share metadata. Although we have a strong preference to OAI-PMH, other formats are supported if needed. EUDAT can support you in installing your OAI-PMH end-point.
After the OAI end-point is available, B2FIND harvests regularly and incrementally from OAI endpoints to assure synchronization between the catalogues of the communities and B2FIND. The frequency and the harvested sets will be negotiated with the community. Initially the B2FIND team will do a pilot harvest, and after early issues have been addressed, regular harvest can begin.
I mentioned previously that the metadata are transformed so they are in format suitable for faceted search and discovery. Here is an overview of several metadata formats, that are supported by B2FIND. In the first three columns the name, the specification and the description of the metadata schemas are listed and in the rightmost column the related communities are listed. Already from these various metadata schemas the need for homogenization to a common B2FIND schema is evident.
We homogenize and process the existing community specific metadata formats to map them to the B2FIND schema in the following steps: First the harvested XML records are parsed and entries are selected by metadata format specific XPATH rules. Then the values are analysed, parsed and mapped onto key-value pairs (JSON) using controlled vocabularies As far as available and possible, community-specific ontologies and thesauri are used. The resulting JSON records satisfy the specifications of the B2FIND schema. The important thing we want to stress here is that there is no change to the content of the metadata during this workflow. The original metadata are only restructured, reformatted and indexed to allow discovery and faceted search.
In this table some central fields of the B2FIND metadata schema are listed. There are only two mandatory fields, ‘title’ and ‘source’, to keep the barriers low. Others, as ‘PID’ and ‘DOI’ are recommended, but not mandatory. For others, such as for ‘Discipline’ and ‘Language’ we use controlled vocabularies. While the schema is in principle extensible, we intend to keep the list of common core facets limited.
I said previously that for ‘Discipline’ we use controlled vocabularies. This means that we map the disciplines of communities to options from a controlled set. The mapping takes place at different levels of granularity. In some cases, whole groups, such as all data records of a community or an OAI subgroup, are assigned to a discipline. E.g. all records belonging to GBIF, a repository for biodiversity data, are mapped to 'Biology' In other cases the mapping is finer and more granular. E.g. for the European Library the values of the Dublin Core element dc:subject is analysed, and if a value matches a term in the vocabulary, it is assigned to that term.
Each field is examined with respect to coverage, consistency and validity. [ Coverage means the percentage of datasets for which a value can be assigned for the field in question. Consistency means conformance to the given metadata formats and structures. Validity means analyzing and parsing values semantically. ] Semantic validation means inspecting field values by using controlled vocabularies and standard libraries, e.g. the standard iso639 for languages Furthermore several ‘technical’ checks are carried on, for example: Date and time fields such as publication year or temporal coverage are checked for conformity with the UTC format Spatial coverage is checked related to geographic names and given coordinates are proofed for correctness. E.g. if the spatial name ‘London’ and the coordinates ‘-0.1’ and ‘51.5’ are given, we check the consistency of this data by sending a name-resolving request to geonames.org . We check the resolvability of links to the data objects, i.e. we use urlopen to check the values of ‘Source’, ‘PID’ and ‘DOI’ as ‘ok’ (leads to an existing website) or ‘deny’ (if e.g. a HTTP error 404 ‘Page not found’ is returned).
Finally the checked and mapped JSON records are uploaded as datasets to the MD catalogue. The repository and the portal of the service are based on the open source software CKAN. CKAN provides a rich RESTful JSON API and indexes each dataset during upload using the Apache SOLR platform. That enables users to query and search the catalogue
For the future several enhancements and developments are planned : We will continue with the integration of communities and aggregators, The functionality of portal will further improved, for instance The implementation of an annotating function is planned and Taxonomies will be used to allow hierarchical searching and filtering. While B2FIND aims on one hand side in homogenization to a common metadata scheme, we want as well address customisation of the service to specific requirements. Options to implement this, are templates and extendable facets for specific community needs or usage of particular vocabularies and ontologies or individually adapted user interfaces Improvement of the quality of metadata is one of the central and most challenging tasks. Here B2FIND can help by Enhancement and further development of the mapping and the validation and Improve the feedback mechanism between the communities and the B2FIND developers.
That brings me to the end of my presentation.
For more info please visit: http://eudat.eu/services/b2find . The User documentation can be found at : https://eudat.eu/services/userdoc/b2find-integration
Thank you for your attention !
B2FIND Integration | www.eudat.eu |
www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Publish Your Metadata
How to publish metadata in EUDAT’s
This work is licensed under the Creative
Commons CC-BY 4.0 licence
What is B2FIND?
is the metadata service of EUDAT
is based on a comprehensive
joint metadata catalogue of
research data collections stored
in EUDAT data centres and other
provides a powerful and user-
friendly discovery service on
metadata covering a wide range
of research communities
This image is licensed under the Creative
Commons CC0 Public Domain
Where is B2FIND in the EUDAT suite?
EUDAT services such
as B2SHARE to
provide access to
data objects within
the EUDAT CDI
is used in inter-
service use cases,
e.g. to identify links
to data collections,
which will be
transferred to HPC
Why should you publish
your metadata in EUDAT B2FIND?
Make your research data
searchable, viewable and
accessible to the public
popular in a cross-disciplinary
and international scope
Improve interoperability and re-
use of data
Allow feedback and annotations
on your research output
Benefit from validation, quality
assurance and added value of
your meta data
This image is licensed under the Creative
Commons CC0 Public Domain
Data from a
great selection of subjects
B2FIND has a truly cross-community
Metadata are harvested from a wide
range of research areas
From Climate Research to Social
From Biodiversity to Linguistics
From Archaeology to Seismology
This necessitates the transformation
and homogenisation of the diverse
metadata to achieve the usage of a
common vocabulary for the whole
This image is licensed under the Creative
Attribution-NoDerivs 2.0 Generic (CC BY-
ND 2.0), taken from
B2FIND initially indexed metadata
harvested from EUDAT core
communities (as ENES and
stored through the EUDAT
service as B2SHARE
EUDAT extended and is extending
the service to other external and
reliable data and metadata
The list of currently integrated
communities is available at
B2FIND MD Catalogue
• > 470000 records
• 17 communities
• (16 external + B2SHARE)
The Metadata (MD) Ingestion Roadmap
How can you get your metadata published in EUDAT B2FIND?
MD Mapping and
MD Uploading and
This image is licensed under CC0 1.0 Universal (CC0 1.0)
Public Domain Dedication
And taken from
has to be done in close
proximity to the data
should be part of the data
must be checked and
possibly enhanced to aim in
a comprehensive data
benefits from quality control
at an early stage
should be based on
common ontologies and
Metadata repository and provider
The community site needs
to be set up to allow
The standard protocol OAI-
PMH is to be used as a
Other data transfer
techniques are supported, if
EUDAT offers support for
the installation This image is licensed under CC BY-SA 3.0 and taken
from RRZEicons (own work) :
B2FIND harvests regular
and incrementally from OAI
The frequency and the
harvested sets will be
negotiated with the
Initially the B2FIND team will
do a first harvest try on a
given and accessible OAI
This image is licensed under CC0 Public Domain
and taken from
MD Schemas (excerpt)
Name Specification Description Used by B2FIND to harvest
Dublincore Specification: See at
fications/ and in the
•IETF RFC 5013
•ISO Standard 15836-2009
•NISO Standard Z39.85
The Dublin Core Schema is a small set of vocabulary terms that can
be used to describe web resources (video, images, web pages,
etc.), as well as physical resources such as books or CDs, and
objects like artworks. The full set of Dublin Core metadata terms
can be found on the Dublin Core Metadata Initiative (DCMI)
website, see left.
ISO 19115 http://www.iso.org/iso/ho
ISO 19115-1:2014 defines the schema required for describing
geographic information and services by means of metadata. It
provides information about the identification, the extent, the
quality, the spatial and temporal aspects, the content, the spatial
reference, the portrayal, distribution, and other properties of
digital geographic data and services.
MARC (MAchine-Readable Cataloging) standards are a set of digital
formats for the description of items catalogued by libraries, such as
books. It was developed by Henriette Avram at the US Library of
Congress during the 1960s to create records that can be used by
computers, and to share those records among libraries.
CMDI (Component MetaData Infrastructure) was initiated by
CLARIN to provide a framework to describe and reuse metadata
blueprints. Description building blocks (“components”, which
include field definitions) can be grouped into a ready-made
description format (a “profile”).
DDI http://www.ddialliance.org DDI (Data Documentation Initiative) is an effort to create an
international standard for describing data from the
social, behavioural, and economic sciences.
The community specific ‘raw’
metadata are processed and mapped
to the B2FIND schema in the following
Parse harvested XML records and
select entries by MD format specific
Analyse and parse values and map
onto key-value pairs (JSON) vs.
given controlled vocabularies
Use (community specific) ontologies
This results in JSON records satisfying
the specification of the B2FIND
schema This image is released into the public domain
by its author, DevinCook at English Wikipedia
and is taken from commons.wikimedia.org
B2FIND MD Schema (excerpt)
Semantic definition Allowed values / CV Level of
Title A name or title a resource
Free text Mandatory 1
Description All additional textual
CKAN2.0 only supports plain text Recommended 1
Data Access Source URI of the related resource Valid URL Mandatory 1
PID Persistent Identifier Recommended 1
DOI Digital Object Identifier Recommended 1
Creator List of the main researchers
involved in producing the
Text field (‘;’ list of citied names,
Discipline Field of research Text field (mapped and validated
Publisher The person or institution
publishes the data
PublicationYear The year when the data
was or will be made public
YYYY Recommended 1
Data coverage TemporalCoverage Relation to or Coverage of
a specific interval in time.
Interval between two UTC Date
Timestamps : [ BeginDateTime ,
SpatialCoverage The spatial limits of a
A spatial point or box specification,
CKAN representation :
1.4.1 Performing arts
2. Social sciences
3. Natural sciences
3.3 Earth sciences
4. Formal sciences
4.2 Computer sciences
5.6.1 Chemical Eng.
5.12 Library studies
Mapping of the Facet ‘Discipline’
ENES Earth Sciences
e.g. OAI set=
Community Filter by Subsets
Map by specific
B2FIND closed vocab
Examine each field for coverage,
consistency and validity
Semantic validation by using
standard libraries, e.g. iso639 library
‘Technical’ checks, e.g.:
Conformance of date-time fields with
Test spatial coverage by
geonames.org and consistency of
online checks of URLs to the data
objects (‘Source’, ‘PID’ and ‘DOI’)
This image is licensed under the CC0 Public Domain
Finally the checked and mapped
JSON records are uploaded as
datasets to the MD catalogue,
which is based on the open
source code CKAN. CKAN:
provides a rich RESTful
JSON API and
uses SOLR for dataset
That enables users to query and
search in the catalogue
Address more communities and
Improve functionality of portal
Include annotating function
Templates and extendable facets for
specific community needs
Usage of vocabularies and
Individually adapted user interfaces
Improve Quality of the metadata by
enhancement of the mapping and
Continued exchange and feedback
between the communities and the
For more info: http://eudat.eu/services/b2find
User documentation: https://eudat.eu/services/userdoc/b2find-integration
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Heinrich Widmann, DKRZ Hannes Thiemann, DKRZ