| www.eudat.eu | The EUDAT data domain handles registered data. Each digital object should have a persistent identifier. This persistent identifier is used for: Replica identification; Identification of the repository of record (in the case of replication); Querying of additional information; Checksum (time stamped)...
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Persistent Identifiers in EUDAT services| www.eudat.eu |
1. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Persistent Identifiers in EUDAT
services
PIDs in EUDAT
Version 2
June 2017
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. The EUDAT data domain handles registered data
Each digital object should have a persistent identifier
This persistent identifier is used for:
Replica identification
Identification of the repository of record (in the case of
replication)
Querying of additional information
Checksum (time stamped)
Actionable PIDs:
Of the form http://<resolver>/<prefix>/<suffix>
PIDs in EUDAT
3. The EUDAT Service Suite + PIDs
http://www.eudat.eu/services
Supports living objects no PIDs
PIDs (collections, files) referable
PIDs (collection, files) long-term
preservation
User Access no PIDs
PIDs fetch data
PIDs refer to data
PID management
4. B2SHARE
A user-friendly, reliable and trustworthy tool for researchers, scientific
communities and citizen scientists to store and share small-scale
research data coming from diverse contexts.
PIDs to every data collection, to make them
referable, and to every file to ease
automatic downloads
5. B2SHARE: The process
- Assigns PIDs to every data collection
to allow citation
- Assigns PIDs to every file and also stores the checksum
to allow automatic download and
integrity checks
8. B2SAFE
A robust, safe and highly available service which allows community
and departmental repositories to implement data management
policies on research data across multiple administrative domains, in a
trustworthy manner
PIDs at file level, for long-term preservation and
linking replicas and their originals
9. B2SAFE: What happens step by step?
iRods
PID
Data Center
Store 1
Community repository
Digital Object (DO) unique identifier
(PID) to the DO
PID
Data ingestion
Data replication
own PID
system
OR
iRODS rules
iRods
CommunityCentre
iRods
PID
Data Center
Store 2
Based on community policy
PID assignment
11. B2STAGE
A reliable, efficient, light-weight and easy-to-use service to transfer
research data sets between EUDAT storage resources and high-
performance computing (HPC) workspaces.
PIDs, to fetch data
Transfer large data collections
In conjunction with B2SAFE,
replicate community data sets,
ingesting them onto EUDAT
storage resources for long-term
preservation
Ingest computation results into the
EUDAT infrastructure
14. B2FIND
a discovery service offering a simple, user-friendly metadata
catalogue of research data collections stored in EUDAT data centres
and other repositories.
PIDs, as source identifier
Find collections of scientific data
quickly and easily, irrespective of
their origin, discipline or community
Get quick overviews of available
data
Browse through collections using
standardised facets
15. B2FIND
Metadata are harvested from various research
community repositories spanning a wide scope of
research disciplines. The benefit for the communities
publishing metadata in EUDAT is improved visibility
and discoverability of their research data in an
interdisciplinary, pan-European scope.
16. B2FIND – B2SHARE Community
PID Training
The Source is an
identifier, therefore a
unique string that
identifies the
resource. It may link
to the data resource
itself or to a landing
page that points to
the data.
You may also find
PID as an alternate
identifier.
B2FIND uses B2SHARE
PIDs
17. B2FIND – SDL Community
PID Training
The SDL
community
supports DOI as
alternate identifier
B2FIND uses PID and DOI
from the SDL Community.
18. B2HANDLE
EUDAT has adopted Handle-based persistent identifiers based on a
solution combining the Handle technology and the EPIC federation.
B2HANDLE is a central service for managing persistent identifiers at
EUDAT.
PID management
Why Handles?
Stable globally unique IDs,
stable cross-Links
Simple Integration
19. PIDs created with B2HANDLE provide the abstraction
layer between a globally unique persistent identifier and
physical location of data objects
Follows policies to register data and make it long term
referable and citable
Assignment of prefix via one of the EUDAT partners
Hosting of PIDs, i.e. operation and maintenance of
Handle servers and technical services
Benefits of the B2HANDLE service
20. Replication for reliability and safe-keeping of PIDs via the
EPIC federation
Resolution mechanism based on Handle
Easy maintenance and programmatic resolving of PIDs by
the B2HANDLE Python library for general interaction with
Handle servers
Benefits of the B2HANDLE service
21. B2HANDLE – The Python library
b2handle: A Python library for interaction with the EUDAT
B2HANDLE service
setuptools-enabled Python package; easy to deploy
Requires access to one of the EUDAT Handle server
sites
Technical documentation:
http://eudat-
b2safe.github.io/B2HANDLE
22. B2HANDLE – B2SAFE example
Where: Offers integration into iRODS via a script.
This comes out of the box with a dedicated script
employing the B2HANDLE python library
How: The script takes credentials as input
Supplied on the command line (or)
Stored in a configuration file (iRODS or local fs)
What: The script supports the following actions
Searching
Resolving
Creation of PIDs with metadata specific to B2SAFE
Modification
23. Conclusions
PIDs run through the EUDAT services
B2HANDLE aids the creation and management of PIDs,
through web and programmatic interfaces
The B2SHARE, B2SAFE and B2STAGE services create
PIDs for digital objects created within EUDAT.
B2FIND lists PIDs together with the rest of the metadata
it collects
EUDAT data can me accessed through the use of PIDs.
PID Training
25. www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Themis Zamani, GRNET
Willem Elbers, CLARIN
Christine Staiger, SURFsara
Ellen Leenarts, DANS
Kostas Kavoussanakis, EPCC
Thank you
Editor's Notes
As we have already mentioned, a persistent identifier is a long-lasting reference to a digital object. EUDAT data domain handles registered data and each digital object should have a persistent identifier.
This persistent identifier is used for
- Replica identification
- Identification of the repository of record (in the case of replication)
- Querying of additional information
- Checksum (time stamped)
A persistent Identifier helps you
- access
- use and re-use
- verify
your data
We also mentioned that there is a version of the PID called actionable, that redirects to the address in the URL field of the PID. Actionable PIDs have the form http://<resolver>/<prefix>/<suffix>
These are the main EUDAT Services.
PIDs are used for different purposes and at different levels.
The services B2SHARE and B2SAFE use PIDs to generate and manage PIDs throughout the life time of the managed data objects and beyond.
B2SAFE uses PIDs directly on file level, whereas B2SHARE employs PIDs to label and refer to data collections (several files, metadata entry). In this case the PIDs point to a landing page describing the collection.
B2FIND and B2STAGE employ the PID resolving mechanism of B2HANDLE/handle.
Again B2STAGE will transfer data based on their PIDs. PIDs may refer to folders (collections) or single files.
B2FIND shows metadata entries and expects that PIDs mentioned in the records shown can be followed by http, i.e. by clicking the link in the browser or executing a curl command.
B2HANDLE itself is EUDAT’s PID service, including a resolver and a handy python library to create and manage PIDs.
B2DROP, as a sort of workspace environment, and B2ACCESS do not work with persistent identifiers.
B2SHARE is a user-friendly, reliable and trustworthy tool for researchers, scientific communities and citizen scientists to store and share small-scale research data from diverse contexts.
B2SHARE is open to all researchers and scientists who are affiliated to research institutions, universities as well as to individual researchers (citizen scientists).
Researchers who want to deposit research data must register; this is a requirement for the upload service and for access to restricted data, but unregistered users can still access public data.
B2SHARE added-value features:
With the use of B2SHARE your data is …
- Hosted so there are no hardware or network worries on the depositor side
- Assigned a PID and therefore is always retraceable
- Stored alongside queryable & findable metadata and automatically available via the B2FIND metadata catalogue
- Managed and stored by a trusted and certified data centre
The basic idea of the B2SHARE service is to assign PIDs to every data collection (with at least one file) , to make it referable.
Let’s suppose that you are an individual researcher and you want to upload and share your research data to B2SHARE
The first thing you have to do is register for an account.
After registering to the service you want to upload your data:
You may select and upload one or several data resources. When ready, click the button marked "Start upload".
The second step is to Select a domain or project specific metadata set to describe the resource(s). Datasets will be annotated with the selected domain’s metadata schema
The final step is to fill in the metadata form. Fields marked with an asterisk are mandatory. The selection of fields that appear, and also whether they are mandatory, depends on the chosen description set. Once you have completed the metadata, deposit the selected resource and its metadata elements by clicking the button "Deposit". The deposit takes a moment to be processed. You will get a reference URL immediately.
While processing your request, the EUDAT infrastructure:
assigns a PID to each object
Checks your authorisation and registers them to the EUDAT domain of registered data and
stores the data according to the B2SHARE service statements.
What actually happened?
The B2SHARE service made a request for two new PIDs.
It created a Digital Object Identifier for the collection. The DOI system provides a technical and social infrastructure for the registration and use of persistent interoperable identifiers, called DOIs, for use on digital networks. This DOI is resolvable with the DOI resolver and will always resolve to the landing page in B2SHARE for the collection.
B2SHARE also created a Handle for the collection, which also resolves to the landing page.
Moreover B2SHARE creates a PID (Handle) for each data file in the collection.
These PIDs resolve directly to the file and can be used to automatically download the files in an unambiguous way.
The PID entry also contains the md5 checksum. This field can be easily queried with e.g. the B2HANDLE python library and be employed for integrity checks.
In today’s rich data-storage ecosystems, large data centres must offer a robust, safe and highly available replication service to allow community and departmental repositories to replicate their research data:
- to guard against data loss in long-term archiving and preservation,
- to optimize access for users from different regions, and
- to bring data closer to powerful computers for compute-intensive analysis.
B2SAFE: a robust, safe and highly available service which allows community and departmental repositories to implement data management policies on research data across multiple administrative domains in a trustworthy manner
B2SAFE uses persistent identifiers for long-term preservation.
The B2SAFE module is a set of iRODS rules which can be put together in workflows enabling data replication and PID management (via the handle and epic PID EUDAT solution).
Prerequisites for a Community center
Persistent Identifiers (PIDs): The community center is responsible for assigning a unique identifier (PID) to the Digital Object.
iRODS (recommended) or similar data management technology for federation
B2SAFE module enabled
Steps
Community center assigns a PID to DO
Based on Community Policy and with iRods Rules the replication starts
We want to replicate the DO to EUDAT Data Center 1. A predefined B2SAFE rule is called which sends a PID creation request to the PID service in use. The replication process is triggered by invoking the B2SAFE replication rule at the client-side. The B2SAFE module ensure that the replica from EUDAT DC1 is assigned a unique PID (handle) . The EUDAT DC1 replica is ready.
We want to replicate the DO to EUDAT Data Center 2. We follow the same process. The B2SAFE module ensure that the replica from EUDAT DC1 is assigned a unique PID (handle) . The EUDAT DC2 replica is ready.
But let’s discuss about the actual DO and its replica.
Main Acronyms:
ROR: Repository of Records, the repository where data was stored first.( controls replication process)
PID: Persistent identifier associated to a digital object or to a whole collection.
PARENT: Parent PID, the persistent identifier associated to the source object in a replication chain. If the chain has only two elements, the master copy and the first replica, then the PARENT= ROR.
REPLICA = List of PIDs to direct replicas
Procedure :
The Community Center owns a DO that wants to replicate across different data centres. (EUDAT Data Center Y, EUDAT Data Center Z as shown in the picture)
The community has to obtain a prefix for the Handle system. Optionally, the community can attach an own identifier to the object and link all replicas to this identifier; this is what we assume in this flow. The identifier of the original object will be used as value for the ROR (Repository Of Record) for the handle record of the first replica and as a parent PID for the handle records of all other replicas. So the community center assigns the PID1x to the DO.
Now we want to replicate the DO to EUDAT Data Center Y. The B2SAFE module ensures that the replica from EUDAT DC Y is assigned a unique PID (handle). The PID is PID1y and the handle record contains:
RoR: a reference to the original source of the replica (typically the community centre) RoR = PID1x or a community specific identifier
Now we want to replicate the DO to EUDAT Data Center Z. The B2SAFE module ensures that the replica in EUDAT DC Z is assigned a unique PID (handle). The PID is PID1z and the handle record contains:
RoR: a reference to the original source of the replica (typically the community centre). RoR = PID1x
PARENT: and a reference to the replica created by B2SAFE from this Community Center to another Data Center (in our example EUDAT Data Center Y). PARENT = PID1y.
In addition, the PID on DC Y acquires a new field:
REPLICA: points to the PID of the DC Z replica
This results in a tree structure of PID records identifying all replicas and the "flow" of replication.
PID handle records: For each replica, links are stored in the instance at the data centre. In addition to the link, the B2SAFE service stores a checksum for that specific replica. This information is intended to be used to perform integrity checks.
REPLICA: list of PIDs to direct replicas1
CHECKSUM: checksum
FIO: First ingested object in EUDAT; here the DO at data centre Y
PARENT: data object that served as origin for the replication, i.e. direct parent of a replica
B2STAGE is a reliable, efficient, light-weight and easy-to-use service to transfer research data sets between EUDAT storage resources and high-performance computing (HPC) workspaces.
B2STAGE is open to all researchers and community managers:
Researchers can transfer large data collections from EUDAT storage resources to HPC facilities for processing.
Community Managers can replicate community data through a lightweight service and ingest data sets to EUDAT storage resources for long term preservation.
The B2STAGE service is deployed on EUDAT datacentres and many HPC nodes. Access to EUDAT nodes is automatic for all EUDAT registered users, though users would need to arrange access to HPC nodes separately. The user can use clients running on their desktop or on other log-in servers that they have access to.
The main B2STAGE options:
Globus Online provides a GUI and a command-line interface,
native GridFTP command-line interface, or other GridFTP clients like UberFTP.
EUDAT is also working on an HTTP interface
In all cases, its up to the user to select a client of his choice to initiate transfers between B2STAGE instances on EUDAT and HPC centres.
(This is a continuation from the last sentence of the previous slide).
This is better depicted in this figure. The user employs the client of their choice, which interacts with B2STAGE instances on the sites involved in the transfer. Underneath the B2STAGE hood is a GridFTP server, enriched with the EUDAT Data Storage Interface component. When data arrive at an EUDAT node to be deposited, the B2STAGE service ensures that a PID is generated for each artefact, and this is recorded in the EUDAT PID Register. The iRODS Server also handles any replication required for these artefacts, according to the community policies that apply to the user who initiated the transfer. Given this PID, the user can access this digital object.
B2FIND
is the metadata service of EUDAT
is based on a comprehensive joint metadata catalogue of research data collections stored in EUDAT data centres and other repositories
provides a powerful and user-friendly discovery service on metadata covering a wide range of research communities
Why should you publish your metadata in EUDAT B2FIND ?
Make your research data
searchable, viewable, and accessible to the public
Available for use in a cross-disciplinary and international scope
Improve interoperability and re-use of data
Allow feedback and annotations on your research output
Benefit from validation and quality assurance of your metadata
Here you see an example of B2FIND indexing an object in B2SHARE, using a record from the general “B2SHARE” community.
The B2SHARE service hosts final versions of files, and assigns PIDs to every object, to make them referable. As a result, when a user uploads a file, B2SHARE will create a PID and associate the object with it. B2FIND indexes B2SHARE records.
When accessing a B2SHARE record through B2FIND, the user will see:
The Source: an identifier, therefore a unique string that identifies the resource. It may link to the data resource itself or to a landing page that points to the data and may include more information about it.
PID is the same as Source, also in form: B2SHARE uses the “actionable” form of PID, i.e. a URL.
This example shows B2FIND indexing a record from the SDL Community.
Search Digital Libraries (SDL) is hosted by the Documentation Research and Training Centre (DRTC), that strives to contribute to the development of Library and Information Science in India. They have a wide range of interests, but place special emphasis on the Information and Systems Sciences, Documentation and Library Science
When accessing a DSL record through B2FIND, the user will see:
The Source as an identifier, therefore a unique string that identifies the resource.
PID as an identifier that is the same as Source, like we saw on B2SHARE.
Because the community also uses DOI, B2FIND also lists is as an alternate and different identifier.
B2HANDLE is a service for Groups, Communities and Centres who want to make their data referencable in a stable way.
Persistent identifiers (PIDs) are an abstraction layer that arbitrates between the reference of a digital object and its location. Since URLs as location tend to change over time for whatever reason they are obviously inappropriate as stable references, but they are ubiquitously given as locations of digital objects. The abstraction layer of PID can be seen as a pointer to the location and becomes a necessary precondition for any stable referencing in the web.
The EUDAT services that create PIDs follow a predefined process of when and how to mint a PID. B2HANDLE follows this policy to register data and make it long term referable and citable.
B2HANDLE allows communities to delegate the assignment of PIDs for their prefix to EUDAT centres, which facilitates their PID management.
This outsources the maintenance and operation costs for Handle servers.
Once PIDs have been created, B2HANDLE benefits from the EPIC federation replicating the PIDs. This allows for a service resilient in the face of server failure, which keeps the PIDs safe.
As B2HANDLE is based on the Handle technology, there is an established service that resolves these PIDs.
Additionally, B2HANDLE offers a Python library that allows to manage PIDs in a programmatic fashion.
(as per slide)
An example of use of B2HANDLE is the B2SAFE service.
Where: Integration via B2HANDLE with the B2SAFE instance. B2SAFE uses a dedicated script employing the B2HANDLE python library
. The actual complexity of the structure is hidden.
How: The script takes credentials as input
Supplied on the command line (or)
Stored in a configuration file (iRODS or local fs)
What: The script supports the following actions
Searching
Resolving
Creation of PIDs with metadata specific to B2SAFE
Modification