Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Persistent Identifiers| |


Published on

| | What are persistent identifiers? Why use persistent identifiers? Different persistent identifier systems; The HANDLE system; EPIC PID system; Policies; Use cases
Ver 2 July 2017

Published in: Technology
  • Login to see the comments

Introduction to Persistent Identifiers| |

  1. 1. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Introduction to Persistent Identifiers PIDs in EUDAT Version 2 July 2017 This work is licensed under the Creative Commons CC-BY 4.0 licence
  2. 2. Content What are persistent identifiers? Why use persistent identifiers? Different persistent identifier systems The HANDLE system EPIC PID system B2HANDLE Policies Use cases
  4. 4. Science Data Data generation is getting easier/cheaper Complexity-shift from data generation to data processing & analysis The amount of data output is increasing, quality is getting better  How to stimulate reuse and enable reproducibility? Data needs to be ReusableAccessible Findable Interoperable
  5. 5. Briefly, what are PIDs? Pointers to data resources Data files, metadata files, documents … Globally unique With infinite lifespan Can be used to identify and retrieve resources Can be resolved to the resource Examples: ISBN, DOIs, PURLs, Handles… PID Training
  6. 6. Data Creation Cycle temporarydata citabledata referabledata raw data registration & preservation analysis & enrichment Citable publication Persistent and robust identification
  7. 7. What is the Problem? Why not use simple URLs? The URL specifies the location, on a particular server, from which the resource could be retrieved. Strictly network locations for digital resources. domain may change resource may be relocated link may change B2SAFE Training BUT URLs a year or two later, often no longer workin the long term In the long term URLs a year later, often no longer work “link rot”
  8. 8. Persistent over time today … ... ... 2030 11839/abc123 11839/abc123 11100 00100 01111 11100 00100 01111 Supports access to resource as it moves from one location to another. .. by design
  9. 9. Why can Persistent Identifiers help? A Persistent Identifier is distinct from a URL not strictly bound to a specific server or filename “A persistent identifier (PID) is a long- lasting reference to a digital object—a single file or set of files.“ 11100 00100 01111 11839 / abc123 resolution prefix suffix Identifier points to a resource with no actual knowledge of the resource Responsibility of the PID owner to keep it up-to-date when the resource changes
  10. 10. Structure of a Persistent Identifier points to a resource Is globally unique 11100 00100 01111 11839 / abc123 resolution prefix sufffix 11839 / abc123 prefix sufffix Once the PID is created, the resource is globally addressable. Data Metadata Document Code Prefix: designates administrative domain, comes from an issuing instance Suffix: unique in the realm of the prefix
  11. 11. Persistent over time today … ... 2030 11839/abc123 11839/abc123 11100 00100 01111 11100 00100 01111 .. by design Update information Redirection Stable
  12. 12. PID Benefits Persistent Identity via Indirection Static references into fluid systems over time Data on networks moves Ownership/responsibility change Formats change Embedded IDs For data object in hand – current state data Updates New related entities Networks of Persistent Links Data / metadata links Provenance chains
  13. 13. PID Costs Extra level of effort / cost on creation Analysis – what to identify (granularity) Folders, files Single measurements in a time series experiment Coordination across organisations Maintain resolution system Persistence requires sustained effort Organisational discipline Technology necessary but not sufficient Analyse cost/benefit ratio Don’t start unless it is worthwhile Is your data worth it?
  14. 14. PID SYSTEMS
  15. 15. Persistent Identifier structure Every persistent identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix Prefix - designates administrative domain, is generated by an issuer, which makes sure that all prefixes are unique Suffix - local name must be unique under its prefix. The uniqueness of a prefix and the local name under that prefix ensure that any identifier is globally unique within the context of the System. PID Training < PREFIX > / < SUFFIX > (e.g. 11111/123456745)
  16. 16. PID Systems Persistent URLs (PURLs)a Cost: no Metadata: No additional metadata purl: GPO/gpo46189 EPIC Systemb Cost: $50 annual fee per prefix Metadata: Associate any metadata hdl:11210/123 Digital Object Identifier (DOI)d Cost: fee per DOI + annual fee Metadata: The INDECS schema, stored in separate database DOI: 10.1000/182 Archival Resource Key (ARK)c Cost: no Metadata: ERC (Electronic Resource Citation) metadata ark: /12025/654xz321 Based on: Handle System
  17. 17. PID system Requirements Attach multiple URLs to a PID Allow part identifiers for complex objects. Granularity issue Allow attaching of extra metadata to the PID (MD5 check, etc) Actionable (i.e. converted to URL) PIDs HTTP proxy for resolving (use port 80 only) Controlled by community Programmable interface for administration of PIDs from applications Delegation of PID administration to other organisations Distributed, robust, highly- available, scalable No single-point of failure, distributed system with mirroring Acceptable non-commercial business model PID Training
  18. 18. Identifier String Requirements Not based on any changeable attributes of the entity, e.g.: Location Ownership Any other attribute that may change without changing identity Unique Avoid conflicts and referential uncertainty A good PID system should not allow you to use the same suffix twice Opaque, preferably a “dumb number” A well known pattern invites assumptions that may be misleading Meaningful semantics invite IP wars, language problems Nice to have Human-readable Cut-able, paste-able Fits common systems, e.g. URI specification PID Training that contribute to persistence
  19. 19. PIDs in EUDAT EUDAT has adopted Handle-based persistent identifiers A combined solution of Handle system and EPIC service Employing the latest Handle v.8 EUDAT developed a library to interact with Handle v.8  B2HANDLE PID Training
  21. 21. The Handle System The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources. The protocols specified enable a distributed computer system to store identifiers (names, known as Handles) of digital resources and resolve those Handles to the information necessary to locate, access, and otherwise make use of the resources. That information can be changed as needed to reflect the current state or location of the identified resource without changing the Handle. PID Training
  22. 22. The Handle System The main goal of the Handle system is to contribute to persistence. The Handle system is: reliable scalable flexible trusted built on open architecture transparent PID Training
  23. 23. A Handle Record Handle Data Type/KEY Index Handle data Timestamp 10232/1234 URL 1 2014-04- 09 12:46:53Z DOMAIN 2 EUDAT 2014-04- 09 12:46:53Z HS_ADMIN 100 eudat/user1 2014-04- 09 12:46:53Z PID Training PID – handle: 10232/1234 Actionable PID (URL/resolving):
  24. 24. Resolving Handle Record PID Training Global Registry E.g. Handle system 3. Client gets request to resolve hdl:10232/1234 1. Client sends request to Global to resolve 0.NA/10232 (prefix handle for 10232/1234) 2. Global Responds with Service Information for 10232 #1 #1 #2 #3 Secondary Site A Secondary Site B Local Service #1 #2 Primary Site 4. Server responds with handle data Service Information Local Handle Service IP xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc .. .. .. xc xc xc .. .. .. xc xc xc .. .. .. ... xcccxv xccx xccx xcccxv xccx xccx xcccxv xccx xccx
  25. 25. HANDLE Record Types Common types URL: the location the HANDLE should resolve to HS_ADMIN: special record encoding the permissions configured for this HANDLE 10320/LOC: supports multiple locations based on intelligent decision. Custom EUDAT types EUDAT/CHECKSUM: Useful for integrity verification EUDAT/ROR: Repository of Records, the ID of the community repository. EUDAT/FIO: PID to first ingested object in the EUDAT domain. EUDAT/PARENT: PID associated with the source object in a replication chain. EUDAT/REPLICA: List of PIDs pointing to replicas
  26. 26. PID SYSTEM IN EUDAT Handle system and EPIC
  27. 27. PID System: How does it work? PID Service generate and manage PIDs for digital objects according to policies Example: B2HANDLE python library (next section) PID Training PID Replication (as one of the EPIC policies) replicate the database of Handles to partners in EPIC to guarantee robust and highly available PID resolution Resolution Service EPIC uses the distributed network provided by Handle and extends it with own local Handle servers. Global Handle Mirror A mirror of the Global Handle in Europe Handle system and EPIC
  28. 28. Resolution Service The web address for the Handle resolution service that EUDAT uses is PID Training
  29. 29. EUDAT options for PIDs In order to access a data object stored in EUDAT, an associated persistent identifier is needed. EUDAT requires integration of Handle in your infrastructure. Before your community or data centre can create PIDs you need a prefix. There are two options: you can run your own Handle system; or you can pass the details to EUDAT partners to manage it on your behalf. additional benefit of using the EUDAT systems is access to a Python library to manage your PID Handles PID Training
  30. 30. B2HANDLE
  31. 31. What is B2HANDLE? B2HANDLE is EUDAT’s PID service based on Handle as technology EPIC as federation B2HANDLE offers: Assignment of prefix via one of the EUDAT partners Hosting of PIDs, i.e. operation and maintenance of Handle servers and technical services Replication and safe-keeping of PIDs via the EPIC federation Resolution mechanism based on Handle Easy maintenance and programmatic resolving of PIDs by the B2HANDLE Python library for general interaction with Handle servers PID Training
  32. 32. B2HANDLE in other EUDAT services In the EUDAT ecosystem, EUDAT services make use of B2HANDLE to: guarantee data access provide long lasting references to data and facilitate data publishing. PID Training B2SAFE and B2SHARE use the service to create and manage PIDs for their hosted data objects. B2FIND and B2STAGE use the resolving mechanism of B2HANDLE to retrieve and refer to objects.
  33. 33. The B2HANDLE Python library b2handle: A Python library for interaction with EUDAT Handle services (Handle version 8) Setup tools-enabled Python package easy installation Can be employed by end-users to programmatically resolve handles Credentials to one of the EUDAT Handle servers are required for creation and maintenance of PIDs Stable state; official release of v1.0 also for use by EUDAT user communities
  34. 34. B2HANDLE: Available at GitHub Code repository:
  35. 35. B2HANDLE documentation Technical documentation:
  36. 36. B2HANDLE library features Methods to read, create, modify Handles and their records Queries against native Handle REST interface Support for multiple locations per object (10320/loc entries) Automatic management of Handle value indexes Support for Handle reverse-lookup via additional Java servlet Support for resolving any Handle from any issuing instance
  37. 37. POLICIES
  38. 38. How may I use a PID When you have a PID use it: To cite the data behind the PID: In publications On web-pages Include actionable PIDs in linked data Retrieve the data: By using the corresponding resolver Via the actionable PID E.g. PID Training
  39. 39. Policy Document When to use Persistent Identifiers? What should the PID resolve to? There is no “one-size fits all” strategy for implementing PIDs! Create a Policy Document of What & When Analyze the use of PIDs, create a policy for the management What to register When it enters the data management life cycle PID Training analysis and thought
  40. 40. Policy Document Simple Questions Which data objects need a PID (collections, files, metadata records)? What kinds of data are likely to stay online long enough? What kinds of data are likely to be linked to your PIDs? What kinds of data are likely to be analysed/processed with tools? What will happen after data goes off-line? etc.. PID Training analysis and thought
  41. 41. PID Policies for EUDAT services Each Service follows its own Policy for managing PIDs. One of the main policies they all follow is the non- deletion policy: Once a PID is generated it is not allowed to delete it. E.g. B2SAFE and B2SHARE use the service to create and manage PIDs for their hosted data objects. They both create their own PID types (Keys) in the PID record. PID Training
  42. 42. USE CASES
  43. 43. Example 1: B2SHARE The persistent identifier for files, download single files Cite the whole data publication
  44. 44. Example 2: B2SAFE B2SAFE employs PIDs to keep track and link replicas of data in the EUDAT network
  45. 45. Example 3: Enable data flows PID Training Link directly to the data (?locatt=id:n) Optionally include a (mime) type in the Handle record – can be used to select appropriate tooling
  46. 46. Summary Persistent Identifiers provide a solution to the “link rot” problem by providing an extra layer of indirection Several systems are available with different conditions PIDs do it yourself: Use a Policy Document The HANDLE system - via EPIC policies - is the foundation for EUDAT’s B2HANDLE service: Low cost, only a flat annual fee Robust, scalable and performing Flexible, allows addition of any metadata Provides a global resolver
  47. 47. Thanks
  48. 48. Authors Contributors This work is licensed under the Creative Commons CC-BY 4.0 licence EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Themis Zamani, GRNET Willem Elbers, CLARIN Christine Staiger, SURFsara Ellen Leenarts, DANS Kostas Kavoussanakis, EPCC Thank you