Presentation at Digital Humanities in the Nordics 2020 conference in panel: Towards deterioration, disappearance or destruction? Discussing the critical issue of long-term sustainability of digital humanities projects
Scanning the Internet for External Cloud Exposures via SSL Certs
Towards a FAIR lifecycle
1. Towards a FAIR data lifecycle
Jessica Parland-von Essen
22.10.2020
https://orcid.org/0000-0003-4460-3906
2. 2
F
A
I
R
FINDABLE
• Described in relevant catalog with enough detail
• Landing page with globally unique persistent identifier
ACCESSIBLE
• Can be retrieved over the internet
• Versioning and lifecycle are documented
• Tombstone page if data is deleted
INTEROPERABLE
• Common, documented, and open formats
RE-USABLE
• Well documented and intelligible
• Rights clearly stated
https://doi.org/10.5281/zenodo.4045402
3. FAIR Ecosystem Components and FAIR Digital Objects
3 http://doi.org/10.5281/zenodo.3565428 https://doi.org/doi:10.2777/1524
4. Shallow FAIR and Deep FAIR
4
Necessary
research
information, PIDs,
machine readable
license
All data
elements are
machine
accessible
Research
Information
Research
Data
5. ACTIVE DATA
Raw, continuously
updated
DYNAMIC
RESEARCH DATA
Version
controlled,
possible to cite
RESEARCH
DATASET
PUBLICATION
Immutable
Documentation, validation
Research
Research Data Types
https://doi.org/10.23978/inf.77419
7. LEVEL 0
Output from automated data
collection
LEVEL 1
Near Real Time data
Metadata
Control
LEVEL 1
Internal Working data
LEVEL 2
Final quality-checked gap-
filled dataset
LEVEL 3
Elaborated Data Products
Metadata
Control
EXTER-
NAL
Data requirements on different levels for enabling FAIR?
8. Interoperability and persistance
• SSHOC reference ontology
• FAIRsFAIR Recommendations for semantic artefacts
• Choosing open formats and protocols
• Good data lifecycle management planning
• Using FAIR enabling services
• Managing reproducibility vs citations
8
A PID should be globally unique, i.e. nobody
else in the world should use the same string to
refer to anything else. In practice this means
that a PID has a controlled syntax and a
governed namespace (generally consisting of
a name space indicator (prefix) and a local
identifier (suffix)) and be issued and managed
by a clearly specified registration authority.
A PID should be resolvable, i.e. provide a way
for both machines and humans to access the
digital object itself, the state information
and/or landing page (in current practice this
means the identifier can be translated to a
fully defined URI, at any moment, without the
requirement that it resolves to the same URL
over time).
A PID it should be persistent, i.e. remain
unique and resolvable with a persistent
syntax. The object it represents should ideally
also be persistent, but even if that last
persistence is
10,11 broken the PID should guarantee not to
be reused for any other object in the future.
Persistent Identifiers
https://doi.org/10.5281/zenodo.4001631
9. Co-creation &
co-development
23/10/20209
Always design a thing by considering it in its next
larger context – a chair in a room, a room in a
house, a house in an environment, an
environment in a city plan.
Eliel Saarinen, Finnish architect (1873--1950)
LA2 / CC BY-S. Wikimedia
(https://creativecommons.org/licenses/by-sa/4.0)
Editor's Notes
F = Findable, kun aineistolla on pysyvä tunniste esim doi, linkki aineistoon toimii aina vaikka säilytyspaikka muuttuisi
A = Accessible, tutkimusaineiston tunniste toimii hyperlinkkinä jonka avulla dataan ja sen kuvailutietoihin pääsee käsiksi verkkoselaimella
I = Interoperable yhteentoimivuuden periaate edellyttää avoimia tiedostomuotoja ja yhteisiä standardeja, ei enää tiedostoja jotka eivät aukea
R = Re-usable (datan kuvailu tukee tätä), dataa voidaan käyttää kun sillä on metatietoja ja käyttöehdoista kertova lisenssi
Figure 8 lähde: TFiR https://doi.org/doi:10.2777/1524
Diagram 2 lähde : http://doi.org/10.5281/zenodo.3565428
The first use case is the visibility of your work and outputs. When reporting on your work, to funders, and publishing outputs, a basic level of FAIRness and PID use is sufficient to enable findability, simple citation and output registration with core descriptive metadata. This is the context of what is usually called research information (sometimes referred to as current research information). The most common and useful PIDs for this are the research output DOI and the ORCID for the creator(s)/contributor(s). There are also other systems available to identify other kinds of entities to help further linking of information, such as organisations or protocols. Funders and employers might for instance require linking to some other contextual reference data like lists of grants, funders and affiliated organisations. This kind of information is becoming more important, but the actual data quality is depending on the functionalities each service provides. If the services used for dataset publication or reporting don’t require PIDs or don’t offer reference (meta)datasets or integration with PIDs for these kinds of things, it is difficult for the researcher to provide this information in an unambiguous way.
The other use case for PIDs is the management of the research data itself. Here the PIDs can have different functions: (a) creating deep FAIR research datasets as research outputs, where all individual data elements are machine accessible, see panel F in Figure 1, or (b) when managing and documenting the actual workflow and data and related information during research to ensure reproducibility of research results.
The archive or generic repository usually operates with research dataset publications, that are are a sort of publication, albeit complex, but immutable, archived as output and evidence for research. This case is quite easy, pid wise. But in real life there are many steps and varieties of data before this- This should be taken into account when citing, for instance. How can we support sufficient reproduciblity without overflowing all systems with PID – that should be kept and maintained forever!
Dynamic data citation
It is NOT recommended that the researcher or any individual person is the PID owner, but this, as well as management, should be governed in a sustainable way.
● Data Versioning: For retrieving earlier states of datasets, the data needs to be versioned. Markers shall indicate inserts, updates and deletes of data in the database.
● Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp.
● Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page.