Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Oxford Common File Layout (OCFL)
1. Oxford Common File Layout
Rosalyn Metz (Emory),
Simeon Warner (Cornell)
Samvera Connect 2018
http://bit.ly/ocfl-samcon2018
2. Not just us...
OCFL Editorial Group
● Andrew Hankinson (Oxford)
● Neil Jefferies (Oxford)
● Julian Morley (Stanford)
● Andrew Woods (DuraSpace)
● and us (Rosalyn and Simeon)
Community input from pasig-discuss and
ocfl-community groups, and from others
4. BagIt
Well established and implemented specification for handling sets of files
● Being formally standardized as RFC:
https://tools.ietf.org/html/draft-kunze-bagit-17
● Used for transfer and (somewhat less) for files at rest
● Good fixity support
● No explicit versioning support
○ Could use local conventions for version inside a bag
○ Could use bag-per-version
5. Moab: A Brief History
Slides adapted from Julian Morley's in the OR2018 OCFL presentation
● Moab is the closest ancestor of OCFL
● Developed at Stanford Libraries by Richard Anderson
○ Article: http://journal.code4lib.org/articles/8482
● Named after Moab, UT
6. Moab: A Brief History
● Moab is a versioned, forward delta file
structure that supports fixity and file
de-duplication.
● You can preserve anything with it (even
cat pictures found on the internet)
● The tools to manage and create Moabs
are open source Ruby gem
○ https://github.com/sul-dlss/moab-versioning
7. Moab is part of the
Stanford Digital Repository
Here be Moabs!
8. Moab in Practice @ Stanford
We have many Moabs in the SDR
● 1.6 million Moab objects
● 5 million version directories
● 50+ million files
● 500+ TB of data (25TB added last month)
● Spread across 15 NFS volumes on NetApp filers
● Backed up by IBM Spectrum Protect (formerly TSM)
○ 1 tape copy kept in local tape frame;1 sent to Iron Mountain
12. CULAR @ 2017
It worked, what
now?
● Fedora 3 no longer being
developed, Fedora 4 not an
appropriate option
● Decision not to buy
"preservation services",
primarily on cost grounds
● Decision that we want one local
copy for legal access reasons
Short term ⇒ use local disk and
AWS S3. Build tools over
filesystem and object stores
13.
14. Those files sure
are piling up!
Nearly 100TB now, planning
100TB/year digitization
● Plan to purchase a scalable
local (object) storage system
for 1 copy
● Two more copies in cloud
(perhaps tape)
● Content will outlast any
application or software system
● Content will outlast any storage
system
● Expect change and hence
migration ⇒ KISS
18. Shared Cornell and OCFL Goals
● Provide an application and vendor neutral storage arrangement that can be
used with filesystems and object stores
○ Allow easy replication between multiple storage environments
○ Allow easy migration between storage systems (modulo the inherent burdens)
○ Allow use with multiple and changing applications
● Support package versioning at low cost (complexity and storage use)
● Support internal package validation for completeness and fixity
● Support audit and self-description of entire store
● Have an easy migration path from current archival storage arrangements
● Develop a shared model that is useful at multiple institutions so that all benefit
from community developed tools and expertise.
20. Lessons from Emory: Deliverables
Actively engaged in a multi-year effort to gather requirements, design, and
develop a digital repository based on the Samvera framework.
Selected deliverables included...
Develop object definitions/types (e.g.
collections, objects, other entities) and their
relationships to one another; determine
preservation objects inside and outside of
Fedora.
Identify needs for AIP structure.
Identify storage requirements (e.g. number of
copies, file access scenarios)
21. Lessons from Emory: Identified requirements
The means to distribute digital objects to third-party preservation services.
A well understood and well documented model for storing digital objects.
Ability to place multiple copies of digital objects into diverse storage services
(AWS, local storage, etc.).
Easily allow for fixity checking of digital objects.
22. Digital
Object
Content Files
(Primary or Supplemental)
Content file 1
Content file 2
Content file 3...
… + additional
… + additional
The content itself:
relationships provided in
structural metadata
Metadata (Actionable/Indexed)
Desc. metadata
Technical metadata (File-level)
Preservation Events/Audits
Administrative metadata
Structural metadata (PCDM)
Metadata converted to RDF
for Hyrax/Fedora - editable
and/or searchable
Supplemental Preservation Files
(Metadata/Administrative Files)
Source Metadata (binary file)
Desc. Metadata record (binary file)
METS (binary file)
License/agreement (binary file)
Supplemental PREMIS (binary file)
Variable supplemental info
stored as files (not directly
system-readable):
staff can view or download
file to read it
23. Collection
Ancient Egyptian
Collection
Administrative
Collection
Carlos Museum
Administrative
Collections reflect the
process the libraries
followed when deciding to
collect materials.
Digital Objects must be a
part of an Administrative
Collection and optionally in
one or more Collections
Digital Objects may
contain one or more files
Digital Objects,
Collections receive
Emory-defined metadata
and relationships
Major Emory
Entities PCDM
Context -
Simple Example
Individual Agreements
contain information about
the Administrative
Collection.
Individual Agreements
may contain one or more
files
Individual Agreements
are assigned to objects
through their parent
Collection
Is a member of
Is a member of
Individual Agreement
Carlos Museum
Agreement
Digital Object
Statuette of a Cat.
Collection
Divine Felines Exhibition
Is a member of
Is a member of
25. OCFL Requirements
1) Completeness, so that a repository can be
rebuilt from the files it stores,
2) Parsability, both by humans and machines,
most importantly in the absence of original
software,
3) Robustness, against errors, corruption, and
migration between storage technologies, and
4) Storage, on a variety of infrastructures
including cloud object stores.
Many existing digital preservation
standards like:
● TDR (ISO 16363)
● OAIS (ISO 14721)
● NDSA Levels of Preservation
● BagIt
discuss the need for these
requirements, but none provided a
standardized way for how to do it.
27. OCFL Object
A group of one or more content files and
administrative information identified by a
URI.
The object may contain a sequence of versions
of the files organized into version directories.
The base directory of the object may contain a
logs directory.
A NAMASTE file indicating conformance.
An object contains an inventory digest file
which provides a digest for the
inventory.json file.
[object root]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
├── v2
│ ├── foo
│ │ └── bar.xml
│ ├── inventory.json
│ └── inventory.json.sha512
└── v3
├── inventory.json
└── inventory.json.sha512
28. OCFL Object
An object contains an inventory.json file
which inventories the contents of an object.
The manifest block lists all the digests and
existing file paths for all of the object’s content.
The versions block identifies the logical file path
and the digest for each version of the object’s
content.
Separating the logical file path from the
existing file path and using digests to refer to
files allows for deduplication of content.
{
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/foo/bar.xml" ],
"7dcc35...c31": [ "v1/foo/bar.xml" ],
"cf83e1...a3e": [ "v1/empty.txt" ],
"ffccf6...62e": [ "v1/image.tiff" ]
},
"type": "Object",
"versions": [
{
"created": "2018-01-01T01:01:01Z",
"message": "Initial import",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty.txt" ],
"ffccf6...62e": [ "image.tiff" ]
},
"type": "Version",
"user": {
"address": "alice@example.com",
"name": "Alice"
},
"version": "v1"
},
{
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml, remove image.tiff,
29. OCFL Storage Root
The base directory of an OCFL storage layout.
Should also contain the OCFL specification in
human-readable plain-text format.
Should contain the conformance declaration
OCFL Objects may conform to the same or
earlier version of the specification.
The storage hierarchy must terminate with an
OCFL Object Root.
[storage root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
├── ab12cd34
│ ├── 0=ocfl_object_1.0
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── v1
│ ├── file.txt
│ ├── inventory.json
│ └── inventory.json.sha512
└── ef56gh78
. ├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
└── v2
├── foo
│ └── bar.xml
├── inventory.json
└── inventory.json.sha512
30. OCFL Storage Root
Storage hierarchies must not include files
within intermediate directories
Storage hierarchies must be terminated by
OCFL Object Roots
Storage hierarchies within the same OCFL
Storage Root should use just one layout
pattern
Storage hierarchies within the same OCFL
Storage Root should consistently use either a
directory hierarchy of OCFL Objects or
top-level OCFL Objects
[storage root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
└── ab
└── 12
└── cd
└── 34
└── ab12cd34
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
└── v2
├── foo
│ └── bar.xml
├── inventory.json
└── inventory.json.sha512
32. Rebuildability
● Key OCFL goal -- be able to rebuild repo
from an OCFL storage root
● Therefore, in OAIS terms: must include
all the descriptive, administrative,
structural, representation, and
preservation metadata relevant to the
object.
● Optionally include copy of spec in top level
of OCFL storage root
● More complete option would be a specific
OCFL object that contains this
documentation and to have a pointer to its
location in the storage root.
e.g. permissions, access, and
creation times
● not portable between filesystems
● not preservable through file
transfer operations
● ill-defined fixity
⇒ out-of-scope
If important, use filesystem image
format or extract as metadata
Filesystem metadata
33. Empty Directories
● OCFL preserves files and their
content
● Directories serve as an
organizational convention
● Empty directories not directly
supported
⇒ Use zero-length `.keep` file as
necessary (ala. `git`, BagIt)
Only special files are the inventory,
its digest file, and conformance
declaration files
Otherwise OCFL makes no
distinction between different types of
files.
⇒ Use local conventions as
needed
Data and Metadata
34. Storage
● Filesystem or Object Store -- you choose
● Original filename or Normalized filename -- you choose
● Deduplication & Forward delta differencing (at file level) --
optional but likely desirable/normal
"logical file path" - path of file in content as part of state for a particular version
"existing file path" - path of file in OCFL object
content addressing ties these two together
36. File operations
(mungification?)
● Inheritance
● Addition
● Updating
● Renaming
● Deletion
● Reinstatement
● Purging ⇒ choices:
a. rebuild new object
b. break immutability and
rewrite (not recommended)
Yes - OCFL supports that...
37. Version Immutability
OCFL supports systems where
versions (everything in a given
version directory) is immutable once
written.
● It is recommended to follow this
practice
● BUT you can rewrite objects if
you really want to, but
OCFL supports (in fact, enforces for
internal references) deduplication
through digests
● Only within an object
● File level
● sha512 digest recommended
Deduplication
38. Forward Delta
Each version need only include new
and changed files
● Files from previous version
included by reference
● Reference by content (digest)
supports renaming without
duplicating
(You can avoid this and include files again if you
really want. But why?)
1. Digests used for reference
already provide basis for strong
fixity checks (pref. sha512)
2. Additional digests may be
include to support legacy fixity
information (e.g. md5)
(Fixity of inventory files themselves handled by
sidecar file, e.g. inventory.json.sha512)
Fixity
39. Log Information
log directory in OCFL object
available for information not in
objects content and not versioned
● form not specified
● will be ignored in object
validation
Objects with many small file may
cause problems with some storage
infrastructures and may make
validation/fixity time consuming
● package in single file (ZIP
recommend)
(Options for a later version of the OCFL spec
are ZIPped objects and/or ZIP by version)
Small Files
40. Roadmap
Alpha (yesterday)
● Released(ish) on October 10 community call
(OCFL Editors and PASIG Discuss)
● Feedback for November community call
Beta (date based on feedback)
● Experimental validation tool
● Determine what other groups communities to
seek input from
Release 1.0 (2019)
● One production-ready validator
● Test suite and fixture objects
● Two institutions committed to backing the
initiative (should define that)