Digital preservation and institutional repositories
for the digital arts and
University of Wisconsin
Preservation for the
digital arts and
University of Wisconsin
for the digital arts and
University of Wisconsin
And I said...
... you’re giving me how
much time for this?
• “Preservation” means nothing unmodiﬁed.
• This is why it becomes such a bogeyman!
• Two things you need to know ﬁrst:
• why you’re preserving what you’re preserving, and
• what you’re preserving it against.
• Your collection-development policy should
inform the ﬁrst question.
• Your coll-dev policy doesn’t include local born-digital or
digitized materials? This is a problem. Fix it.
• The second question is your “threat model.”
Why did I just make you
• I’m weird.
• I’m trying to destroy the myth that any given
medium “preserves itself.”
• Media do not preserve themselves. People preserve media
—or media get bizarrely lucky.
• We need not panic over digital preservation
any more than we panic about print.
• Approach digital preservation the same way you approach
• Strategically: this approach helps your colleagues get a
grip, too. Your colleagues may well be the biggest barrier
to digital preservation in your library!
In your groups...
List important threats
to digital data.
• “It’s in Google, so it’s preserved.” (Not even
• “I make backups, so I’m ﬁne.”
• “I have a graduate student who takes care of
• “Metadata? What’s that? I have to have it?”
• “Digital preservation is an unsolvable problem,
so why even try?” (I’ve heard this one from
librarians. I bet you have too.)
But ﬁrst, a word about
• “We can’t save everything digital!”
• Well, no, we can’t.
• We can’t save everything printed either.
• That’s no excuse, in either medium. Why do we
let it be one for digital materials?
• Yes, we will lose some stuﬀ. That’s life in the
big city. Dive in anyway.
And a word about scale
• Many of those currently panicking about digital
preservation are thinking about huge scales.
• At some repository size, bitrot happens faster than you can
detect and ﬁx it.
• Last I heard, this was somewhere in the exabyte range.
• We’re not. So let’s relax about some of this
stuﬀ. At our scale, many problems are solvable.
• Unless your problem is digital video. Good luck with that.
• Our scale problems happen on the front end, as
we’ve been learning this week.
Physical medium failure
• Gold CDs are not the panacea we thought.
• They’re not bad; they’re just hard to audit, so they fail
(when they fail) silently. Silent failure is DEADLY.
• How long will hardware be able to read them?
• ALL such physical media are risky, for the same reasons!
• Current state of the art: get it on spinning disk.
• Back up often. Distribute your backups
geographically. Test them now and then.
• Consider a LOCKSS cooperative agreement. Others have.
• Any physical medium WILL FAIL. Have a plan
for when it does.
• Sometimes used for “ﬁle format obsolescence.”
• I use it for “the bits ﬂipped unexpectedly.”
• Checking a ﬁle bit-by-bit against a backup copy
is computationally impractical for every day.
• Though on ingest it’s a good idea to verify bit-by-bit!
• A ﬁle is, fundamentally, a great big number.
• Do math on the number ﬁle. Store the result as metadata.
• To check for bitrot, redo the math and check the answer
against the stored result. If they’re diﬀerent, scream.
• Several checksum algorithms; for our purposes, which one
you use doesn’t matter much.
File format obsolescence
• When possible, prefer ﬁle formats that are:
• Open/non-proprietary. (If a software vendor goes out of
business, does their format?)
• Standardized, non-patent-encumbered
• In widespread use. (If the format dies, lots of people have
incentive to solve the problem.)
• For text, non-binary
• For everything else, lossless rather than lossy
• For compound objects, compound documents rather than
• Realistically? We often have to take what
Lossless? Lossy? What?
• Essential tradeoﬀ: quality and ﬁdelity vs. ﬁle size
• Clipping information out makes the ﬁle size
smaller! But once it’s gone, it’s gone.
• Tremendous problem with video. Lossless video
formats are HUGE.
• Lossy image formats: JPEG, JPEG2000 (much
• (more or less) Lossless: TIFF, PNG, GIF
• Compression may be lossless or lossy. Find out!
• I am NOT going to talk about codecs vs.
container formats. Consider it homework.
• No ideal choice here; lossless formats are
patent-encumbered and/or proprietary
• WAV and AIFF are okay. Ogg Vorbis is ideal, but
nobody supports it.
• mp3: if you must, it’s lossy.
Migration vs. emulation
• Migration: move the ﬁle to a new format
• Don’t throw away your original! You may have made the
wrong migration decision.
• Not necessarily a lossless process. (Fonts!)
• Emulation: create a modern hardware/software
environment that can deal with the old format
• For some cultural artifacts such as games, this is the only
• Emulation advocates make big claims that I’m not sure
they can back up. Proceed with caution.
• Migration of a dataset toward a well-deﬁned
• “Treat the same thing the same way.”
• E.g. census data... deﬁne a set of data tables, move all
data into them.
• Great for interoperability and preservation!
• Pitfall: “the same thing”?
• Humanities: TEI is a de facto normalizer for
humanities textual data.
• (Other XML formats in other ﬁelds: e.g. ChemML, NLM
• Migration can preserve information content
and (often but not always) appearance.
• Preserving interaction patterns is much
• Or a database with a query engine
• Or an applet or Flash object
• Or a collection whose interactions are based on an
obsolete software system. (DynaText anyone?)
• Hard problem. No obvious solutions; certainly
no easy ones.
When is a PDF not a PDF?
• When it’s a .doc with the wrong ﬁle extension
• When there’s no ﬁle extension on it at all
• When it’s so old it doesn’t follow the
standardized PDF conventions
• When it’s otherwise malformed, made by a
bad piece of software.
• How do you know whether you have a good
PDF? (Or .doc, or .jpg, or .xml, or anything else.)
File format registries and
• JHOVE: JSTOR/Harvard Object Validation
• Java software intended to be pluggable into other
• Answers “What format is this thing?” and “Is this thing a
good example of the format?”
• Limited repertoire of formats
• PRONOM/DROID + GDFR = Uniﬁed Digital
Forgetting what you have
• Absolutely pernicious problem. We don’t know
what we have to begin with!
• Do you know how much Faculty Stuﬀ is scattered
throughout your institution’s .edu domain? Me neither.
But I know it’s a lot. How much of that is irreplaceable?
• We’re also bad at labelling and tracking what
• No easy answer to this one; the solution lies in
a complete praxis reinvention.
• Yeah. Good luck with that.
... but I thought you meant
in libraries, Dorothea!
• Come on, we’ve solved that one: Metadata!
• Once it’s in the library, it’s probably ﬁne. The
real problem is all that Other Stuﬀ Out There.
• This is a collection-development problem and
should be treated as one.
• Don’t dump it on some poor “digital preservation
librarian!” That ﬂat out doesn’t scale.
• Don’t make the mistake of drawing thick lines around
“our stuﬀ” and “their stuﬀ.” Like it or not, our coll-dev
universe has moved beyond what’s published and what’s
What the stuﬀ you have
• Collect whatever it takes to answer this
• If the owner of this material were hit by a bus tomorrow,
what would be needed for others to use it?
• Nasty discipline-speciﬁc problem.
• This is what the NARA/RLG Trusted Digital Repository
checklist is aiming at with “designated community.”
• Where NARA/RLG goes oﬀ the rails is assuming you have
to go through this exercise with EVERYTHING YOU HAVE.
• Data-dictionaries, algorithms, speciﬁcations, tech
metadata, whatever it takes. Use common sense!
Rights and DRM
• Not having IP rights to something may mean
you can’t preserve it.
• Brian Lavoie writes well about this problem.
• Copyright law and its exceptions haven’t caught up to the
• Third-party services (e.g. blogs, ITunesU, Slideshare) are a
• DRM means that no matter the rights
situation, you’re stuck.
• PDFs: Users turn on “security” features. This is DRM. Tell
them not to do that!
• Huge headache with third-party services, again.
... and other hassles
• Privacy, conﬁdentiality, and human-subject
• Think “we’re the humanities; IRBs don’t happen to us”?
Think again. One word: FOLKLORE.
• Third-party copyright
• Campus musical or dramatic performances
• Issues of cultural sensitivity, heritage,
• You need a dark (or at least dim) archive if
you’re serious about digital preservation.
There is no way around this. Sorry.
• There is only one answer: POLICY.
• Unfortunately, it’s not a quick, easy, or
• Digital preservation costs money.
• People in high places are scared of it.
• It requires process and staﬀ change.
• You have to make the case. And then make it
again. And again. Until they get it!
• Where I am, Somebody Else’s Problem ﬁelds are
everywhere around this issue.
You are probably the
of last resort.
Be prepared for anything
excluded from your policy
When organizations fail
• Remember Geocities? We’re worse.
• Mellon: Can’t make a list of its funded on-the-web
projects, because most of them are GONE. G-O-N-E.
• We do not, as a profession, have a safety net
for each others’ projects and materials.
• This is, frankly, unconscionable.
• I don’t know how to ﬁx it; I am just warning
you that project rescues are and will continue
to be necessary.
• Institutional boundaries are a barrier here.
Great policy guidance
• Policy-making for research data in repositories:
• Practical data management: a legal and policy
• Australian, so take “legal” with a grain of salt
• Guide to social science data preparation and
Summary: the OAIS model
• “Reference model” for archival systems
• All theory, no praxis, by design. (Because praxis changes!)
• Four parts
• Data (and interaction) model
• Required responsibilities of an archive
• Recommended functions (in the computer-programming
sense) for carrying out those responsibilities
• My favorite distillation: Ockerbloom
For our purposes...
• We’re talking about the software.
• I’m not going to rant (much) about what IRs
are for or how they’re run.
• If you want that, read Roach Motel. Better yet, read
Palmer et al. 2009.
• We’re interested in the application (or lack
thereof ) of IRs to data curation in the arts and
humanities. Right? Right.
• I’m not afraid of the technical, and neither
should you be.
The IR content use-case
• A research paper
• In a single ﬁle; possibly more than one format
• Is not related to any other item in the history
• The user can download it, and... um... just
download it, really.
How much of our stuﬀ
does that work for?
• Image collections
• Page-scanned books (with or without OCR)
• Marked-up books
• Theses and dissertations
• Website preservation
• Audio and video
• Complex multimedia
• Database (linguistic, geographic...)
One metadata standard
does not ﬁt all
• The simple fact is that
• VRA Core
EPrints and DSpace do
• MODS Dublin Core, METS, and
• TEI Header nothing else natively.
• Dublin Core This is purely inadequate
for humanities data
• ... the beat goes on.
One ﬁle format does not
• Yes, we have to take what we get.
• With discrete ﬁles, most IR software is ﬁne.
• Forget about streaming audio/video.
• DSpace is good with static websites.
• For other composite objects, you’re in trouble.
• For anything like a database, you’re in trouble.
The DSpace/EPrints view
of the universe
• Communities and collections
• must be given explicit permission to add or edit materials
• Metadata entry forms
• DSpace: ﬁelds conﬁgurable by collection
• EPrints: auto-conﬁgures ﬁelds based on content type
• Many permitted per item; must upload one by one in DSpace!
• Get friendly with the DSpace batch importer. You’ll need it.
The Fedora view of the
• You can do anything at all with anything at all
as long as you’re willing to tell Fedora how to
do it. Inﬁnite ﬂexibility! But also inﬁnite
• “Content model:” what’s in this thing?
• “Service:” what should the user-interface do
with what’s in this thing?
• Metadata, relationships, stuﬀ
Can you use Fedora for
• Yes, but not alone; you need all the Content
Models and Services bolted on top.
• Try Islandora or Muradora. Fez is a last resort; it
acts like DSpace, and this is not a good thing.
• Even if you can’t build a real Fedora digital
library now, you may not be able to do so in
future if you stick with DSpace...
• ... but the Fedora/DSpace merger may change
What is this FOXML
• Think of it as the Fedora batch-import format.
• It’s complex! But it can absorb any amount or
type of XML metadata or data, which is really
• Out-of-the-box IR software will handle some
A&H data-curation jobs adequately...
• ... but by no means all of them.
• If you need sophisticated UI, bite the bullet
and go with Fedora. Islandora and Muradora
make Fedora simpler for simple things than it
• If you don’t need sophisticated user-facing UI,
go with EPrints.
• DSpace is a loser choice.