SlideShare a Scribd company logo
1 of 71
Download to read offline
Institutional repositories
  for the digital arts and
        humanities


                          Dorothea Salo
                University of Wisconsin
                 dsalo@library.wisc.edu
Preservation for the
  digital arts and
    humanities


                       Dorothea Salo
             University of Wisconsin
              dsalo@library.wisc.edu
Preservation and
institutional repositories
  for the digital arts and
        humanities



                           Dorothea Salo
                 University of Wisconsin
                  dsalo@library.wisc.edu
And I said...




... you’re giving me how
   much time for this?
Threat model
• “Preservation” means nothing unmodified.
  • This is why it becomes such a bogeyman!
• Two things you need to know first:
  • why you’re preserving what you’re preserving, and
  • what you’re preserving it against.
• Your collection-development policy should
  inform the first question.
  • Your coll-dev policy doesn’t include local born-digital or
    digitized materials? This is a problem. Fix it.
• The second question is your “threat model.”
What is your threat
 model for print?
Homelessness
Water
Bad materials
Flora and fauna
Physical damage
Loss or destruction
Armageddon
Why did I just make you
          do that?
• I’m weird.
• I’m trying to destroy the myth that any given
  medium “preserves itself.”
  • Media do not preserve themselves. People preserve media
    —or media get bizarrely lucky.
• We need not panic over digital preservation
  any more than we panic about print.
  • Approach digital preservation the same way you approach
    print preservation.
  • Strategically: this approach helps your colleagues get a
    grip, too. Your colleagues may well be the biggest barrier
    to digital preservation in your library!
In your groups...

List important threats
    to digital data.
Physical medium failure
“Bitrot”
File format obsolescence
Forgetting what you have
Forgetting what the
stuff you have means
Rights and DRM
Lack (or disappearance)
of organizational commitment
One word: Geocities.
Ignorance




                 ?
• “It’s in Google, so it’s preserved.” (Not even
  “Google Books!”)
• “I make backups, so I’m fine.”
• “I have a graduate student who takes care of
  these things.”
• “Metadata? What’s that? I have to have it?”
• “Digital preservation is an unsolvable problem,
  so why even try?” (I’ve heard this one from
  librarians. I bet you have too.)
Apathy
Armageddon
Salo’s needs pyramid
  Less                                 Less
immediate          Fidelity          tractable
                  to original
                  Usability

               Format viability

                    Bitrot

            Physical medium issues
  More                                 More
immediate     Acquisition issues
                                     tractable
Mitigating the risks
But first, a word about
            failure

• “We can’t save everything digital!”
• Well, no, we can’t.
• We can’t save everything printed either.
• That’s no excuse, in either medium. Why do we
  let it be one for digital materials?
• Yes, we will lose some stuff. That’s life in the
  big city. Dive in anyway.
And a word about scale
• Many of those currently panicking about digital
  preservation are thinking about huge scales.
  • At some repository size, bitrot happens faster than you can
    detect and fix it.
  • Last I heard, this was somewhere in the exabyte range.
• We’re not. So let’s relax about some of this
  stuff. At our scale, many problems are solvable.
  • Unless your problem is digital video. Good luck with that.
• Our scale problems happen on the front end, as
  we’ve been learning this week.
Physical medium failure
• Gold CDs are not the panacea we thought.
  • They’re not bad; they’re just hard to audit, so they fail
    (when they fail) silently. Silent failure is DEADLY.
  • How long will hardware be able to read them?
  • ALL such physical media are risky, for the same reasons!
• Current state of the art: get it on spinning disk.
• Back up often. Distribute your backups
  geographically. Test them now and then.
  • Consider a LOCKSS cooperative agreement. Others have.
• Any physical medium WILL FAIL. Have a plan
  for when it does.
Bitrot
• Sometimes used for “file format obsolescence.”
• I use it for “the bits flipped unexpectedly.”
• Checking a file bit-by-bit against a backup copy
  is computationally impractical for every day.
  • Though on ingest it’s a good idea to verify bit-by-bit!
• Checksums
  • A file is, fundamentally, a great big number.
  • Do math on the number file. Store the result as metadata.
  • To check for bitrot, redo the math and check the answer
    against the stored result. If they’re different, scream.
  • Several checksum algorithms; for our purposes, which one
    you use doesn’t matter much.
File format obsolescence
• When possible, prefer file formats that are:
  • Open/non-proprietary. (If a software vendor goes out of
    business, does their format?)
  • Documented
  • Standardized, non-patent-encumbered
  • In widespread use. (If the format dies, lots of people have
    incentive to solve the problem.)
  • For text, non-binary
  • For everything else, lossless rather than lossy
  • For compound objects, compound documents rather than
    embedded
• Realistically? We often have to take what
  we’re given.
Lossless? Lossy? What?
• Essential tradeoff: quality and fidelity vs. file size
• Clipping information out makes the file size
  smaller! But once it’s gone, it’s gone.
• Tremendous problem with video. Lossless video
  formats are HUGE.
• Lossy image formats: JPEG, JPEG2000 (much
  less so)
• (more or less) Lossless: TIFF, PNG, GIF
• Compression may be lossless or lossy. Find out!
Example: JPG
Audio formats

• I am NOT going to talk about codecs vs.
  container formats. Consider it homework.
• No ideal choice here; lossless formats are
  patent-encumbered and/or proprietary
• WAV and AIFF are okay. Ogg Vorbis is ideal, but
  nobody supports it.
• mp3: if you must, it’s lossy.
Migration vs. emulation
• Migration: move the file to a new format
  • Don’t throw away your original! You may have made the
    wrong migration decision.
  • Not necessarily a lossless process. (Fonts!)
• Emulation: create a modern hardware/software
  environment that can deal with the old format
  • For some cultural artifacts such as games, this is the only
    reasonable option.
  • Emulation advocates make big claims that I’m not sure
    they can back up. Proceed with caution.
Normalization
• Migration of a dataset toward a well-defined
  target.
  • “Treat the same thing the same way.”
  • E.g. census data... define a set of data tables, move all
    data into them.
  • Great for interoperability and preservation!
• Pitfall: “the same thing”?
• Humanities: TEI is a de facto normalizer for
  humanities textual data.
  • (Other XML formats in other fields: e.g. ChemML, NLM
    DTD.)
Problem: BEHAVIOR.
• Migration can preserve information content
  and (often but not always) appearance.
• Preserving interaction patterns is much
  harder!
  • E.g. a web page containing Javascript
  • Or a database with a query engine
  • Or an applet or Flash object
  • Or a collection whose interactions are based on an
    obsolete software system. (DynaText anyone?)
• Hard problem. No obvious solutions; certainly
  no easy ones.
When is a PDF not a PDF?

• When it’s a .doc with the wrong file extension
• When there’s no file extension on it at all
• When it’s so old it doesn’t follow the
  standardized PDF conventions
• When it’s otherwise malformed, made by a
  bad piece of software.
• How do you know whether you have a good
  PDF? (Or .doc, or .jpg, or .xml, or anything else.)
File format registries and
         testing tools
• JHOVE: JSTOR/Harvard Object Validation
  Environment
  • Java software intended to be pluggable into other
    software environments
  • Answers “What format is this thing?” and “Is this thing a
    good example of the format?”
  • Limited repertoire of formats
• PRONOM/DROID + GDFR = Unified Digital
  Formats Registry
Forgetting what you have
• Absolutely pernicious problem. We don’t know
  what we have to begin with!
  • Do you know how much Faculty Stuff is scattered
    throughout your institution’s .edu domain? Me neither.
    But I know it’s a lot. How much of that is irreplaceable?
• We’re also bad at labelling and tracking what
  we have.
• No easy answer to this one; the solution lies in
  a complete praxis reinvention.
  • Yeah. Good luck with that.
... but I thought you meant
     in libraries, Dorothea!
• Come on, we’ve solved that one: Metadata!
• Once it’s in the library, it’s probably fine. The
  real problem is all that Other Stuff Out There.
• This is a collection-development problem and
  should be treated as one.
  • Don’t dump it on some poor “digital preservation
    librarian!” That flat out doesn’t scale.
  • Don’t make the mistake of drawing thick lines around
    “our stuff” and “their stuff.” Like it or not, our coll-dev
    universe has moved beyond what’s published and what’s
    canonically “library.”
What the stuff you have
           means
• Collect whatever it takes to answer this
  question:
  • If the owner of this material were hit by a bus tomorrow,
    what would be needed for others to use it?
• Nasty discipline-specific problem.
  • This is what the NARA/RLG Trusted Digital Repository
    checklist is aiming at with “designated community.”
  • Where NARA/RLG goes off the rails is assuming you have
    to go through this exercise with EVERYTHING YOU HAVE.
  • Data-dictionaries, algorithms, specifications, tech
    metadata, whatever it takes. Use common sense!
Rights and DRM
• Not having IP rights to something may mean
  you can’t preserve it.
  • Brian Lavoie writes well about this problem.
  • Copyright law and its exceptions haven’t caught up to the
    digital age!
  • Third-party services (e.g. blogs, ITunesU, Slideshare) are a
    headache here.
• DRM means that no matter the rights
  situation, you’re stuck.
  • PDFs: Users turn on “security” features. This is DRM. Tell
    them not to do that!
  • Huge headache with third-party services, again.
... and other hassles
• Privacy, confidentiality, and human-subject
  research issues
  • Think “we’re the humanities; IRBs don’t happen to us”?
    Think again. One word: FOLKLORE.
• Third-party copyright
  • Campus musical or dramatic performances
• Issues of cultural sensitivity, heritage,
  repatriation
• You need a dark (or at least dim) archive if
  you’re serious about digital preservation.
  There is no way around this. Sorry.
Organizational commitment
• There is only one answer: POLICY.
• Unfortunately, it’s not a quick, easy, or
  uncomplicated answer.
  • Digital preservation costs money.
  • People in high places are scared of it.
  • It requires process and staff change.
• You have to make the case. And then make it
  again. And again. Until they get it!
  • Where I am, Somebody Else’s Problem fields are
    everywhere around this issue.
You are probably the
  preservation option
     of last resort.


Be prepared for anything
excluded from your policy
      to disappear.
When organizations fail
• Remember Geocities? We’re worse.
  • Mellon: Can’t make a list of its funded on-the-web
    projects, because most of them are GONE. G-O-N-E.
• We do not, as a profession, have a safety net
  for each others’ projects and materials.
• This is, frankly, unconscionable.
• I don’t know how to fix it; I am just warning
  you that project rescues are and will continue
  to be necessary.
  • Institutional boundaries are a barrier here.
Great policy guidance
• Policy-making for research data in repositories:
  a guide
  • http://www.disc-uk.org/docs/guide.pdf
• Practical data management: a legal and policy
  guide
  • http://eprints.qut.edu.au/archive/00014923/01/
    Microsoft_Word_-_Practical_Data_Management_-
    _A_Legal_and_Policy_Guide_doc.pdf
  • Australian, so take “legal” with a grain of salt
• Guide to social science data preparation and
  archiving
  • http://www.icpsr.umich.edu/ICPSR/access/dataprep.pdf
Summary: the OAIS model
• “Reference model” for archival systems
  • All theory, no praxis, by design. (Because praxis changes!)
• Four parts
  • Vocabulary
  • Data (and interaction) model
  • Required responsibilities of an archive
  • Recommended functions (in the computer-programming
    sense) for carrying out those responsibilities
• My favorite distillation: Ockerbloom
  • http://everybodyslibraries.com/2008/10/13/what-
    repositories-do-the-oais-model/
Institutional
repositories
For our purposes...
• We’re talking about the software.
• I’m not going to rant (much) about what IRs
  are for or how they’re run.
  • If you want that, read Roach Motel. Better yet, read
    Palmer et al. 2009.
• We’re interested in the application (or lack
  thereof ) of IRs to data curation in the arts and
  humanities. Right? Right.
• I’m not afraid of the technical, and neither
  should you be.
IR software
• Open source
  • Fedora Commons: http://fedora-commons.info/
  • DSpace: http://dspace.org/
  • EPrints: http://eprints.org/
• Commercial
  • ContentDM: http://contentdm.com/
  • VTLS/Vital: http://www.vtls.com/products/vital
• Hosted
  • ContentDM: http://contentdm.com/
  • BePress: http://bepress.com/
  • Open Repository (based on DSpace): http://
    www.openrepository.com/
  • Digitool: http://www.exlibrisgroup.com/category/
    DigiToolOverview
In your groups...

Please brainstorm common
  examples of A&H digital
     content requiring
       preservation.
Common A&H use-cases
• Image collections
• Page-scanned books (with or without OCR)
• Marked-up books
• Theses and dissertations
• Website preservation
• Audio and video
• Complex multimedia
• Database (linguistic, geographic...)
• Software
In your groups...

Please brainstorm how you
and your patrons expect to
use and interact with these
      genres of data.

   Make a list of verbs.
What they’ll tell you
                      on al
                  ituti
                st .
             in y          ere!”
           an itor
        ve os thing     th
      ha ep
“W
    e    r every
          ut
    ca np
You
How you must not respond
The IR content use-case

• A research paper
• In a single file; possibly more than one format
  available
• Is not related to any other item in the history
  of ever
• The user can download it, and... um... just
  download it, really.
How much of our stuff
      does that work for?
• Image collections
• Page-scanned books (with or without OCR)
• Marked-up books
• Theses and dissertations
• Website preservation
• Audio and video
• Complex multimedia
• Database (linguistic, geographic...)
• Software
One user interface does
      not fit all
One metadata standard
        does not fit all
• EAD
• METS
                          • The simple fact is that
• VRA Core
                            EPrints and DSpace do
• MODS                      Dublin Core, METS, and
• TEI Header                nothing else natively.
• Dublin Core               This is purely inadequate
                            for humanities data
• MARC
                            curation.
• ... the beat goes on.
One file format does not
             fit all

• Yes, we have to take what we get.
• With discrete files, most IR software is fine.
• Forget about streaming audio/video.
• DSpace is good with static websites.
• For other composite objects, you’re in trouble.
• For anything like a database, you’re in trouble.
The DSpace/EPrints view
         of the universe
• Communities and collections
• “EPeople”
  • must be given explicit permission to add or edit materials
• Metadata entry forms
  • DSpace: fields configurable by collection
  • EPrints: auto-configures fields based on content type
• Files/bitstreams
  • Many permitted per item; must upload one by one in DSpace!
  • Get friendly with the DSpace batch importer. You’ll need it.
The Fedora view of the
            universe
• You can do anything at all with anything at all
  as long as you’re willing to tell Fedora how to
  do it. Infinite flexibility! But also infinite
  responsibility.
• “Content model:” what’s in this thing?
• “Service:” what should the user-interface do
  with what’s in this thing?
• Metadata, relationships, stuff
Can you use Fedora for
            an IR?
• Yes, but not alone; you need all the Content
  Models and Services bolted on top.
• Try Islandora or Muradora. Fez is a last resort; it
  acts like DSpace, and this is not a good thing.
• Even if you can’t build a real Fedora digital
  library now, you may not be able to do so in
  future if you stick with DSpace...
• ... but the Fedora/DSpace merger may change
  things.
What is this FOXML
        stuff anyway?


• Think of it as the Fedora batch-import format.
• It’s complex! But it can absorb any amount or
  type of XML metadata or data, which is really
  quite nice.
Summing up
• Out-of-the-box IR software will handle some
  A&H data-curation jobs adequately...
• ... but by no means all of them.
• If you need sophisticated UI, bite the bullet
  and go with Fedora. Islandora and Muradora
  make Fedora simpler for simple things than it
  once was.
• If you don’t need sophisticated user-facing UI,
  go with EPrints.
• DSpace is a loser choice.
Credits
• Watch: http://www.flickr.com/photos/fdecomite/406635986/
• Wet book: http://www.flickr.com/photos/dno1967/2979040762/
• “Bookworm and Bug Juice”: http://www.flickr.com/photos/modestospeed/576659116/
• Moldy books: http://www.flickr.com/photos/umjanedoan/496656416/
• Damaged book: http://www.flickr.com/photos/donabelandewen/3375108358/
• Carnegie library: http://www.flickr.com/photos/jhoweaa/436923541/
• Floppy box: http://www.flickr.com/photos/rintakumpu/2684989757/
• Floppy art: http://www.flickr.com/photos/bludgeoner86/2507833950/
• Bitrot: http://www.flickr.com/photos/raver_mikey/2865543940/
• Escape the ring: http://www.flickr.com/photos/hydropeek/2611071166/
• Obsolete grownups: http://www.flickr.com/photos/nietsdoener/1091201075/
• Confusion: http://www.flickr.com/photos/flavinsky/3411791256/
• Confusion II: http://www.flickr.com/photos/demibrooke/2550349404/
• Axeman: http://www.flickr.com/photos/27888428@N00/3163030403/
• Lazy dazy: http://www.flickr.com/photos/hmk/2742398590/
• DRM/Orwell: http://www.flickr.com/photos/jbonnain/523672080/
• Mushroom cloud: http://www.flickr.com/photos/nicholas_t/543334336/
• Pollock: http://www.flickr.com/photos/redneck/215447253/
Thank you!
• This presentation is available under a Creative
  Commons Attribution 3.0 United States
  license.
• Please remember to credit images if you reuse
  individual slides. Thank you!

More Related Content

Viewers also liked

Biblio to Fedora Commons REST API
Biblio to Fedora Commons REST APIBiblio to Fedora Commons REST API
Biblio to Fedora Commons REST APIcmoyers
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overvieweposthumus
 
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”DuraSpace
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library softwaresonia naomi bandao
 
Repositories and digital preservation
Repositories and digital preservationRepositories and digital preservation
Repositories and digital preservationMichael Day
 
Introduction to fedora 20cat
Introduction to fedora   20catIntroduction to fedora   20cat
Introduction to fedora 20catMedo EL-Masry
 
2.28.17 Introducing DSpace 7 Webinar Slides
2.28.17 Introducing DSpace 7 Webinar Slides2.28.17 Introducing DSpace 7 Webinar Slides
2.28.17 Introducing DSpace 7 Webinar SlidesDuraSpace
 
DSpace Training Presentation
DSpace Training PresentationDSpace Training Presentation
DSpace Training PresentationThomas King
 
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar SlidesDuraSpace
 
DSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital LibraryDSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital Libraryrajivkumarmca
 
What is Greenstone Digital Library and Tips for Development
What is Greenstone Digital Library and Tips for DevelopmentWhat is Greenstone Digital Library and Tips for Development
What is Greenstone Digital Library and Tips for DevelopmentAshok Kumar Satapathy
 
Introduction To Fedora
Introduction To FedoraIntroduction To Fedora
Introduction To FedoraArindam Ghosh
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital LibraryImran Mansuri
 
Digital libraries power point
Digital libraries power pointDigital libraries power point
Digital libraries power pointckdozier
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDuraSpace
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrowBram Luyten
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and TomorrowBram Luyten
 

Viewers also liked (17)

Biblio to Fedora Commons REST API
Biblio to Fedora Commons REST APIBiblio to Fedora Commons REST API
Biblio to Fedora Commons REST API
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library software
 
Repositories and digital preservation
Repositories and digital preservationRepositories and digital preservation
Repositories and digital preservation
 
Introduction to fedora 20cat
Introduction to fedora   20catIntroduction to fedora   20cat
Introduction to fedora 20cat
 
2.28.17 Introducing DSpace 7 Webinar Slides
2.28.17 Introducing DSpace 7 Webinar Slides2.28.17 Introducing DSpace 7 Webinar Slides
2.28.17 Introducing DSpace 7 Webinar Slides
 
DSpace Training Presentation
DSpace Training PresentationDSpace Training Presentation
DSpace Training Presentation
 
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
 
DSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital LibraryDSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital Library
 
What is Greenstone Digital Library and Tips for Development
What is Greenstone Digital Library and Tips for DevelopmentWhat is Greenstone Digital Library and Tips for Development
What is Greenstone Digital Library and Tips for Development
 
Introduction To Fedora
Introduction To FedoraIntroduction To Fedora
Introduction To Fedora
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital Library
 
Digital libraries power point
Digital libraries power pointDigital libraries power point
Digital libraries power point
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrow
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and Tomorrow
 

Similar to Digital preservation and institutional repositories

Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditingDorothea Salo
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
 
Meditations/Metadata in an Emergency
Meditations/Metadata in an EmergencyMeditations/Metadata in an Emergency
Meditations/Metadata in an Emergencykramsey
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better PresentationsJose Ramon Macias
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Going mobile - tip, tricks and tools for building mobile web-apps
Going mobile - tip, tricks and tools for building mobile web-appsGoing mobile - tip, tricks and tools for building mobile web-apps
Going mobile - tip, tricks and tools for building mobile web-appsJoshua May
 
Embracing The Straightjacket
Embracing  The  StraightjacketEmbracing  The  Straightjacket
Embracing The StraightjacketEmma Hamer
 
The Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can StealThe Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can Stealmozilla.presentations
 
Hackability: Free/Open Source Assistive Tech
Hackability: Free/Open Source Assistive TechHackability: Free/Open Source Assistive Tech
Hackability: Free/Open Source Assistive TechLiz Henry
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Aron Ahmadia
 
Object Based Storage
Object Based StorageObject Based Storage
Object Based StorageEMC
 
ForgetIT – Some store to remember, some store to forget
ForgetIT – Some store to remember, some store to forgetForgetIT – Some store to remember, some store to forget
ForgetIT – Some store to remember, some store to forgetSøren Schaffstein
 

Similar to Digital preservation and institutional repositories (20)

Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditing
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Preserve or preserve not
Preserve or preserve notPreserve or preserve not
Preserve or preserve not
 
Meditations/Metadata in an Emergency
Meditations/Metadata in an EmergencyMeditations/Metadata in an Emergency
Meditations/Metadata in an Emergency
 
Knowledge Management 2.0
Knowledge Management 2.0Knowledge Management 2.0
Knowledge Management 2.0
 
What lies beneath
What lies beneathWhat lies beneath
What lies beneath
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better Presentations
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Tonethatplone
TonethatploneTonethatplone
Tonethatplone
 
Going mobile - tip, tricks and tools for building mobile web-apps
Going mobile - tip, tricks and tools for building mobile web-appsGoing mobile - tip, tricks and tools for building mobile web-apps
Going mobile - tip, tricks and tools for building mobile web-apps
 
Embracing The Straightjacket
Embracing  The  StraightjacketEmbracing  The  Straightjacket
Embracing The Straightjacket
 
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...
 
The Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can StealThe Most Important Thing: How Mozilla Does Security and What You Can Steal
The Most Important Thing: How Mozilla Does Security and What You Can Steal
 
Hackability: Free/Open Source Assistive Tech
Hackability: Free/Open Source Assistive TechHackability: Free/Open Source Assistive Tech
Hackability: Free/Open Source Assistive Tech
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013
 
The alignment
The alignmentThe alignment
The alignment
 
Object Based Storage
Object Based StorageObject Based Storage
Object Based Storage
 
Progressing and enhancing
Progressing and enhancingProgressing and enhancing
Progressing and enhancing
 
ForgetIT – Some store to remember, some store to forget
ForgetIT – Some store to remember, some store to forgetForgetIT – Some store to remember, some store to forget
ForgetIT – Some store to remember, some store to forget
 

More from Dorothea Salo

Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Dorothea Salo
 
Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Dorothea Salo
 
Privacy and libraries
Privacy and librariesPrivacy and libraries
Privacy and librariesDorothea Salo
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)Dorothea Salo
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Dorothea Salo
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesDorothea Salo
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly CommunicationDorothea Salo
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Dorothea Salo
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing SerendipityDorothea Salo
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAsDorothea Salo
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!Dorothea Salo
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!Dorothea Salo
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsDorothea Salo
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's WayDorothea Salo
 

More from Dorothea Salo (20)

Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)
 
Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!
 
Encryption
EncryptionEncryption
Encryption
 
Privacy and libraries
Privacy and librariesPrivacy and libraries
Privacy and libraries
 
Paying for it
Paying for itPaying for it
Paying for it
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 
FRBR and RDA
FRBR and RDAFRBR and RDA
FRBR and RDA
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly Communication
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
What We Organize
What We OrganizeWhat We Organize
What We Organize
 
Occupy Copyright!
Occupy Copyright!Occupy Copyright!
Occupy Copyright!
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation Tools
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's Way
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Digital preservation and institutional repositories

  • 1. Institutional repositories for the digital arts and humanities Dorothea Salo University of Wisconsin dsalo@library.wisc.edu
  • 2. Preservation for the digital arts and humanities Dorothea Salo University of Wisconsin dsalo@library.wisc.edu
  • 3. Preservation and institutional repositories for the digital arts and humanities Dorothea Salo University of Wisconsin dsalo@library.wisc.edu
  • 4. And I said... ... you’re giving me how much time for this?
  • 5. Threat model • “Preservation” means nothing unmodified. • This is why it becomes such a bogeyman! • Two things you need to know first: • why you’re preserving what you’re preserving, and • what you’re preserving it against. • Your collection-development policy should inform the first question. • Your coll-dev policy doesn’t include local born-digital or digitized materials? This is a problem. Fix it. • The second question is your “threat model.”
  • 6. What is your threat model for print?
  • 14. Why did I just make you do that? • I’m weird. • I’m trying to destroy the myth that any given medium “preserves itself.” • Media do not preserve themselves. People preserve media —or media get bizarrely lucky. • We need not panic over digital preservation any more than we panic about print. • Approach digital preservation the same way you approach print preservation. • Strategically: this approach helps your colleagues get a grip, too. Your colleagues may well be the biggest barrier to digital preservation in your library!
  • 15. In your groups... List important threats to digital data.
  • 20. Forgetting what the stuff you have means
  • 22. Lack (or disappearance) of organizational commitment
  • 24. Ignorance ? • “It’s in Google, so it’s preserved.” (Not even “Google Books!”) • “I make backups, so I’m fine.” • “I have a graduate student who takes care of these things.” • “Metadata? What’s that? I have to have it?” • “Digital preservation is an unsolvable problem, so why even try?” (I’ve heard this one from librarians. I bet you have too.)
  • 27. Salo’s needs pyramid Less Less immediate Fidelity tractable to original Usability Format viability Bitrot Physical medium issues More More immediate Acquisition issues tractable
  • 29. But first, a word about failure • “We can’t save everything digital!” • Well, no, we can’t. • We can’t save everything printed either. • That’s no excuse, in either medium. Why do we let it be one for digital materials? • Yes, we will lose some stuff. That’s life in the big city. Dive in anyway.
  • 30. And a word about scale • Many of those currently panicking about digital preservation are thinking about huge scales. • At some repository size, bitrot happens faster than you can detect and fix it. • Last I heard, this was somewhere in the exabyte range. • We’re not. So let’s relax about some of this stuff. At our scale, many problems are solvable. • Unless your problem is digital video. Good luck with that. • Our scale problems happen on the front end, as we’ve been learning this week.
  • 31. Physical medium failure • Gold CDs are not the panacea we thought. • They’re not bad; they’re just hard to audit, so they fail (when they fail) silently. Silent failure is DEADLY. • How long will hardware be able to read them? • ALL such physical media are risky, for the same reasons! • Current state of the art: get it on spinning disk. • Back up often. Distribute your backups geographically. Test them now and then. • Consider a LOCKSS cooperative agreement. Others have. • Any physical medium WILL FAIL. Have a plan for when it does.
  • 32. Bitrot • Sometimes used for “file format obsolescence.” • I use it for “the bits flipped unexpectedly.” • Checking a file bit-by-bit against a backup copy is computationally impractical for every day. • Though on ingest it’s a good idea to verify bit-by-bit! • Checksums • A file is, fundamentally, a great big number. • Do math on the number file. Store the result as metadata. • To check for bitrot, redo the math and check the answer against the stored result. If they’re different, scream. • Several checksum algorithms; for our purposes, which one you use doesn’t matter much.
  • 33. File format obsolescence • When possible, prefer file formats that are: • Open/non-proprietary. (If a software vendor goes out of business, does their format?) • Documented • Standardized, non-patent-encumbered • In widespread use. (If the format dies, lots of people have incentive to solve the problem.) • For text, non-binary • For everything else, lossless rather than lossy • For compound objects, compound documents rather than embedded • Realistically? We often have to take what we’re given.
  • 34. Lossless? Lossy? What? • Essential tradeoff: quality and fidelity vs. file size • Clipping information out makes the file size smaller! But once it’s gone, it’s gone. • Tremendous problem with video. Lossless video formats are HUGE. • Lossy image formats: JPEG, JPEG2000 (much less so) • (more or less) Lossless: TIFF, PNG, GIF • Compression may be lossless or lossy. Find out!
  • 36. Audio formats • I am NOT going to talk about codecs vs. container formats. Consider it homework. • No ideal choice here; lossless formats are patent-encumbered and/or proprietary • WAV and AIFF are okay. Ogg Vorbis is ideal, but nobody supports it. • mp3: if you must, it’s lossy.
  • 37. Migration vs. emulation • Migration: move the file to a new format • Don’t throw away your original! You may have made the wrong migration decision. • Not necessarily a lossless process. (Fonts!) • Emulation: create a modern hardware/software environment that can deal with the old format • For some cultural artifacts such as games, this is the only reasonable option. • Emulation advocates make big claims that I’m not sure they can back up. Proceed with caution.
  • 38. Normalization • Migration of a dataset toward a well-defined target. • “Treat the same thing the same way.” • E.g. census data... define a set of data tables, move all data into them. • Great for interoperability and preservation! • Pitfall: “the same thing”? • Humanities: TEI is a de facto normalizer for humanities textual data. • (Other XML formats in other fields: e.g. ChemML, NLM DTD.)
  • 39. Problem: BEHAVIOR. • Migration can preserve information content and (often but not always) appearance. • Preserving interaction patterns is much harder! • E.g. a web page containing Javascript • Or a database with a query engine • Or an applet or Flash object • Or a collection whose interactions are based on an obsolete software system. (DynaText anyone?) • Hard problem. No obvious solutions; certainly no easy ones.
  • 40. When is a PDF not a PDF? • When it’s a .doc with the wrong file extension • When there’s no file extension on it at all • When it’s so old it doesn’t follow the standardized PDF conventions • When it’s otherwise malformed, made by a bad piece of software. • How do you know whether you have a good PDF? (Or .doc, or .jpg, or .xml, or anything else.)
  • 41. File format registries and testing tools • JHOVE: JSTOR/Harvard Object Validation Environment • Java software intended to be pluggable into other software environments • Answers “What format is this thing?” and “Is this thing a good example of the format?” • Limited repertoire of formats • PRONOM/DROID + GDFR = Unified Digital Formats Registry
  • 42. Forgetting what you have • Absolutely pernicious problem. We don’t know what we have to begin with! • Do you know how much Faculty Stuff is scattered throughout your institution’s .edu domain? Me neither. But I know it’s a lot. How much of that is irreplaceable? • We’re also bad at labelling and tracking what we have. • No easy answer to this one; the solution lies in a complete praxis reinvention. • Yeah. Good luck with that.
  • 43. ... but I thought you meant in libraries, Dorothea! • Come on, we’ve solved that one: Metadata! • Once it’s in the library, it’s probably fine. The real problem is all that Other Stuff Out There. • This is a collection-development problem and should be treated as one. • Don’t dump it on some poor “digital preservation librarian!” That flat out doesn’t scale. • Don’t make the mistake of drawing thick lines around “our stuff” and “their stuff.” Like it or not, our coll-dev universe has moved beyond what’s published and what’s canonically “library.”
  • 44. What the stuff you have means • Collect whatever it takes to answer this question: • If the owner of this material were hit by a bus tomorrow, what would be needed for others to use it? • Nasty discipline-specific problem. • This is what the NARA/RLG Trusted Digital Repository checklist is aiming at with “designated community.” • Where NARA/RLG goes off the rails is assuming you have to go through this exercise with EVERYTHING YOU HAVE. • Data-dictionaries, algorithms, specifications, tech metadata, whatever it takes. Use common sense!
  • 45. Rights and DRM • Not having IP rights to something may mean you can’t preserve it. • Brian Lavoie writes well about this problem. • Copyright law and its exceptions haven’t caught up to the digital age! • Third-party services (e.g. blogs, ITunesU, Slideshare) are a headache here. • DRM means that no matter the rights situation, you’re stuck. • PDFs: Users turn on “security” features. This is DRM. Tell them not to do that! • Huge headache with third-party services, again.
  • 46. ... and other hassles • Privacy, confidentiality, and human-subject research issues • Think “we’re the humanities; IRBs don’t happen to us”? Think again. One word: FOLKLORE. • Third-party copyright • Campus musical or dramatic performances • Issues of cultural sensitivity, heritage, repatriation • You need a dark (or at least dim) archive if you’re serious about digital preservation. There is no way around this. Sorry.
  • 47. Organizational commitment • There is only one answer: POLICY. • Unfortunately, it’s not a quick, easy, or uncomplicated answer. • Digital preservation costs money. • People in high places are scared of it. • It requires process and staff change. • You have to make the case. And then make it again. And again. Until they get it! • Where I am, Somebody Else’s Problem fields are everywhere around this issue.
  • 48. You are probably the preservation option of last resort. Be prepared for anything excluded from your policy to disappear.
  • 49. When organizations fail • Remember Geocities? We’re worse. • Mellon: Can’t make a list of its funded on-the-web projects, because most of them are GONE. G-O-N-E. • We do not, as a profession, have a safety net for each others’ projects and materials. • This is, frankly, unconscionable. • I don’t know how to fix it; I am just warning you that project rescues are and will continue to be necessary. • Institutional boundaries are a barrier here.
  • 50. Great policy guidance • Policy-making for research data in repositories: a guide • http://www.disc-uk.org/docs/guide.pdf • Practical data management: a legal and policy guide • http://eprints.qut.edu.au/archive/00014923/01/ Microsoft_Word_-_Practical_Data_Management_- _A_Legal_and_Policy_Guide_doc.pdf • Australian, so take “legal” with a grain of salt • Guide to social science data preparation and archiving • http://www.icpsr.umich.edu/ICPSR/access/dataprep.pdf
  • 51. Summary: the OAIS model • “Reference model” for archival systems • All theory, no praxis, by design. (Because praxis changes!) • Four parts • Vocabulary • Data (and interaction) model • Required responsibilities of an archive • Recommended functions (in the computer-programming sense) for carrying out those responsibilities • My favorite distillation: Ockerbloom • http://everybodyslibraries.com/2008/10/13/what- repositories-do-the-oais-model/
  • 53. For our purposes... • We’re talking about the software. • I’m not going to rant (much) about what IRs are for or how they’re run. • If you want that, read Roach Motel. Better yet, read Palmer et al. 2009. • We’re interested in the application (or lack thereof ) of IRs to data curation in the arts and humanities. Right? Right. • I’m not afraid of the technical, and neither should you be.
  • 54. IR software • Open source • Fedora Commons: http://fedora-commons.info/ • DSpace: http://dspace.org/ • EPrints: http://eprints.org/ • Commercial • ContentDM: http://contentdm.com/ • VTLS/Vital: http://www.vtls.com/products/vital • Hosted • ContentDM: http://contentdm.com/ • BePress: http://bepress.com/ • Open Repository (based on DSpace): http:// www.openrepository.com/ • Digitool: http://www.exlibrisgroup.com/category/ DigiToolOverview
  • 55. In your groups... Please brainstorm common examples of A&H digital content requiring preservation.
  • 56. Common A&H use-cases • Image collections • Page-scanned books (with or without OCR) • Marked-up books • Theses and dissertations • Website preservation • Audio and video • Complex multimedia • Database (linguistic, geographic...) • Software
  • 57. In your groups... Please brainstorm how you and your patrons expect to use and interact with these genres of data. Make a list of verbs.
  • 58. What they’ll tell you on al ituti st . in y ere!” an itor ve os thing th ha ep “W e r every ut ca np You
  • 59. How you must not respond
  • 60. The IR content use-case • A research paper • In a single file; possibly more than one format available • Is not related to any other item in the history of ever • The user can download it, and... um... just download it, really.
  • 61. How much of our stuff does that work for? • Image collections • Page-scanned books (with or without OCR) • Marked-up books • Theses and dissertations • Website preservation • Audio and video • Complex multimedia • Database (linguistic, geographic...) • Software
  • 62. One user interface does not fit all
  • 63. One metadata standard does not fit all • EAD • METS • The simple fact is that • VRA Core EPrints and DSpace do • MODS Dublin Core, METS, and • TEI Header nothing else natively. • Dublin Core This is purely inadequate for humanities data • MARC curation. • ... the beat goes on.
  • 64. One file format does not fit all • Yes, we have to take what we get. • With discrete files, most IR software is fine. • Forget about streaming audio/video. • DSpace is good with static websites. • For other composite objects, you’re in trouble. • For anything like a database, you’re in trouble.
  • 65. The DSpace/EPrints view of the universe • Communities and collections • “EPeople” • must be given explicit permission to add or edit materials • Metadata entry forms • DSpace: fields configurable by collection • EPrints: auto-configures fields based on content type • Files/bitstreams • Many permitted per item; must upload one by one in DSpace! • Get friendly with the DSpace batch importer. You’ll need it.
  • 66. The Fedora view of the universe • You can do anything at all with anything at all as long as you’re willing to tell Fedora how to do it. Infinite flexibility! But also infinite responsibility. • “Content model:” what’s in this thing? • “Service:” what should the user-interface do with what’s in this thing? • Metadata, relationships, stuff
  • 67. Can you use Fedora for an IR? • Yes, but not alone; you need all the Content Models and Services bolted on top. • Try Islandora or Muradora. Fez is a last resort; it acts like DSpace, and this is not a good thing. • Even if you can’t build a real Fedora digital library now, you may not be able to do so in future if you stick with DSpace... • ... but the Fedora/DSpace merger may change things.
  • 68. What is this FOXML stuff anyway? • Think of it as the Fedora batch-import format. • It’s complex! But it can absorb any amount or type of XML metadata or data, which is really quite nice.
  • 69. Summing up • Out-of-the-box IR software will handle some A&H data-curation jobs adequately... • ... but by no means all of them. • If you need sophisticated UI, bite the bullet and go with Fedora. Islandora and Muradora make Fedora simpler for simple things than it once was. • If you don’t need sophisticated user-facing UI, go with EPrints. • DSpace is a loser choice.
  • 70. Credits • Watch: http://www.flickr.com/photos/fdecomite/406635986/ • Wet book: http://www.flickr.com/photos/dno1967/2979040762/ • “Bookworm and Bug Juice”: http://www.flickr.com/photos/modestospeed/576659116/ • Moldy books: http://www.flickr.com/photos/umjanedoan/496656416/ • Damaged book: http://www.flickr.com/photos/donabelandewen/3375108358/ • Carnegie library: http://www.flickr.com/photos/jhoweaa/436923541/ • Floppy box: http://www.flickr.com/photos/rintakumpu/2684989757/ • Floppy art: http://www.flickr.com/photos/bludgeoner86/2507833950/ • Bitrot: http://www.flickr.com/photos/raver_mikey/2865543940/ • Escape the ring: http://www.flickr.com/photos/hydropeek/2611071166/ • Obsolete grownups: http://www.flickr.com/photos/nietsdoener/1091201075/ • Confusion: http://www.flickr.com/photos/flavinsky/3411791256/ • Confusion II: http://www.flickr.com/photos/demibrooke/2550349404/ • Axeman: http://www.flickr.com/photos/27888428@N00/3163030403/ • Lazy dazy: http://www.flickr.com/photos/hmk/2742398590/ • DRM/Orwell: http://www.flickr.com/photos/jbonnain/523672080/ • Mushroom cloud: http://www.flickr.com/photos/nicholas_t/543334336/ • Pollock: http://www.flickr.com/photos/redneck/215447253/
  • 71. Thank you! • This presentation is available under a Creative Commons Attribution 3.0 United States license. • Please remember to credit images if you reuse individual slides. Thank you!

Editor's Notes