1. Future of Web Archiving
Stephen Abrams
California Digital Library
Martin Klein
Los Alamos National Laboratory
Jimmy Lin
University of Maryland
Michael Nelson
Old Dominion University
Digital Preservation 2014, Washington, July 22-24
2. www.flickr.com/photos/adesigna/4090782772
Agenda
Web archiving problems and opportunities
Memento tools
WarcBase platform
Assessing quality of archives
Discussion
Agenda
Web archiving problems and opportunities
Memento tools
WarcBase platform
Assessing quality of archives
Discussion
3. Web archiving is important but (really) hard
Why web archiving?
Continuation of longstanding mission to
collect, preserve, and provide access to the
scholarly record and our cultural heritage
Publishing/dissemination platform of
choice
But …
www.flickr.com/photos/alaig/3522953697
www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382
the web isn’t the web anymore
4. Web in transition
Document retrieval
Document viewer
HTML
Common
Desktop
Information
Programming environment
Virtual machine
JavaScript
Personalized
Mobile/handheld/wearable
Things
www.flickr.com/photos/swamibu/2223726960 www.flickr.com/photos/sharples/79222765
A “web” of notes with links (like
references) between them …”
– Tim Berners-Lee, March 1989
5. (Some) other issues
Crawlers don’t act like browsers
► Need robots that act more like people
www.flickr.com/photos/benhusmann/5126030385
6. (Some) other issues
Crawlers don’t act like browsers
Responsiveness to time-sensitive content
► Need to bypass v-e-r-y deliberate collection development
procedures
Gaurdian News and Media Limited
9. (Some) other issues
Crawlers don’t act like browsers
Responsiveness to time-sensitive content
Policies, rights, and permissions
Difficult integration into traditional management
and discovery services
Siloed collections
www.flickr.com/photos/54159370@N08/7148880783
11. Supporting research
Little awareness in the scholarly community
Poorly understood use cases
Few tools
Traditional find→download→manipulate locally
workflows may not be feasible at web scale
► Need APIs and business models for in situ analysis
berkeley.edu/teach www.flickr.com/photos/infocux/8450190120
12. www.flickr.com/photos/bartelomeus/4184705426
Browsing the past should be as
simple and intuitive as the now
Better discovery modalities
www.flickr.com/photos/shebalso/6357626617
mechanisms
Technological opportunities
Better capture mechanisms
► Headless browsers
► API harvesters
…
Better discovery modalities
► Browsing the past should be as
simple and intuitive as the now
…
13. Cooperative opportunities
Complementary collection development
Coordinated infrastructure support and operation
► Or perhaps centralized – a HathiTrust for web archives?
Crowd sourcing selection, description, quality
assurance
www.flickr.com/photos/chiotsrun/4115059294 www.flickr.com/photos/sagesolar/9230445157
First of all, why is web archiving important?
As members of memory institutions, it is the continuation in a new technological context of our longstanding mission and obligation to collect, preserve, and provide access to the scholar record and our collective cultural heritage.
Since the web is where the content is, that is where we have to go to acquire it.
But the fundamental problem is that the web is not web.
As soon as you think you have quantified or characterized it, it has changed into something else; and as soon as you have processes in place to capture web content, the content is not available in the same way.
What a tangled web we weave, https://www.flickr.com/photos/alaig/3522953697
Thorsten Hartmann, Untitles, https://www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382
It’s different than what anyone – Tim Berners-Lee included – had in mind 25 years ago
The web is no longer giant document retrieval system, but a programming environment
The browser is no longer a document view, but a general purpose virtual machine; its fundamental language is no longer HTML but JavaScript.
The mode of experience has shifted from a common to a highly personalized one; whose web are we archiving?
Crumbled paper, https://www.flickr.com/photos/84564583@N08/11167321155
The great pyramid: Size matters, https://www.flickr.com/photos/swamibu/2223726960
A pile of rocks, https://www.flickr.com/photos/sharples/79222765
Paywalls, robot exclusions, crawler traps, … What we need is a collection mechanism that acts like a person
Ben Husmann, The FREE HUGS robot says "I am here for you“, https://www.flickr.com/photos/benhusmann/5126030385
Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes
Search is simple if you know the URL
Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes
Hossam el-Hamalawy, Tahrir Square, https://www.flickr.com/photos/elhamalawy/6378330927
U Can’t Touch This, https://www.flickr.com/photos/vblibrary/7414544704
Dan Storey, Square peg in a round hole, https://www.flickr.com/photos/21664580@N04/2095574414
Paywalls, robot exclusions, crawler traps, …
Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes
Search is simple if you know the URL
How to find enough good people? (We’re hiring!)
“You’re collecting that?”
May need programmatic or API access to in situ collection analysis
Headless browsers (PhantomJS, Umbra, etc.), API harvesters
Make browsing the past web as simple and intuitive as browsing the live web
Net casting at disk Contarf Pelican Park, https://www.flickr.com/photos/shebalso/6357626617
Bart van de Biezen, Goed Zoekveld, https://www.flickr.com/photos/bartelomeus/4184705426
Avoid needless duplication of effort
As librarians we have historically given perhaps inordinate priority to content creators and curators and not enough to consumers. But over significant timespans it is the users who affirmatively seek out and exploit content who may be best positioned to contribute towards its successful management.
Meyer lemons, https://www.flickr.com/photos/chiotsrun/4115059294
We sit in the shade and drink lemonade, https://www.flickr.com/photos/sagesolar/9230445157
Michael Harries, Drawing back the curtain, http://cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg