Future of web archiving

•Download as PPTX, PDF•

0 likes•512 views

University of California Curation Center

Art & Photos

Future of Web Archiving
Stephen Abrams
California Digital Library
Martin Klein
Los Alamos National Laboratory
Jimmy Lin
University of Maryland
Michael Nelson
Old Dominion University
Digital Preservation 2014, Washington, July 22-24

www.flickr.com/photos/adesigna/4090782772
Agenda
Web archiving problems and opportunities
Memento tools
WarcBase platform
Assessing quality of archives
Discussion
Agenda
 Web archiving problems and opportunities
 Memento tools
 WarcBase platform
 Assessing quality of archives
 Discussion

Web archiving is important but (really) hard
 Why web archiving?
Continuation of longstanding mission to
collect, preserve, and provide access to the
scholarly record and our cultural heritage
Publishing/dissemination platform of
choice
 But …
www.flickr.com/photos/alaig/3522953697
www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382
the web isn’t the web anymore

Web in transition
Document retrieval
Document viewer
HTML
Common
Desktop
Information
Programming environment
Virtual machine
JavaScript
Personalized
Mobile/handheld/wearable
Things
www.flickr.com/photos/swamibu/2223726960 www.flickr.com/photos/sharples/79222765
A “web” of notes with links (like
references) between them …”
– Tim Berners-Lee, March 1989

(Some) other issues
 Crawlers don’t act like browsers
► Need robots that act more like people
www.flickr.com/photos/benhusmann/5126030385

(Some) other issues
 Crawlers don’t act like browsers
 Responsiveness to time-sensitive content
► Need to bypass v-e-r-y deliberate collection development
procedures
Gaurdian News and Media Limited

www.flickr.com/photos/vblibrary/7414544704
(Some) other issues
 Crawlers don’t act like browsers
 Responsiveness to time-sensitive content
 Policies, rights, and permissions
► Need to overcome legal barriers that follow the
monetization of content

www.flickr.com/photos/21664580@N04/2095574414
into traditional management
(Some) other issues
 Crawlers don’t act like browsers
 Responsiveness to time-sensitive content
 Policies, rights, and permissions
 Difficult integration into traditional management
and discovery services
► Leading to …

(Some) other issues
 Crawlers don’t act like browsers
 Responsiveness to time-sensitive content
 Policies, rights, and permissions
 Difficult integration into traditional management
and discovery services
 Siloed collections
www.flickr.com/photos/54159370@N08/7148880783

Supporting research
 Little awareness in the scholarly community
 Poorly understood use cases
 Few tools
 Traditional find→download→manipulate locally
workflows may not be feasible at web scale
► Need APIs and business models for in situ analysis
berkeley.edu/teach www.flickr.com/photos/infocux/8450190120

www.flickr.com/photos/bartelomeus/4184705426
Browsing the past should be as
simple and intuitive as the now
Better discovery modalities
www.flickr.com/photos/shebalso/6357626617
mechanisms
Technological opportunities
 Better capture mechanisms
► Headless browsers
► API harvesters
…
 Better discovery modalities
► Browsing the past should be as
simple and intuitive as the now
…

Cooperative opportunities
 Complementary collection development
 Coordinated infrastructure support and operation
► Or perhaps centralized – a HathiTrust for web archives?
 Crowd sourcing selection, description, quality
assurance
www.flickr.com/photos/chiotsrun/4115059294 www.flickr.com/photos/sagesolar/9230445157

And now …
cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg

What's hot

Web RDFAmina Aiache

Dash: data sharing made easyUniversity of California Curation Center

Deepak semantic web_iitdDeepak Shevani

Innovative Interfaces: making the most of the data we haveWinona Salesky

Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Robert H. McDonald

Towards social webtops using semantic wikiJie Bao

Access 2005 TaggingDaniele

NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNebraska Library Commission

Linked Open Data: Opportunities & Barriers for ArchivesAdrian Stevenson

Stahmer-9-Jun15-finalNational Information Standards Organization (NISO)

DPOE Managing Digital Content over Time: Identify Module ResourcesNebraska Library Commission

Discover the invisible webdrakowski

Ucmp 20150407University of California Curation Center

Building a Single User ExperienceRachel Vacek

The Web, the User and the LibraryGuus van den Brekel

Considerations for Your Mobile LibraryRachel Vacek

What's hot (16)

Web RDF

Dash: data sharing made easy

Deepak semantic web_iitd

Innovative Interfaces: making the most of the data we have

Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...

Towards social webtops using semantic wiki

Access 2005 Tagging

NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data

Linked Open Data: Opportunities & Barriers for Archives

Stahmer-9-Jun15-final

DPOE Managing Digital Content over Time: Identify Module Resources

Discover the invisible web

Ucmp 20150407

Building a Single User Experience

The Web, the User and the Library

Considerations for Your Mobile Library

Viewers also liked

Design Principles for Digital Preservation SystemsUniversity of California Curation Center

Nielsen global mobile reportElif Terzi Tezel

EZID: Easy Persistent Identifiers and Data CitationUniversity of California Curation Center

Dataset Identification and CitationUniversity of California Curation Center

Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen

Plan for digital financial service in timor lesteRuhullah Raihan Alhusain

Viewers also liked (6)

Design Principles for Digital Preservation Systems

Nielsen global mobile report

EZID: Easy Persistent Identifiers and Data Citation

Dataset Identification and Citation

Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014

Plan for digital financial service in timor leste

Similar to Future of web archiving

Introduction To Linked DataLeigh Dodds

Rise presentation-2012-01Richard Nurse

Introduction to Web ArchivingAnna Perricci

Capture All the URLS: First Steps in Web ArchivingKristen Yarmey

Drupal Open Source Everythinglibrarywebchic

Library discovery: past, present and some futureslisld

Boundless OpportunityRachel Frick

Trekking through the world of informationKristin Hokanson

web 2.0, library systems and the library systemlisld

Old Dominion University Computer Science IIPC New Member Michael Nelson

Rich Media Hoarders session for 24HourPhotoshopExtensis

NASA and PHPJ.J. Toothman

Too Late for the Library Catalog? Inconceivable!Courtney McDonald

Beyond MARC: BIBFRAME and the Future of Bibliographic DataEmily Nimsakont

1330 mon dochart2 brockUKSG: connecting the knowledge community

OCLC Research Update at ALA Chicago. June 26, 2017.OCLC

DMPTool Webinar 2: Data Management Resources You Can UseUniversity of California Curation Center

Digital Curation for Excel (DCXL)University of California Curation Center

Digital library services and the changing environmentJohn MacColl

Online Collections Crawlability for Libraries, Archives, and Museumsmherbison

Similar to Future of web archiving (20)

Introduction To Linked Data

Rise presentation-2012-01

Introduction to Web Archiving

Capture All the URLS: First Steps in Web Archiving

Drupal Open Source Everything

Library discovery: past, present and some futures

Boundless Opportunity

Trekking through the world of information

web 2.0, library systems and the library system

Old Dominion University Computer Science IIPC New Member

Rich Media Hoarders session for 24HourPhotoshop

NASA and PHP

Too Late for the Library Catalog? Inconceivable!

Beyond MARC: BIBFRAME and the Future of Bibliographic Data

1330 mon dochart2 brock

OCLC Research Update at ALA Chicago. June 26, 2017.

DMPTool Webinar 2: Data Management Resources You Can Use

Digital Curation for Excel (DCXL)

Digital library services and the changing environment

Online Collections Crawlability for Libraries, Archives, and Museums

Recently uploaded

Bobbie goods colorinsssssssssssg book.pdflunavro0105

Olivia Cox. intertextual references.pptxLauraFagan6

Russian⚡ Call Girls In Sector 39 Noida✨8375860717⚡Escorts Servicedoor45step

Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044

Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044

Russian⚡ Call Girls In Sector 104 Noida✨8375860717⚡Escorts Servicedoor45step

Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044

Zagor VČ OP 055 - Oluja nad Haitijem.pdfStripovizijacom

Kristy Soto's Industrial design PortfolioKristySoto

FULL ENJOY - 9953040155 Call Girls in Moti Nagar | DelhiMalviyaNagarCallGirl

Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...dajasot375

Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call GirlYinisingh

SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineShivna Prakashan

Clines Corners Travel Center, Curio Shop, Clines Corners NMroute66connected

Triangle Vinyl Record Store, Clermont FloridaGabrielaMiletti

Pow Wow Inn, Motel/Residence, Tucumcari NMroute66connected

9654467111 Call Girls In Noida Sector 62 Short 1500 Night 6000Sapana Sha

Aiims Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044

How Can You Get Dubai Call Girls +971564860409 Call Girls Dubai?kexey39068

Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044

Recently uploaded (20)

Bobbie goods colorinsssssssssssg book.pdf

Olivia Cox. intertextual references.pptx

Russian⚡ Call Girls In Sector 39 Noida✨8375860717⚡Escorts Service

Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls

Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls

Russian⚡ Call Girls In Sector 104 Noida✨8375860717⚡Escorts Service

Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls

Zagor VČ OP 055 - Oluja nad Haitijem.pdf

Kristy Soto's Industrial design Portfolio

FULL ENJOY - 9953040155 Call Girls in Moti Nagar | Delhi

Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...

Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl

SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine

Clines Corners Travel Center, Curio Shop, Clines Corners NM

Triangle Vinyl Record Store, Clermont Florida

Pow Wow Inn, Motel/Residence, Tucumcari NM

9654467111 Call Girls In Noida Sector 62 Short 1500 Night 6000

Aiims Call Girls : ☎ 8527673949, Low rate Call Girls

How Can You Get Dubai Call Girls +971564860409 Call Girls Dubai?

Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls

Future of web archiving

1. Future of Web Archiving Stephen Abrams California Digital Library Martin Klein Los Alamos National Laboratory Jimmy Lin University of Maryland Michael Nelson Old Dominion University Digital Preservation 2014, Washington, July 22-24

2. www.flickr.com/photos/adesigna/4090782772 Agenda Web archiving problems and opportunities Memento tools WarcBase platform Assessing quality of archives Discussion Agenda  Web archiving problems and opportunities  Memento tools  WarcBase platform  Assessing quality of archives  Discussion

3. Web archiving is important but (really) hard  Why web archiving? Continuation of longstanding mission to collect, preserve, and provide access to the scholarly record and our cultural heritage Publishing/dissemination platform of choice  But … www.flickr.com/photos/alaig/3522953697 www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382 the web isn’t the web anymore

4. Web in transition Document retrieval Document viewer HTML Common Desktop Information Programming environment Virtual machine JavaScript Personalized Mobile/handheld/wearable Things www.flickr.com/photos/swamibu/2223726960 www.flickr.com/photos/sharples/79222765 A “web” of notes with links (like references) between them …” – Tim Berners-Lee, March 1989

5. (Some) other issues  Crawlers don’t act like browsers ► Need robots that act more like people www.flickr.com/photos/benhusmann/5126030385

6. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content ► Need to bypass v-e-r-y deliberate collection development procedures Gaurdian News and Media Limited

7. www.flickr.com/photos/vblibrary/7414544704 (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions ► Need to overcome legal barriers that follow the monetization of content

8. www.flickr.com/photos/21664580@N04/2095574414 into traditional management (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services ► Leading to …

9. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services  Siloed collections www.flickr.com/photos/54159370@N08/7148880783

10. (Some) other issues  Crawlers don’t act like browsers  Responsiveness to time-sensitive content  Policies, rights, and permissions  Difficult integration into traditional management and discovery services  Siloed collections  Scale ► Storage capacity ► Full-text indexing ► De-duplication ► Resources Raiders of the Lost Ark © Paramount Pictures

11. Supporting research  Little awareness in the scholarly community  Poorly understood use cases  Few tools  Traditional find→download→manipulate locally workflows may not be feasible at web scale ► Need APIs and business models for in situ analysis berkeley.edu/teach www.flickr.com/photos/infocux/8450190120

12. www.flickr.com/photos/bartelomeus/4184705426 Browsing the past should be as simple and intuitive as the now Better discovery modalities www.flickr.com/photos/shebalso/6357626617 mechanisms Technological opportunities  Better capture mechanisms ► Headless browsers ► API harvesters …  Better discovery modalities ► Browsing the past should be as simple and intuitive as the now …

13. Cooperative opportunities  Complementary collection development  Coordinated infrastructure support and operation ► Or perhaps centralized – a HathiTrust for web archives?  Crowd sourcing selection, description, quality assurance www.flickr.com/photos/chiotsrun/4115059294 www.flickr.com/photos/sagesolar/9230445157

14. And now … cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg

Editor's Notes

Checklist, https://www.flickr.com/photos/adesigna/4090782772
First of all, why is web archiving important? As members of memory institutions, it is the continuation in a new technological context of our longstanding mission and obligation to collect, preserve, and provide access to the scholar record and our collective cultural heritage. Since the web is where the content is, that is where we have to go to acquire it. But the fundamental problem is that the web is not web. As soon as you think you have quantified or characterized it, it has changed into something else; and as soon as you have processes in place to capture web content, the content is not available in the same way. What a tangled web we weave, https://www.flickr.com/photos/alaig/3522953697 Thorsten Hartmann, Untitles, https://www.flickr.com/photos/hier_gibt_es_nichts_zu_sehen_bitte_gehen_sie_weiter/840587382
It’s different than what anyone – Tim Berners-Lee included – had in mind 25 years ago The web is no longer giant document retrieval system, but a programming environment The browser is no longer a document view, but a general purpose virtual machine; its fundamental language is no longer HTML but JavaScript. The mode of experience has shifted from a common to a highly personalized one; whose web are we archiving? Crumbled paper, https://www.flickr.com/photos/84564583@N08/11167321155 The great pyramid: Size matters, https://www.flickr.com/photos/swamibu/2223726960 A pile of rocks, https://www.flickr.com/photos/sharples/79222765
Paywalls, robot exclusions, crawler traps, … What we need is a collection mechanism that acts like a person Ben Husmann, The FREE HUGS robot says "I am here for you“, https://www.flickr.com/photos/benhusmann/5126030385 Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes Search is simple if you know the URL
Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes Hossam el-Hamalawy, Tahrir Square, https://www.flickr.com/photos/elhamalawy/6378330927
U Can’t Touch This, https://www.flickr.com/photos/vblibrary/7414544704
Dan Storey, Square peg in a round hole, https://www.flickr.com/photos/21664580@N04/2095574414
Silos, https://www.flickr.com/photos/54159370@N08/7148880783
Paywalls, robot exclusions, crawler traps, … Event-driven content doesn’t mesh well with established – meaning v-e-r-y deliberate – collection development processes Search is simple if you know the URL How to find enough good people? (We’re hiring!)
“You’re collecting that?” May need programmatic or API access to in situ collection analysis
Headless browsers (PhantomJS, Umbra, etc.), API harvesters Make browsing the past web as simple and intuitive as browsing the live web Net casting at disk Contarf Pelican Park, https://www.flickr.com/photos/shebalso/6357626617 Bart van de Biezen, Goed Zoekveld, https://www.flickr.com/photos/bartelomeus/4184705426
Avoid needless duplication of effort As librarians we have historically given perhaps inordinate priority to content creators and curators and not enough to consumers. But over significant timespans it is the users who affirmatively seek out and exploit content who may be best positioned to contribute towards its successful management. Meyer lemons, https://www.flickr.com/photos/chiotsrun/4115059294 We sit in the shade and drink lemonade, https://www.flickr.com/photos/sagesolar/9230445157
Michael Harries, Drawing back the curtain, http://cdn.ws.citrix.com/wp-content/uploads/2012/05/iStock_000010348904XSmall.jpg

Future of web archiving

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (6)

Similar to Future of web archiving

Similar to Future of web archiving (20)

More from University of California Curation Center

More from University of California Curation Center (20)

Recently uploaded

Recently uploaded (20)

Future of web archiving

Editor's Notes