Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mind the gap! Reflections on the state of repository data harvesting

A 24x7 presentation at Open Repositories 2017 in Brisbane, Australia.

I start with an opinionated history of the evolution of repository data harvesting since the late 1990's to the present. A conclusion is that we are currently in danger of creating a repository environment with fewer cross-repository services than before, with the potential to reinforce the silos we hope to open. I suggest that the community needs to agree upon a new solution, and further suggest that solution should be ResourceSync.

  • Login to see the comments

Mind the gap! Reflections on the state of repository data harvesting

  1. 1. Mind the gap! Reflections on the state of repository data harvesting Simeon Warner (Cornell University)
  2. 2. Long long ago, when XML was hard, Unicode was merely one possible character set, a big hard drive was 10GB, and HotBot & AltaVista had a new competitor...
  3. 3. ... it was1999 and the UPS meeting in Santa Fe aimed to “... identify technologies to stimulate the adoption of the concept of [Open Access] author self-archived systems in scholarly communication; theorize a framework for the integration of e- print services in the academic document system ...”
  4. 4. Thus was born OAI-PMH v1.0 2001, v1.1 2002, v2.0 2003
  5. 5. OAI-PMH was great! •  It works •  Scales to millions of items •  Easy to implement (good s/w libraries) •  XML, which brought UTF-8 (hurrah!) •  Widely deployed, stable since 2003 (v2.0) •  Registries & validators •  Community & documentation
  6. 6. BASE harvests >5000 sources >112M documents
  7. 7. BUT... •  Not RESTful •  Repository-centric •  XML metadata only •  Metadata is wrapped •  Dynamic set membership bug
  8. 8. "Currently, OAI-PMH is the only behavior that is uniformly exposed by most repositories. [But], its focus on metadata, its pull- based paradigm, and its technological roots that date back to the web of the nineties put it at odds with ... current web technologies." COAR Next Generation Repositories
  9. 9. Photo by drivethrucafe CC BY-SA
  10. 10. Google Scholar is great, but not the answer
  11. 11. Replacement with no gap New approach must: •  Meet existing OAI-PMH use cases •  Support content as well as metadata •  Scale better •  Follow web standards •  Be modern, developer friendly
  12. 12. Push-me pull-you many items / sources low latency / efficiency => push/notification modest size low barrier => pull
  13. 13. Conclusion v1 We, the repository community, need to discuss and agree on a new approach to harvesting
  14. 14. ResourceSync ANSI/NISO Z39.99-2017 Sitemaps + •  multiple sets •  fixity •  links •  changes only •  dumps
  15. 15. + Notifications (Push) PubSubHubbub WebSub •  low latency •  efficiency
  16. 16. CORE >6000 journals >2400 repositories >77M articles (>6M full text) metadata + content
  17. 17. Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- Tested with resync client. 20 x 25MB sitemaps, 1M items ✔
  18. 18. IIIF & Europeana •  500,000,000+ IIIF resources – how to find them? •  JSON-LD documents and related web pages •  Europeana experiments with NLW and UCD o  ResourceSync, Sitemaps and native structures
  19. 19. Hyku & DPLA •  Extension of HydraSamvera codebase to provide in-the-box repository •  Native ResourceSync support o  Both resource lists and change lists •  Successful harvesting tests with DPLA o  Desire for resource dumps and change dumps for efficiency (see new report: )
  20. 20. Conclusion v2 We, the repository community, should agree on & transition to ResourceSync as the new approach to harvesting
  21. 21. Repository prescription •  Metadata and content should be web resources o  stable URIs, follow web standards, not hidden behind query interfaces •  Support ResourceSync as the primary harvesting interface o  OAI-PMH as secondary where necessary •  Distinguish and relate metadata and content entries
  22. 22. That’s all folks @zimeon