Mind the gap! Reflections on the state of repository data harvesting

•

2 likes•1,267 views

A 24x7 presentation at Open Repositories 2017 in Brisbane, Australia. I start with an opinionated history of the evolution of repository data harvesting since the late 1990's to the present. A conclusion is that we are currently in danger of creating a repository environment with fewer cross-repository services than before, with the potential to reinforce the silos we hope to open. I suggest that the community needs to agree upon a new solution, and further suggest that solution should be ResourceSync.

Technology

Mind the gap!
Reflections on the state of
repository data harvesting
Simeon Warner (Cornell University)
http://orcid.org/0000-0002-7970-7855

Long long ago,
when XML was hard,
Unicode was merely one
possible character set,
a big hard drive was 10GB,
and HotBot & AltaVista
had a new competitor...

... it was1999 and the UPS meeting in
Santa Fe aimed to
“... identify technologies to stimulate
the adoption of the concept of [Open
Access] author self-archived systems in
scholarly communication; theorize a
framework for the integration of e-
print services in the academic
document system ...”
https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm

Thus was born
OAI-PMH
v1.0 2001, v1.1 2002, v2.0 2003

OAI-PMH was great!
•  It works
•  Scales to millions of items
•  Easy to implement (good s/w libraries)
•  XML, which brought UTF-8 (hurrah!)
•  Widely deployed, stable since 2003 (v2.0)
•  Registries & validators
•  Community & documentation

BASE harvests
>5000 sources
>112M documents

BUT...
•  Not RESTful
•  Repository-centric
•  XML metadata only
•  Metadata is wrapped
•  Dynamic set membership bug

"Currently, OAI-PMH is the only
behavior that is uniformly exposed by
most repositories.
[But], its focus on metadata, its pull-
based paradigm, and its technological
roots that date back to the web of the
nineties put it at odds with ... current
web technologies."
COAR Next Generation Repositories
http://comment.coar-repositories.org/2-next-generation-repositories/

Photo by drivethrucafe CC BY-SA
https://www.flickr.com/photos/128758398@N07/15836296662

Google Scholar
is great, but
not the answer

Replacement with no gap
New approach must:
•  Meet existing OAI-PMH use cases
•  Support content as well as metadata
•  Scale better
•  Follow web standards
•  Be modern, developer friendly

Push-me pull-you
many items / sources
low latency / efficiency
=> push/notification
modest size
low barrier
=> pull

Conclusion v1
We, the repository
community, need to
discuss and agree on
a new approach to
harvesting

ResourceSync
ANSI/NISO Z39.99-2017
Sitemaps +
•  multiple sets
•  fixity
•  links
•  changes only
•  dumps

+ Notifications (Push)
PubSubHubbub
WebSub
•  low latency
•  efficiency

CORE
>6000 journals
>2400 repositories
>77M articles
(>6M full text)
metadata + content

Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ
Tested with
resync client. 20
x 25MB sitemaps,
1M items ✔

IIIF & Europeana
•  500,000,000+ IIIF resources – how to
find them?
•  JSON-LD documents and related web
pages
•  Europeana experiments with NLW and
UCD
o  ResourceSync, Sitemaps and native
structures

Hyku & DPLA
•  Extension of HydraSamvera codebase
to provide in-the-box repository
•  Native ResourceSync support
o  Both resource lists and change lists
•  Successful harvesting tests with DPLA
o  Desire for resource dumps and change
dumps for efficiency
(see new report:
http://hydrainabox.projecthydra.org/2017/06/22/resourcesync.html )

Conclusion v2
We, the repository
community, should
agree on & transition to
ResourceSync as the
new approach to
harvesting

Repository prescription
•  Metadata and content should be web
resources
o  stable URIs, follow web standards, not hidden
behind query interfaces
•  Support ResourceSync as the primary
harvesting interface
o  OAI-PMH as secondary where necessary
•  Distinguish and relate metadata and content
entries

That’s
all
folks
@zimeon
simeon.warner@cornell.edu

What's hot

SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker

Maximising (Re)Usability of Library metadata using Linked Data Asuncion Gomez-Perez

20170501 Distributed Network of Digital Heritage InformationEnno Meijers

ORDS, research data networkJisc RDM

鏈結資料在圖書館的應用20131107皓仁柯

Wednesday 6 May: Hand me the data! What you should know as a humanities resea...WARCnet

Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker

Dash UCCSC 2016University of California Curation Center

5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar SlidesDuraSpace

Digital Infrastructure: Storage and Content ManagementNoreen Whysel

Linked Open Data for Cultural HeritageNoreen Whysel

Connecting the Dots: Constellations in the Linked Data UniverseNational Information Standards Organization (NISO)

Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela

DSpace-CRIS Workshop OR2015: SlidesAndrea Bollini

DSpace for Cultural Heritage: adding support for images visualization,audio/v...Andrea Bollini

6.15.17 DSpace-Cris Webinar Presentation SlidesDuraSpace

Semantic web 101: Benefits for geologistsdgarijo

Repository technologiesAndrea Bollini

Nanopublications and Decentralized PublishingTobias Kuhn

ORCID Adoption & Integration in DSpaceORCID, Inc

What's hot (20)

SWIB14 Weaving repository contents into the Semantic Web

Maximising (Re)Usability of Library metadata using Linked Data

20170501 Distributed Network of Digital Heritage Information

ORDS, research data network

鏈結資料在圖書館的應用20131107

Wednesday 6 May: Hand me the data! What you should know as a humanities resea...

Open Science Days 2014 - Becker - Repositories and Linked Data

Dash UCCSC 2016

5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides

Digital Infrastructure: Storage and Content Management

Linked Open Data for Cultural Heritage

Connecting the Dots: Constellations in the Linked Data Universe

Making social science more reproducible by encapsulating access to linked data

DSpace-CRIS Workshop OR2015: Slides

DSpace for Cultural Heritage: adding support for images visualization,audio/v...

6.15.17 DSpace-Cris Webinar Presentation Slides

Semantic web 101: Benefits for geologists

Repository technologies

Nanopublications and Decentralized Publishing

ORCID Adoption & Integration in DSpace

Similar to Mind the gap! Reflections on the state of repository data harvesting

Desktop as a Service supporting Environmental ‘omicsDavid Wallom

Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell

Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.

Internet content as research dataNational Library of Australia

ResourceSync: Web-based Resource SynchronizationSimeon Warner

Danis biosystematics2011Bruno Danis

Another history of the Web from its architectureAlexandre Monnin

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit

Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom

The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell

Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling

The Open Archives Initiative Protocol for Metadata HarvestingAndy Powell

Open for Business Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell

Slides anu talkwebarchivingaug2012Roxanne Missingham

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Matthew Hale - Open Source at the Kings FundTracy Kent

Implementing Samvera Open Source Technology at WGBH and the American Archive ...WGBH Media Library and Archives

Big dataroysonli

SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb

Dm2 e ontotext-nov2012Mariana Damova, Ph.D

Similar to Mind the gap! Reflections on the state of repository data harvesting (20)

Desktop as a Service supporting Environmental ‘omics

Open for Business - Open Archives, OpenURL, RSS and the Dublin Core

Building a Distributed File System for the Cloud-Native Era

Internet content as research data

ResourceSync: Web-based Resource Synchronization

Danis biosystematics2011

Another history of the Web from its architecture

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

Supporting Research through "Desktop as a Service" models of e-infrastructure...

The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK

Hopsworks in the cloud Berlin Buzzwords 2019

The Open Archives Initiative Protocol for Metadata Harvesting

Open for Business Open Archives, OpenURL, RSS and the Dublin Core

Slides anu talkwebarchivingaug2012

Apache Arrow -- Cross-language development platform for in-memory data

Matthew Hale - Open Source at the Kings Fund

Implementing Samvera Open Source Technology at WGBH and the American Archive ...

Big data

SCAPE Presentation at the Elag2013 conference in Gent/Belgium

Dm2 e ontotext-nov2012

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Scaling API-first – The story of a global engineering organizationRadu Cotescu

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Install Stable Diffusion in windows machinePadma Pradeep

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

How to Remove Document Management Hurdles with X-Docs?XfilesPro

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

A Domino Admins Adventures (Engage 2024)Gabriella Davis

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Understanding the Laravel MVC ArchitecturePixlogix Infotech

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

How to convert PDF to text with Nanonetsnaman860154

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Benefits Of Flutter Compared To Other Frameworks

Scaling API-first – The story of a global engineering organization

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Install Stable Diffusion in windows machine

GenCyber Cyber Security Day Presentation

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

How to Remove Document Management Hurdles with X-Docs?

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Handwritten Text Recognition for manuscripts and early printed texts

A Domino Admins Adventures (Engage 2024)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Understanding the Laravel MVC Architecture

How to Troubleshoot Apps for the Modern Connected Worker

Injustice - Developers Among Us (SciFiDevCon 2024)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

How to convert PDF to text with Nanonets

Mind the gap! Reflections on the state of repository data harvesting

1. Mind the gap! Reflections on the state of repository data harvesting Simeon Warner (Cornell University) http://orcid.org/0000-0002-7970-7855

2. Long long ago, when XML was hard, Unicode was merely one possible character set, a big hard drive was 10GB, and HotBot & AltaVista had a new competitor...

3. ... it was1999 and the UPS meeting in Santa Fe aimed to “... identify technologies to stimulate the adoption of the concept of [Open Access] author self-archived systems in scholarly communication; theorize a framework for the integration of e- print services in the academic document system ...” https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm

4. Thus was born OAI-PMH v1.0 2001, v1.1 2002, v2.0 2003

5. OAI-PMH was great! •  It works •  Scales to millions of items •  Easy to implement (good s/w libraries) •  XML, which brought UTF-8 (hurrah!) •  Widely deployed, stable since 2003 (v2.0) •  Registries & validators •  Community & documentation

6. BASE harvests >5000 sources >112M documents

8. BUT... •  Not RESTful •  Repository-centric •  XML metadata only •  Metadata is wrapped •  Dynamic set membership bug

9. "Currently, OAI-PMH is the only behavior that is uniformly exposed by most repositories. [But], its focus on metadata, its pull- based paradigm, and its technological roots that date back to the web of the nineties put it at odds with ... current web technologies." COAR Next Generation Repositories http://comment.coar-repositories.org/2-next-generation-repositories/

10. Photo by drivethrucafe CC BY-SA https://www.flickr.com/photos/128758398@N07/15836296662

11. Google Scholar is great, but not the answer

12. Replacement with no gap New approach must: •  Meet existing OAI-PMH use cases •  Support content as well as metadata •  Scale better •  Follow web standards •  Be modern, developer friendly

13. Push-me pull-you many items / sources low latency / efficiency => push/notification modest size low barrier => pull

14. Conclusion v1 We, the repository community, need to discuss and agree on a new approach to harvesting

15. ResourceSync ANSI/NISO Z39.99-2017 Sitemaps + •  multiple sets •  fixity •  links •  changes only •  dumps

16. + Notifications (Push) PubSubHubbub WebSub •  low latency •  efficiency

17. CORE >6000 journals >2400 repositories >77M articles (>6M full text) metadata + content

18. Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ Tested with resync client. 20 x 25MB sitemaps, 1M items ✔

19. IIIF & Europeana •  500,000,000+ IIIF resources – how to find them? •  JSON-LD documents and related web pages •  Europeana experiments with NLW and UCD o  ResourceSync, Sitemaps and native structures

20. Hyku & DPLA •  Extension of HydraSamvera codebase to provide in-the-box repository •  Native ResourceSync support o  Both resource lists and change lists •  Successful harvesting tests with DPLA o  Desire for resource dumps and change dumps for efficiency (see new report: http://hydrainabox.projecthydra.org/2017/06/22/resourcesync.html )

21. Conclusion v2 We, the repository community, should agree on & transition to ResourceSync as the new approach to harvesting

22. Repository prescription •  Metadata and content should be web resources o  stable URIs, follow web standards, not hidden behind query interfaces •  Support ResourceSync as the primary harvesting interface o  OAI-PMH as secondary where necessary •  Distinguish and relate metadata and content entries

23. That’s all folks @zimeon simeon.warner@cornell.edu

Mind the gap! Reflections on the state of repository data harvesting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mind the gap! Reflections on the state of repository data harvesting

Similar to Mind the gap! Reflections on the state of repository data harvesting (20)

More from Simeon Warner

More from Simeon Warner (20)

Recently uploaded

Recently uploaded (20)

Mind the gap! Reflections on the state of repository data harvesting