SlideShare a Scribd company logo
1 of 61
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein
Los Alamos National Laboratory
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
https://orcid.org/0000-0002-0715-6126
A Web-Centric Pipeline for Archiving Scholarly Artifacts
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Scholarly Orphans โ€“ Project Motivation
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โ€ข Consideration
โ€ข Researchers are increasingly using a variety of web platforms for
collaboration and communication
โ€ข Why?
โ€ข Many of these platforms have desirable characteristics
โ€ข Versioning
โ€ข Time stamping
โ€ข Social embedding
โ€ข Their institutions do not provide platforms that have global reach
โ€ข Collaboration, cf. Github ~ productivity
โ€ข Communication, cf. SlideShare ~ visibility
Research and Research Communication on the Web
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Emma Schymanski
https://orcid.org/0000-0001-6868-8145
https://github.com/schymane
https://www.slideshare.net/EmmaSchymanski
https://figshare.com/authors/Emma_Schymanski/5087039
https://publons.com/author/1538491/emma-schymanski#profile
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Shawn Jones
https://orcid.org/0000-0002-4372-870X
http://www.shawnmjones.org/
https://github.com/shawnmjones
https://www.slideshare.net/shawnmjones
https://en.wikipedia.org/wiki/User:Shawnmjones
https://www.blogger.com/profile/17827543974149663194
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โ€ข Consideration
โ€ข Researchers deposit artifacts in these web platforms
โ€ข Web Platforms:
โ€ข Dedicated to scholarship:
โ€ข Commercial: e.g., FigShare, Publons
โ€ข Not for profit: e.g., OSF, Zenodo
โ€ข General purpose:
โ€ข Commercial: e.g., GitHub, SlideShare
โ€ข Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โ€ข Consideration
โ€ข Researchers deposit artifacts in these web platforms
โ€ข Status quo - The researchersโ€™ institutions commonly:
โ€ข Do not know about the existence of these artifact
โ€ข Do not have a copy of these artifacts
Research and Research Communication on the Web
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โ€ข Consideration
โ€ข Researchers deposit artifacts in these web platforms
โ€ข Status quo โ€“ Uncertainty regarding long-term accessibility of these
artifacts:
โ€ข General purpose platforms donโ€™t provide long-term access
guarantees; platforms dedicated to scholarship commonly do
โ€ข Uncertainty regarding the sustainability of unhindered long-
term access to artifacts in these platforms:
โ€ข Commercial: when is the change in business model
coming?
โ€ข Not for profit: will the next round of grant applications,
member contributions be successful?
Research and Research Communication on the Web
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
โ€ข Consideration
โ€ข Researchers deposit artifacts in these web platforms
โ€ข Status quo - These artifacts are not systematically archived:
โ€ข No frameworks like LOCKSS/Portico exist for these artifacts
โ€ข Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
โ€ข Canโ€™t expect researchers to (also) upload all artifacts in IRs
โ€ข Web archives only incidentally archive these artifacts
โ€ข Anecdotal & Hiberlink evidence
Research and Research Communication on the Web
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Emmaโ€™s SlideShare Artifact: 0 Mementos
https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge
http://timetravel.mementoweb.org/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Shawnโ€™s GitHub Artifact: 1 Memento
https://github.com/shawnmjones/mediawiki
http://web.archive.org/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Hiberlink Evidence
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Need for an Archiving Infrastructure
Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web
https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Recording versus Archiving
Recording Archiving
Short-term Longer-term
No guarantees provided Attempt to provide guarantees
Write many/read many Write once/Read many
Scholarly process Scholarly record
Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web
https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Scholarly Orphans โ€“ Project Overview
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โ€ข Funded by the Andrew W. Mellon Foundation
โ€ข Los Alamos National Laboratory & New Mexico Consortium
โ€ข Old Dominion University
โ€ข 04/2016 - 03/2019
โ€ข How to capture Scholarly Orphans (i.e., the scholarly artifacts
deposited in web portals) for long-term archiving?
โ€ข Experimental project, aimed at exploring technical possibilities
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โ€ข Explores an institution-driven paradigm
โ€ข Academic institutions typically have a long shelf life
โ€ข A basic premise underlying e.g., LOCKSS, perma.cc
โ€ข An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
โ€ข Collecting and archiving such artifacts aligns with the
mission of academic libraries
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
An Institutional Perspective
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
The Scholarly Orphans Project
โ€ข Explores a paradigm inspired by web archiving
โ€ข Scale of the problem
โ€ข Canโ€™t expect researchers to upload all artifacts in an institutional
repository
โ€ข Bilateral agreements for archival purposes with most web
portals unlikely
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
A Web Archiving Perspective
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Inspiration
โ€ข LOCKSS
โ€ข Web crawling approach
โ€ข Focused on journal literature
โ€ข Archive-It
โ€ข On-demand, subscription-based web archiving
โ€ข Not focused on scholarly orphans
โ€ข Institutional repository, auto-discovery of journal articles
โ€ข Capture an institutionโ€™s output
โ€ข Focused on journal literature
โ€ข The Locker Project & Amy Guyโ€™s Personal Web Observatory work
โ€ข Capture an individualโ€™s web presence
โ€ข Not focused on scholarly orphans
http://rhiaro.co.uk/
https://rhiaro.github.io/thesis/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Scholarly Orphans โ€“ Prototype Pipeline Overview
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Prototype Pipeline
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Prototype Pipeline
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Demo - myresearch.institute
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
myresearch.institute - Researchers
โ€ข Uniquely identified by ORCIDs
โ€ข Web identities in multiple portals
โ€ข Create various types of artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
myresearch.institute - Portals
โ€ข Tracking started August 27 2018
โ€ข Tracking artifacts created starting
August 1 2018
โ€ข >2,200 artifacts tracked to date
for all 16 researchers
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
myresearch.institute - Artifacts
โ€ข schema.org typology:
โ€ข Answer
โ€ข Article
โ€ข BlogPosting
โ€ข Comment
โ€ข Dataset
โ€ข PresentationDigitalDocument
โ€ข Question
โ€ข Review
โ€ข SoftwareSourceCode
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Description
โ€ข In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
โ€ข The web identity of the researcher in the portal
โ€ข Algorithmic discovery
โ€ข Discovery via a registry
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Algorithmic Discovery of Web Identities
James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014)
EgoSystem: Where are our alumni? In: code4lib http://journal.code4lib.org/articles/9519
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein and Herbert Van de Sompel (2017)
Discovering Scholarly Orphans Using ORCID In: JCDL2017 https://arxiv.org/abs/1703.09343
Discovery of Web Identities via a Registry (ORCID)
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
https://orcid.org/0000-0002-4372-870X
Shawnโ€™s ORCID Record
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
https://orcid.org/0000-0001-6868-8145
Emmaโ€™s ORCID Record
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Description
โ€ข In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
โ€ข The web identity of the researcher in the portal
โ€ข Algorithmic discovery
โ€ข Discovery via a registry
โ€ข A portal API that supports:
โ€ข Access by web identity
โ€ข Access to contributions โ€œsince โ€ฆโ€ for the web identity
โ€ข Result of tracking:
โ€ข URI(s) of new artifact(s) discovered in the portal
Tracking Artifacts - Architecture
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Implementation
โ€ข Tracker event notifications:
โ€ข Linked Data Notifications (JSON-LD) using AS2, PROV-O,
schema.org
โ€ข Identifiers: Unique tracker event identifier per notification
โ€ข Dates: artifact publication date & artifact tracked date
โ€ข URIs: 1+ artifact URI
โ€ข Event database:
โ€ข Notifications stored/indexed in ElasticSearch
โ€ข Researcher database:
โ€ข SQLite
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Demo
Demo: https://myresearch.institute/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Tracking Artifacts - Challenges
โ€ข Discovery of web identities of researchers
โ€ข Algorithmic, registry-based currently not adequate
โ€ข Fallback: manual discovery and entry
โ€ข With help of researcher
โ€ข Portal API access by web identity
โ€ข Broadly supported by general purpose portals
โ€ข Typically not supported by scholarly portals
โ€ข Some lack an API altogether
โ€ข Should add ORCID access to APIs
โ€ข OAI-PMH and ResourceSync need sets per web identity
โ€ข Professional versus personal contributions
โ€ข Tracking frequency/scale
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Description
โ€ข The capture process takes as input the URI of a new artifact
discovered in a portal
โ€ข Its task is to create a representative institutional capture of the
artifact
โ€ข Result of capture:
โ€ข WARC file for new artifact in an institutional archive
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Description
โ€ข Challenges:
โ€ข Delineate the web boundary of the artifact
โ€ข More than the input artifact URI
โ€ข The boundary is in the eye of the beholder
โ€ข Create a high-fidelity capture using an approach that scales for
a steady stream of new artifacts
โ€ข Unsolved problem
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Memento Tracer - Framework
http://tracer.mementoweb.org
Capturing Artifacts - Architecture
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Implementation
โ€ข Capture event notifications:
โ€ข Identifiers: Unique capture event identifier per notification ;
Preceding tracker event identifier conveyed as provenance
โ€ข Dates: Datetime of WARC file creation
โ€ข URIs: 1+ WARC file URI
โ€ข Tracer, client-side:
โ€ข Tracer Chrome extension leveraging Selenium IDE
โ€ข Tracer, server-side:
โ€ข Stormcrawler ; Selenium (Chrome) with Tracer plug-in ;
WarcProxy ; file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Demo
Demo: https://myresearch.institute/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Capturing Artifacts - Challenges
โ€ข Memento Tracer:
โ€ข Language used to express Traces (interoperability)
โ€ข Organization of the shared repository for Traces
โ€ข Limitations of the browser event listener approach for recording
Traces
โ€ข Selection of a Trace for capturing a web publication by other
means than URI pattern
โ€ข Legal constraints
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Description
โ€ข The archiving process takes as input the URI of a WARC file
generated by the capture process
โ€ข Its task is to ingest the WARC file in a cross-institutional web archive
โ€ข This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
โ€ข Result of archiving:
โ€ข Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
Archiving Artifacts - Architecture
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Implementation
โ€ข Archiver event notifications:
โ€ข Identifiers: Unique archiver event identifier per notification ;
preceding tracker/capturer event identifiers conveyed as
provenance
โ€ข Dates: WARC file ingest date ; Memento-Datetime values
URIs: 1+ Memento URI, each corresponding to an artifact URI
โ€ข Web Archive:
โ€ข pywb
โ€ข Social card:
โ€ข MementoEmbed
https://github.com/webrecorder/pywb
https://github.com/oduwsdl/MementoEmbed
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Demo
Demo: https://myresearch.institute/
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Archiving Artifacts - Challenges
โ€ข Attempted to use ipwb, a pywb version that uses IPFS
โ€ข Cross-institutional distributed file system with redundancy
โ€ข Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Scholarly Orphans โ€“ Summary
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Summary (1/2)
โ€ข The Scholarly Orphans project explores an institution-driven
approach to capture scholarly artifacts deposited in web portals
โ€ข Artifacts out of scope of existing archival approaches such as
LOCKSS, Portico, web archives
โ€ข Institutions have a long shelf life, should be interested in
collecting these artifacts, and have feasible scale for
identity/artifact discovery
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Summary (2/2)
โ€ข Components of the experimental pipeline:
โ€ข Tracker: Automatically discover artifacts because researchers
will not upload them to the institution
โ€ข Capturer: High fidelity artifact captures through crowd-sourcing
navigation patterns with Memento Tracer
โ€ข Archiver: Cross-institutional, Memento-compliant scholarly web
archive
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Acknowledgments
โ€ข Los Alamos National Laboratory:
โ€ข Lyudmila Balakireva
โ€ข Martin Klein
โ€ข James Powell
โ€ข Harihar Shankar
โ€ข Herbert Van de Sompel
โ€ข Old Dominion University:
โ€ข Sawood Alam
โ€ข Grant Atkins
โ€ข Shawn Jones
โ€ข Mat Kelly
โ€ข Michael L. Nelson
โ€ข myresearch.institute โ€“ all volunteering researchers
@mart1nkle1n @hvdsomp
TPDL2018, Porto, Portugal, 12 Sep 2018
Martin Klein
Los Alamos National Laboratory
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
https://orcid.org/0000-0002-0715-6126
A Web-Centric Pipeline for Archiving Scholarly Artifacts
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation

More Related Content

Similar to A Web-Centric Pipeline for Archiving Scholarly Artifacts

Inspire Hackathon - Integration of Research Projects Sustainability with Cit...
Inspire Hackathon -  Integration of Research Projects Sustainability with Cit...Inspire Hackathon -  Integration of Research Projects Sustainability with Cit...
Inspire Hackathon - Integration of Research Projects Sustainability with Cit...
plan4all
ย 
A hands-on data exploration & challenge to become a derived data-set author o...
A hands-on data exploration & challenge to become a derived data-set author o...A hands-on data exploration & challenge to become a derived data-set author o...
A hands-on data exploration & challenge to become a derived data-set author o...
labsbl
ย 
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
plan4all
ย 

Similar to A Web-Centric Pipeline for Archiving Scholarly Artifacts (20)

Inspire Hackathon - Integration of Research Projects Sustainability with Cit...
Inspire Hackathon -  Integration of Research Projects Sustainability with Cit...Inspire Hackathon -  Integration of Research Projects Sustainability with Cit...
Inspire Hackathon - Integration of Research Projects Sustainability with Cit...
ย 
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
ย 
Perseverance on Persistence
Perseverance on PersistencePerseverance on Persistence
Perseverance on Persistence
ย 
Perseverance on Persistence by Herbert van de Sompel - EuropeanaTech Conferen...
Perseverance on Persistence by Herbert van de Sompel - EuropeanaTech Conferen...Perseverance on Persistence by Herbert van de Sompel - EuropeanaTech Conferen...
Perseverance on Persistence by Herbert van de Sompel - EuropeanaTech Conferen...
ย 
A hands-on data exploration & challenge to become a derived data-set author o...
A hands-on data exploration & challenge to become a derived data-set author o...A hands-on data exploration & challenge to become a derived data-set author o...
A hands-on data exploration & challenge to become a derived data-set author o...
ย 
Working with the British Libraryโ€™s Digital Collections & Data - Insights from...
Working with the British Libraryโ€™s Digital Collections & Data - Insights from...Working with the British Libraryโ€™s Digital Collections & Data - Insights from...
Working with the British Libraryโ€™s Digital Collections & Data - Insights from...
ย 
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
Introduction to the Orlรฉans/OGC INSPIRE Hackathon 2018
ย 
From Open Access to Open Data: collaborative work in the university libraries...
From Open Access to Open Data: collaborative work in the university libraries...From Open Access to Open Data: collaborative work in the university libraries...
From Open Access to Open Data: collaborative work in the university libraries...
ย 
Introduction to the Oxford Collections Visualization Project
Introduction to the Oxford Collections Visualization ProjectIntroduction to the Oxford Collections Visualization Project
Introduction to the Oxford Collections Visualization Project
ย 
PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)PechaKucha (FormaliSE'2018)
PechaKucha (FormaliSE'2018)
ย 
Deconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationDeconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communication
ย 
180515 kitodo wendt_europeana_tech
180515 kitodo wendt_europeana_tech180515 kitodo wendt_europeana_tech
180515 kitodo wendt_europeana_tech
ย 
Toward complex e service for management of reserach outcomes, poland
Toward complex e service for management of reserach outcomes, polandToward complex e service for management of reserach outcomes, poland
Toward complex e service for management of reserach outcomes, poland
ย 
To the Rescue of Scholarly Orphans
To the Rescue of Scholarly OrphansTo the Rescue of Scholarly Orphans
To the Rescue of Scholarly Orphans
ย 
PIDs for cultural heritage Flanders
PIDs for cultural heritage FlandersPIDs for cultural heritage Flanders
PIDs for cultural heritage Flanders
ย 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
ย 
OpenGLAM CH Hackathons
OpenGLAM CH HackathonsOpenGLAM CH Hackathons
OpenGLAM CH Hackathons
ย 
Open Science policy: EC, ERC, Belspo, FWO
Open Science policy: EC, ERC, Belspo, FWOOpen Science policy: EC, ERC, Belspo, FWO
Open Science policy: EC, ERC, Belspo, FWO
ย 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
ย 
OpenAIRE: Implementing Open Science in EOSC - crosscutting with RDA (Presenta...
OpenAIRE: Implementing Open Science in EOSC - crosscutting with RDA (Presenta...OpenAIRE: Implementing Open Science in EOSC - crosscutting with RDA (Presenta...
OpenAIRE: Implementing Open Science in EOSC - crosscutting with RDA (Presenta...
ย 

More from Martin Klein

A Vision of the Libraryโ€™s Role in Archiving Scholarly Artifacts
A Vision of the Libraryโ€™s Role  in Archiving Scholarly ArtifactsA Vision of the Libraryโ€™s Role  in Archiving Scholarly Artifacts
A Vision of the Libraryโ€™s Role in Archiving Scholarly Artifacts
Martin Klein
ย 

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
ย 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
ย 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
ย 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
ย 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
ย 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
ย 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
ย 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
ย 
A Vision of the Libraryโ€™s Role in Archiving Scholarly Artifacts
A Vision of the Libraryโ€™s Role  in Archiving Scholarly ArtifactsA Vision of the Libraryโ€™s Role  in Archiving Scholarly Artifacts
A Vision of the Libraryโ€™s Role in Archiving Scholarly Artifacts
ย 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
ย 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
ย 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
ย 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
ย 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
ย 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
ย 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
ย 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
ย 
Uniform Access to Raw Mementos
Uniform Access to Raw MementosUniform Access to Raw Mementos
Uniform Access to Raw Mementos
ย 
Robust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communicationRobust Links - a proposed solution to reference rot in scholarly communication
Robust Links - a proposed solution to reference rot in scholarly communication
ย 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ย 

Recently uploaded

Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
SUHANI PANDEY
ย 
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
ย 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
SUHANI PANDEY
ย 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
singhpriety023
ย 
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
nilamkumrai
ย 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
SUHANI PANDEY
ย 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
SUHANI PANDEY
ย 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
SUHANI PANDEY
ย 

Recently uploaded (20)

๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
ย 
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
ย 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
ย 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
ย 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
ย 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
ย 
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
ย 
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
ย 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
ย 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
ย 
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | G...
ย 
Call Now โ˜Ž 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now โ˜Ž 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now โ˜Ž 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now โ˜Ž 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
ย 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
ย 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
ย 
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort ServiceBusty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
ย 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
ย 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
ย 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
ย 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
ย 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
ย 

A Web-Centric Pipeline for Archiving Scholarly Artifacts

  • 1. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein Los Alamos National Laboratory @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp https://orcid.org/0000-0002-0715-6126 A Web-Centric Pipeline for Archiving Scholarly Artifacts The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  • 2. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans โ€“ Project Motivation
  • 3. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 โ€ข Consideration โ€ข Researchers are increasingly using a variety of web platforms for collaboration and communication โ€ข Why? โ€ข Many of these platforms have desirable characteristics โ€ข Versioning โ€ข Time stamping โ€ข Social embedding โ€ข Their institutions do not provide platforms that have global reach โ€ข Collaboration, cf. Github ~ productivity โ€ข Communication, cf. SlideShare ~ visibility Research and Research Communication on the Web
  • 4. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Emma Schymanski https://orcid.org/0000-0001-6868-8145 https://github.com/schymane https://www.slideshare.net/EmmaSchymanski https://figshare.com/authors/Emma_Schymanski/5087039 https://publons.com/author/1538491/emma-schymanski#profile
  • 5. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Shawn Jones https://orcid.org/0000-0002-4372-870X http://www.shawnmjones.org/ https://github.com/shawnmjones https://www.slideshare.net/shawnmjones https://en.wikipedia.org/wiki/User:Shawnmjones https://www.blogger.com/profile/17827543974149663194
  • 6. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 โ€ข Consideration โ€ข Researchers deposit artifacts in these web platforms โ€ข Web Platforms: โ€ข Dedicated to scholarship: โ€ข Commercial: e.g., FigShare, Publons โ€ข Not for profit: e.g., OSF, Zenodo โ€ข General purpose: โ€ข Commercial: e.g., GitHub, SlideShare โ€ข Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  • 7. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 โ€ข Consideration โ€ข Researchers deposit artifacts in these web platforms โ€ข Status quo - The researchersโ€™ institutions commonly: โ€ข Do not know about the existence of these artifact โ€ข Do not have a copy of these artifacts Research and Research Communication on the Web
  • 8. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 โ€ข Consideration โ€ข Researchers deposit artifacts in these web platforms โ€ข Status quo โ€“ Uncertainty regarding long-term accessibility of these artifacts: โ€ข General purpose platforms donโ€™t provide long-term access guarantees; platforms dedicated to scholarship commonly do โ€ข Uncertainty regarding the sustainability of unhindered long- term access to artifacts in these platforms: โ€ข Commercial: when is the change in business model coming? โ€ข Not for profit: will the next round of grant applications, member contributions be successful? Research and Research Communication on the Web
  • 9. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 โ€ข Consideration โ€ข Researchers deposit artifacts in these web platforms โ€ข Status quo - These artifacts are not systematically archived: โ€ข No frameworks like LOCKSS/Portico exist for these artifacts โ€ข Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI โ€ข Canโ€™t expect researchers to (also) upload all artifacts in IRs โ€ข Web archives only incidentally archive these artifacts โ€ข Anecdotal & Hiberlink evidence Research and Research Communication on the Web
  • 10. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Emmaโ€™s SlideShare Artifact: 0 Mementos https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge http://timetravel.mementoweb.org/
  • 11. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Shawnโ€™s GitHub Artifact: 1 Memento https://github.com/shawnmjones/mediawiki http://web.archive.org/
  • 12. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Hiberlink Evidence Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  • 13. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Need for an Archiving Infrastructure Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  • 14. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Recording versus Archiving Recording Archiving Short-term Longer-term No guarantees provided Attempt to provide guarantees Write many/read many Write once/Read many Scholarly process Scholarly record Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  • 15. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans โ€“ Project Overview
  • 16. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project โ€ข Funded by the Andrew W. Mellon Foundation โ€ข Los Alamos National Laboratory & New Mexico Consortium โ€ข Old Dominion University โ€ข 04/2016 - 03/2019 โ€ข How to capture Scholarly Orphans (i.e., the scholarly artifacts deposited in web portals) for long-term archiving? โ€ข Experimental project, aimed at exploring technical possibilities
  • 17. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project โ€ข Explores an institution-driven paradigm โ€ข Academic institutions typically have a long shelf life โ€ข A basic premise underlying e.g., LOCKSS, perma.cc โ€ข An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web โ€ข Collecting and archiving such artifacts aligns with the mission of academic libraries
  • 18. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 An Institutional Perspective
  • 19. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project โ€ข Explores a paradigm inspired by web archiving โ€ข Scale of the problem โ€ข Canโ€™t expect researchers to upload all artifacts in an institutional repository โ€ข Bilateral agreements for archival purposes with most web portals unlikely
  • 20. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 A Web Archiving Perspective
  • 21. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Inspiration โ€ข LOCKSS โ€ข Web crawling approach โ€ข Focused on journal literature โ€ข Archive-It โ€ข On-demand, subscription-based web archiving โ€ข Not focused on scholarly orphans โ€ข Institutional repository, auto-discovery of journal articles โ€ข Capture an institutionโ€™s output โ€ข Focused on journal literature โ€ข The Locker Project & Amy Guyโ€™s Personal Web Observatory work โ€ข Capture an individualโ€™s web presence โ€ข Not focused on scholarly orphans http://rhiaro.co.uk/ https://rhiaro.github.io/thesis/
  • 22. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans โ€“ Prototype Pipeline Overview
  • 23. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Prototype Pipeline
  • 24. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Prototype Pipeline
  • 25. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Demo - myresearch.institute
  • 26. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Researchers โ€ข Uniquely identified by ORCIDs โ€ข Web identities in multiple portals โ€ข Create various types of artifacts
  • 27. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Portals โ€ข Tracking started August 27 2018 โ€ข Tracking artifacts created starting August 1 2018 โ€ข >2,200 artifacts tracked to date for all 16 researchers
  • 28. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Artifacts โ€ข schema.org typology: โ€ข Answer โ€ข Article โ€ข BlogPosting โ€ข Comment โ€ข Dataset โ€ข PresentationDigitalDocument โ€ข Question โ€ข Review โ€ข SoftwareSourceCode
  • 29. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts
  • 30. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Description โ€ข In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: โ€ข The web identity of the researcher in the portal โ€ข Algorithmic discovery โ€ข Discovery via a registry
  • 31. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Algorithmic Discovery of Web Identities James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our alumni? In: code4lib http://journal.code4lib.org/articles/9519
  • 32. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein and Herbert Van de Sompel (2017) Discovering Scholarly Orphans Using ORCID In: JCDL2017 https://arxiv.org/abs/1703.09343 Discovery of Web Identities via a Registry (ORCID)
  • 33. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 https://orcid.org/0000-0002-4372-870X Shawnโ€™s ORCID Record
  • 34. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 https://orcid.org/0000-0001-6868-8145 Emmaโ€™s ORCID Record
  • 35. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Description โ€ข In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: โ€ข The web identity of the researcher in the portal โ€ข Algorithmic discovery โ€ข Discovery via a registry โ€ข A portal API that supports: โ€ข Access by web identity โ€ข Access to contributions โ€œsince โ€ฆโ€ for the web identity โ€ข Result of tracking: โ€ข URI(s) of new artifact(s) discovered in the portal
  • 36. Tracking Artifacts - Architecture
  • 37. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Implementation โ€ข Tracker event notifications: โ€ข Linked Data Notifications (JSON-LD) using AS2, PROV-O, schema.org โ€ข Identifiers: Unique tracker event identifier per notification โ€ข Dates: artifact publication date & artifact tracked date โ€ข URIs: 1+ artifact URI โ€ข Event database: โ€ข Notifications stored/indexed in ElasticSearch โ€ข Researcher database: โ€ข SQLite
  • 38. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Demo Demo: https://myresearch.institute/
  • 39. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Challenges โ€ข Discovery of web identities of researchers โ€ข Algorithmic, registry-based currently not adequate โ€ข Fallback: manual discovery and entry โ€ข With help of researcher โ€ข Portal API access by web identity โ€ข Broadly supported by general purpose portals โ€ข Typically not supported by scholarly portals โ€ข Some lack an API altogether โ€ข Should add ORCID access to APIs โ€ข OAI-PMH and ResourceSync need sets per web identity โ€ข Professional versus personal contributions โ€ข Tracking frequency/scale
  • 40. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  • 41. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Description โ€ข The capture process takes as input the URI of a new artifact discovered in a portal โ€ข Its task is to create a representative institutional capture of the artifact โ€ข Result of capture: โ€ข WARC file for new artifact in an institutional archive
  • 42. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Description โ€ข Challenges: โ€ข Delineate the web boundary of the artifact โ€ข More than the input artifact URI โ€ข The boundary is in the eye of the beholder โ€ข Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts โ€ข Unsolved problem
  • 43. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  • 44. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  • 45. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  • 46. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Memento Tracer - Framework http://tracer.mementoweb.org
  • 47. Capturing Artifacts - Architecture
  • 48. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Implementation โ€ข Capture event notifications: โ€ข Identifiers: Unique capture event identifier per notification ; Preceding tracker event identifier conveyed as provenance โ€ข Dates: Datetime of WARC file creation โ€ข URIs: 1+ WARC file URI โ€ข Tracer, client-side: โ€ข Tracer Chrome extension leveraging Selenium IDE โ€ข Tracer, server-side: โ€ข Stormcrawler ; Selenium (Chrome) with Tracer plug-in ; WarcProxy ; file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  • 49. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Demo Demo: https://myresearch.institute/
  • 50. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Challenges โ€ข Memento Tracer: โ€ข Language used to express Traces (interoperability) โ€ข Organization of the shared repository for Traces โ€ข Limitations of the browser event listener approach for recording Traces โ€ข Selection of a Trace for capturing a web publication by other means than URI pattern โ€ข Legal constraints
  • 51. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts
  • 52. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Description โ€ข The archiving process takes as input the URI of a WARC file generated by the capture process โ€ข Its task is to ingest the WARC file in a cross-institutional web archive โ€ข This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback โ€ข Result of archiving: โ€ข Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive
  • 53. Archiving Artifacts - Architecture
  • 54. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Implementation โ€ข Archiver event notifications: โ€ข Identifiers: Unique archiver event identifier per notification ; preceding tracker/capturer event identifiers conveyed as provenance โ€ข Dates: WARC file ingest date ; Memento-Datetime values URIs: 1+ Memento URI, each corresponding to an artifact URI โ€ข Web Archive: โ€ข pywb โ€ข Social card: โ€ข MementoEmbed https://github.com/webrecorder/pywb https://github.com/oduwsdl/MementoEmbed
  • 55. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Demo Demo: https://myresearch.institute/
  • 56. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Challenges โ€ข Attempted to use ipwb, a pywb version that uses IPFS โ€ข Cross-institutional distributed file system with redundancy โ€ข Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  • 57. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans โ€“ Summary
  • 58. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Summary (1/2) โ€ข The Scholarly Orphans project explores an institution-driven approach to capture scholarly artifacts deposited in web portals โ€ข Artifacts out of scope of existing archival approaches such as LOCKSS, Portico, web archives โ€ข Institutions have a long shelf life, should be interested in collecting these artifacts, and have feasible scale for identity/artifact discovery
  • 59. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Summary (2/2) โ€ข Components of the experimental pipeline: โ€ข Tracker: Automatically discover artifacts because researchers will not upload them to the institution โ€ข Capturer: High fidelity artifact captures through crowd-sourcing navigation patterns with Memento Tracer โ€ข Archiver: Cross-institutional, Memento-compliant scholarly web archive
  • 60. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Acknowledgments โ€ข Los Alamos National Laboratory: โ€ข Lyudmila Balakireva โ€ข Martin Klein โ€ข James Powell โ€ข Harihar Shankar โ€ข Herbert Van de Sompel โ€ข Old Dominion University: โ€ข Sawood Alam โ€ข Grant Atkins โ€ข Shawn Jones โ€ข Mat Kelly โ€ข Michael L. Nelson โ€ข myresearch.institute โ€“ all volunteering researchers
  • 61. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein Los Alamos National Laboratory @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp https://orcid.org/0000-0002-0715-6126 A Web-Centric Pipeline for Archiving Scholarly Artifacts The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation